{{ searchResult.published_at | date:'d MMMM yyyy' }}

Loading ...
Loading ...

Enter a search term such as “mobile analytics” or browse our content using the filters above.

No_results

That’s not only a poor Scrabble score but we also couldn’t find any results matching “”.
Check your spelling or try broadening your search.

Logo_distressed

Sorry about this, there is a problem with our search at the moment.
Please try again later.

September is here again and the kids are back to school.

We thought we'd also go 'back to basics' and explain how retailers can simplify their data extraction process.

Web scraping is a way of extracting data from websites. Rich data extraction ensures that the most comprehensive product information is extracted from the retailer’s ecommerce site.

This ensures that the data remains accurate and up-to-date and leaves less room for error.

Why is web scraping important?

If retailers want to increase their product visibility and display their product inventory across the various channels, data extraction is essential. There are several ways of extracting site data, but one of the most common is screen scraping.

Screen scraping is carried out by a crawler that is sent onto an ecommerce site to capture specific data. This extracted data is then put together to create a product data feed.

Why are websites scraped?

Scraping makes data extraction much easier for retailers. Most of them have complex CMS systems, so their website is usually the only place where all of their product information comes together.

How can retailers improve their site so it’s easy to scrape?

Use IDs and classes within tags

If a website uses IDs and classes within its page tags, it’s much easier to produce Xpaths (the query language for selecting nodes), which are used to navigate through HTML.

Don’t use tables for structuring

If a site’s structure includes a lot of tables, it becomes more difficult to scrape. This is because there are unlikely to be IDs and classes within the table’s data. 

Not only this, but when tables are used, the Xpaths can become much longer and are therefore more likely to break.

Don’t use unnecessary AJAX

AJAX (asynchronous JavaScript and XML)tends to load independently from HTML, meaning that it can be missed in the scraping process. 

Although the browser does drive content and the HTML does load, sometimes something else will pop up. Though a crawler can be set to wait for AJAX content to load before scraping, any AJAX can still dramatically increase the scraping time.

Avoid using sessions

Unnecessary sessions make it difficult to deep link products and can also make the website difficult to scrape. 

This is most common on travel website pages, as search URLs sometimes use sessions, causing them to timeout or expire after a period of time.

Be consistent

Crawlers are programmed to recognise each type of webpage based on its structure; if site pages are inconsistent, the crawler will return invalid results. 

So, for example, if the crawler is expecting to find the product price under a particular HTML tag class or id and the client introduces a new product page where the price is located under an unfamiliar HTML tag class or id, the product is likely to be overlooked.

Make sure your website is compliant

All websites should comply with W3C standards, which lays out standards that developers need to adhere to. It’s also best to have well-formed HTML so that Xpaths can be created easily. 

For example, if an HTML tag is not closed off properly, it can affect the structure of the site.

Keep your website accessible

Even in its simplest form, your website should be compatible with each of the various internet browsers. 

So, even if a user has content blockers switched on, the website should still load. This also makes it much easier to scrape the product data.

Robert Durkin

Published 10 September, 2013 by Robert Durkin

Robert Durkin is Chairman and Co-Founder of FusePump and a contributor to Econsultancy. You can follow him on Twitter.

15 more posts from this author

Comments (7)

Comment
No-profile-pic
Save or Cancel
Avatar-blank-50x50

Sandy Sunny, CEO at AngelHart Marketing Services LLC.

Scraping is very essential, Especially for major e-commerce comparison sites.

about 3 years ago

Avatar-blank-50x50

Samantha

Yes, I agree with you as well. Scraping process must be consistent without any clutter.

about 3 years ago

Avatar-blank-50x50

Dan Mitroi

Hi Robert,
Is great to see so much good info here about web scraping.
The big key to stand in today search engines is indeed a compliant website.

about 3 years ago

Robert Durkin

Robert Durkin, Founder at FusePump (WPP)

Thanks for all your comments!

about 3 years ago

Avatar-blank-50x50

Ernests Stals, CEO, co-founder at Reach.ly

I'm quite surprised with this article and way it addresses scraping problem. It's not about html it self but about ways you pass data. For couple of years there have been joint Google, Yahoo and other company initiative called Schema.org which standardize way how to pass product info with html meta tags. By implementing this you get two gains - rich snippets in Search result page AND easy way for developers to scrape your site.
Don't take just my word for it - http://econsultancy.com/lv/blog/62899-why-you-need-to-schema-now-not-later
If you are using Magento, Bigcommerce, Shopify or other platform, just google for Schema for YourPlatform and you will find how to implement it or even get some plugin which does that for you.
Also google for Goodrelations and Productontology

about 3 years ago

Avatar-blank-50x50

Chris Gedge

Scraping other peoples content ey. You've never heard of the Google Panda update then? Although this method is very scalable, it isn't going to do your organic rankings any favors.

about 3 years ago

Avatar-blank-50x50

Monty Richards, Customer & Community Support at Inspyder Software Inc.(www.inspyder.com)

Hi Chris,

Scraping is necessary to get the most accurate information from the suppliers, who often won't take the time to update all of the wholesalers and retailers who count on them.

Having said that, there's nothing stopping you from also adding original/unique content to be able to also please the search engines...but you have to start with specific/accurate data.

almost 3 years ago

Comment
No-profile-pic
Save or Cancel
Daily_pulse_signup_wide

Enjoying this article?

Get more just like this, delivered to your inbox.

Keep up to date with the latest analysis, inspiration and learning from the Econsultancy blog with our free Daily Pulse newsletter. Each weekday, you ll receive a hand-picked digest of the latest and greatest articles, as well as snippets of new market data, best practice guides and trends research.