How to avoid duplicate content issues in ecommerce

1. Duplicate content caused by faceted navigation

This issue is very common for ecommerce sites and is likely to be the worst from this list for SEO. A single category page on some retail websites could have over 100 variations of it’s URL, due to the many combinations of parameters for facets / filters.

Here is an example of how a duplicate content issue caused by faceted navigation could arise:

Top version = unfiltered category page | Bottom version = filtered version of the same page.

The example above illustrates how a query string is appended onto the existing URL to filter the results, however the content on the page will remain the same, resulting in duplicate content. Search engines will be able to crawl these duplicate pages and, in addition to the SEO issues with duplicate content, these pages will also eat up crawl budget too.

I’ve seen plenty of websites held back by indexation issues caused by faceted navigation and have also seen Google send messages via Webmaster Tools.

Preventing these pages from being indexed

There are a number of ways you can prevent search engines from accessing / indexing faceted navigation pages, here are the ones that I would recommend:

Meta robots rules

Assigning meta robots rules to filter pages is the best solution in my opinion and I’ve had the most success with this in the past.

I would always recommend using the following meta robots tag:

The ‘noindex’ tag tells search engines not to index the page and the ‘follow’ tag tells them to continue following the links on the page.

Parameter handling in Google Webmaster Tools

Although I’ve had mixed results with them in the past, the parameter handling tool within Google Webmaster Tools has definitely improved and lots of SEO’s I know use it as their primary way to address dynamic pages.

Webmaster Tools is generally pretty good at identifying the parameters on your website, however you can also manually add additional ones if it doesn’t find the ones you’re looking to address.

For the example above, you can see that I’ve told Google that the parameter sorts page content and have asked them not to crawl any of these URLs. There are a number of options for this, enabling you to have exceptions for pages you would like to be indexed.

Canonical tag

The canonical tag was introduced in 2009 to help webmasters tell search engines that a URL is a variation of another URL. The canonical tag can be used to tell search engines that filter pages are duplicate versions of the original category page.

I would recommend using the meta robots rules over the canonical tag as I’ve seen plenty of examples where search engines have ignored the canonical tag and continued to index pages.

2. Duplicate content caused by product ordering

Similarly to faceted navigation, directory-ordering parameters create duplicate variations of pages (with the same content and meta content), which can be accessed by search engines.

Often, these pages manage to escape under the radar for retailers as they won’t have the same level of volume as faceted navigation.

I would recommend adopting the same methods of resolving this issue as with duplicate content caused by faceted navigation; either using meta robots rules (recommended as the best option), the canonical tag or using parameter handling in Google Webmaster Tools.

3. Duplicate content caused by hierarchical URLs

A few years ago hierarchical URLs were considered to be best practice for ecommerce websites, as they illustrate the structure of the website. Now, as SEO has evolved, hierarchical URLs can often be a cause of duplicate content issues as they create multiple variations of the same products if they’re featured within the same category.

In most cases, these products will have the same or very similar content, which will provide detrimental to search engine rankings.

I would usually recommend, if possible, creating rewrite rules to change these pages to top-level product URLs. If you’re unable to do this, I would recommend using the canonical tag to pass value to the preferred page and also make it clear which one is the primary version to the search engines.

4. Duplicate content caused by search pages

Like faceted navigation, catalogue search pages are another prime example of a common duplicate content perpetrator, with lots of large and small retailers leaving them accessible to search engines.

I’ve seen scenarios where retailers have had well over 100,000 search pages indexed, which has caused significant issues for them with their rankings.

The easiest way to prevent search engines from accessing these pages is to block the directory in the robots.txt file.

Example:

To block pages like this:/shop/catalogsearch/result/?q=testquery

You would add this line into the robots.txt file: Disallow: /shop/catalogsearch/

If these pages have already been indexed for your website, I would recommend just removing the directory within Google Webmaster Tools – your pages will generally be removed from the index within 12 hours.

5. Duplicate content caused by internationalisation

Something that I see time and time again from retailers is the introduction of international versions of their websites before they’ve translated all of their content. The result of this is lots of duplicate versions of products and categories with slightly different URLs.

This is not always the case, some platforms will not manage international products with multi-site functionality (whichwould create replicas of the initial site architecture), so are less likely have this issue.

In my opinion, the only truly effective way to resolve this situation is to add the international content, although you could temporarily block access to the pages until the content has been added.

For those looking to launch international versions of their website, I would strongly recommend using university students, as they’re quick and affordable. In the past we’ve had great results from posting adverts on university websites.

6. Duplicate content caused by pagination

Pagination is another really common duplicate content issue for online retailers and before the introduction of the rel=next and prev tags, it was seen as a big issue for SEO.

The rel=next and prev tags, which were introduced by Google in 2011, allow webmasters to tell search engines which pages are pagination and prevent them from being seen as duplicate content.

7. Duplicate content caused by session IDs

Session IDs are one of the most annoying things to have to face in SEO, as URLs are created based on user sessions and can cause an unlimited amount of new duplicate pages to be created / indexed.

Ecommerce websites commonly have issues with session IDs, as the unique IDs are appended to the URLs when there’s a change in the host name, so the session ID would be appended to the next page visited. So when users move from one subdomain (often because of an SSL certificate) to another, a session ID is appended to the URL.

Session IDs can be a complete nightmare to eliminate, but the best (and only real solution) is to resolve the issue properly and stop the session IDs from being created.

You can also use the parameter handling section of Google Webmaster Tools to tell search engines to ignore session IDs.

8. Duplicate content caused by print pages

Often, more so with older ecommerce websites, there is an option on product pages to display a printer friendly version of the page, which would display the same content but on a different URL. These pages are duplicate versions of the product pages and are therefore duplicate content.

In order to prevent these pages from being indexed, you need to either apply meta robots rules (noindex, follow) to the pages if they’re dynamic or disallow the directory in the robots.txt file.

9. Duplicate content caused by review pages

Customer reviews are displayed in different ways depending on the way the site has been built (or the platform it’s been built on). Some websites display all of the reviews on product pages and then have separate (often paginated) pages with just the reviews.

Here’s an example of this:

As you can see, the review pages contain the same customer review content but on a different URL.

In order to prevent these pages from being indexed, you just need to disallow the directory in the robots.txt file, or if they’re dynamic, apply meta robots rules (noindex, follow).

If you have had any other issues with duplicate content, please feel free to ask questions within the comments below or email me at paul (@) gpmd.co.uk.