Canonicalization and Duplicate Content

Most people have multiple versions of a URL that direct to separate but similar versions of their website. There are a couple of reasons for this. First, the site owner may split-test different elements or track traffic from other places, like social media sites.

A few questions arise from this. First, which website versions get indexed and become the ones Google displays on their results pages for searchers to see? What does Google look at when determining which version is the main or “canonical” one that ranks on SERPs? Will site owners get penalized for having these duplicate content pages?

This process is called canonicalization, normalization, or standardization, and we will answer these questions for you today and give you a more precise concept of canonicalization. We will dive into the basics, what signals Google looks at, and look at example scenarios where multiple site versions and duplicate content is used. This is essential information only, so please consult white label SEO services if you need more information or help with canonicalization issues.

Let’s jump in!

Looking For White Label SEO Experts? Contact Us Now!

The Canonical Tag

A canonical tag is a code you can input into the “<head>” section of your page code or the HTTP header. This informs search engines that this version of your URL is the one you want to rank for – the one Google shows to searchers. When search engines crawl websites and come across duplicate or similar content, including these tags, it clarifies that one URL version is the one you prefer to be indexed.

This is what it looks like:

(Ensure the code is put in and closed out correctly.)

However, it is worth noting that this is only one of the other signals Google checks. Google looks at several factors, and the canonical tag may even be overlooked in favor of a different signal.

Canonicalization Signals

So how does Google determine which URL is the “canonical” version?

Google’s John Mueller explains that there are two general guidelines when picking the canonical URL:

Site preference – what the site tells Google it wants the canonical URL to be
User preference – what URL Google determines is more beneficial for the searcher

The things Google looks at regarding site preference:

Canonical tag (link rel canonical)
Which URL is in the sitemap file
Internal linking
Redirects
HTTPS URLs
URLs that look better/cleaner

Mueller says they factor in all these elements and choose the canonical one based on which URL incorporates these things the best. He also advises site owners that if they prefer what URLs to show searchers, they must apply these preferences consistently across their websites. Again, white label local SEO is helpful if you have limited resources in this area.

Other factors in the canonicalization process include duplicates, external links, and Hreflang.

[bctt tweet=” Google looks at several factors, and the canonical tag may even be overlooked in favor of a different signal.” username= “ThatCompanycom”]

Duplicate Content

Duplicate or very similar content may exist for various reasons, intentional or otherwise, and can cause many issues when ranking. Canonicalization can mitigate these problems. While duplicate content won’t earn you a penalty per se, it doesn’t mean it’s entirely without consequences.

Mainly, duplicate content could delay the right site pages from showing up on results pages. For instance, if you have two similar pages ranking for the exact keywords, they could compete, or it will take Google time to determine which one to place on SERPs. Even when Google gets around to it, it may not show the one you prefer or the version you put more effort into. Google is getting better at identifying which pages offer the best user experience. Still, valuable content can sometimes get buried under excessive duplicates like a needle in a stack of toothpicks – similar in form but not in essence. You can simplify this process through canonicalization practices.

Google’s Duplicate Canonicalization Rules

In terms of URLs, Google will often choose a cleaner, shorter URL version over one that is longer and includes parameters. Also, Google will often prefer HTTPS to the HTTP version of a site.

When Google encounters duplicate content on a page, it will choose a canonical version to index. This will be the version that it determines to be the best. All the pages it identifies as duplicates will form a cluster of pages. Signals are sent to the pages within that cluster and act as a consolidator that points to the chosen canonical. Note that the canonical that Google determines can still change over time depending on Google ranking and indexing factors.

The following are a few examples of cases that are considered duplicate content on pages or even canonicalization issues:

URLs with “www.” vs. those without
Having a URL with and without capital letters – it’s recommended to use lowercase as much as possible.
URLs with and without trailing slashes “/” at the end of the web address
URLs for pages containing scraped or syndicated content – scraping is illegal, but content syndication is allowed if you link to the original owner. Syndication becomes an issue if Google chooses your URL as the canonical version over the original site. This is content theft to a higher degree and must be sorted out if this happens.
URLs with and without “index.html.”
URL location variants containing the same content
URLs for mobile devices
URL versions redirecting from social media sites
URL parameters, whether they change the page’s content or not, are added at the ends of URLs, such as faceted navigation, tracking codes, session IDs, sorting content, etc.
Any pages that show the same full content as another page can confuse Google and cause the wrong canonical to be selected, including the main page for the blog, category pages, tag pages, paginated pages, or feed pages.

Remember, consistency is vital when it comes to site preference. Structure your URLs as uniformly as possible to minimize duplicate content and canonicalization issues.

Learn More About Canonicalization and Duplicate Content! Call Us Now!

Hreflang

Although Hreflang can solve duplication issues on some sites, it does not work on international sites. By default, Google will go for the correct website version and try to switch to a local version. However, it does not always work because the local version is not the leading site and can cause issues. If it does, what happens is that users end up being served website pages that are meant for users in a different country. Because Google’s system is not set up to sort this out properly, website owners are encouraged to publish multi-language pages, for example, with at least slightly different content (even if the content is a simple translation).

For JavaScript sites, usually those built on app shell models, the code that appears first on the pages can read very similarly to other pages and even code on various websites. Because of this, the pages can sometimes get canonicalized as part of other page clusters (i.e., other websites) or to other pages on the same domain level.

Remember that Google uses algorithms and most likely runs duplicate detection on automatic cycles. This can be part of the problem. If Google tags the pages as duplicates when it sees the code and does not fully crawl them, it’s possible that it will not be able to swap them correctly because it looks like another page based on the HTML content. This means there will be a delay in rendering the page because it has already been tagged as duplicate. As your white label experts, we can help with our white label digital advertising and SEO services.