What is Considered Duplicate Content by Crawlers?
-
I am asking this because I have a couple of site audit tools that I use to crawl a site I work on every week and they are showing duplicate content issues (which I know there is a lot on this site) but some of what is flagged as duplicate content makes no sense.
For example, the following URL's were grouped together as duplicate content:
|
https://www.firefold.com/contact-us
|
| https://www.firefold.com/sale |
|
|
How are these pages duplicate content? I am confused on what site audit tools are considering duplicate content.
Just FYI, this is data from Moz crawl diagnostics but SEMrush site auditor is giving me the same type of data.
Any help would be greatly appreciated.
Ryan
-
Yea I just started working on this site. I haven't used Moz Analytics much so just wanting to see how their crawler crawls pages.
And yes I agree, there are a lot of BIG BIG BIG issues with this site.
I got a large workload over the next few months haha.
-
I would add that there's is no text on any of those three pages - any "text" one would see there is actually just embedded in an image - which is a huge issue for a number of reasons:
- Search engines see that there's no text - a big no-no.
- You're getting practically no SEO value from the content that would be there, even if there isn't much.
- It's heavier this way - which makes load times slower.
I want to clarify that there are many, bigger issues with these pages - but as your question concerns only duplicate content, I'll leave all of that out for the time being. To summarize, Google, Yahoo, and Bing are just seeing some duplicate banners, sidebars, etc. and then some images in the body of your pages. Hence, duplicate content.
-
Thanks for that information.
It makes sense looking at the data and pages from that perspective.
-
Hi Ryan!
Our crawler will flag pages that have at least 90% similarity in the entire source code of the site so not just the body.
The way you want to interpret the report is the contact-us page has 35 duplicates, so "gabe" and "sale" are not dupes of each other in this section but are only each a duplicate of "contact-us". Those URLs might appear with their own duplicates of the same pages further down in the report.
While on the front end the pages do not appear to be similar. The issue is likely with the amount of javascript code on those pages.
Our crawler cannot read javascript so we are likely only able to see the template of the page. Other search tools are probably seeing the same thing as it returns 79% similarity using this tool: http://www.freebulkseotools.com/similar-page-checker-tool.php
I can't provide much insight from a dev perspective but hope this helps!
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Crawler triggering Spam Throttle and creating 4xx errors
Hey Folks, We have a client with an experience I want to ask about. The Moz crawler is showing 4xx errors. These are happening because the crawler is triggering my client's spam throttling. They could increase from 240 to 480 page loads per minute but this could open the door for spam as well. Any thoughts on how to proceed? Thanks! Kirk
Moz Bar | | kbates1 -
Moz Crawler Causing Server Timeouts... Crawling thousands of non-existant pages with query parameters
Moz crawler is crawling all pages like this: http://www.xxxx.com/?product_count=100&product_order=desc&product_orderby=date http://www.xxxx.com/?product_count=100&product_order=desc&paged=1 http://www.xxx.com/?product_count=100&product_order=desc&product_view=grid Last month it crawled 80,000 pages on a site with less than 100 pages. Is there a way to select only certain pages to be crawled? Right now it is still crawling this site, since Monday morning and it's Tuesday mid-day. Every Monday it is causing time-outs from high band width on our server. Just getting ready to delete this client from the account unless there is a solution someone can give us. Thanks.
Moz Bar | | adirondack0 -
Duplicate Page Content
The site crawl is registering duplicate page content for our storefront site, but the pages aren't the same. They're ascending pages within the same category (ex: Featured, Featured pg2, Featured pg3, and so on). What can be done to fix these errors or prevent them in the future?
Moz Bar | | MGuid550 -
Alternative to Moz Content?
Hi, Looks like moz content is really gone 😞 Does anyone have an alternative that does sort of the same thing?
Moz Bar | | mikeymosh1 -
Duplicate content reported for totally different pages
Hi, The Moz report is showing just over 21,500 duplicate page issues on our site. This is more or less every page we have. However when I look at the pages it says are duplicates they are totally different (it could for example report that a news page for 2009 is the same as a product page just added which has no relation when you read the content or view the page). What sort of thing could it be picking up as duplicate content? I assume it must be something in the HTML for the site rather than the actual page content as there is no cross over at all on the pages highlighted. The only issue I can currently identify is that the menu for the mobile version of the site has a huge number of internal links which I will cut down. If the tools purely look at HTML content this could be seen as duplicate but shouldn't it be clever enough to realise what is content and what is site structure? Thanks,
Moz Bar | | TW-Steve0 -
Duplicate page titles
Hi -- A crawl tells me I have 200 duplicate page titles. Unfortunately, it doesn't tell me what those pages are duplicating. What do I do with this information? How do I begin to respond? Thanks
Moz Bar | | skipperdoodle0 -
Blocked Production Site from Search Engines - How to get it Crawled by Moz Crawler
I have an 'under development' site hosted, (which is an exact replica of live site as working on to add new functionalities & modules) - but its password protected, excluded from robots.txt (Disallow) & also marked noindex on all pages in the index - so that Googlebot & other Search Engines can not crawl the site At present the development work is almost 95% completed., Now - feel like to crawl the site through SEOMOZ Roger Bot - to know the errors and all indexed urls by Rogerbot. What's the best way to get Moz Bot crawl the site - but simultaneously continue it blocking its access to Search Engines I have gone through - https://support.google.com/webmasters/answer/93708?hl=en, it says a) Save it in a password-protected directory. Googlebot and other spiders won't be able to access the content- But this way Moz will also not be able to crawl the site b) Use a robots.txt to control access to files and directories on your server - However it also says - It's important to note that even if you use a robots.txt file to block spiders from crawling content on your site, Google could discover it in other ways and add it to our index. c) Use a noindex meta tag to prevent content from appearing in our search results - It also says that a link to the page can still appear in their search results. Because we have to crawl your page in order to see the noindex tag, there's a small chance that Googlebot won't see and respect the noindex meta tag Password Protected thus seems the best way to continue blocking. However, continuing with it will also block Moz bot to crawl the site. Any suggestions Thanks
Moz Bar | | Modi0