GWT False Reporting or GoogleBot has weird crawling ability?
-
Hi I hope someone can help me.
I have launched a new website and trying hard to make everything perfect. I have been using Google Webmaster Tools (GWT) to ensure everything is as it should be but the crawl errors being reported do not match my site. I mark them as fixed and then check again the next day and it reports the same or similar errors again the next day.
Example:
http://www.mydomain.com/category/article/ (this would be a correct structure for the site).
GWT reports:
http://www.mydomain.com/category/article/category/article/ 404 (It does not exist, never has and never will) I have been to the pages listed to be linking to this page and it does not have the links in this manner. I have checked the page source code and all links from the given pages are correct structure and it is impossible to replicate this type of crawl.
This happens accross most of the site, I have a few hundred pages all ending in a trailing slash and most pages of the site are reported in this manner making it look like I have close to 1000, 404 errors when I am not able to replicate this crawl using many different methods.
The site is using a htacess file with redirects and a rewrite condition.
Rewrite Condition:
Need to redirect when no trailing slash
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !.(html|shtml)$
RewriteCond %{REQUEST_URI} !(.)/$
RewriteRule ^(.)$ /$1/ [L,R=301]The above condition forces the trailing slash on folders.
Then we are using redirects in this manner:
Redirect 301 /article.html http://www.domain.com/article/
In addition to the above we had a development site whilst I was building the new site which was http://dev.slimandsave.co.uk now this had been spidered without my knowledge until it was too late. So when I put the site live I left the development domain in place (http://dev.domain.com) and redirected it like so:
<ifmodule mod_rewrite.c="">RewriteEngine on
RewriteRule ^ - [E=protossl]
RewriteCond %{HTTPS} on
RewriteRule ^ - [E=protossl:s]RewriteRule ^ http%{ENV:protossl}://www.domain.com%{REQUEST_URI} [L,R=301]</ifmodule>
Is there anything that I have done that would cause this type of redirect 'loop' ?
Any help greatly appreciated.\
-
Yeah - do this!
-
Anyone any thoughts on this?
-
Sorry I also should add that the url structure that google generates is like this:
http://www.domain.com/category/article/
http://www.domain.com/category/article/same-category/differentarticle/
http://www.domain.com/category/article/same-category/another-different-article/
http://www.domain.com/category/article/another-different-category/differentarticle/
etc, it is like it gets to a category article and then moves sideways and somehow adds the move onto the current url without keeping hold of the suffix of the URL
-
Doesn't sound like GWT is false reporting. May want to check your trailing slash URL rewrite. It seems like there is an issue there as what you are describing sounds like the URLs are being written incorrectly and causing the incorrect URLs to be generated and show up in GWT.
Your 301 looks ok and if the dev site was spidered and indexed, you should just add the site to GWT and then use the URL removal tool to remove the site from the index, then remove the site and redirect.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Old domain still being crawled despite 301s to new domain
Hi there, We switched from the domain X.com to Y.com in late 2013 and for the most part, the transition was successful. We were able to 301 most of our content over without too much trouble. But when when I do a site:X.com in Google, I still see about 6240 URLs of X listed. But if you click on a link, you get 301d to Y. Maybe Google has not re-crawled those X pages to know of the 301 to Y, right? The home page of X.com is shown in the site:X.com results. But if I look at the cached version, the cached description will say :This is Google's cache of Y.com. It is a snapshot of the page as it appeared on July 31, 2014." So, Google has freshly crawled the page. It does know of the 301 to Y and is showing that page's content. But the X.com home page still shows up on site:X.com. How is the domain for X showing rather than Y when even Google's cache is showing the page content and URL for Y? There are some other similar examples. For instance, you would see a deep URL for X, but just looking at the <title>in the SERP, you can see it has crawled the Y equivalent. Clicking on the link gives you a 301 to the Y equivalent. The cached version of the deep URL to X also shows the content of Y.</p> <p>Any suggestions on how to fix this or if it's a problem. I'm concerned that some SEO equity is still being sequestered in the old domain.</p> <p>Thanks,</p> <p>Stephen</p></title>
Technical SEO | | fernandoRiveraZ1 -
Weird Cigarette URLs showing up in Google Webmaster Tools
Hi there, I'm noticing a bunch of URLs showing up in my google webmaster tools that are all cigarette related (they are appearing as 404s in the crawl error report). They are throwing 404 errors which is why they are listed here... Anyone have any idea of what this could be? I recently switched from Wordpress to Shopify and these weird URLs just started appearing on my webmaster tools in the last week. Kinda bizarre / a little alarming! Thanks,
Technical SEO | | TheBatesMillStore
Bianca0 -
Our stage site got crawled and we got an unnatural inbound links warning. What now?
live site: www.mybarnwoodframes.com stage site: www.methodseo.net We recently finished a redesign of our site to improve our navigation. Our developer insisted on hosting the stage site on her own server with a separate domain while she worked on it. However, somebody left the site turned on one day and Google crawled the entire thing. Now we have 4,320 pages of 100% identical duplicate content with this other site. We were upset but didn't think that it would have any serious repercussions until we got two orders from customers from the stage site one day. Turns out that the second site was ranking pretty decently for a duplicate site with 0 links, the worst was yet to come however. During the 3 months of the redesign our rankings on our live site dropped and we suffered a 60% drop in organic search traffic. On May 22, 2013 day of the Penguin 2.0 release we received an unnatural inbound links warning. Google webmaster tools shows 4,320 of our 8,000 links coming from the stage site domain to our live site, we figure that was the cause of the warning. We finished the redesign around May 14th and we took down the stage site, but it is still showing up in the search results and the 4,320 links are still showing up in our webmaster tools. 1. Are we correct to assume that it was the stage site that caused the unnatural links warning? 2. Do you think that it was the stage site that caused the drop in traffic? After doing a link audit I can't find any large amount of horrendously bad links coming to the site. 3. Now that the stage site has been taken down, how do we get it out of Google's indexes? Will it be taken out over time or do we need to do something on our end for it to be delisted? 4. Once it's delisted the links coming from it should go away, in the meantime however, should we disavow all of the links from the stage site? Do we need to file a reconsideration request or should we just be patient and let them go away naturally? 5. Do you think that our rankings will ever recover?
Technical SEO | | gallreddy0 -
Rip Off Report.com?
Who has had dealing with Rip Off Report.com They posted a "rip off report" about my client. At the top of the site it has a banner to hire an SEO that can get rid of this "negative online reputation." Black Hat?
Technical SEO | | JML11790 -
Fixing Crawl Errors
Hi! I moved my Wordpress blog back in August, and lost much of my site traffic. I recently found over 1000 crawl errors in Webmaster Tools because some of my redirects weren't transferred, so we are working on fixing the errors and letting Google know. I'm wondering how long I should expect for Google to recognize that the errors have been fixed and for the traffic to start returning? Thanks! Jodi - momsfavoritestuff.com
Technical SEO | | JodiFTM0 -
Nofollow links appear to be still included in SEOMOZ crawl and Google
I have added the nofollow tag to links throughout my site to hide duplicate content from Google but these pages are still being shown in my SEOMOZ crawl. I also fetched an example page with the Googlebot within Webmaster tools and it showed all nofollow links. An example is http://www.adventurepeaks.com/news All News tags have nofollow but each tag is appearing in my SEOMOZ crawl report as duplicate content. Any suggestions on whether this is a problem or if i have applied the tag incorrectly? Many thanks in advance
Technical SEO | | adventure340 -
Blocking AJAX Content from being crawled
Our website has some pages with content shared from a third party provider and we use AJAX as our implementation. We dont want Google to crawl the third party's content but we do want them to crawl and index the rest of the web page. However, In light of Google's recent announcement about more effectively indexing google, I have some concern that we are at risk for that content to be indexed. I have thought about x-robots but have concern about implementing it on the pages because of a potential risk in Google not indexing the whole page. These pages get significant traffic for the website, and I cant risk. Thanks, Phil
Technical SEO | | AU-SEO0 -
Google crawl rate almost zero since re-launch, organic search up 50% though!
We're confused as to why Google's crawl of our site has dropped hugely since our new site went live. The URLs of almost all pages changed, and were 301d to the new site. About 20% of our pages were blocked by robots.txt for the re-launch. The re-launch has been great for organic search, with hits up about 50%. Yet our new content is taking a lot longer to get indexed than before. Our KB downloaded a day according to webmaster tools are well down, as is time spent downloading a page. Any ideas as to why this is?i7hwX.png
Technical SEO | | soulnafein0