Disallow statement - is this tiny anomaly enough to render Disallow invalid?

lzhao

Google site search (site:'hbn.hoovers.com') indicates 171,000 results for this subdomain. That is not a desired result - this site has 100% duplicate content. We don't want SEs spending any time here.

Robots.txt is set up mostly right to disallow all search engines from indexing this site. That asterisk at the end of the disallow statement looks pretty harmless - but could that be why the site has been indexed?

User-agent: *
Disallow: /*

lzhao

Interesting. I'd never heard that before.

We've never had GA or GWT on these mirror sites before, so it's hard to say what Google is doing these days.

But the goal is definitely to make them and their contents invisible to SEs. We'll get GWT on there and start removing URLs.

Thanks!

WilliamKammer

The additional asterisk shouldn't do you any harm, although standard practice seems to be just putting the "/".

Does it seem like Google is still crawling this subdomain when you look at webmasters crawl stats? While the disallow function in robots.txt will usually stop bots from crawling, it doesn't prevent them from indexing or keeping pages indexed that were before the disallow was put in place. If you want these pages removed from the index, you can request it through webmasters and also use meta robots noindex as opposed to the robots.txt file. Moz has a good article about it here: http://moz.com/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts

If you're just worried about bots crawling the subdomain, it's possible they've already stopped crawling it, but continue to index it due to history or additional indicators suggesting they should index it.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Disallow statement - is this tiny anomaly enough to render Disallow invalid?

Browse Questions

Explore more categories

Related Questions

Disallow wildcard match in Robots.txt

H1 on responsive pages - Not enough room

Google's ability to crawl AJAX rendered content

Fetching & Rendering a non ranking page in GWT to look for issues

Can I disallow my subdomain for penguin recover?

Is Noindex Enough To Solve My Duplicate Content Issue?

How to disallow google and roger?

How do I use the Robots.txt "disallow" command properly for folders I don't want indexed?