Wildcarding Robots.txt for Particular Word in URL

EvansHunt

Hey All,

So I know that this isn't a standard robots.txt, I'm aware of how to block or wildcard certain folders but I'm wondering whether it's possible to block all URL's with a certain word in it?

We have a client that was hacked a year ago and now they want us to help remove some of the pages that were being autogenerated with the word "viagra" in it. I saw this article and tried implementing it https://builtvisible.com/wildcards-in-robots-txt/ and it seems that I've been able to remove some of the URL's (although I can't confirm yet until I do a full pull of the SERPs on the domain). However, when I test certain URL's inside of WMT it still says that they are allowed which makes me think that it's not working fully or working at all.

In this case these are the lines I've added to the robots.txt

Disallow: /*&viagra

Disallow: /*&Viagra

I know I have the solution of individually requesting URL's to be removed from the index but I want to see if anybody has every had success with wildcarding URL's with a certain word in their robots.txt? The individual URL route could be very tedious.

Thanks!

Jon

EvansHunt

Hey Paul,

Great answer, for some reason it totally slipped my mind that robots.txt is a crawling directive and not an index one. Yes the pages return a 404 on the headers. I've grabbed a copy of the complete SERPS and will now manually disallow them.

Thanks!

Jon

ThompsonPaul

Thank for the endorsement, Christy! Funny, I only just now saw Rand's recent WBF related to this topic, but pleased to see my answer lines up exactly with his info.

P.

ThompsonPaul

You need to be aware, Jonathan, that there is absolutely nothing about a robots.txt disallow that will help remove a URL from the search engine indexes. Robots is a crawling directive, NOT an indexing directive. In fact, in most cases, blocking URLs in robots.txt will actually cause them to remain in the index even longer.

I'm assuming you have cleaned up the site so the actual spam URLs no longer resolve. Those URLs should now result in a 404 error page. You must confirm they are actually returning the correct 404 code in the headers. As long as this is the case, it is a matter of waiting while the search engines crawl the spam URLs often enough to recognise they are really gone and remove them from the index. The problem with adding them to the robots.txt is that is actually telling the search engines NOT to crawl them, so they are unlikely to discover that they lead to 404s, hence they may remain in the index even longer.

Unfortunately you can't use a no-index tag on the offending pages, because the pages should no longer exist on the site. I don't think even a careful implementation of a X-Robots noindex directive in htaccess would work, because the URLs should be resulting in a 404.

Make certain the problem URLs return a clean 404, use the Google Search Console Remove URLs tool for as many of them as you can (for example you can request removal for entire directories, if the spam happened to be built that way), and then be patient for the rest. But do NOT block them in robots.txt - you'll just prolong the agony and waste your time.

Hope that all makes sense?

Paul

Martijn_Scheijbeler

Hi Jon,

Why not just: Disallow: /viagra

LesleyPaone

Jon,

I have never done it with a robots.txt, one easy why that I think you could do it would be on the page level. You could add a noindex nofollow to the page itself.

You can generate it automatically too and have it fired depending on the url by using a substring search on the url as well. That will get them all for sure.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Wildcarding Robots.txt for Particular Word in URL

Browse Questions

Explore more categories

Related Questions

Keyword in URL - SEO impact

¿Disallow duplicate URL?

Default Robots.txt in WordPress - Should i change it??

Canonical Issue with urls

Urls in Bilingual websites

Robots.txt 404 problem

Negative impact on crawling after upload robots.txt file on HTTPS pages

Old pages still crawled by SE returning 404s. Better to put 301 or block with robots.txt ?