No indexing url including query string with Robots txt

HMK-NL

Dear all,

how can I block url/pages with query strings like page.html?dir=asc&order=name with robots txt?

Thanks!

HMK-NL

Dear all, what is the best option? And are the option below good? A: Disallow

sort-order (Only URLs with value = asc)

"A single URL may contain many parameters for each of which you can specify settings. More restrictive settings override less restrictive settings. For example, here are three parameters and their settings"

source:

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687

B: User-agent:

Googlebot Disallow: /*.=name$

for example www.sub.domain.com/collection.html?dir=desc&order=name source: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449

Thanks!

kyleNeedham

You could always just use rel="canonical" which would be much better than completely blocking all URL parameters.

Matthew_Edgar

Hey,

Should that second URL be www.sub.domain.com/collection/adresboeken.html?whatever=something If so, then by using /collection/?* you are saying that anything within /collection/ with a query string should not be indexed. If adresboeken.html always has a query string, it may not get indexed.

The other options I'd consider before using robots.txt are telling Google to ignore dir=desc&order=color in Google Webmaster Tools parameter handling. This is the best way to handle query string issues. (Assuming you are trying to influence Google. Clearly Google Webmaster Tools won't affect Bing!)

Another idea is to set a canonical URL on /collection/adresboeken.html referencing /collection/adresboeken.html without the query string. This tells the search engines that the query strings do not make a unique URL. (adresboeken.html?dir=desc&order=color is the same as adresboeken.html?dir=desc&order=price is the same as adresboeken.html?dir=asc&order=color is the same as adresboeken.html, and so on).

I hope that helps. Thanks,
Matthew

cprasad

Hi,

Robots.txt works mainly on 2 rules. Those are User-agent: and Disallow:

User-agent: the name of the robot you need to block

Disallow: the url or folder or other url with conditions you need to block.

As you have asked in your question you need to block a url with a condition. But you have to remember that Robot.txt is giving so critical results if you did not use it correctly.

Anyway in your question, you wanted to block url/pages with query strings like page.html?dir=asc&order=name

so you have to use following:

User-agent: *

Disallow: /*?

So the above will block all the urls with a question mark (?) for all the search robots. This will not block only page.html?dir=asc&order=name it will alos block comments.html?dir=asc&order=name

So use it so carefully.

Hope this is the what you have looked for. If need more help you may ask.

Regards

Prasad

HMK-NL

Dear all,

thanks for responding. If I have a pages like

1. www.sub.domain.com/collection.html exists, I want to index it, and

2. www.sub.domain.com/collection.html?dir=desc&order=color which I don't want to index

Is this the way to do this in de robots.txt?:

Disallow: /collection/?*

Thanks!

Matthew_Edgar

Hi,

Here is an article explaining how to do this in robots.txt:
http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/

Depending on what you are trying to do, it might also be worth investigating parameter handling in Google Webmaster Tools:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687

Thanks,
Matthew

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

No indexing url including query string with Robots txt

Browse Questions

Explore more categories

Related Questions

I have two robots.txt pages for www and non-www version. Will that be a problem?

Indexing Issue

Crawl reveals hundreds of urls with multiple urls in the url string

Help needed with robots.txt regarding wordpress!

Robots.txt usage

Subdomain Removal in Robots.txt with Conditional Logic??

"To keyword or not to keyword" in the URL string?

Why is this url showing as "not crawled" on opensiteexplorer, but still showing up in Google's index?