Trying to reduce pages crawled to within 10K limit via robots.txt

AspenFasteners

Our site has far too many pages for our 10K page PRO account which are not SEO worthy. In fact, only about 2000 pages qualify for SEO value. Limitations of the store software only permit me to use robots.txt to sculpt the rogerbot site crawl. However, I am having trouble getting this to work. Our biggest problem is the 35K individual product pages and the related shopping cart links (at least another 35K); these aren't needed as they duplicate the SEO-worthy content in the product category pages.

The signature of a product page is that it is contained within a folder ending in -p. So I made the following addition to robots.txt:

User-agent: rogerbot
Disallow: /-p/

However, the latest crawl results show the 10K limit is still being exceeded. I went to Crawl Diagnostics and clicked on Export Latest Crawl to CSV. To my dismay I saw the report was overflowing with product page links:

e.g. www.aspenfasteners.com/3-Star-tm-Bulbing-Type-Blind-Rivets-Anodized-p/rv006-316x039354-coan.htm

The value for the column "Search Engine blocked by robots.txt" = FALSE; does this mean blocked for all search engines? Then it's correct. If it means "blocked for rogerbot? Then it shouldn't even be in the report, as the report seems to only contain 10K pages.

Any thoughts or hints on trying to attain my goal would REALLY be appreciated, I've been trying for weeks now. Honestly - virtual beers for everyone!

Carlo

andresgmontero

Wow! thank you, many of the robots.txt testers still show them as disallow, good to know! thank you!

AspenFasteners

Hi Andres!

Sorry, I thought I answered this earlier. If I understand correctly wildcards ARE allowed, according to this reply to my question on the topic: http://www.seomoz.org/q/does-rogerbot-read-url-wildcards-in-robots-txt

Hope THIS reply sticks this time!

andresgmontero

Hi, as far as I know wildcard characters (like "*") are not allowed there, the line must be an allow, disallow, comment or a blank line statement, so before you get angry at Roger for not listening to you, go to Google Webmaster Tools > Crawler Access and test the robots.txt file. Hope it works.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Trying to reduce pages crawled to within 10K limit via robots.txt

Browse Questions

Explore more categories

Related Questions

How can a keyword placed on a page with the Moz page optimization score of 100 be ranked #51+?

What should I do with all these 404 pages?

Robots.txt blocking Addon Domains

Why are only a few of our pages being indexed

GWT returning 200 for robots.txt, but it's actually returning a 404?

When creating parent and child pages should key words be repeated in url and page title?

How can I change the page title "two" (artigos/page/2.html) in each category ?

Handling 301s: Multiple pages to a single page (consolidation)