Robots.txt anomaly
-
Hi,
I'm monitoring a site thats had a new design relaunch and new robots.txt added.
Over the period of a week (since launch) webmaster tools has shown a steadily increasing number of blocked urls (now at 14).
In the robots.txt file though theres only 12 lines with the disallow command, could this be occurring because a line in the command could refer to more than one page/url ? They all look like single urls for example:
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themesetc, etc
And is it normal for webmaster tools reporting of robots.txt blocked urls to steadily increase in number over time, as opposed to being identified straight away ?
Thanks in advance for any help/advice/clarity why this may be happening ?
Cheers
Dan
-
many thanks for that Dan !
-
As far as I thought, the important thing is that your feed shows up in feed readers. Can you subscribe to and view your RSS feed in a variety of different feed readers?
Yes, so long as the ? is utilized only in ways in which would result in duplicate content, or content that would not be desirable to crawl, it will have that effect.
-Dan
-
Many Thanks for your comments Dan !
So it doesnt matter that the feeds not going to be crawled, dont we want feeds to be crawled usually?
Blocking anything with a ? is surely good then isnt it since prevents all the dupe content etc one gets from search results ?
Yes my clients webmaster set it up
-
Hi Dan
I see no reason to disallow the feed like that by default, unless there is some reason I don't know about. But it won't harm anything either.
The second part blocks any URL which begins with a ? (question mark). This would block anything that has a parameter in the URL - most commonly a search word, pagination, filtering settings etc.
As far as I'm aware this is not going to be damaging to the site, but it's not the default setting. Did someone set it up that way for you?
My robots.txt shows the default WordPress settings: http://www.evolvingseo.com/robots.txt
-
Hi Dan
Yes please find below, please can you also confirm if the bottom 2 lines refer to blocking internal search results ?:
Disallow: /feed
Disallow: */feedDisallow: /?
Disallow: /*?Many Thanks
Dan
-
Hi Dan
Can you share the exact line disallowing RSS?
Thanks!
-Dan
-
sorry 1 more question, i see that the webmaster has disallowed the feeds in the robots.txt file is this normal/desirable, i would have thought one would want rss feeds crawled by Google ?
-
nice 1 cheers Jesse !
-
Your assumption is correct. The disallows you listed are directories, not pages. Therefore, anything within the Plugins folder will be disallowed, same with the cache and themes folder.
So you may have multiple files (and I'm sure you do) within each of those folders.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Blocking in Robots.txt and the re-indexing - DA effects?
I have two good high level DA sites that target the US (.com) and UK (.co.uk). The .com ranks well but is dormant from a commercial aspect - the .co.uk is the commercial focus and gets great traffic. Issue is the .com ranks for brand in the UK - I want the .co.uk to rank for brand in the UK. I can't 301 the .com as it will be used again in the near future. I want to block the .com in Robots.txt with a view to un-block it again when I need it. I don't think the DA would be affected as the links stay and the sites live (just not indexed) so when I unblock it should be fine - HOWEVER - my query is things like organic CTR data that Google records and other factors won't contribute to its value. Has anyone ever blocked and un-blocked and whats the affects pls? All answers greatly received - cheers GB
Technical SEO | | Bush_JSM0 -
No index tag robots.txt
Hi Mozzers, A client's website has a lot of internal directories defined as /node/*. I already added the rule 'Disallow: /node/*' to the robots.txt file to prevents bots from crawling these pages. However, the pages are already indexed and appear in the search results. In an article of Deepcrawl, they say you can simply add the rule 'Noindex: /node/*' to the robots.txt file, but other sources claim the only way is to add a noindex directive in the meta robots tag of every page. Can someone tell me which is the best way to prevent these pages from getting indexed? Small note: there are more than 100 pages. Thanks!
Technical SEO | | WeAreDigital_BE
Jens0 -
Are robots.txt wildcards still valid? If so, what is the proper syntax for setting this up?
I've got several URL's that I need to disallow in my robots.txt file. For example, I've got several documents that I don't want indexed and filters that are getting flagged as duplicate content. Rather than typing in thousands of URL's I was hoping that wildcards were still valid.
Technical SEO | | mkhGT0 -
Have I constructed my robots.txt file correctly for sitemap autodiscovery?
Hi, Here is my sitemap: User-agent: * Sitemap: http://www.bedsite.co.uk/sitemaps/sitemap.xml Directories Disallow: /sendfriend/
Technical SEO | | Bedsite
Disallow: /catalog/product_compare/
Disallow: /media/catalog/product/cache/
Disallow: /checkout/
Disallow: /categories/
Disallow: /blog/index.php/
Disallow: /catalogsearch/result/index/
Disallow: /links.html I'm using Magento and want to make sure I have constructed my robots.txt file correctly with the sitemap autodiscovery? thanks,0 -
Robots.txt & Mobile Site
Background - Our mobile site is on the same domain as our main site. We use a folder approach for our mobile site abc.com/m/home.html We are re-directing traffic to our mobile site vie device detection and re-direction exists for a handful of pages of our site ie most of our pages do not redirect the user to a mobile equivalent page. Issue – Our mobile pages are being indexed in desktop Google searches Input Required – How should we modify our robots.txt so that the desktop google index does not index our mobile pages/urls User-agent: Googlebot-Mobile Disallow: /m User-agent: `YahooSeeker/M1A1-R2D2` Disallow: /m User-agent: `MSNBOT_Mobile` Disallow: /m Many thanks
Technical SEO | | CeeC-Blogger0 -
Robots.txt to disallow /index.php/ path
Hi SEOmoz, I have a problem with my Joomla site (yeah - me too!). I get a large amount of /index.php/ urls despite using a program to handle these issues. The URLs cause indexation errors with google (404). Now, I fixed this issue once before, but the problem persist. So I thought, instead of wasting more time, couldnt I just disallow all paths containing /index.php/ ?. I don't use that extension, but would it cause me any problems from an SEO perspective? How do I disallow all index.php's? Is it a simple: Disallow: /index.php/
Technical SEO | | Mikkehl0 -
Should I add my blog posts to my sitemap.txt file?
This seems like it should be an obvious no, just because of the amount of work that would entail, and then remembering to do it every time I make a post, but since I couldn't find anything on Google about it and have never heard anyone mention it, I figured I'd ask.
Technical SEO | | UnderRugSwept0 -
Robots.txt question
What is this robots.txt telling the search engines? User-agent: * Disallow: /stats/
Technical SEO | | DenverKelly0