Is robots.txt a must-have for 150 page well-structured site?
-
By looking in my logs I see dozens of 404 errors each day from different bots trying to load robots.txt. I have a small site (150 pages) with clean navigation that allows the bots to index the whole site (which they are doing). There are no secret areas I don't want the bots to find (the secret areas are behind a Login so the bots won't see them).
I have used rel=nofollow for internal links that point to my Login page.
Is there any reason to include a generic robots.txt file that contains "user-agent: *"?
I have a minor reason: to stop getting 404 errors and clean up my error logs so I can find other issues that may exist. But I'm wondering if not having a robots.txt file is the same as some default blank file (or 1-line file giving all bots all access)?
-
Thanks, Keri. No, it's a hand-built blog. No CMS.
I think the googlebot is doing a good job of indexing my site. The site is small and when I search for my content I do find it in google. I was pretty sure that google worked the way you describe. So it sounds like sitemaps are an optional hint, and perhaps not needed for relatively small sites (couple hundred pages of well linked content). Thanks.
-
The phrase "blog entries" makes me ask are you on a CMS like Wordpress, or are the blog entries pages you are creating from scratch?
If you're on WP or a CMS, you'll want a robots.txt so that your admin, plugin, and other directories aren't indexed. On the plus side, WP (and other CMSs) have plugins that will generate a sitemap.xml file you as you add pages.
Google will find pages if you don't have a site map, or forget to add them. The sitemap is a way to let Google know what is out there, but it a) isn't required for Google to index a page and b) won't force Google to index a page.
-
Thanks, Keith. Makes sense.
So how important is an xml sitemap for a 150 page site with clean navigation? As near as I can tell (from the site: command) my whole site is already being indexed by Google. Does a sitemap buy me anything? What happens if my sitemap is partial (ie if I forget to add new pages to it, but I do link to the new pages from my other indexed pages, then will the new pages get indexed)? I'm a little worried about sitemap maintenance as I add new blog entries and so on...
-
Hi Mike...
I am sure that you are always going to get a range of opinions to this kind of question.
I think that for your site the answer may be simply that having a robots.txt file is more of a “belt and braces” safe harbour-type thing – the same goes for say whether you should have a keywords meta tag – many say these pieces of code can be of marginal value but, when you are competing head to head for a #1 listing (ie 35%+ of the clicks) then you should use every option and weapon possible ...furthermore, if your site is likely to grow significantly or eventually have content/files that you may want excluded, it’s just a “tidy” thing to have had indexed over time.
Also, don’t forget that best practice robots.txt file taxonomy is to also include directions to your xml sitemap/s.
Here is an example from one of our sites...
User-agent: *
Disallow: /design_examples.xml
Disallow: /case_studies.xmlUser-agent: Googlebot-Image
Disallow: /Sitemap: http://www.sitetopleveldomain.com/sitemap.xml
In this example there are two root files specifically excluded from all bots and this site has also specifically excluded the Google Images bot as they were getting a lot of traffic from image searches and then subsequently seeing the same copyright images turn up on a hundred junk sites – this doesn’t stop image scraping but certainly reduces the ease of finding them.
In relation to the “or 1-line file giving all bots all access” part of your question...
Some bots (most notably Google) now support an additional field called "Allow:"
As the name suggests, "Allow:" lets you specifically indicate what files/folders CAN be crawled, excluding all others. However, this field is currently not part of the "robots.txt" protocol and so not universally supported, so my suggestion would be to test it for your site for a week, as it might confuse some less intelligent crawlers.
So, in summary, my recommendation is to keep a simple robots.txt file, test if the Allow: field works for you and also ensure you have that guide to your xml sitemap – although wearing a belt and braces might not be a good look, at least your pants are unlikely to fall down
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt
I have a client who after designer added a robots.txt file has experience continual growth of urls blocked by robots,tx but now urls blocked (1700 aprox urls) has surpassed those indexed (1000). Surely that would mean all current urls are blocked (plus some extra mysterious ones). However pages still listing in Google and traffic being generated from organic search so doesnt look like this is the case apart from the rather alarming webmaster tools report any ideas whats going on here ? cheers dan
Technical SEO | | Dan-Lawrence0 -
Is there any value in having a blank robots.txt file?
I've read an audit where the writer recommended creating and uploading a blank robots.txt file, there was no current file in place. Is there any merit in having a blank robots.txt file? What is the minimum you would include in a basic robots.txt file?
Technical SEO | | NicDale0 -
Robots.txt best practices & tips
Hey, I was wondering if someone could give me some advice on whether I should block the robots.txt file from the average user (not from googlebot, yandex, etc)? If so, how would I go about doing this? With .htaccess I'm guessing - but not an expert. What can people do with the information in the file? Maybe someone can give me some "best practices"? (I have a wordpress based website) Thanks in advance!
Technical SEO | | JonathanRolande0 -
Can't find mistake in robots.txt
Hi all, we recently filled our robots.txt file to prevent some directories from crawling. Looks like: User-agent: * Disallow: /Views/ Disallow: /login/ Disallow: /routing/ Disallow: /Profiler/ Disallow: /LILLYPROFILER/ Disallow: /EventRweKompaktProfiler/ Disallow: /AccessIntProfiler/ Disallow: /KellyIntProfiler/ Disallow: /lilly/ now, as Google Webmaster Tools hasn't updated our robots.txt yet, I checked our robots.txt in some ckeckers. They tell me that the User agent: * contains an error. **Example:** **Line 1: Syntax error! Expected <field>:</field> <value></value> 1: User-agent: *** **`I checked other robots.txt written the same way --> they work,`** accordign to the checkers... **`Where the .... is the mistake???`** ```
Technical SEO | | accessKellyOCG0 -
Secondary Pages Indexed over Primary Page
I have 4 pages for a single product Each of the pages link to the Main page for that product Google is indexing the secondary pages above my preferred landing page How do I fix this?
Technical SEO | | Bucky0 -
Should search pages be disallowed in robots.txt?
The SEOmoz crawler picks up "search" pages on a site as having duplicate page titles, which of course they do. Does that mean I should put a "Disallow: /search" tag in my robots.txt? When I put the URL's into Google, they aren't coming up in any SERPS, so I would assume everything's ok. I try to abide by the SEOmoz crawl errors as much as possible, that's why I'm asking. Any thoughts would be helpful. Thanks!
Technical SEO | | MichaelWeisbaum0 -
Mitigating duplicate page content on dynamic sites such as social networks and blogs.
Hello, I recently did an SEOMoz crawl for a client site. As it typical, the most common errors were duplicate page title and duplicate content. The client site is a custom social network for researchers. Most of the pages that showing as duplicate are simple variations of each user's profile such as comment sections, friends pages, and events. So my question is how can we limit duplicate content errors for a complex site like this. I already know about the rel canonical tag, and rel next tag, but I'm not sure if either of these will do the job. Also, I don't want to lose potential links/link juice for good pages. Are there ways of using the "noindex" tag in batches? For instance: noindex all urls containing this character? Or do most CMS allow this to be done systematically? Anyone with experience doing SEO for a custom Social Network or Forum, please advise. Thanks!!!
Technical SEO | | BPIAnalytics0