Robot.txt help

Studio33

Hi,

We have a blog that is killing our SEO.

We need to

Disallow

Disallow: /Blog/?tag*
Disallow: /Blog/?page*
Disallow: /Blog/category/*
Disallow: /Blog/author/*
Disallow: /Blog/archive/*
Disallow: /Blog/Account/.
Disallow: /Blog/search*
Disallow: /Blog/search.aspx
Disallow: /Blog/error404.aspx
Disallow: /Blog/archive*
Disallow: /Blog/archive.aspx
Disallow: /Blog/sitemap.axd
Disallow: /Blog/post.aspx

But Allow everything below /Blog/Post

The disallow list seems to keep growing as we find issues. So rather than adding in to our Robot.txt all the areas to disallow. Is there a way to easily just say Allow /Blog/Post and ignore the rest. How do we do that in Robot.txt

Thanks

evolvingSEO

These: http://screencast.com/t/p120RbUhCT

They appear on every page I looked at, and take up the entire area "above the fold" and the content is "below the fold"

-Dan

Studio33

Thanks Dan, but what grey areas, what url are you looking at?

evolvingSEO

Ahh. I see. You just need to "noindex" the pages you don't want in the index. As far as how to do that with blogengine, I am not sure, as I have never used it before.

But I think a bigger issue is like the giant box areas at the top of every page. They are pushing your content way down. That's definitely hurting UX and making the site a little confusing. I'd suggest improving that as well

-Dan

Studio33

Hi Dan, Yes sorry that's the one!

evolvingSEO

Hi There... that address does not seem to work for me. Should it be .net? http://www.dotnetblogengine.net/

-Dan

Studio33

Hi

The blog is www.dotnetblogengine.com

The content is only on the blog once it is just it can be accessed lots of different ways

evolvingSEO

Andrew

I doubt that one thing made your rankings drop so much. Also, what type of CMS are you on? Duplicate content like that should be controlled through indexation for the most part, but I am not recognizing that type of URL structure as any particular CMS?

Are just the title tags duplicate or the entire page content? Essentially, I would either change the content of the pages so they are not duplicate, or if that doesn't make sense I would just "noindex" them.

-Dan

Studio33

Hi Dan,

I am getting duplicate content errors in WMT like

www.mysite.com/Blog/?tag=ABC

www.mysite.com/Blog/?Page=1

This is because tag=ABC and page=1 are both different ways to get to www.mysite.com/Blog/Post/My-Blog-Post.aspx

To fix this I have remove the URL's www.mysite.com/Blog/?tag=ABC and www.mysite.com/Blog/?Page=1from GWMT and by setting robot.txt up like

User-agent: *
Disallow: /Blog/
Allow: /Blog/post
Allow: /Blog/Post

I hope to solve the duplicate content issue to stop it happening again.

Since doing this my SERP's have dropped massively. Is what I have done wrong or bad? How would I fix?

Hope this makes sense thanks for you help on this its appreciated.

Andrew

evolvingSEO

Hi There

Where are they appearing in WMT? In crawl errors?

You can also control crawling of parameters within webmaster tools - but I am still not quite sure if you are trying to remove these from the index or just prevent crawling (and if preventing crawling, for what reason?) or both?

-Dan

Studio33

Hi Dan,

The issue is my blog had tagging switched on, it cause canonicalization mayhem.

I switched it off, but the tags still appears in Google Webmaster Tools (GWMT). I Remove URL via GWMT but they are still appearing. This has also caused me to plummet down the SERPs! I am hoping this is why my SERPs had dropped anyway! I am now trying to get to a point where google just sees my blog posts and not the ?Tag or ?Author or any other parameter that is going to cause me canoncilization pain. In the meantime I am sat waiting for google to bring me back up the SERPs when things settle down but it has been 2 weeks now so maybe something else is up?

evolvingSEO

I'm wondering why you want to block crawling of these URLs - I think what you're going for is to not index them, yes? If you block them from being crawled, they'll remain in the index. I would suggest considering robots meta noindex tags - unless you can describe in a little more detail what the issue is?

-Dan

G2W

Ok then you should be all set if your tests on GWMT did not indicate any errors.

Studio33

Thanks it goes straight to www.mysite.com/Blog

G2W

Yup, I understand that you want to see your main site. This is why I recommended blocking only /Blog and not / (your root domain).

However, many blogs have a landing page. Does yours? In other words, when you click on your blog link, does it take you straight to Blog/posts or is there another page in between, eg /Blog/welcome?

If it does not go straight into Blog/posts you would want to also allow the landing page.

Does that make sense?

Studio33

The structure is:

www.mysite.com - want to see everything at this level and below it

www.mysite.com/Blog - want to BLOCK everything at this level

www.mysite.com/Blog/posts - want to see everything at this level and below it

G2W

Well what Martijn (sorry, I spelled his name wrong before) and I were saying was not to forget to allow the landing page of your blog - otherwise this will not be indexed as you are disallowing the main blog directory.

Do you have a specific landing page for your blog or does it go straight into the /posts directory?

I'd say there's nothing wrong with allowing both Blog/Post and Blog/post just to be on the safe side...honestly not sure about case sensitivity in this instance.

Studio33

"We're getting closer David, but after reading the question again I think we both miss an essential point ;-)" What was the essential point you missed. sorry I don't understand. I don;t want to make a mistake in my Robot.txt so would like to be 100% sure on what you are saying

Studio33

Thanks guys so I have

User-agent: *
Disallow: /Blog/
Allow: /Blog/post
Allow: /Blog/Post

that works. My Home page also works. I there anything wrong with including both uppercase "Post" and lowercase "post". It is lowercase on the site but want uppercase "P" just incase. Is there a way to make the entry non case sensitive?

Thanks

G2W

Correct, Martijin. Good catch!

Martijn_Scheijbeler

There was a reason that I said he should test this!

We're getting closer David, but after reading the question again I think we both miss an essential point ;-). As we know also exclude the robots from crawling the 'homepage' of the blog. If you have this homepage don't forget to also Allow it.

G2W

Well, no point in a blog that hurts your seo

I respectfully disagree with Martijin; I believe what you would want to do is disallow the Blog directory itself, not the whole site. It would seem if you Disallow: / and _Allow:/Blog/Post _ that you are telling SEs not to index anything on your site except for /Blog/Post.

I'd recommend:

User-agent: *
Disallow: /Blog/
Allow: /Blog/Post

This should block off the entire Blog directory except for your post subdirectory. As Maritijin stated; always test before you make real changes to your robots.txt.

Martijn_Scheijbeler

That would be something like this, please check this or test this within Google Webmaster Tools if it works because I don't want to screw up your whole site. What this does is disallowing your complete site and just allows the /Blog/Post urls.

User-agent: *
Disallow: /
Allow: /Blog/Post

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Robot.txt help

Browse Questions

Explore more categories

Related Questions

Block session id URLs with robots.txt

Our parent company has included their sitemap links in our robots.txt file - will that have an impact on the way our site is crawled?

Robots.txt for Facet Results

Meta NoIndex tag and Robots Disallow

Robot.txt error

About robots.txt for resolve Duplicate content

Best practices for robotx.txt -- allow one page but not the others?

Robots.txt & url removal vs. noindex, follow?