Blocking https from being crawled

Sean_Dawes

I have an ecommerce site where https is being crawled for some pages. Wondering if the below solution will fix the issue

www.example.com will be my domain

In the nav there is a login page www.example.com/login which is redirecting to the https://www.example.com/login

If I just disallowed /login in the robots file wouldn't it not follow the redirect and index that stuff?

The redirect part is what I am questioning.

Sean_Dawes

Correct once /login gets redirected to https://www.example.com/login all nav links etc are https

What I ended up doing was blocking /login in robots and now doing canonicals on https as well as nofollow the /login link that is in the nav that redirects

Willl see what happens now.

Dr-Pete

So, the "/login" page gets redirected to https: and then every link on that page goes secure and Google crawls them all? I think blocking the "/login" page is a perfectly good way to go here - cut the crawl path, and you'll cut most of the problem.

You could request removal of "/login" in Google Webmaster Tools, too. Sometimes, I find that Robots.txt isn't great at removing pages that are already indexed. I would definitely add the canonical as well, if it's feasible. Cutting the path may not cut the pages that have already been indexed with https:.

Sorry, I'd actually reverse that:

(1) Add the canonicals, and let Google sweep up the duplicates

(2) A few weeks later, block the "/login" page

Sounds counter-intuitive, but if you block the crawl path to the https: pages first, then Google won't crawl the canonical tags on those versions. Use canonical to clean up the index, and then block the page to prevent future problems.

Sean_Dawes

Gotcha. Yea I commented above how I was going to add a canonical as well as a noindex in the meta but was curious how it handled the redirect that was happening.

thanks for your help

Sean_Dawes

Yea I was going to nofollow the link in the nav and add a meta tag but was curious how the robots file would handle this since the url is a redirect.

Thanks for your input

NakulGoyal

The pages that are being crawled under https, are the same pages available under http as well ? If yes, can you just add a canonical tag on these pages to go to the http version. That should fix it. And if your login page is the entry point, your fix will help as well. But then as Rebekah said, what if somebody is linking to your https page. I would suggest you look into making a canonical tag on these pages to http if that makes sense and is doable.

RebekahMay

You can disallow the https portion in robots.txt, but remember robots.txt isn't always a sure fire way of not getting an area of your site crawled. If you have other important content to crawl from the secured page, be careful you are not blocking robots from there.

If this is linked to other places on the web, and the link doesn't include no-follow, search engines may still crawl the page. Can you change the link in your navigation to no-follow as well? I would also add a meta noindex tag to the page itself, and a canonical tag to the https version.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Blocking https from being crawled

Browse Questions

Explore more categories

Related Questions

URLs dropping from index (Crawled, currently not indexed)

Canonicalize or Block?

Switching to HTTPS

Duplicate Content Showing up on Moz Crawl | www. vs. no-www.

Responsive web design has a crawl error of redirecting to HTTP instead of HTTPS ? is this because of the new update of google that appreciates the HTTPs more?

Google is indexing blocked content in robots.txt

Summarize your question.Crawl Diagnostics Summary

What is the largest page size a searchbot will crawl?