Working out exactly how Google is crawling my site if I have loooots of pages

soeren.hofmayer

I am trying to work out exactly how Google is crawling my site including entry points and its path from there. The site has millions of pages and hundreds of thousands indexed. I have simple log files with a time stamp and URL that google bot was on. Unfortunately there are hundreds of thousands of entries even for one day and as it is a massive site I am finding it hard to work out the spiders paths. Is there any way using the log files and excel or other tools to work this out simply? Also I was expecting the bot to almost instantaneously go through each level eg. main page--> category page ---> subcategory page (expecting same time stamp) but this does not appear to be the case. Does the bot follow a path right through to the deepest level it can/allowed to for that crawl and then returns to the higher level category pages at a later time? Any help would be appreciated

Cheers

wazza1985

Can you explain to me how you did your site map for this please?

eyepaq

I've run into the same issue for a site with 40 k + pages - far from your overall page # but still .. maybe it's the same flow overall.

The site I was working on had a structure of about 5 level deep. Some of the areas within the last level were out of reach and they didn't get indexed. More then that even a few areas on level 2 were not present in the google index and the google boot didn't visit those either.

I've created a large xml site map and a dynamic html sitemap with all the pages from the site and submit it via webmaster tool (the xml sitemap that is) but that didn't solve the issue and the same areas were out of the index and didn't got hit. Anyway the huge html sitemap was impossible to follow from a user point of view so I didn't keep that online for long but I am sure it can't work that way either.

What i did that finally solved the issue was to spot the exact areas that were left out, identify the "head" of those pages - that means several pages that acted as gateway for the entire module and I've build a few outside links that pointed to those pages directly and a few that were pointed to main internal pages of those modules that were left out.

Those pages gain authority fast and only in a few days we've spotted the google boot staying over night

All pages are now indexed and even ranking well.

If you can spot some entry pages that can conduct the spider to the rest you can try this approach - it should work for you too.

As far as links I've started with social network links, a few posts with links within the site blog (so that means internal links) and only a couple of outside links - articles with content links for those pages. Overall I think we are talking about 20-25 social network links (twitter, facebook, digg, stumble and delic), about 10 blog posts published in a 2-3 days span and about 10 articles in outside sources.

Since you have a much larger # as far as pages you probably will need more gateways and that means more links - but overall it's not a very time consuming session and it can solve your issue... hopefully

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Working out exactly how Google is crawling my site if I have loooots of pages

Browse Questions

Explore more categories

Related Questions

A single page from site not ranking

Interest in optimise Google Crawl

Google does not want to index my page

Reviews not pulling through to Google My Business page

Will I lose traffic from Google for re-directing a page?

Does Google make continued attempts to crawl an old page one it has followed a 301 to the new page?

Issues with Google-Bot crawl vs. Roger-Bot

Getting Google Authorship to Work