Working out exactly how Google is crawling my site if I have loooots of pages
-
I am trying to work out exactly how Google is crawling my site including entry points and its path from there. The site has millions of pages and hundreds of thousands indexed. I have simple log files with a time stamp and URL that google bot was on. Unfortunately there are hundreds of thousands of entries even for one day and as it is a massive site I am finding it hard to work out the spiders paths. Is there any way using the log files and excel or other tools to work this out simply? Also I was expecting the bot to almost instantaneously go through each level eg. main page--> category page ---> subcategory page (expecting same time stamp) but this does not appear to be the case. Does the bot follow a path right through to the deepest level it can/allowed to for that crawl and then returns to the higher level category pages at a later time? Any help would be appreciated
Cheers
-
Can you explain to me how you did your site map for this please?
-
I've run into the same issue for a site with 40 k + pages - far from your overall page # but still .. maybe it's the same flow overall.
The site I was working on had a structure of about 5 level deep. Some of the areas within the last level were out of reach and they didn't get indexed. More then that even a few areas on level 2 were not present in the google index and the google boot didn't visit those either.
I've created a large xml site map and a dynamic html sitemap with all the pages from the site and submit it via webmaster tool (the xml sitemap that is) but that didn't solve the issue and the same areas were out of the index and didn't got hit. Anyway the huge html sitemap was impossible to follow from a user point of view so I didn't keep that online for long but I am sure it can't work that way either.
What i did that finally solved the issue was to spot the exact areas that were left out, identify the "head" of those pages - that means several pages that acted as gateway for the entire module and I've build a few outside links that pointed to those pages directly and a few that were pointed to main internal pages of those modules that were left out.
Those pages gain authority fast and only in a few days we've spotted the google boot staying over night
All pages are now indexed and even ranking well.
If you can spot some entry pages that can conduct the spider to the rest you can try this approach - it should work for you too.
As far as links I've started with social network links, a few posts with links within the site blog (so that means internal links) and only a couple of outside links - articles with content links for those pages. Overall I think we are talking about 20-25 social network links (twitter, facebook, digg, stumble and delic), about 10 blog posts published in a 2-3 days span and about 10 articles in outside sources.
Since you have a much larger # as far as pages you probably will need more gateways and that means more links - but overall it's not a very time consuming session and it can solve your issue... hopefully
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
A single page from site not ranking
Hello, We have a new site launched in March, that is ranking well in search for all of the pages, except one and we don't know why. This page it is optimised exactly the same way like the others, but still doesn't rank in Google. We have verified robots.txt for noffollow, noindex tags, we have verified if it was penalized by Google, but still didn't find nothing. Initially we had another site and was on the topic of this page, but we have redirected it to the new one. In case this old site was anytime in the past penalized by Google, could it be possible that the new page be influenced by this? Also, we have another site that ranks on the first position, that targets the same keywords like the page that does not rank. It was the first site we launched, so it is pretty much old, but we do not have duplicate content on them. Maybe Google doesn't like the fact that both target the same keywords and chooses to display only the old site? Please help us if you have any ideas or have been through such thing. Thank you!
Intermediate & Advanced SEO | | daniela.pirlogea0 -
Interest in optimise Google Crawl
Hello, I have an ecommerce site with all pages crawled and indexed by Google. But I have some pages with multiple urls like : www.sitename.com/product-name.html and www.sitename.com/category/product-name.html There is a canonical on all these pages linking to the simplest url (so Google index only one page). So the multiple pages are not indexed, but Google still comes crawling them. My question is : Did I have any interest in avoiding Google to crawl these pages or not ? My point is that Google crawl around 1500 pages a day on my site, but there are only 800 real pages and they are all indexed on Google. There is no particular issue, so is it interesting to make it change ? Thanks
Intermediate & Advanced SEO | | onibi290 -
Google does not want to index my page
I have a site that is hundreds of page indexed on Google. But there is a page that I put in the footer section that Google seems does not like and are not indexing that page. I've tried submitting it to their index through google webmaster and it will appear on Google index but then after a few days it's gone again. Before that page had canonical meta to another page, but it is removed now.
Intermediate & Advanced SEO | | odihost0 -
Reviews not pulling through to Google My Business page
OK, a local SEO question! We are working with a plumbing company. A search for (Google UK) shows the knowledge panel with 20+ reviews. This is good! However, if you search for "plumbers norwich" and look at the map, thecompany is on the third page and has no reviews. I've logged into Google My Business, and it says the profile is not up to date and only 70% complete with no reviews. This is odd, as there was a fully complete profile recently. Any ideas on how best to reconcile the two? Thanks!
Intermediate & Advanced SEO | | Ad-Rank1 -
Will I lose traffic from Google for re-directing a page?
I’m currently planning to a retire a discontinued product and put a 301 redirect to a related product (although not identical). The thing is, I’m still getting significant traffic from people searching for the old product by name. Would Google send this traffic to the new pages via the re-direct? Is Google likely to display the new page in place of the old page for similar queries or will it serve other content? I’d like to answer this question so that I can decide between the two following approaches: 1) Retiring the old page immediately and putting a 301 redirect to the new related pages. This will have the advantage of transferring the value of any link signals / referring traffic. Traffic will also land on the new pages directly without having to click through from another page. We would have a dynamic message telling users that the old product had been retired depending on whether they had visited out site before. 2) Keep the old product pages temporarily so that we don’t lose the traffic from the search engines. We would then change the old pages to advise users that the old product was now retired, but that we have other products that might solve their problems. When this organic traffic decreases over time, then we will proceed with the re-direct as above. I am worried though that the old product pages might outrank the new product pages. I’d really appreciate some advice with this. I’ve been reading lots of articles, but it seems like there are different opinions on this. I understand that I will lose between 10% - 15% of page rank as per the Matt Cutts video.
Intermediate & Advanced SEO | | RG_SEO0 -
Does Google make continued attempts to crawl an old page one it has followed a 301 to the new page?
I am curious about this for a couple of reasons. We have all dealt with a site who switched platforms and didn't plan properly and now have 1,000's of crawl errors. Many of the developers I have talked to have stated very clearly that the HTacccess file should not be used for 1,000's of singe redirects. I figured If I only needed them in their temporarily it wouldn't be an issue. I am curious if once Google follows a 301 from an old page to a new page, will they stop crawling the old page?
Intermediate & Advanced SEO | | RossFruin0 -
Issues with Google-Bot crawl vs. Roger-Bot
Greetings from a first time poster and SEO noob... I hope that this question makes sense... I have a small e-commerce site, I have had Roger-bot crawl the site and I have fixed all errors and warnings that Volusion will allow me to fix. Then I checked Webmaster Tools, HTML improvements section and the Google-bot sees different dupe. title tag issues that Roger-bot did not. so A few weeks back I changed the title tag for a product, and GWT says that I have duplicate title tags but there is only one live page for the product. GWT lists the dupe. title tags, but when I click on each they all lead to the same live page. I'm confused, what pages are these other title tags referring to? Does Google have more than one page for that product indexed due to me changing the title tag when the page had a different URL? Does this question make sense? 2) Is this issue a problem? 3) What can I do to fix it? Any help would be greatly appreciated Jeff
Intermediate & Advanced SEO | | IOSC0 -
Getting Google Authorship to Work
Hi I set up authorship about 10 days ago at the suggestion of this great forum but haven't seen anything happen yet. Could someone take a quick look and check I've done it right? You can see a typical post here: http://www.touristisrael.com/neve-tzedek-tel-aviv/354/ Thanks!
Intermediate & Advanced SEO | | ben10000