How does Google index pagination variables in Ajax snapshots? We're seeing random huge variables.
-
We're using the Google snapshot method to index dynamic Ajax content. Some of this content is from tables using pagination. The pagination is tracked with a var in the hash, something like:
#!home/?view_3_page=1
We're seeing all sorts of calls from Google now with huge numbers for these URL variables that we are not generating with our snapshots. Like this:
#!home/?view_3_page=10099089
These aren't trivial since each snapshot represents a server load, so we'd like these vars to only represent what's returned by the snapshots.
Is Google generating random numbers going fishing for content? If so, is this something we can control or minimize?
-
Thanks for the great replies all. Just to clarify, this is the page we're referencing:
http://www.knackhq.com/business-directory-user-demo/?escaped_fragment=
You can see the one pagination var "next" that points here:
http://www.knackhq.com/business-directory-user-demo/?escaped_fragment=home/?view_3_page=2
As you can see this is pretty simple. There's only one potential variable (the "prev" and "next" links) for introducing these huge numbers and that's pretty limited. We tested the Google URLs up and down the app and haven't seen anything that would send it fishing for larger numbers. But Google keeps hammering us with:
GET /business-directory-user-demo/?escaped_fragment=home/?view_3_page=1000251
For now we're trying to respond to those with 404s and hope they eventually die.
Unfortunately we can't avoid hashbangs.
-
This seems to do this only for parameters that it has decided "changes, re-orders, or narrows content." They may also crawl things that look like URLs in Javascript even when it's part of a function, but it doesn't seem like that's what's happening in this case.
Depending on the setup of the site, you can either manually configure the variable in WMT (don't do this if the parameter is material), write a clever robots.txt rule (e.g. to block anything after a number of digits after the parameter), or (the best solution) re-work the system to generate URLs that don't rely on parameters.
I'm not sure I understand why the server is rendering a page if the URL isn't supposed to exist. Depending on your server config, you may also be able to return a 404 and make a rule for which (valid) pages to render. From there you can just ignore the 404 errors until Google figures it out.
I think that's the best I can do without seeing the site.
-
I agree with Federico. I've seen Google go fishing with URL parameters (?param=xyz) and I've seen it with AJAX and hashbangs as well. How far they take this and when they choose to apply it doesn't seem to follow a consistent pattern . You can see some folks on StackExchange discussing this, too: http://webmasters.stackexchange.com/questions/25560/does-the-google-crawler-really-guess-url-patterns-and-index-pages-that-were-neve
-
Awesome, thanks for looking into it. We've gotten nowhere with any kind of answer.
-
Hi There
I'm an associate here at Moz, and have asked the other associates if they might know the answer, as this one's a little outside of my experience. Please follow up and let us know if you don't hear from anyone.
Thanks!
-Dan
-
We also noticed some weird crawls last year using random numbers at the end of the URL, checking in google webmaster tools we saw that most of those urls were reported as not found, checking from where the link came from google listed some of our URLs, but didn't had any link to those URLs google was trying to fetch. After 2 or 3 months those crawls stopped. We never knew from where Google got those URLs...
-
Hi Federico, thanks for the response.
Unfortunately this is an SEO solution for a third-party JavaScript product, so removing the hash isn't an option.
I'm still interested in knowing if this is a formal Google practice and if there's some way to control or mitigate this.
-
I think you are right. Google is fishing for content. I would find a solution to make those URL friendly by removing the hash and using some URL rewrite and pushState to paginate that content instead.
Here's a previous question that may help: http://moz.com/community/q/best-way-to-break-down-paginated-content
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Website dropped out from Google index
Howdy, fellow mozzers. I got approached by my friend - their website is https://www.hauteheadquarters.com She is saying that they dropped from google index over night - and, as you can see if you google their name, website url or even site: , most of the pages are not indexed. Home page is nowhere to be found - that's for sure. I know that they were indexed before. Google webmaster tools don't have any manual actions (at least yet). No sudden changes in content or backlink profile. robots.txt has some weird rule - disallow everything for EtaoSpider. I don't know if google would listen to that - robots checker in GWT says it's all good. Any ideas why that happen? Any ideas what I should check? P.S. Just noticed in GWT there was a huge drop in indexed pages within first week of August. Still no idea why though. P.P.S. Just noticed that there is noindex x-robots-tag in headers... Anyone knows where this can be set?
Intermediate & Advanced SEO | | DmitriiK0 -
How can I make a list of all URLs indexed by Google?
I started working for this eCommerce site 2 months ago, and my SEO site audit revealed a massive spider trap. The site should have been 3500-ish pages, but Google has over 30K pages in its index. I'm trying to find a effective way of making a list of all URLs indexed by Google. Anyone? (I basically want to build a sitemap with all the indexed spider trap URLs, then set up 301 on those, then ping Google with the "defective" sitemap so they can see what the site really looks like and remove those URLs, shrinking the site back to around 3500 pages)
Intermediate & Advanced SEO | | Bryggselv.no0 -
Google is not indexing an updated website
We just relaunched a website that has 5 years old, we maintain all the old URLs and articles but for some reason google is not picking up the new website https://www.navisyachts.com. In Google Webmaster Tools we can see the sitemap with over 1000 pages submitted but shows nothing as indexed. The site is loosing traffic rapidly and positions, from the SEO side all looks fine for me. What can be wrong? I’ll appreciate any help. The new website is built over Joomla 3.4, we have it here at MOZ and other than some minor details it doesn't show that something can be wrong with the website. Thank you.
Intermediate & Advanced SEO | | FWC_SEO0 -
Canonical Tags being indexed on paginated results?
On a website I'm working on which has a search feature with paginated results, all of the pages of the search results are set with a canonical tag back to the first page of the search results, however Google is indexing certain random pages within the result set. I can literally do a search in Google and find a deep page in the results, click on it and view source on that page and see that it has a canonical tag leading back to the first page of the set. Has anyone experienced this before? Why would Google not honor a canonical tag if it is set correctly? I've seen several SEO techniques for dealing with pagination, is there another solution that you all recommend?
Intermediate & Advanced SEO | | IrvCo_Interactive0 -
How can Google index a page that it can't crawl completely?
I recently posted a question regarding a product page that appeared to have no content. [http://www.seomoz.org/q/why-is-ose-showing-now-data-for-this-url] What puzzles me is that this page got indexed anyway. Was it indexed based on Google knowing that there was once content on the page? Was it indexed based on the trust level of our root domain? What are your thoughts? I'm asking not only because I don't know the answer, but because I know the argument is going to be made that if Google indexed the page then it must have been crawlable...therefore we didn't really have a crawlability problem. Why Google index a page it can't crawl?
Intermediate & Advanced SEO | | danatanseo0 -
Google Indexing Feedburner Links???
I just noticed that for lots of the articles on my website, there are two results in Google's index. For instance: http://www.thewebhostinghero.com/articles/tools-for-creating-wordpress-plugins.html and http://www.thewebhostinghero.com/articles/tools-for-creating-wordpress-plugins.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+thewebhostinghero+(TheWebHostingHero.com) Now my Feedburner feed is set to "noindex" and it's always been that way. The canonical tag on the webpage is set to: rel='canonical' href='http://www.thewebhostinghero.com/articles/tools-for-creating-wordpress-plugins.html' /> The robots tag is set to: name="robots" content="index,follow,noodp" /> I found out that there are scrapper sites that are linking to my content using the Feedburner link. So should the robots tag be set to "noindex" when the requested URL is different from the canonical URL? If so, is there an easy way to do this in Wordpress?
Intermediate & Advanced SEO | | sbrault740 -
Ajax Content Indexed
I used the following guide to implement the endless scroll https://developers.google.com/webmasters/ajax-crawling/docs/getting-started crawlers and correctly reads all URLs the command "site:" show me all indexed Url with #!key=value I want it to be indexed only the first URL, for the other Urls I would be scanned but not indexed like if there were the robots meta tag "noindex, follow" how I can do?
Intermediate & Advanced SEO | | wwmind1 -
Google swapped our website's long standing ranking home page for a less authoritative product page?
Our website has ranked for two variations of a keyword, one singular & the other plural in Google at #1 & #2 (for over a year). Keep in mind both links in serps were pointed to our home page. This year we targeted both variations of the keyword in PPC to a products landing page(still relevant to the keywords) within our website. After about 6 weeks, Google swapped out the long standing ranked home page links (p.a. 55) rank #1,2 with the ppc directed product page links (p.a. 01) and dropped us to #2 & #8 respectively in search results for the singular and plural version of the keyword. Would you consider this swapping of pages temporary, if the volume of traffic slowed on our product page?
Intermediate & Advanced SEO | | JingShack0