Site Spider/ Crawler/ Scraper Software
-
Short of coding up your own web crawler - does anyone know/ have any experience with a good bit of software to run through all the pages on a single domain?
(And potentially on linked domains 1 hop away...)
This could be either server or desktop based.
Useful capabilities would include:
- Scraping (x-path parameters)
-
of clicks from homepage (site architecture)
- http headers
- Multi threading
- Use of proxies
- Robots.txt compliance option
- csv output
- Anything else you can think of...
Perhaps an oppourtunity for an additional SEOmoz tool here since they do it already!
Cheers!
Note:
I've had a look at:- Nutch
http://nutch.apache.org/ - Heritrix
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix - Scrapy
http://doc.scrapy.org/en/latest/intro/overview.html - Mozenda (does scraping but doesn't appear extensible..)
Any experience/ preferences with these or others?
-
Hey Alex,
Screaming Frog is hands down the best desktop crawling software and it has most of what you are looking for.
-Mike
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Multilang site: Auto redirect 301 or 302?
We need to establish if 301 or 302 response code is to be used for our auto redirects based on Accept-Language header. https://domain.com
International SEO | | fJ66doneOIdDpj
30x > https://domain.com/en
30x > https://domain.com/ru
30x > https://domain.com/de The site architecture is set up with proper inline HREFLANG.
We have read different opinions about this, Ahrefs says 302 is the correct one:
https://ahrefs.com/blog/301-vs-302-redirects/
302 redirect:
"You want to redirect users to the right version of the site for them (based on location/language)." You could argue that the root redirect is never permanent as it varies based on user language settings (302)
On the other hand, the lang specific redirects are permanent per language: IF Accept-Language header = en
https://domain.com > 301 > https://domain.com/en
IF Accept-Language header = ru
https://domain.com > 301 > https://domain.com/ru So each of these is 'permanent'. So which is the correct?0 -
Is it compulsory to use hreflang attribute for Multilingual site? What if I do not use such tag?
Hello Everybody, My main site - abcd.co.uk and other sites are like this se.abcd.co.uk, fr.abcd.co.uk, es.abcd.co.uk etc Now if I donot use hreflang for Multilingual site then google will consider it as subdomain or duplicate site? But content of the sites are in different language. Thanks!
International SEO | | wright3350 -
Splitting a site into 2 international sites
Hi all, I have a client that currently has a .com domain that ranks in both the US and the UK for various search terms. They have identified a need to provide different information for UK and US visitors which will require 2 versions of all pages. If we set up a .co.uk domain and keep the .com obviously that will be a brand new UK site which will have zero rankings. Any suggestions as to the best way to introduce this second version of the content without losing UK rankings? Thanks
International SEO | | danfrost0 -
If I redirect based on IP will Google still crawl my international sites if I implement Hreflang
We are setting up several international sites. Ideally, we wouldn't set up any redirects, but if we have to (for merchandising reasons etc) I'd like to assess what the next best option would be. A secondary option could be that we implement the redirects based on IP. However, Google then wouldn't be able to access the content for all the international sites (we're setting up 6 in total) and would only index the .com site. I'm wondering whether the Hreflang annotations would still allow Google to find the International sites? If not, that's a lot of content we are not fully benefiting from. Another option could be that we treat the Googlebot user agent differently, but this would probably be considered as cloaking by the G-Man. If there are any other options, please let me know.
International SEO | | Ben.JD0 -
Best URL structure for Multinational/Multilingual websites
Hi I am wondering what the best URL format to use is when a website targets several countries, in several languages. (without owning the local domains, only a .com, and ideally to use sub-folders rather than sub-domains.) As an example, to target a hotel in Sweden (Google.se) are there any MUST-HAVE indicators in the URL to target the relevant countries? Such as hotelsite.com**/se/**hotel-name. Would this represent the language? Or is it the location of the product? To clarify a bit, I would like to target around 10 countries, with the product pages each having 2 languages (the local language + english). I'm considering using the following format: hotelsite.com/en/hotel-name (for english) and hotelsite.com/se/hotel-name (for swedish content of that same product) and then using rel=”alternate” hreflang=”se-SV” markup to target the /se/ page for Sweden (Google.se) and rel=”alternate” hreflang=”en” for UK? And to also geotarget those in Webmaster tools using those /se/ folders etc. Would this be sufficient? Or does there need to be an indicator of both the location, AND the language in the URLs? I mean would the URL's need to be hotelsite.com/se/hotel-name/se-SV (for swedish) or can it just be hotelsite.com/se/hotel-name? Any thoughts on best practice would be greatly appreciated.
International SEO | | pikka0 -
How to optimise a site for 2 countries
Hi there - Any help with the below much appreciated I am helping an Australian company, producing packaging products for businesses. Their site is hosted in Australia and their offices are in Australia. They have asked me to take care of both on-page and off-page SEO so that they rank for keywords related to their products - e.g. 'cardboard boxes'. This should be fairly straightforward for Australian based (.com.au) searchers, but they also supply their products to South Africa, and so want their results to show up also for South African based (.co.za) searchers. Also consider: it is not typical for searchers for these products to use geomodifiers in their search terms there is no unique content for the South African market versus the Australian... the product information is essentially identical. What should we do to ensure their results show up equally for those in South Africa as well as Australia? I am considering building a completely separate site, hosted in South Africa and specifically for the S.A market, but will the duplicate content effect be an issue? Also, this would essentially mean double the SEO effort, is there no way I could achieve our goals more efficiently? many thanks to any help
International SEO | | dnaynay0 -
How to optimise you site in other countries eg Australia
We would like to rank better for specific keywords in Australia. We rank pretty well in our home tld .co.uk but would like to do so in .com.au I would appreciate your thoughts and recommendations.
International SEO | | seanmccauley0 -
Multi-lingual SEO: Country-specific TLD's, or migration to a huge .com site?
Dear SEOmoz team, I’m an in-house SEO looking after a number of sites in a competitive vertical. Right now we have our core example.com site translated into over thirty different languages, with each one sitting on its own country-specific TLD (so example.de, example.jp, example.es, example.co.kr etc…). Though we’re using a template system so that changes to the .com domain propagate across all languages, over the years things have become more complex in quite a few areas. For example, the level of analytics script hacks and filters we have created in order to channel users through to each language profile is now bordering on the epic. For a number of reasons we’ve recently been discussing the cost/benefit of migrating all of these languages into the single example.com domain. On first look this would appear to simplify things greatly; however I’m nervous about what effect this would have on our organic SE traffic. All these separate sites have cumulatively received years of on/off-site work, and even if we went through the process of setting up page-for-page redirects to their new home on example.com, I would hate to lose all this hard-work (and business) if we saw our rankings tank as a result of the move. So I guess the question is, for an international business such as ours, which is the optimal site structure in the eyes of the search engines; Local sites on local TLD’s, or one mammoth site with language identifiers in the URL path (or subdomains)? Is Google still so reliant on TLD for geo targeting search results, or is it less of a factor in today’s search engine environment? Cheers!
International SEO | | linklater0