Site Spider/ Crawler/ Scraper Software
-
Short of coding up your own web crawler - does anyone know/ have any experience with a good bit of software to run through all the pages on a single domain?
(And potentially on linked domains 1 hop away...)
This could be either server or desktop based.
Useful capabilities would include:
- Scraping (x-path parameters)
-
of clicks from homepage (site architecture)
- http headers
- Multi threading
- Use of proxies
- Robots.txt compliance option
- csv output
- Anything else you can think of...
Perhaps an oppourtunity for an additional SEOmoz tool here since they do it already!
Cheers!
Note:
I've had a look at:- Nutch
http://nutch.apache.org/ - Heritrix
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix - Scrapy
http://doc.scrapy.org/en/latest/intro/overview.html - Mozenda (does scraping but doesn't appear extensible..)
Any experience/ preferences with these or others?
-
Hey Alex,
Screaming Frog is hands down the best desktop crawling software and it has most of what you are looking for.
-Mike
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Targeting/Optimising for US English in addition to British English (hreflang tags)
Hi, I wonder if anyone can help? We have an e-commerce website based in the UK. We sell to customers worldwide. After the UK, the US is our second biggest market. We are English language only (written in British English), we do not have any geo-targeted language versions of our website. However, we are successful in selling to customers around the world on a regular basis. We have developers working on a new site due to launch in Winter 2021. This will include a properly managed site migration from our .net to a .com domain and associated redirects etc. Management are keen to increase sales / conversions to the US before the new site launches. They have requested that we create a US optimised version of the site. Maintaining broadly the same content, but dynamically replacing keywords: Example (clothing is not really what we sell): Replacing references to “trainers” with “sneakers”
International SEO | | IronBeetle
Replacing references ‘jumpers with “sweaters”
Replacing UK phone number with a US phone number It seems the wrong time to implement a major overhaul of URL structure, considering the planned migration from .net to .com in the not too distant future. For example I’m not keen to move British English content on to https://www.example.com/en-gb Would this be a viable solution: 1. hreflang non-us visitors directed to the existing URL structure (including en-gb customers): https://www.example.com/
2. hreflang US Language version of the site: https://www.example.com/en-us/ As the UK is our biggest market It is really important that we don’t negatively affect sales. We have extremely good visibility in SERPS for a wide range of high value/well converting keywords. In terms of hreflang tags would something like this work? Do we need need to make reference to en-gb being on https://www.example.com/ ? This seems a bit of a ‘half-way-house’. I recognise that there are also issues around the URL structure, which is optimised for British English/international English keywords rather than US English e.g. https://www.example.com/clothing/trainers Vs. https://example.com/clothing/sneakers Any advice / insight / guidance would be welcome. Thanks.0 -
Correct site internationalization strategy
Hi, I'm working on the internationalization of a large website; the company wants to reach around 100 countries. I read this Google doc: https://support.google.com/webmasters/answer/182192?hl=en in order to design the strategy. The strategy is the following: For each market, I'll define a domain or subdomain with the next settings: Leave the mysitename.com for the biggest market in which it has been working for years, and define the geographic target in Google search console. Reserve the ccTLD domains for other markets In the markets where I'm not able to reserve the ccTLD domains, I'll use subdomains for the .com site, for example us.mysitename.com, and I'll define in Google search console the geographic target for this domain. Each domain will only be in the preferred language of each country (but the user will be able to change the language via cookies). The content will be similar in all markets of the same language, for example, in the .co.uk and in .us the texts will be the same, but the product selections will be specific for each market. Each URL will link to the same link in other countries via direct link and also via hreflang. The point of this is that all the link relevance that any of them gets, will be transmitted to all other sites. My questions are: Do you think that there are any possible problems with this strategy? Is it possible that I'll have problems with duplicate content? (like I said before, all domains will be assigned to a specific geographic target) Each site will have around 2.000.000 of URLs. Do you think that this could generate problems? It's possible that only primary and other important locations will have URLs with high quality external links and a decent TrustRank. Any other consideration or related experience with a similar process will be very appreciated as well. Sorry for all these questions, but I want to be really sure with this plan, since the company's growth is linked to this internationalization process. Thanks in advance!
International SEO | | robertorg0 -
What is the proper way to setup hreflang tags on my English and Spanish site?
I have a full English website at http://www.example.com and I have a Spanish version of the website at http://spanish.example.com but only about half of the English pages were translated and exist on the Spanish site. Should I just add a sitemap to both sites with hreflang tags that point to the correct version of the page? Is this a proper way to set this up? I was going to repeat this same process for all of the applicable URLs that exist on both versions of the website (English and Spanish). Is it okay to have hreflang="es" or do I need to have a country code attached as well? There are many Spanish speaking countries and I don't know if I need to list them all out. For example hreflang="es-bo" (Bolivia), hreflang="es-cl" (Chile), hreflang="es-co" (Columbia), etc... Sitemap example for English website URL:
International SEO | | peteboyd
<url><loc>http://www.example.com/</loc></url> Sitemap example for Spanish website URL:
<url><loc>http://spanish.example.com/</loc></url> Thanks in advance for your feedback and help!0 -
International SEO Subfolders / user journey etc
Hi According to all the resources i can find on Moz and elsewhere re int seo, say in the context of having duplicate versions of US & UK site, its best to have subfolders i.e. domain.com/en-gb/ & domain.com/en-us/ however when it comes to the user journey and promoting web address seems a bit weird to say visit us at: domain.com/en-us/ !? And what happens if someone just enters in domain.com from the US or UK ? My client wants to use an IP sniffer but i've read thats bad practice and should employ above style country/language code instead, but i'm confused about both the user journey and experience in the case of multiple sub folders. Any advice much appreciated ? Cheers Dan
International SEO | | Dan-Lawrence0 -
Naming URL for Russian version of the site
Hi, Our site has two languages: English and Russian. My question is that should I use Cyrillic letters in the URL structure and file naming of the Russian version of the site, as Russian users are searching for information by using Russian words not English words? Thanks in advance, Sam
International SEO | | Awaraman0 -
Multi country targeting for listing site, ccTLD, sub domain or .com/folder?
Hi I know this has been covered in a few questions but seen nothing recent that may take into account changes google may have applied. We would like to target multiple english speaking counties with a new project and I'm a little unsure as to whether ccTLD, subdomain or subfolders are the best way to publish country specific information. Can anyone shed some light on this?
International SEO | | Mulith0 -
Reciprocal Links between my own sites ?
Is is ok to have Reciprocal Links between sites you really own ? We have a website that has been regionalized to 5 countries, using 5 different domains. The content is exclusive for the country but the keywords used might be similar. We have all the domains under the same Analytics account and all of them share the same Adsense code. Can I be penalized by Google for making reciprocal links between them ? Is something usefull for improving the SEO rank or I should avoid doing it ? Thanks in advance
International SEO | | martincad0 -
Does it matter whether you use /en vs /uk
I have a global site targeting many countries including the UK which is the only English language site. Does it matter whether I use /en or /uk for the UK sub-folder? If I already have /en in place, but my Google UK listings are struggling, will it benefit me to switch to /uk? I honestly don't think it matters too much, but given the choice would've gone for the /uk I'm trying to weigh up whether it is worth the effort of changing it.
International SEO | | Red_Mud_Rookie0