What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Longterm wordpress blog not providing seo benefit to main site - help needed please
Hi I have a bigcommerce ecommerce store, with a Wordpress blog on a subdomain. The store and blog have been active for four years, the blog is regularly updated with original content, has many links to the store, is promoted regularly via my brand's social media channels and mailing list, and has the simplest SEO basics covered via a Yoast SEO plugin. But the store sees very little, if any, SEO benefit from the blog. My question is: based on this information, and the details below, is there an issue with the connection between the blog and main site in SEO terms? And if there is, how can I start fixing it? Further info: 1 In my Moz dashboard for the store site, the blog does not show at all as providing any inbound links or linking domains 2 Google Analytics also shows zero referral traffic to the store site from the blog since April 2015 3 Moz crawl issues is flagging ‘duplicate page content issues’ for pretty much every page of the blog, and the analysis provided suggests this may be related to tags but I have only basic SEO knowledge and am fast getting out of my depth here. 4 I have today altered the settings within the Yoast plugin on the blog to ‘noindex’ for Tags, Meta Robots, based on advice I have found in this section but am already well over my head and unsure even this is correct. An agency have been running SEO for the store since 2012 but since uncovering how little they have done in this time for the money paid, I am now taking matters back into my own hands. However I am on a very steep learning curve and this one is beyond me right now - please does anyone have any suggestions where I can start looking to uncover the root issue? Any guidance or advice would be greatly appreciated Thanks very much and hope to hear from someone!
Reporting & Analytics | | Warren_331 -
Possible penalty question - need expert help
hallo everyone, I am posting this question to the MOZ community, because I could not find any useful information or proper advice so far, even after consulting a few local SEO experts. I noticed from the end of september a steady and consistent decrease in visits (please see attached pdf) for my website https://bastabollette.it I lost so far almost 40%. Please consider that I have not changed my habits in blog posting lately, both in quantity and quality. I have not made any subtantial change on the website lately. I did a general audit of the site asking to an expert but apart from some generic suggestions (like: "work on increasing PR, add more quality backliks, use more no-follow links, fix broken links" - things I am currently going to fix anyway) I don't really understand the reason of the drop. Please also note the strange drop of 11/22/15 (see search console screenshot). Can you please help me? thank you. Selezione_018.jpg Selezione_019.jpg
Reporting & Analytics | | micvitale0 -
Google Analytics Stopped Tracking Visits - NEED HELP!
Hi Moz Community, I have about 10 sites, static HTML sites and WordPress sites, which ALL stopped tracking Google Analytics on August 2nd. They go to a flat-line! Dead! No data! Has anyone else experienced this either currently or before? I have confirmed all code is correct as it's been tracking these sites for years. One site gets 5,000+ visits a month and they are sitting at only 1,500 now and will be a fun conversation to have with the client. If all code is correct, what should I do? How do I overcome this without having to re-create another account/tracking ID? Never dealt with something like this before and there is not much on the web or in other forums. Would appreciate any help or advice or tips! - Patrick
Reporting & Analytics | | WhiteboardCreations0 -
Google Analytics showing discrepancy in Geo data
My site is related to UAE and the Gulf region. An year ago, Google Analytics abruptly started showing high incoming traffic from USA (Visits grew from 50k/month to 500k/month) while the overall traffic was stagnant. An year down the line, this has now been reverted back to normal nos and the overall traffic is still maintained. What could be a possible explanation for this discrepancy? Since this reversal has also now boosted traffic nos for other countries, should we rely on this data? Or is it polluted?
Reporting & Analytics | | vivekrathore0 -
Bing Webmaster Tools data discrepency on traffic
Hi, For a given time period in Bing Webmaster Tools it says that combined clicks from Bing and Yahoo organic search are about 4,700 total, for the same time period in Google Analytics, combining sessions for Bing and Yahoo Organic comes out to about 8,900 total. Has anyone else experienced this discrepancy? Is this common? How can I get 8900 visits from 4700 clicks?
Reporting & Analytics | | IrvCo_Interactive0 -
Tag Manager & Universal Analytics Code - Do you need both?
Hi Mozzers I've created a container for a domain in Google Tag Manager. Within that container I've created a tag for universal analytics with track type "Page view" and the firing rule "all pages". Can I then replace the Universal Analytics code with the tag manager code? Would it still track all the normal data in Google Analytics? There are no events setup up yet so that's not a concern but there are goals setup tracking which are triggered by a page view. Would they be affected? Thanks Anthony
Reporting & Analytics | | Tone_Agency1 -
Why is it when i don i keyword research here that it says data not available?
Why is it when i do keyword research that with all the keywords it says data not availabe. It all are keywords with high search volume at least a few of them. I am also looking in the right search engine google.es and when i go to google keyword tool it gives numbers as well. CAn somebody tell me what i am doing wrong? remco | Local Search
Reporting & Analytics | | seoroyal
Volume (Sep) | Global Monthly
Search Volume | Local Search
Volume (Sep) | Global Monthly
Search Volume |0 -
Need solid, no jargon definitions
Total Links : 15,967 Ext. Followed Links : 2,078 Linking Root Domains : 177 Followed Linking Root Domains : 125 Linking C-Blocks : 117 Simple question, answers in English please. What are these?
Reporting & Analytics | | GeniusGoodsInc0