What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Need your Opinion on Bounce Rate Analysis
I'm currently doing a bounce rate analysis for our resource pages. These are information article pages - mix of plain texts and those containing either images, infographics, videos or even podcasts. By the way, I did search for bounce rate topics here, but I felt like i still need to post this. Unless I've overlooked a similar post, my apologies. It's a first for me to do an in-depth BR analysis, so I need to clarify few things. What is a good or bad range bounce rate? Is there even a range comparison? Like when can you say a bounce rate is high for an information type page? I've read some stuff online but they're confusing. What other Analytics factors should I consider looking at together with bounce rate? For pages (which purposely educate visitors) with high bounce rate, can you guys suggest tips to improve it? I would appreciate and value any advise. Thanks a lot!
Reporting & Analytics | | ktrich1 -
How can I overwrite previously imported cost data in Google Analytics?
I choose the overwrite option but it keeps adding the costs to the existing costs. When i select that radio button and click done, that option does not appear on the summary screen. That section next to Import Behavior is blank.
Reporting & Analytics | | JGB5550 -
Google Analytics Stopped Tracking Visits - NEED HELP!
Hi Moz Community, I have about 10 sites, static HTML sites and WordPress sites, which ALL stopped tracking Google Analytics on August 2nd. They go to a flat-line! Dead! No data! Has anyone else experienced this either currently or before? I have confirmed all code is correct as it's been tracking these sites for years. One site gets 5,000+ visits a month and they are sitting at only 1,500 now and will be a fun conversation to have with the client. If all code is correct, what should I do? How do I overcome this without having to re-create another account/tracking ID? Never dealt with something like this before and there is not much on the web or in other forums. Would appreciate any help or advice or tips! - Patrick
Reporting & Analytics | | WhiteboardCreations0 -
Does "not provided" affect the referral traffic data?
There is a discrepancy between what an referring site is showing as clicks back to my site versus what Google Analytics is showing. For example - They are stating that their site sent 100 clicks to my site. The GA data shows 25 referred clicks from that site. Could this be due to the "not provided" that hides most of the keyword data?
Reporting & Analytics | | devonkrusich0 -
Webmaster Tools vs. Google Trends data doesn't add up
I am investigating a two-month 25% drop in organic traffic from Google to a client's site. When I turned to the Webmaster Tools data for the site, there is a clear, gradual drop over the course of a couple months both in impressions and clicks. In general, the drop occurred across many pages and for a large number of queries; there wasn't a core group of keywords or pages that saw the drop...it was more sitewide. Yet, the average rankings reported by WMT were, for the top 100 or so landing pages, not significantly different. The site hosts information about medical conditions, and I wouldn't expect any time-related variations in search volume, and this was confirmed by looking at Google Trends data for a number of the top keywords. I started to look at the data by query for all the top keywords (all ranked in the top 10), and saw the following general trend: impressions were down, rankings stayed in the top 10, and Google Trends showed either flat or rising volumes. So I am trying to make sense of that. If the search volume trend did not decline and rankings held inside the top 10, then how could the number of impressions drop significantly? Am I trusting the WMT data too much? But the reality is that the volume of traffic measured by Google Analytics from Google organic did indeed drop the way Webmaster Tools show it.
Reporting & Analytics | | WillW0 -
Can't seem to rank for keyword "home care grand rapids" - need some advice
I am trying to rank for "home care grand rapids" and am having a really difficult time. My site: http://healthcareassociates.net has better backlinks, keywords and other seo markers than my competitors but I still can't seem to rank. The keyword and associated keywords (home care grand rapids michigan, home health care grand rapids, etc.) are only 31-33% difficulty and my site/page rank is better than the leading sites. What gives? Todd
Reporting & Analytics | | t1kuslik0 -
Need help setting up a Google analytics goal.
I'm just now getting my feet wet with the goals in GA. I'm trying to figure out how many visitors to a certain page, click on a certain link (which takes them no a certain page on my site.) What's the best way to go about this? Thanks.
Reporting & Analytics | | NoahsDad0 -
Data Overload! So . . . ..Confused . . .. !
Just when I think I'm starting to get it . . . info comes in that blows my mind. I'm using several tools at SeoMoz and others trying to keep track of my link building success, which has been largely me registering at related industries bulletin boards and forums and posting as well as making sure we are listed properly in any and all sites listing companies in our industry etc. I have to say I was a little frustrated with the results. I had seen SOME increase in the back links, but almost no move in the number of linking domains. NOW I just logged into googles webmaster tools . . . and their data says I have more than DOUBLE the number of linking domains as any other sites are suggesting. What's going on here? I can understand slight discrepancies due to when they crawl the data etc . .. but 22 linking domains compared to 9-10 from everyone else? What the heck?
Reporting & Analytics | | damon12120