What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Does Search Console data include GMB traffic? Branded CTR is 37.8%- Good or Bad?
Hey all, Per Search Console our branded keyword CTR is 37.8%. But when that keyword is searched our GMB listing shows up on top of the #1 result. For the same 90 day period GMB shows another 35% visits to our GMB (based on the number of impressions and visits to our GMB page) listing when the same keyword is searched. My question is this. Does Search console data include clicks that came from our GMB listing or not? My thinking is like this: If GMB traffic is not calculated in search console then it means that 72.8% of people looking for our brand will end up on our site on way or another 9organic #1 result plus GMB listing visits) We are also doing PPC for this very keyword that has gets almost 20% of the remaining traffic. So after adding all up we are loosing about 8% of our branded traffic to people who are doing adwords. When you search our brand you normally see 2, 3 competitor's adwords ads. Does anyone know how this works exactly? And if you don't mind sharing your branded keyword CTR's, so I can compare to ours please. I would love to compare to a site that actually has a GMB listing ranking for the same keyword Thanks in advance, Davit
Reporting & Analytics | | Davit19850 -
How do you analyze a traffic drop with no historic Google Analytics data?
A client of mine has a large website with multiple sections (shop, forums, articles, etc.) that apparently had a significant reduction in rankings, traffic, and sales in the past. However, historic Google Analytics data is not available for the site, and I'm having troubles identifying anything concrete about the traffic drop, such as when it happened, what pages/sections it happened to, etc. The shop traffic drives most of the revenue, but it's a small number compared to the forums traffic, so it's hard to pick anything out of top-line trends like SEMrush offers. What tools or strategies might help in this situation?
Reporting & Analytics | | AdamThompson0 -
How do I get compete.com to track my data
Is there a tracking code for them? I cannot find a way to get them to track my site data. I know it seems trivial, but it is sadly a big tool in my industry so I need to get my data on their site
Reporting & Analytics | | Atomicx0 -
Is it possible to import data from an old Google Analytics profile to a new Google Analytics profile?
We have encountered a situation where a client's old SEO firm is refusing to grant us Admin access to our client's existing GA account. For security purposes (so the other SEO firm doesn't delete the existing GA profile) we have started a new Google Analytics profile. Again we do have access to the data in the old account. Is it possible to migrate this old data over (if we just have user access)? Thanks for the help
Reporting & Analytics | | RosemaryB0 -
Can you tell MUV data on websites using MOZ?
I want to write reports on other websites and need to know MUV data on them
Reporting & Analytics | | WeAreVillage0 -
Can't seem to rank for keyword "home care grand rapids" - need some advice
I am trying to rank for "home care grand rapids" and am having a really difficult time. My site: http://healthcareassociates.net has better backlinks, keywords and other seo markers than my competitors but I still can't seem to rank. The keyword and associated keywords (home care grand rapids michigan, home health care grand rapids, etc.) are only 31-33% difficulty and my site/page rank is better than the leading sites. What gives? Todd
Reporting & Analytics | | t1kuslik0 -
Google Webmaster Tools is showing wrong data - help?
Hey all, I'm seeing some weird problems with Webmaster Tools. Specifically: We've submitted a sitemap with 174k URLs. According to the WMT dashboard, only 21 are indexed, though if you search our site via site:sitename.com blah blah, there are thousands of pages returned. Why is WMT only showing 21 indexed pages? Yet if I go to Health -> Index Status, it's showing nearly 199k URLs indexed. This seems consistent with searching Google site:sitename.com blah blah. Under "Search Queries", it's showing "no data available". Not sure why as it's linked to the proper Google Analytics account, which has keyword data. Any ideas what I'm doing wrong here? Thanks.
Reporting & Analytics | | chimptech0