What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Need help with my "ghost" blog...
...when I fired my original web designer, did they sabotage coding? I have never checked my Alexa/Google Analytics, or any blog ranking until last night. Subsequently, I have spent the last 24 hours googling away, and finally joining MOZ b/c I'm desperate to find out WHY I'm not ranking. I've googled and found many answers to a problem directly opposite of mine: (How to increase traffic with a high ranking), but I already have quite a bit of traffic (via Wordpress Stats), but can not be found on any ranking system. So, fiddled with some NoFollow/NoIndex boxes in Genesis SEO settings thinking maybe when my domain name changed it messed everything up? Most the boxes HAD been checked, so I unchecked them all. Anyhow, basically signed up for the monthly service so i could ask this question on the forum. My site is hellowhitney.com **it's so weird---i have a LOT of organic direct hits coming directly to my blog (for instance a celebrity re-posted a post which gained a lot of traffic from Twitter to the page), but Google nor another ranking is seeing it. IN FACT, it stops any and all ranking data back to FEBRUARY 2016 when I changed my domain name from Myscriptedreality.com to HelloWhitney.com Ignorance is NOT bliss in this case--would appreciate any help! #ForeverGrateful
Reporting & Analytics | | hellowhitney0 -
No Data in Custom Report set to 'Hit' Scope
Hi Guys, Been having a problem recently with a custom report I have set up... I want to find out number of sessions, bounce rate, session duration etc for different dimensions on my site - store area, store name, product type etc but I cannot seem to get the data to filter through to the report I have set up when 'Session' scope is selected. If I set it to 'hit' then I do get the data but this will only record the first instance of a dimension being triggered (from what I can gather) rather than all dimensions that might be triggered during a complete session. Has anyone experienced similar problems? Thanks, Dan
Reporting & Analytics | | SEOBirmingham810 -
Can you tell MUV data on websites using MOZ?
I want to write reports on other websites and need to know MUV data on them
Reporting & Analytics | | WeAreVillage0 -
Transferring of analytic data
Hey SEOMoz community, Question, we just purchased a business and while we didn't keep their website content we acquired their URLs and will be rebuilding the site. We've asked for and been denied any historical analytic data e.g. they wont transfer admin rights over to us. Is there anyway to access historical data without being made an admin? or do we start from ground zero. One of the reasons we're being given for not being allowed access is that "Google analytics and the associated keywords are the vendor's intellectual property" - given that we brought the brand doesn't that IP transfer over to us anyways? Thanks, PC
Reporting & Analytics | | PC-QSG
(Long time forum stalker, first time poster)0 -
When will traffic data be working ? also whats with the spike in duplicate listing issues with everyone.
Hi There, We have no traffic data, is this something we are doing wrong or is this an issue with SEOMOZ ? Also duplicate listings have gone sky high, check goggle analytics's and all ok ? Any answers ? Thanks Charlie
Reporting & Analytics | | pro580 -
Changed URL's, traffic dropped from 2k week to 1K week. Need advice!
Hi Mozers, I recently changed my URLs for my ecommerce site and my traffic went from 2,000 visitors a week to 1,000 visitors a week, over a 3 week period. Traffic is down, so are unique Kwds. I need advice on why this happened and what I should do moving forward. To brief, I have a ecommerce website, www.ecustomfinishes.com. I noticed pattern that a lot of my URLs with a unique URL structure (URL.Com/ProductDescription/ProductName) were getting a lot of entrances ~30-50 a month, and others that followed the path of my subcategory (URL.com/SubCat/Product) were getting 0-3 entrances a month. The seo pattern was that those with unique product URLs were hitting long tail Kwds, and those URLs with /subcategory/product were getting far less traffic. I changed 150 or so urls to be unique. Good idea, I thought. Since then: CON: Since then my traffic dropped from 2200 visitors a week to 1100 visitors a week. -25% week to week, over 3 weeks CON: # of non-paid keywords sending visits: -25% week to week, over 3 weeks PRO: my Urls receiving entrances +10% week to week, over 3 weeks REF: http://imgur.com/GwZT8 Question: What are your best suggestions moving forward? Any advice is much appreciated, Thank you!!! abBN3
Reporting & Analytics | | longdenc_gmail.com0 -
Need a tool for finding new links from analytics
I'd like to get a weekly report of all of the brand new referral links that have come in to my site over the last week. Is this something that can be done in google analytics? Is there a better tool for this out there? Thanks in advance
Reporting & Analytics | | seo-hunter0 -
Do I need to turn off custom results or empty my cache before using the SEOMOZ Pro Rank checker?
I'm getting unusual swings in SERPS ranks from one day to another (I had several keywords ranked in top ten go to the the third or fourth page in a week). I knwo that there's always variation because of localized and custom search, but this seems to be something more.
Reporting & Analytics | | bbelgard0