What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Need advice on setting up primary domain and shopify site analytics to work best together
Hello, I have a client that I have been working on their primary site for the last year or so. In the last month they decided to have one of their internal employees setup a small shopify store. Now they are asking for the analytics tracking codes for it. My question for you is what would be the best way for me to set that up? variables: primary domain and shopify domain, google and bing analytics Have been looking at how cross domain tracking works (https://support.google.com/tagmanager/answer/6106951), and the instructions for setting up ecommerce in analytics for shopify (https://help.shopify.com/manual/reports-and-analytics/google-analytics/google-analytics-setup). But am still not 100% which route would be the best, any input would be greatly appreciated! thank you, Dustin
Reporting & Analytics | | pastedtoast1 -
Structured Data dropped suddenly
Just noticed a large drop in Webmaster tools of our structured data graphs. Both "items" and "items with errors" dropped. It is across the board on all our sites. Even checked some of the sites that I do consulting work for, and they dropped. My assumption is that this is another Google glitch, similar to what we saw last year, and in March of this year, where is corrected itself. Anyone else seeing anything on their end?
Reporting & Analytics | | tdawson090 -
Newsletter Campaign Need HELP to Create a Custom Report in Google Analytics
I have this newsletter send using Mailchimp. This campaign is link to G.A. How can I create custom report for me online store about this campaign? For example: I have 2 Goals Completion Location setup in G.A. they are: /checkout.php and /finishorder.php Is there a way to find out how many visitors from my campaign reach the /finishorder.php vs. /checkout.php Reason, about 50% are reaching /finishorder.php Maybe I need to creat a "How to redeem you coupon code" video to included in the newsletter to HELP customers complet there order process. Thank you, BigBlaze
Reporting & Analytics | | BigBlaze2050 -
When will traffic data be working ? also whats with the spike in duplicate listing issues with everyone.
Hi There, We have no traffic data, is this something we are doing wrong or is this an issue with SEOMOZ ? Also duplicate listings have gone sky high, check goggle analytics's and all ok ? Any answers ? Thanks Charlie
Reporting & Analytics | | pro580 -
ECommerc site redirect to external site when add to cart. Need HELP to track sales!!!
Hi, I buil this site on WordPress, http://www.pilatesboisfranc.com When you go on <<plan &="" pricing="">> on the menu you can purchase a package online.</plan> When you click ''Get Started Now'' or ''Add to Cart'' the buyer is redirect to this external site: mindbodyonline.com QUESTIONS: Can I track my sales on Googles Analytics? Can I creat a goal on G.A. ? I found this video: https://getsatisfaction.com/mindbody/topics/chalk_talk_how_to_setup_google_analytics Is this the right way to do this? About goals, a simple goal I would like to create is, one purchase. Can I acheive that? Not shure about Goals. When I test and purchase, URL is always the same https://clients.mindbodyonline.com/ASP/home.asp?studioid=30371 I'm know only very basics stuff when it is time to play in Analytics, I hope you can provide help in details. Thank you, BigBlaze
Reporting & Analytics | | BigBlaze2050 -
Google analytics for mobile apps - do we need different id's?
Dear all, if I want to implement Google Analytics for my mobile apps: Do I need different accounts or id's to ensure separate tracking of my website and mobile apps? How did you do it? Thanks!
Reporting & Analytics | | HMK-NL0 -
Newbie Need Step by Step to Track .ca Domaine Redirect from GoDaddy to .com
I, I ask a few time about how to track using Google Analytics, my domain, www.pilatesboisfranc.ca bought at GoDaddy and redirect from the GoDaddy control panel to my domain, http://www.pilatesboisfranc.com/ I don't know anything about coding or webdesign, I did this web site from a theme on wordpress for my wife opening this Pilates Studio in our neighbourhood soon. http://www.pilatesboisfranc.ca/ is advertise on our car.(it will be nice to see if that advertising is worth it!) http://www.pilatesboisfranc.ca/ is redirect from GoGDaddy control panel to the site: http://www.pilatesboisfranc.com/ I had a few answers on this forum, but I'm not sure how to do this. My knowledges are very limited in html and all technical side. Thank to WordPress, Lynda.com and Theme Forest. Those are the tools I took to built this basic web site. Can any body help me track this .ca? I will need a step by step guide to achieve my goal. Google Analytics is instal on the site. Any help will be really appreciated. Thank you, BigBlaze
Reporting & Analytics | | BigBlaze2050 -
Moved Up in SERPS & Traffic, Need Help Converting
Hello, After listening to the advice of many of you on this forum, I have managed to move my site up in the SERPS, close enough to where I want/need to be. My traffic has increased heavily, yet I am still not seeing a large increase in orders being placed. I am positive that I have the lowest prices on these items, and the most information available about them, yet I still can't seem to convert a lot of this traffic into sales. Can you guys please take a look at my site and provide some guidance on what I can/should do to help convert these visitors to customers? my site is : http://goo.gl/JgK1e Thanks
Reporting & Analytics | | Prime850