What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Help needed - traffic reduced by half for no apparent reason - not sure what to do
My client site, https://www.helpinhearing.co.uk/ - has regularly been showing traffic of between 1000 & 1500 sessions per month. I've just looked at their Analytics for September and the sessions have dropped to just over 600! And not only was there a period of around 11 days from September 19th onwards when there was either zero traffic or 1 or 2 visits, but also from October 4th to today (8th Oct) also no visits at all. This has never happened and we've managed this website for several years now. I cannot fathom out why this may be the case. We haven't changed anything on the site from a technical point of view, just added content and usual blog etc pages. Search Console lists 147 pages as 404 errors but there are no urgent messages or alerts/warnings. I really don't know how to proceed and try and find out what is going on with the site. Can anyone offer suggestions?
Reporting & Analytics | | mfrgolfgti0 -
How do you analyze a traffic drop with no historic Google Analytics data?
A client of mine has a large website with multiple sections (shop, forums, articles, etc.) that apparently had a significant reduction in rankings, traffic, and sales in the past. However, historic Google Analytics data is not available for the site, and I'm having troubles identifying anything concrete about the traffic drop, such as when it happened, what pages/sections it happened to, etc. The shop traffic drives most of the revenue, but it's a small number compared to the forums traffic, so it's hard to pick anything out of top-line trends like SEMrush offers. What tools or strategies might help in this situation?
Reporting & Analytics | | AdamThompson0 -
Analytics Data Messed Up
I have a website that gives very unatural data in Analytics. First for some reason my paid search traffic and organic traffic seems to be pretty much the same. If I pause my adwords ads my organic traffic go down if I start them it goes up. Anyone else had this problem before?
Reporting & Analytics | | AngelosS0 -
Google Analytics Stopped Tracking Visits - NEED HELP!
Hi Moz Community, I have about 10 sites, static HTML sites and WordPress sites, which ALL stopped tracking Google Analytics on August 2nd. They go to a flat-line! Dead! No data! Has anyone else experienced this either currently or before? I have confirmed all code is correct as it's been tracking these sites for years. One site gets 5,000+ visits a month and they are sitting at only 1,500 now and will be a fun conversation to have with the client. If all code is correct, what should I do? How do I overcome this without having to re-create another account/tracking ID? Never dealt with something like this before and there is not much on the web or in other forums. Would appreciate any help or advice or tips! - Patrick
Reporting & Analytics | | WhiteboardCreations0 -
E-commerce data import Google Analytics
Hi there, Since a few weeks we have started a cooperation with a big online wholesale company which now sells our products. I want to import simple E-commerce data: amount of transactions and revenue. More detailed data such as amount of products, name of products, etc. is not necessary in the first place. Now I discovered the 'data import' functionality in Google Analytics but I can't find any suitable option for E-commerce data import. The data from the wholesale company is gathered in a nice export in a separate system, so there should be a way ti import this data into Analytics. Could anybody help me with this? Any advice is welcome! Thanks in advance.
Reporting & Analytics | | MarcelMoz
Marcel0 -
Keeping Google Analytics Data when Moving to Subdomain
Hey All, Against my objections a client has decided to move an existing site into a subdomain while putting up a new site on the main domain. My question revolves around Google Analytics, how do I make sure that I don't lose historical data on the domain before it moves to a subdomain? We're going to be doing a redesign of the old site and I need to keep the historical data so I can prioritize content. What do I need to do? Or will Google analytics recognize the URL's and still attribute the data to those URL's or will I have a separate set of data based on the new URL (with the subdomain). Any insight would be appreciated! Thanks!
Reporting & Analytics | | EvansHunt0 -
I have data missing in Google and don't know who to turn to for help
Hi everyone, I know this isn't the 'Google help forum' but I'm stuck and I hope someone here might be able to point me in the right direction. For a period last month - Thursday 22nd to Sunday 25th November Google Analytics reports our site as having 0 visits. In addition we have two days which were strangely low - Weds 21st 105 visits, Weds 28th Nov 78 visits. We normally get between 1000 and 1200 visits on a weekday from a global audience (I know that was the Thanksgiving weekend, but the US accounts for ~10% of total traffic). Has anyone else had this problem? If so, what did you do? The "report a bug" board on the Google help forum has a few entries like this, people with 0 visits shouting "help!" into the void with no response. Ideas?
Reporting & Analytics | | StevenHowe0 -
ECommerc site redirect to external site when add to cart. Need HELP to track sales!!!
Hi, I buil this site on WordPress, http://www.pilatesboisfranc.com When you go on <<plan &="" pricing="">> on the menu you can purchase a package online.</plan> When you click ''Get Started Now'' or ''Add to Cart'' the buyer is redirect to this external site: mindbodyonline.com QUESTIONS: Can I track my sales on Googles Analytics? Can I creat a goal on G.A. ? I found this video: https://getsatisfaction.com/mindbody/topics/chalk_talk_how_to_setup_google_analytics Is this the right way to do this? About goals, a simple goal I would like to create is, one purchase. Can I acheive that? Not shure about Goals. When I test and purchase, URL is always the same https://clients.mindbodyonline.com/ASP/home.asp?studioid=30371 I'm know only very basics stuff when it is time to play in Analytics, I hope you can provide help in details. Thank you, BigBlaze
Reporting & Analytics | | BigBlaze2050