Ways to analyze a 1M rows dataset of search queries
-
Hi,
I have this large dataset, about 1 million search queries with visits, bounce rate and a few other metrics. I'm trying to explore this data to find keyword "buckets" (such as include product name, location name, transactional objective, informational, etc.), as well as explore the density of certain keywords (keywords as in instances of a single word amongst all queries)
My idea was to use Excel and a macro to split all queries in separate words (also clearing punctuation and uppercase/lowercase), then storing this word in a new worksheet, adding to another column the visit counts from the row where the word was extracted (as to give a sense of weight). Before adding the word to the new worksheet, the script will look if the word already existed, if so it would just add the current value of visits to the existing visit counts etc.
In the end it will create sort of a "dictionary" of all the keywords in all search queries ranked by weight (= visits from search query including this keyword)
This would help me get started I believe, because I can't segment and analyze 1M raw search queries...
My issue is: this VBA has been running on my (fast) PC for the last 24hr and it doesn't seem to get to an end. Obviously excel+VBA is not the best way to do text mining and manipulation in such a large dataset (although it's just a 30mb file)
What would you do if you had this dataset and would like to mine the text/semantic as I am doing? Any idea of tools? process?
I'm considering dumping this data into a MySQL db and doing the processing through PHP (the only backend language I'm versed in), and getting the "summified" data stored into another table, which I'll then be able to export to a Excel for analysis. But I'm afraid that I'll be facing memory limit issues and such...
In the meantime, I'm definitely interested into knowing what you guys would do if you had this data and wanted to simply start exploring its constituencies
Thanks!
-
Yeah, Access can process any number of rows. It's Microsoft's database program. You can upload data, and then create queries. They have a design view where you can construct queries in a WYSIWYG fashion, or if you want, you can write your own SQL.
-
Thanks a lot John!
I'm going to try this out tonight!
So, I assume, Access won't have the same processing limitations with 1 million rows, will it?
Once I'll be done with the "discovery phase" I'm going through with this keyword list, I'll definitely use Advanced filters (in Excel) as you recommend to understand keyword groups in details
-
I had a similar problem going through my search query reports. If you're already familiar with VB you could do this with a Microsoft Access database rather than setting up a MySQL one w/PHP. I've been working on creating an Access database that I can import my data into, and have it spit out all sorts of useful info (for example negative keywords and placements), but it's only in its early stages right now.
If you just want to see it for a few terms and don't mind doing it one at a time, in the past I've filtered data like this in Excel without VB using advanced filters. I found that using advanced filters rather than VB sped up the process quite a bit; I'd imagine because it's an innate Excel function. Using 4 filters you can match whole words in the queries. For example, to find queries with "blah", you'd set a filter for "blah", "* blah", "blah " and " blah *". Then you can use the Subtotal command to do calculations over the visible rows and calculate the data.
More about advanced filters: http://office.microsoft.com/en-us/excel-help/filter-by-using-advanced-criteria-HP005200178.aspx
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can't see backlinks in Search Console
Hello, We have 7 sites and have noticed that for one site, we don't see any backlink info in Search Console even if it's been linked for over 8 months. Other tools show thousands of backlinks, but Search Console is still pending. I also see very little info in regards to Performance even if we are getting thousands of hits a day. Could this be a sign of a bigger problem? In summary, the site is up and running, getting hits, getting backlinks, but Search Console still looks like it's a new site with no activity several months after being installed.
Reporting & Analytics | | CJolicoeur0 -
Are these Search Console crawl errors a major concern to new client site?
We recently (4/1) went live with a new site for a client of ours. The client site was originally Point2 before they made the switch to a template site with Real Estate Webmasters. Now when I look into the Search Console I am getting the following Crawl Errors: 111 Server Errors (photos) 104 Soft 404s (blogs, archives, tags) 6,229 Not Found (listings) I have a few questions. The server errors I know not a lot about so I generally ignore. My main concerns are the 404s and not found. The 404s are mostly tags and blog archives which I wonder if I should leave alone or do 301s for each to /blog. For not found, these are all the previous listings from the IDX. My assumption is these will naturally fall away after some time, as the new ones have already indexed. But I wonder what I should be doing here and which will be affecting me. When we launched the new site there was a large spike in clicks ( 250% increase) which has now tapered off to an average of ~85 clicks versus ~160 at time of launch. Not sure if the Crawl Errors have any effect, I'm guessing not so much right now. I'd appreciate your insights Mozzers!
Reporting & Analytics | | localwork0 -
Search traffic hit after switching magazine to subdomain
Hi there, I'm fairly new at all of this and would appreciate any help with understanding why our website has taken a hit in traffic. We curate an online magazine, which was previously accessible through: forensicoutreach.com. It was receiving about 2,000+ unique visitors per day up until a week ago, when we changed a few things. However, the magazine doesn't reflect what our business does, so we created a product-focused web presence on forensicoutreach.com, and moved the magazine (which everyone loved) to library.forensicoutreach.com (DA 37, PA 1). We thought separating the properties was a good idea, but now I'm not so sure. Our traffic on library.forensicoutreach.com is 1,500 (so 500 less than usual!) and our main property has about 56 unique visits a day. It's pretty substantial. A few questions: 1. If we move the library to forensicoutreach.com/library, will that make any difference? 2. Where did we go wrong here and how can we fix it? Any help would be appreciated. Thanks.
Reporting & Analytics | | shivaniseos1 -
Delays in Search Console Data
We are seeing big delays in search console data this month. In the past we have seen delays of a few days, but I have never noticed it being this long before. At the moment we have no data since 23rd February. Have I just never noticed this before or is this particularly long? Is anyone else seeing the same thing?
Reporting & Analytics | | Wagada0 -
Improving Search Click through Rate
We are having a problem on our website with click through rates. We are getting between 100-150k impressions through search but we are only getting between 500-1000 clicks to the site. What strategies have you used in the past to help improve your click through rates? Thanks!
Reporting & Analytics | | pdangermond2 -
Google Search Results inconsistent from different computers
Recently after some optimization activities - I do not see much movement in search rankings - my client is seeing the results on page 1 position 3 and I see page 2 for the same keyword. How does Google change ranking based on past searches and how can I get an accurate picture of what the actual rank is?
Reporting & Analytics | | devonkrusich0 -
Tracing Google Analytics 'goal' back to original search phrase
I added Goals to my Google Analytics tracking. It's working; I get visitors who have completed Goals showing up in the reporting. My question is: Is it possible to trace backwards from a completed Goal to the original search phrase a user entered in Google to come to my site (for those who entered from Google.com via organic search result)? I'm trying to answer the question of which search phrases are resulting in completed Goals (as opposed to bouncing off the site or just any behaviour other than completing a Goal). It seems like this should be one of Analytics' default reports -- help identify which search phrases are converting well. It's probably there and I'm just not seeing it... Thanks.
Reporting & Analytics | | scanlin0 -
Search within search? Weird google URLs
Good morning afternoon, how are you guys doing today? I'm experiencing a few Panda issues I'm trying to fix, and I was hoping I could get some help here about one of my problems. I used Google analytics to extract pages people land on after a Google search. I'm trying to identify thin pages that potentially harm my website as a whole. It turns out I have a bunch of pages in the likes of the following: /search?cd=15&hl=en&ct=clnk&gl=uk&source=www.google .co.uk, and so on for a bunch of countries (.fi, .com, .sg, .pk, and so on, maybe 50 of them) My question is: what are those pages? their stats are awful, usually 1 visitor, 100% bounce rate, and 0 links. Do you think they can explain my dramatic drop in traffic following Panda? If so, what should I do with them? NOINDEX? Deletion? What would you suggest? I also have a lot of links in the likes of the following: /google-search?cx=partner-pub-6553421918056260:armz8yts3ql&cof=FORID:10&ie=ISO-8859-1&sa=Search&siteurl=www.mysite.com/content/article They lead to custom search pages. What should I do with them? Almost two weeks ago, Dr. Pete posted an article untitled Fat Panda and Thin Content in which he deals with "search within search" and how they might be targeted by Panda. Do you think this is the issue I'm facing? Any suggestion/help would be much appreciated! Thanks a lot and have a great day 🙂
Reporting & Analytics | | Ericc220