November 11, 2010

A Philosophical Look at LDA

Search Engines

This YouMoz entry was submitted by one of our community members. The author’s views are entirely their own (excluding an unlikely case of hypnosis) and may not reflect the views of Moz.

I’m going to make a comment here that may shock a lot of people. I hate to say it, but it may shock a lot of people who are very good at using statistics.

Statistical models are not answers – they are models.

Models are built in replica of a system. They may look similar, but they are not perfect and the resemblance can always be improved.

Now that I have got that shock out of the way, here is my motivation. Rand recently posted on a new correlation value for Latent Dirchlet Allocation (LDA) to Google ranking. The first value was 0.32, then recognition of a mistake in the calculations lead to a figure of 0.17. Now, it seems, the number should be around 0.06 - 0.1. Meanwhile, other datasets have given others different results. As SEOs we are after a clear number - we want the answer "LDA correlates to Google's rankings with a score of 0.XX". Unfortunately, we are unlikely to ever get that answer

A Binary View on Probability

Rand and Danny Sullivan’s debate in the comments on Rand’s original post about LDA was caused by the fact that that people, especially those for whom mathematics is not their adopted language, have pretty binary views on the subject: they either trust statisticians implicitly or are highly cynical of the results they produce. Both responses are due to a lack of understanding and a fundamental position on the relevance of things that we do not understand. When we are uncertain about something, we like to probe it to find out whether it is good or bad. Think of all the questions you asked last time you bought an oven, or a car. You probed each one to find out whether it was suitable for the task it claimed to perform and efficient in performing that task. The problem is, with physical objects like these we can see examples, road tests and samples; with something abstract like statistical methodology we have to rely on those who do understand it – and they were the people who we didn’t trust in the first place.

Buying a car - clearer results than modeling with statistics

It's easy enough to find out whether a car does what it's supposed to, but statistics can be a little more difficult. Image credit, Cartoon Stock

String and Sealing Wax

What a lot of people forget is that these models are not perfect and they weren’t built to be perfect. In the same way that the models you might have built of a Spitfire or Mustang when you were young could not fly, they are just representations of the system and the best you could make at the time. You might then have improved them – add an engine and some RF wizardry and you can make them fly. But you still can’t fit in them – they’re just models. To be perfect models of a plane, they would have to be a plane.

Another good example is gravity. Isaac Newton published his laws of gravity which said that all objects with mass attract each other in a certain way. This model worked then and it works now – we send satellites into orbit and predict the motions of galaxies with it. But it had problems: it could not explain the observed changes in Mercury’s orbit and it did not say what gravity actually was, only what it did. Albert Einstein refined the model, with his famous General Theory of Relativity. This told us what gravity is and its relationship with mass – gravity tells space-time to bend, mass says by how much – and it solved the problem of Mercury’s orbit. It even predicted that gravity causes light to bend. It was a much better model, one that we can use to look out far beyond the solar system. But it’s still not perfect – the satellite Pioneer 10 is nowhere near where it should be and General Relativity alone gets the age of the universe completely wrong. It’s a flawed model. My opinion, I wrote my masters dissertation on this, is that this is because Einstein’s model of space-time is wrong. He said it’s flat, or slightly curved, while the model I think improves on it says that it is actually far from flat. This model improves on relativity in many ways and even provides links with another great model, quantum mechanics. It even fulfils Einstein’s dream of describing everything in the universe through geometry. But it’s still not perfect or, I hasten to ad, properly verified.

The point here is that the real world is damned complex. Physicists and statisticians, like people who make model aeroplanes, represent it to the best of their abilities and those abilities are constantly evolving. Like the LDA model, where Rand and Ben started with the assumption that all keywords have the same ranking factors factors affecting their SERPs in the same way. This model is simplified, it will not be a complete model of reality, just like Newton's model of gravity, but we can still use it in our everyday lives. If we want to be more precise, we must take into account how competitive a term is - refining our model. This is evolution in action.

From Ape to Homo Sapiens

In his comments to Rand's original post on LDA, Danny Sullivan mentions people in the late ‘90s using “theme pyramids” and their association with LDA. He’s completely right in how he’s introduced them, but not right to dismiss LDA because of the association of these to techniques. Theme pyramids are the equivalent of Aristotle or Newton – a sensible starting place – while LDA is crinkled space-time – a very good idea but something we’re pretty sceptical of because it’s unproven. We can even chart how the model of content relevance has evolved over the years:

Theme pyramids
Information Frequency vs Inverse Document Frequency (IF vs IDF)
Latent Semantic Indexing (LSI)
Probabilistic LSI (pLSI)
Latent Direchlet Allocation (LDA)

Each model improves on the last, but is not perfect. Even Rand, replying to Danny’s commented that started their debate on the original LDA post, says "We think it's interesting, given the relatively high correlation (compared to link metrics) to try it out, but we haven't suggested conclusive results." In his latest post, Rand says, "more polished results may still be several months away" and Ben in his update was scathing of his own work, saying "I think 0.17 really might not be the last word.... Just treat all of this as suspect until we know more."

Aristotle's periodic table - wrong, but not completely

Aristotle's periodic tabel. Just because it was wrong doesn't mean all periodic tables are wrong. As a starting place it served us well. Image credit, University of Virginia

A Conclusion - Perhaps

The point here is that in their recent debate, both Rand and Danny are right and both are wrong. The SEOmoz team seem, to my mind at least, to have made a pretty convincing case for LDA. The methodology used in the original SEOmoz study was the same as in the Bing vs Google study and seems sound as a starting point, but:

Only in the case of the ranks between 1 and 10
Only if Wikipedia is taken as the de facto corpus of all English words and their proper, contextual usage
Only if we look at keywords, not key phrases
Only if all phrases are equally competitive.

Beyond these restrictions, who knows?

The problem is that more data is needed and it needs to be analysed in a less naive way. Ideally, we should look at sites ranking 1 to 100. Even Rand said in the Google vs Bing post that “Ben [Hendrickson] & I both feel that... we should gather the first 3-5 pages of results, not just the 1st page.” The problem with this is that while there will be 10 times the number of results to review, the work will be 100 times greater. It would also make more sense to train the algo on a thesaurus and Wikipedia combined, and possibly even the whole Oxford University Press range of specialist dictionaries. But again all this would take a huge amount of resources, as would expanding the model to included phrases.

We should also define exactly what we mean, numerically, by a "competitive" keyword and see how this affects our end results. We need to figure out stress-tests for the model, how local and universal search are incorporated, and many other factors. Until we do this, we will only have a simplistic model - but it could require the SEO equivalent of Einstein to work out

So, the great debate: is LDA a pile of codswallop or a pile of gold? Neither – it’s something in between, as is any statistical model. How much gold and how much fish is in the pile we can only learn from collecting more data and putting in more resource. It is also not the be-all and end-all, it is simply the best that we have. There will be a better model devised, with fewer assumptions and restrictions, and until then LDA is probably the most accurate. But then this is true of any new scientific or mathematical model.

My own view, which I commented on Rand’s first post with, is that the next stage of evolution needs to be introducing the Zipf-Mandelbrot law, which models language usage much more accurately than simple Bayesian statistics and allows phrases as well as words to be analysed. We also need to take a website, or category of a website, and apply LDA to a page as a document within a corpus. Unfortunately, both of these will require a leap beyond my mathematical abilities and a lot of processing power.

A Philosophical Look at LDA

Table of Contents

A Philosophical Look at LDA

With Moz Pro, you have the tools you need to get SEO right — all in one place.

Read Next

The Helpful Content Update Was Not What You Think

How to Optimize for Google's Featured Snippets [Updated for 2024]

How Will Google’s Antitrust Ruling Affect You?

Comments