December 31, 2007

Semantic Search Revisted and Reintroduced

Technical SEO

This YouMoz entry was submitted by one of our community members. The author’s views are entirely their own (excluding an unlikely case of hypnosis) and may not reflect the views of Moz.

This is my first blog entry, so be gentle with me. It’s rather long in the hope it passes for confidence and depth in my subject, whereas I really just got too excited and couldn’t stop typing!

The topic that has been on my mind recently is the fascinating potential of search engines to step up to the next level of intelligence. Now, I'm not talking about personalisation such as user-defined content; there's nothing particularly impressive to the SE community about serving content you know the user will find more useful. What I'm keen to delve into goes right back to the first steps of the search engine dance: the indexation.

SEMANTIC SEARCH: PART I

What is Semantic Search?

Semantic search is defined by the ability of the search engine to cognitively recognise and index content based on the actual sentence structure and meaning (or, in other words, the ability of the search engine to actually READ your content).

But doesn't it read my content already?

No, it doesn't. Beyond the most basic rules, the majority of search engines have no clue as to what your words actually mean. It may as well be indexing Klingon and mindlessly serving it back to users via the final SERPs.

We do know that conjunctions and articles such as AND, OR, A, and THE are recognised and usually disregarded unless used as operators (the search engines most likely run a simple match against a list of omitted words before the query is processed). It's even a possibility that some initial measures are included in the algorithms to recognise what is a verb, a noun, an adjective, an adverb, etc. But that's where it currently ends…you may need to cast your mind back (way back, for some of us) to your English language lessons in school.

Semantic search currently follows two major lines of research:

1. Pre-Tagged Content

By designing a web page with tags throughout the content, it is possible to tell a search engine what the page is about. Imagine the XML version of using highlighter pens throughout a newspaper to pick out verbs, nouns, semantics, syntax, etc. This is a very basic overview and, whilst you would have a rather colourful tabloid by the end of the exercise, it would be theoretically possible for a machine to understand the actual meaning and assign ultra-relevance to the content.

There are currently around 8 different ways to mark up a page of content, ranging from simple XML (designed to define relationships between page elements) to Web Ontology Language (OWL). By combining these, it's possible to present a page of super-rich content and meaning to advanced spiders.

There is a fantastic benefit to this, in that page content can then be 'read' by all manner of computer programs and not just search engines. It would potentially open the doors for a seamless triangle of Human - Program - Internet interactivity and communication.

However, I see four major downsides to this method:

Each and every element of content must be tagged or defined somewhere, essentially doubling a designer’s workload.
As the page essentially tells the search engine what it is trying to say, there will be no consistency between websites in level of authorship.
You still need a search engine capable of processing the semantic tagging, and sorting that into something that satisfies Joe Blogg’s search query.
And of course, it's an open invitation for spam manipulation and its better behaved cousin, SEO (hey, I'd rather freeware apps couldn't do my job, thank you very much.)

2. Semantic-Based Search Engines

This is the line of research that has me quite excited--search engines that can actually read a page, understand each individual element (semantics), establish relationships and structure (syntax), and then derive the human meaning of the paragraphs (more semantics!).

Semantic search engines are designed to read and understand both the user’s search query and the content of web pages without the need for additional tagging, and to some extent without relying on the source code like traditional search spiders.

Hakia is one such search engine that is making good in-roads in this area. If you can remember the original selling point of Ask(jeeves).com, where it encouraged you to use real, human questions such as “who is the prime minister of Micronesia,” you get the idea. The problem with Ask.com was that it still just matched old-school keywords. Semantic search engines such as Hakia claim to actually understand your query…it knows what a ‘prime minister’ is, where and what ‘Micronesia’ is, and that you are asking it WHO.

SEMANTIC SEARCH: PART II

What Does This Mean to SEOs and the Search Industry?

Now the current Hakia is still light years away from putting us out of work. It still can’t tell me what the word Hakia means and it still seems to match some keywords like its peers, but the Hakia Lab promises some tantalising glimpses of what it could eventually become. Interestingly, you can query their algorithm directly to see how it reads the words and chooses the best canonical meaning for the context. I asked it to show me how it understood ‘Nena has a giant, red balloon’…

Fuzzy Logic approximation of words with search importance weights -

nena [PR] = Agent, weight=100

has [SP] = Stop symbol, weight=00

a [Z1] = Immaterial, weight=00

giant [A1] = State - strong, weight=40

red [A1] = State - strong, weight=40

balloon [N1] = Instrument/Theme - strong, weight=60

You can see it has correctly understood GIANT to be an adjective when it could have been a possible noun, and BALLOON to be a noun when it could have been a possible verb. Interestingly, it has also given more weight to NENA, then BALLOON, and then an equal amount to the adjectives RED and GIANT, which makes sense. I would have liked it to understand the word HAS and there the concept of possession, but it’s not yet at that stage of development.

Hakia have demonstrated that it can potentially understand your query closer to your intention than Google, Ask, Live, etc., but this is just the tip of the semantic search iceberg – it’s what they do with this technology and how it finally manifests itself in the SERPs that is of utmost importance.

Should We SEOs be Worried Yet?

Many of us in the search industry have learned that trying to fight the tide is a fruitless waste of energy – unfortunately, the reluctance to embrace change has set some digital marketers back years behind their peers. The belief that social media would never be more than geeky ICQ chatrooms or Yahoo profiles, and that nobody would ever use fiddly mobile phones to access the internet has taught the entire SEM industry some valuable lessons.

Should we be worried that there will come a time when manipulation of the source code is no longer a deciding factor in SEO? Probably not – if you’re doing your job right, you should be utilising a daunting variety of techniques including keyword research, architecture, URL simplification, linking strategies, 404s, canonicalisation, 301s, and, of course, the ultimate in optimisation: page content.

I’m confident there will always be a way to play the system: otherwise, I’m back to email and banner marketing (and the general look of disgust from my internet savvy friends). Semantic search will require more copywriting skills from SEOs as we learn how the syntax (sentence structure and order) is processed by the search engines. It’s highly likely we will become walking thesauruses with our brains full of synonyms. SEO may come to rely not upon the understanding of how to fake or highlight relevance, but on the understanding of the word, the sentence, and the paragraph.

We might even become so socially un-integratable that we have to form our own colonies and live underground, although I sincerely hope not.

Further Reading:

SEOmoz

Rand wrote an article in April about the Google’s current semantic abilities.

There’s an article called Fuzzy Set Theory & Semantic Connectivity, which expands upon the concepts of semantics such as weighting, fuzzy logic, and perceived proximity.

Mr. Michael Martinez hardcore seo also wrote a blog entry in 2006, but I couldn’t quite get my head around some of the academic concepts.

Further Afield

Phill Midwinter wrote a very good article back in March titled Is Google a Semantic Search Engine?, and whilst I don’t agree with some of his points, it does open up the subject for debate and is well worth reading.

Bufferzone also wrote about semantic search engines at Webmasterworld.com back in 2004, so the concept and discussions are certainly not new.

What I’ve tried to achieve here is a grounded introduction and overview to a subject that can get very theoretical, opinionated, and woolly. Anyway, thanks for reading and I hope to see lots of interesting comments and opinions... and best wishes for the new year!

Semantic Search Revisted and Reintroduced

Table of Contents

Semantic Search Revisted and Reintroduced

With Moz Pro, you have the tools you need to get SEO right — all in one place.

Read Next

How to Optimize E-commerce Sitemaps with 1M+ Pages — Whiteboard Friday

7 Ways SEO and Product Teams Can Collaborate to Ensure Success

6 Things SEOs Should Advocate for When Building a Headless Website — Whiteboard Friday

Comments