«« New Direction Positive Reviews are Boring »»
blog header image
Searching for Context

What is a search engine? They are relatively dumb, aren't they? You search for a word and the search engine tells you which pages that word is on. The secret (for now) seems to be the ordering. Which pages are most relevant for the word(s)?

It's hard to do this if the server doesn't understand context of the Internet's pages and the user's search. What makes the problem harder is that users have been conditioned to use dumb search engines with dumb queries, removing important context from their searches.

Google innovated by counting links to a page (PageRank) to improve search results ordering. But does this really improve results? It sounds more like a popularity contest than improving relevance but it was still better. No, the next step in search engines will involve context: understanding the web, not just indexing keywords and calculating PageRank.

How?

Google has already started to do this with its music search. When it parses pages from Amazon or other online stores, it sorts the page's data out and gives it context. It's easy to find an album title on Amazon because it's always in the same place, with the same HTML tags on it.

The Google Music Search can be a little more intelligent when parsing a web page from a music store because it knows the page's structure and can deduce context from that structure.

The strange thing about this kind of search engine crawling is that it seems to violate many web site's "Terms of Service" agreements. In a world where duplication is free, original content is key. So websites try to protect their content by legally restricting what people (or machines) can do with it.

For example, Google's Terms of Service say you can't re-use their search results for commercial purposes. But that's exactly what Google is going with other website's data: Google crawls a website, stores the entire site on Google's servers, indexes the keywords and calculates PageRanks for every page.

The difference is that Google's "service" is the ordering of the results, not data. Google's search engine doesn't have any original content.

Why are Amazon and the other music stores allowing Google to repurpose their music data? Because it's driving consumers to their site, even though technically the data re-use may be a violation of their Terms of Service. It's an interesting implicit agreement (or explicit, who knows): if you drive relevant traffic to us, you can re-use our public data.

Search engines are going to try to do this more: take a website's data and understand it. In the process they will be *using* this data, data that isn't theirs -- it's just floating around in the open, bound by loose laws and implicit agreements to play nice.

For example, an online store could simply reject access to the GoogleBot user agent if they don't like what Google is doing with their data. The downside is that the site wouldn't appear on Google any more. The upside is that their data might seem to be "protected".

Google can throw their weight around like this because they have influence ... but what if a smaller website tried to do the same thing? Would Amazon take kindly to it? We'll soon see: the search engine that beats Google is going to do it with context.

Posted at March 31, 2006 at 10:06 AM EST
Last updated March 31, 2006 at 10:06 AM EST
Comments

One thing that we can draw from this is that if the computer crawling your site cannot understand your site, then it will not index your site. If your entire webpage is all Flash, or requires JavaScript to make the content display, then it won't be indexed very well. Conversely, if you use standard properly formatted HTML, it is much more likely that the search engines will be able to understand the content of your website, and that people will be driven to your site.

» Posted by: Kibbee at April 1, 2006 07:54 PM

I just want to be more specific: content is different from context. Context is what the content is actually about.

» Posted by: Ryan at April 1, 2006 11:20 PM

Yes, but good usage of standards can also help the search engine figure out the context. Usage of alt tags on images, specifying what text is the header, and giving proper descriptive titles to your pages can help the search engine figure out what it is that your page is talking about. If the search engine can't pick out what the important parts of the page are, and distinguish the content from the filler, then it's going to have a hard time figuring out the context.

» Posted by: Kibbee at April 2, 2006 02:52 PM
Google
 
Search scope: Web ryanlowe.ca