| «« New Direction | Positive Reviews are Boring »» |
|
About
I'm Ryan Lowe, a Software Engineering graduate living in Ottawa, Canada. I like agile software development and Ruby on Rails.
I write this blog in Canadian English and don't use a spell checker. Typos happen.
Projects
» Full-time Ruby on Rails freelancer
» Full-time with Rails since May 2005 » Former committer for RadRails (now Aptana) » I also have a few Rails side-projects in development: 1. wheretogoinTO.com Toronto nightlife 2. Hey Heads Up! TODO list and sharing 3. Layered Genealogy family history research 4. foos for foosball scoring 5. fanconcert for music fans (on hold) Hiring Rails developers? I can telecommute by the hour from Ottawa, Canada »» Email: rails AT ryanlowe DOT ca
BulletBlog
Now hosted on Hey! Heads Up -- check it out!
Syndication
Pings
Recent
Derek Lowe's (Ryan's older brother) words at Ryan's funeral
blog@ryanlowe.ca no more Forging Email Headers: Good, Bad or Ugly? Sarcastic Dictionary (Part 1 of Many) Tags Hierarchies Twisting Rails is Risky Business Risky Business? My Take on Early Alphas Whoa, it's August 2007 Closing Comments A Postscript to "Growth at the grassroots" »» All Blog Posts
Linkage
del.icio.us/ryanlowe
technorati/ryanlowe.ca/blog Aurora Roy Jim Andrew Trasker Travis Kibbee Karen Dr. Unk Ayana Van Bloggers Joel Spolsky Robert Scoble Tim Bray Dave Winer Raymond Chen James Robertson Ruby/Rails Bloggers rubyonrails.org weblog David Heinemeier Hansson Dave Thomas James Duncan Davidson Mike Clark Jamis Buck Signal vs. Noise Tobias Luetke Amy Hoy: (24)slash7 Jeremy Voorhis Eclipse Bloggers Planet Eclipse EclipseZone Luis de la Rosa Eclipse Foundation Kim Horne Billy Biggs Ian Skerrett Mike Milinkovich Bjorn Freeman-Benson Denis Roy
Archives
|
Searching for Context
What is a search engine? They are relatively dumb, aren't they? You search for a word and the search engine tells you which pages that word is on. The secret (for now) seems to be the ordering. Which pages are most relevant for the word(s)? It's hard to do this if the server doesn't understand context of the Internet's pages and the user's search. What makes the problem harder is that users have been conditioned to use dumb search engines with dumb queries, removing important context from their searches. Google innovated by counting links to a page (PageRank) to improve search results ordering. But does this really improve results? It sounds more like a popularity contest than improving relevance but it was still better. No, the next step in search engines will involve context: understanding the web, not just indexing keywords and calculating PageRank. How? Google has already started to do this with its music search. When it parses pages from Amazon or other online stores, it sorts the page's data out and gives it context. It's easy to find an album title on Amazon because it's always in the same place, with the same HTML tags on it. The Google Music Search can be a little more intelligent when parsing a web page from a music store because it knows the page's structure and can deduce context from that structure. The strange thing about this kind of search engine crawling is that it seems to violate many web site's "Terms of Service" agreements. In a world where duplication is free, original content is key. So websites try to protect their content by legally restricting what people (or machines) can do with it. For example, Google's Terms of Service say you can't re-use their search results for commercial purposes. But that's exactly what Google is going with other website's data: Google crawls a website, stores the entire site on Google's servers, indexes the keywords and calculates PageRanks for every page. The difference is that Google's "service" is the ordering of the results, not data. Google's search engine doesn't have any original content. Why are Amazon and the other music stores allowing Google to repurpose their music data? Because it's driving consumers to their site, even though technically the data re-use may be a violation of their Terms of Service. It's an interesting implicit agreement (or explicit, who knows): if you drive relevant traffic to us, you can re-use our public data. Search engines are going to try to do this more: take a website's data and understand it. In the process they will be *using* this data, data that isn't theirs -- it's just floating around in the open, bound by loose laws and implicit agreements to play nice. For example, an online store could simply reject access to the GoogleBot user agent if they don't like what Google is doing with their data. The downside is that the site wouldn't appear on Google any more. The upside is that their data might seem to be "protected". Google can throw their weight around like this because they have influence ... but what if a smaller website tried to do the same thing? Would Amazon take kindly to it? We'll soon see: the search engine that beats Google is going to do it with context. Posted at March 31, 2006 at 10:06 AM ESTLast updated March 31, 2006 at 10:06 AM EST Comments
One thing that we can draw from this is that if the computer crawling your site cannot understand your site, then it will not index your site. If your entire webpage is all Flash, or requires JavaScript to make the content display, then it won't be indexed very well. Conversely, if you use standard properly formatted HTML, it is much more likely that the search engines will be able to understand the content of your website, and that people will be driven to your site. » Posted by: Kibbee at April 1, 2006 07:54 PMI just want to be more specific: content is different from context. Context is what the content is actually about. » Posted by: Ryan at April 1, 2006 11:20 PMYes, but good usage of standards can also help the search engine figure out the context. Usage of alt tags on images, specifying what text is the header, and giving proper descriptive titles to your pages can help the search engine figure out what it is that your page is talking about. If the search engine can't pick out what the important parts of the page are, and distinguish the content from the filler, then it's going to have a hard time figuring out the context. » Posted by: Kibbee at April 2, 2006 02:52 PM |