Saturday, September 09, 2006

Context Sensitive Grammar tips

Context Sensitive Grammar

This type of context-based ambiguity is at the heart of a major problem with Internet Search engines. Unless they understand the context of your queries, they cannot serve-up the "best" results.

Some experts suggest that the issue of context-sensitive searching can only be fixed by allowing the search engines to "spy" on us and collect our query history.

Researchers know that understanding the context of a query is critical to making search engines effective. Search engine quality used to be expressed solely in terms of “precision and recall”, but in the past few months Google started word-stemming and synonym expansion, techniques that often leads to context-related false positives.

I’ve been playing with web-videos, and I have some great tips for using Google to find technical information.

Why is context-sensitive grammar critical on the Web?

For a simple example of context-sensitive queries, let's consider a simple web query for the heteronymous keyword "bass".

If you are Bubba the fisherman, you expect to see search results that include bass Lure's and bass-related fishing tools.

On the other hand, if you are an aspiring rock star, a search for "bass" should include references to bass guitars and famous bass players.

But heteronyms are problematic, especially when we consider the new synonym expansion feature of Google.

Heteronyms and Synonym Expansion

Homonyms are words that are spelled and pronounced the same but have different meanings. We also have Heteronyms, which is a pair of homonyms that have different pronunciations and different meanings:

Bow - After a formal bow, he drew back his bow, the arrow strking the bow of the ship.
Excuse -Please excuse me while I think of an excuse.
Polish - Tell the Polish cleaners to polish the floor.
Minute - The button was so minute that it was a minute before I found it.
Wind - Hopefully the wind will be strong enough to wind the windmill.
Record - It's the referee's job to record the new world record.
Dove – The Dove dove for the popcorn.
Lead – The lead weight was too heavy for the lead rope.
Moped – Joe moped about until his moped was repaired.
Pussy – The pussy cat has a pussy wound on her leg.

Ludicrous Synonym Expansion

The Princeton WordNet (from the Princeton Cognitive Science program) is a common tool to perform synonym expansion. I have worked with WordNet, and you must be very careful to control the “distance” of related words. I turned-on full-expansion and searched a large legal database for the "F" word and I was surprised to get hits on "congress".

Huh? I was confused about the relationship, so I searched WordNet and discovered that “a congress is a union of two bodies”.

Don't be a Pussy

It’s these heteronyms that are the bane of synonym expansion. Let’s take our last heteronym keyword example, "pussy", a word with a multitude of meanings:

Doctor – To a physician, a search with "pussy" would expand to include synonyms "pustule", "infected" and "festering".
Grandma – When a search by Granny contains "pussy", appropriate synonym’s might include "kitty", "cat", and "feline".
Teenaged Boy – When a teenager searches for “pussy”, we might take the vulgar derivation and include synonyms such as "snatch", "fur pie" and "carpet" (a contraction of the derogatory Lesbian noun, "carpet muncher").

You get the idea . . .

Context sensitive phrases

Back in the 1980's when I was teaching grad school at the University of New Mexico I evaluated a natural language for database queries. When I entered the database query "How long has John Doe been with us?", it could not understand the context of my question and asked "Are you requesting date_of_birth or date_of_hire?" Of course, any human would have understood the context of my query and responsed with the hire date . . .

Language colloquialisms also effect context-sensitive grammar. For example, the 1960's Flintstones TV jingle "We'll have a Gay old time" takes on new meanings in the 21st century.

When good searches go Bad

Back in 1997 when the movie "Titanic" came out, I tried to show my daughter how to use a search engine. We went to hotbot, and she entered "Titanic". To my horror, the results looked something like this, with loads of references to "titanic tata's":

But how much have things changed in the last decade? Consider these other innocent examples where old aunt Mable uses "that internet thang" and gets an unexpected anatomy lesson:

* Mable is looking for information on the new Civic center and she enters the query "massive erection".

* Mable is looking for a shipper for her legendary homemade confections and she enters the query "fudge packer".

* Mabel has 30 Persian Cats and does a search on "grooming my hairy pussy"

What can be done to prevent these "bad context" queries? Is the only solution is to allow the search engines to track our behavior? Google already controls the “glue” that ties the web together, and many folks are concerned about a Google monopoly and privacy.

Context sensitive queries and privacy

Google says that they can only understand our "context" by allowing them to monitor our web search activity. They note that search engines are not psychic, and they need to know our query "context" in order to serve-up the "appropriate" search results.

I allow Google to collect my searches, and they reward me by providing instant "keyword suggestions" of related queries in the search entry tab. I wonder if everyone gets "herpes", "hernia" and "heroin" for this query?

Some folks say that if you don't allow their search engines to spy on you, then you can expect to continue to get annoying, off-base and sometimes offensive search results.

Privacy advocates scream "1984" when we talk about allowing search engines to keep histories of our search queries, and it's true that companies like Google are law-abiding and they will hand-over your search history if they are compelled by a subpoena.

But I'm more concerned about the web stalkers who prey on know-it-all kids like my son and daughter. Despite their book-learning, they still have little real-world experience, and they don't understand how dangerous the web can be. This is one reason that I co-wrote my book "Web Stalkers: How to protect yourself from Internet Criminals and Psychopaths":