|
Cover Page Introduction About Search Engines Structured Data Latent Semantic Indexing How LSI Works A Quick Example The Term-Document Matrix Applications of LSI Multi-Dimensional Scaling Applications of MDS Further Reading | < previous next > LSI EXAMPLE - INDEXING A DOCUMENTPutting Theory into Practice While a discussion of the mathematics behind singular value decomposition is beyond the scope of our article, it's worthwhile to follow the process of creating a term-document matrix in some detail, to get a feel for what goes on behind the scenes. Here we will process a sample wire story to demonstrate how real-life texts get converted into the numerical representation we use as input for our SVD algorithm. The first step in the chain is obtaining a set of documents in electronic form. This can be the hardest thing about LSI - there are all too many interesting collections not yet available online. In our experimental database, we download wire stories from an online newspaper with an AP news feed. A script downloads each day's news stories to a local disk, where they are stored as text files. Let's imagine we have downloaded the following sample wire story, and want to incorporate it in our collection:
The first thing we do is strip all formatting from the article, including capitalization, punctuation, and extraneous markup (like the dateline). LSI pays no attention to word order, formatting, or capitalization, so can safely discard that information. Our cleaned-up wire story looks like this:
The next thing we want to do is pick out the content words in our article. These are the words we consider semantically significant - everything else is clutter. We do this by applying a stop list of commonly used English words that don't carry semantic meaning. Using a stop list greatly reduces the amount of noise in our collection, as well as eliminating a large number of words that would make the computation more difficult. Creating a stop list is something of an art - they depend very much on the nature of the data collection. You can see our full wire stories stop list here.
Removing these stop words leaves us with an abbreviated version of the article containing content words only:
However, one more important step remains before our document is ready for indexing. Notice how many of our content words are plural nouns (grants, nations) and inflected verbs (pushing, refused). It doesn't seem very useful to have each inflected form of a content word be listed seperately in our master word list - with all the possible variants, the list would soon grow unwieldy. More troubling is that LSI might not recognize that the different variant forms were actually the same word in disguise. We solve this problem by using a stemmer. Stemming While LSI itself knows nothing about language (we saw how it deals exclusively with a mathematical vector space), some of the preparatory work needed to get documents ready for indexing is very language-specific. We have already seen the need for a stop list, which will vary entirely from language to language and to a lesser extent from document collection to document collection. Stemming is similarly language-specific, derived from the morphology of the language. For English documents, we use an algorithm called the Porter stemmer to remove common endings from words, leaving behind an invariant root form. Here are some examples of words before and after stemming: information -> inform presidency -> presid presiding -> presid happiness -> happi happily -> happi discouragement -> discourag battles -> battl And here is our sample story as it appears to the stemmer:
Note that at this point we have reduced the original natural-language news story to a series of word stems. All of the information carried by punctuation, grammar, and style is gone - all that remains is word order, and we will be doing away with even that by transforming our text into a word list. It is striking that so much of the meaning of text passages inheres in the number and choice of content words, and relatively little in the way they are arranged. This is very counterintuitive, considering how important grammar and writing style are to human perceptions of writing. Having stripped, pruned, and stemmed our text, we are left with a flat list of words:
administrat
amount
assist
bank
boost
bush
countri (2)
direct
europ
express
grant (2)
increas
irritat
loan
nation (3)
o'neill
paul
plan
poor (2)
propos
push
refus
rich
secretar
treasuri
US
world
This is the information we will use to generate our term-document matrix, along with a similar word list for every document in our collection. < previous next > |

© 2002 National Institute for Technology in Liberal Education. For more info, contact the author.