Latent Semantic Indexing (LSI): What it is, how it works and what it means




By Michael Marshall

LSI is a methodology for automatic document classification. It examines all the words in all the documents of a corpus and calculates similarity measurements for each document or for individual terms. It can gauge very accurately which documents in a corpus are really relevant to a search phrase even if that search phrase does not appear in a document. Measuring relevancy is a key component of a search engine’s ranking algorithm. When search engines use it, LSI can have a significant impact on the ranking of your web pages.

How can a search engine tell the difference between relevant information and irrelevant information? Some search engines use LSI to achieve this goal. LSI helps improve a search engine’s performance in three significant tasks: recall, precision, and ranking. Recall is getting all of the relevant information available for your search. Precision is getting only the information that is relevant to your search. Ranking is getting all the information ordered in a meaningful way – from the most relevant to the least, for example.

When you query a search engine that uses LSI, the search engine examines similarity values calculated for every content word. This method examines the document collection as a whole and knows which documents are semantically close or distant based on the relationships between all the words in each document and all the words in the rest of the collection. LSI does not require an exact match to the query phrase to find relevant documents.

What this means for the search engine optimization (SEO) specialist and anyone with a website who wants high visibility in the search engines is that every word on your web page is important, not just the keyphrase(s). It is the right combination of all the words in your content that really matters here. What you do with your keyphrase(s) is still important but now you must go beyond that . . . way beyond that. You’ve got to have the right context to support your keyphrase(s).

Because LSI correlates surprisingly well with how we as humans might classify a document collection, writing content that performs well under LSI analysis is not like writing contrived, robotic styled verbiage for a machine. It involves giving proper attention both to persuasive, well written copy and to semantics. It is a delicate balance of art and science.

It forces you to write more relevant, more compelling content. This is good for the search engines because it increases the quality of the content in their databases. This is good for your business because you’ll have content that generates more traffic and more conversions. A proper solution will involve principles from multiple disciplines: computer science, information theory, and human psychology that dovetail very nicely with time tested marketing principles.

The combined insights from three well-known figures in history, a theologian, a painter, and a mathematician, can explain why.

* What does a 5th century Theologian (St. Augustine of Hippo) know about lasting joy and human desire that can help you transform your web site into one that converts more visitors to customers?!
* What does a 19th century Painter (Claude Monet) know about beauty and imagination that will help you make your web site irresistible to the search engines?!
* What does a 20th century Mathematician (Benoit Mandelbrot) know about the link between words and chaos theory that will help you both persuade your visitor and satisfy the search engines in one stroke?!

If this trio, Augustine, Monet, and Mandelbrot, sat down to dinner and had a discussion about how search engines do what they do, about search engine marketing, and about your website, here is what you might hear in that conversation. You might hear Augustine talk about why when you write with LSI in mind, you will also have content that is compelling to your human audience. You might hear Mandelbrot talk about why the math and artificial intelligence behind LSI tracks so well with how humans use words and would organize documents. You might hear Monet talk about the importance of creating a great context to support and enhance your keyphrase(s).

Augustine’s Law

It is our nature to be attracted to that which is beyond the ability of our minds to fully comprehend. This is what Roy H. Williams calls Augustine’s Law. In On the Trinity, Augustine comments that the seeker in Psalm 105 finds the joy described therein only “. . . when one has been able to find how incomprehensible that is which he was seeking . . .� Examples of some phenomena in nature toward which we are drawn are ocean waves, cloud formations, mountains, lightning, and snowflakes. All these examples share something in common. The elegant order in each can be described in mathematics by the science of chaos. Unpredictability, information theory (the foundation for LSI), and chaos are very closely related. Augustine’s Law would hold that writing copy with these principles in mind will add to the appeal of your message.

Mandelbrot’s Fractals

Computers use mathematical equations to produce images called fractals which are maps of chaotic systems (e.g. population fluctuations, chemical reactions, and clouds). Mandelbrot actually created fractal images mapping the variations in stock market prices and, more important to our topic, the probabilities of word occurrences in English. Suffice to say, the way we use language can be described by mathematical equations similar to those that describe other chaotic systems. This is why something seemingly as mathematical and abstract as the principles and concepts underlying LSI track so well with how humans use words and organize documents.

Monet’s Impressionism

The term Impressionism was derived from a painting by Claude Monet, Impression: Sunrise (1872). In this style, Monet would capture the ever-changing effects of sunlight on their surroundings and the technique allowed him to be responsive both to the character and texture of an object in nature and to the impact of light on its surfaces. He was able to engage the imagination because he realized the importance of context in his painting technique. The color of an object is modified by the light in which it is seen, by reflections from other objects, and by its contrast with juxtaposed colors.

Similarly, the color (sense of meaning) of a word is modified by the context in which it is seen, reflections from words near it, and by contrast with words juxtaposed to it. If you write with words the way Monet painted with colors, you will engage the imagination of your audience as well. Roy H. Williams speaks of this with regard to traditional marketing; it is even more important with regard to search engines using LSI. Since LSI can help tell you what that context should be for a word or phrase, Monet would highly recommend it as a great tool to support and enhance keyphrase(s) in search engine marketing. Doing so would please the search engines and captivate your audience.

What this means

Latent Semantic Indexing (LSI) is a highly beneficial technique for search engines to use for improving recall, precision, and ranking. In a future article, I will discuss in more detail how LSI is actually used within search engine algorithms. As an added benefit, by using LSI, search engines provide an incentive for web copywriters and SEO professionals alike to produce better content in their web pages. This, in turn, increases the quality of a search engine’s database.

In your business, is visibility in the search engines and your online presence important to you? Then you need to understand the impact LSI has on your search marketing goals. You need to experience the benefits a proper understanding of LSI can deliver when it becomes an integral part of your search marketing efforts. At Fortune Interactive, we have technology, staff, and expertise unique in the industry to help your search marketing efforts and your business reap those benefits.

Michael Marshall is co-founder and Vice President of Technology at Fortune Interactive and has over 17 years experience in information technology covering a wide range of specialties including: web design, software engineering, e-commerce solutions, artificial intelligence, and Internet marketing. He is a contributor to “Building Your Business With Google for Dummies� by Brad Hill (Wiley Publishing) and a regular, featured speaker at Ultra Advanced SEO Symposiums (www.ultraadvancedsymposium.com/), a meeting of select masters of the search engine marketing industry. Michael is also a certified instructor at the D.C. Search Engine Academy (http://www.thedcacademy.com/). He has degrees in Linguistics, Philosophy and Theology and is presently a Philosophy PhD student at the University of Virginia focusing in the area of semantics.