Porter Stemming Algorithm

Stemming refers to reducing words to their stems. Wikipedia says, “Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.” (Wikipedia, Stemming).

Word stemming is used extensively as part of search algorithms to increase the number of relevant, though not identical, word results that can be recovered. Using stemming, a search for the word “nursing” will also match the word “nurse” because both have a stem of “nurs.”

While working on a “live search” system for our website, I realized that the results being returned were not optimal for precisely this reason. We run a (very expensive) search package for our normal searching systems, but this was a one-off search for a small feature of the site, so I didn’t have access to the full search index and its capabilities. In searching for a reasonable stand-in for that stemming system, I came across the Porter Stemmer.

The Porter Stemmer is a word stemming algorithm developed in the late ’70s by Martin Porter, which has gained a good deal of attention over the years. On the site linked above, you can find implementations of this algorithm in 17 languages!

What I decided to do was implement the stemming portion in JavaScript, as part of the “live search,” and then let our back-end do its normal SQL search based on the word stems. Because stems generated by the Porter Stemmer are almost invariably truncated versions of the input words, and because our SQL-based matching uses simple T-SQL wildcards (WHERE foo LIKE '%bar%') , it works quite well without any back-end additions.

I used the JavaScript implementation of the Porter Stemmer written by someone identified only as Andargor, and bootstrapped it into the page with a custom Prototype-based class that I call, simply, Stemmer.

You can give it a try and view the code on my Porter Stemmer test page.

My class wrapper includes options to turn stemming off for quoted words and for capitalized words. Instantiating the class is dreadfully easy, the only required parameter is the name of the stemming function. In the Porter Stemmer JavaScript implementation, this function is called “stemWord”.

var stemmer = new Stemmer({
    stemQuoted: false,
    stemCaps: true,
    stemFunction: stemWord
});  

Providing the word-stemming function as a parameter allows you to use any stemming implementation you wish without messing around inside of the Stemmer class itself. The other two options, stemQuoted and stemCaps, are optional and default to false and true, respectively, causing quoted words not to be stemmed, but capitalized words to be stemmed as usual.

Just for giggles, there is also a function within my class called stemUpdate, which automatically updates the innerHTML of the named element with the stemmed result when it’s called. I originally used it for the demonstration page, and though I question its overall usefulness, it’s there for you. The syntax is:

var stemmer = new Stemmer({ stemFunction: stemWord });
stemmer.stemUpdate('element_or_ID', 'words to stem');  

Just getting the stemmed result back is as easy as:

var stemmer = new Stemmer({ stemFunction: stemWord });
var result = stemmer.stem('a phrase to stem');  

Hopefully this will be useful to someone else out there!

0 Responses to “Porter Stemming Algorithm”


  1. No Comments

Leave a Reply