Monthly Archive for April, 2007

They What Said It Best

Brian Guthrie blogged today about an article in The Economist about the VT tragedy. The Economist is a publication I have a great deal of respect for, but in this case they’re flying their European socialist flag as high as ever. Here is the quote from the article that Brian used:

When it comes to most dangerous products—be they drugs, cigarettes or fast cars—this newspaper advocates a more [classically] liberal approach than the American government does. But when it comes to handguns, automatic weapons and other things specifically designed to kill people, we believe control is necessary, not least because the failure to deal with such violent devices often means that other freedoms must be curtailed. Instead of a debate about guns, America is now having a debate about campus security.

Later in the same article, they state:

Had powerful guns not been available to him, the deranged Cho would have killed fewer people, and perhaps none at all.

Let us not confuse the availability of guns with the legality of gun ownership. For reference, please note the availability of illegal drugs. Tightening regulations is one thing, but when reasonable firearms for self-defense (e.g. Glocks, non-automatic weapons, non-explosive weapons) are withheld from all but the police, you merely catalyze the growth and ubiquity of the weapons black market, which peddles its wares primarily to those with the urge to do real harm.

I’m a proponent of gun control. I do believe that the way we manage the sale and distribution of (especially) handguns in this country deserves to be revisited. But I also believe that the ability for an individual to buy a gun for self-defense, not having any intention or expectation to use it in public, is a boon to the overall safety of the country (or any country). In my mind, the critical issue to be addressed is how to better regulate the sale of arms to people like Cho, who clearly demonstrated (non-criminal) characteristics that would not inspire trust in any gun seller. Because those characteristics were not readily available to the gun seller, they had no way to know what his intentions might be.

The issue of gun control is divisive, as it sets forth premises that are so broad, socially, that there is scarcely a way to determine their accuracy. It was precisely these types of anecdotal premises that spurned the “war on drugs,” and as history has demonstrated, politicians are seldom very adept sociologists.

Porter Stemming Algorithm

Stemming refers to reducing words to their stems. Wikipedia says, “Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.” (Wikipedia, Stemming).

Word stemming is used extensively as part of search algorithms to increase the number of relevant, though not identical, word results that can be recovered. Using stemming, a search for the word “nursing” will also match the word “nurse” because both have a stem of “nurs.”

While working on a “live search” system for our website, I realized that the results being returned were not optimal for precisely this reason. We run a (very expensive) search package for our normal searching systems, but this was a one-off search for a small feature of the site, so I didn’t have access to the full search index and its capabilities. In searching for a reasonable stand-in for that stemming system, I came across the Porter Stemmer.

The Porter Stemmer is a word stemming algorithm developed in the late ’70s by Martin Porter, which has gained a good deal of attention over the years. On the site linked above, you can find implementations of this algorithm in 17 languages!

What I decided to do was implement the stemming portion in JavaScript, as part of the “live search,” and then let our back-end do its normal SQL search based on the word stems. Because stems generated by the Porter Stemmer are almost invariably truncated versions of the input words, and because our SQL-based matching uses simple T-SQL wildcards (WHERE foo LIKE '%bar%') , it works quite well without any back-end additions.

I used the JavaScript implementation of the Porter Stemmer written by someone identified only as Andargor, and bootstrapped it into the page with a custom Prototype-based class that I call, simply, Stemmer.

You can give it a try and view the code on my Porter Stemmer test page.

My class wrapper includes options to turn stemming off for quoted words and for capitalized words. Instantiating the class is dreadfully easy, the only required parameter is the name of the stemming function. In the Porter Stemmer JavaScript implementation, this function is called “stemWord”.

var stemmer = new Stemmer({
    stemQuoted: false,
    stemCaps: true,
    stemFunction: stemWord
});  

Providing the word-stemming function as a parameter allows you to use any stemming implementation you wish without messing around inside of the Stemmer class itself. The other two options, stemQuoted and stemCaps, are optional and default to false and true, respectively, causing quoted words not to be stemmed, but capitalized words to be stemmed as usual.

Just for giggles, there is also a function within my class called stemUpdate, which automatically updates the innerHTML of the named element with the stemmed result when it’s called. I originally used it for the demonstration page, and though I question its overall usefulness, it’s there for you. The syntax is:

var stemmer = new Stemmer({ stemFunction: stemWord });
stemmer.stemUpdate('element_or_ID', 'words to stem');  

Just getting the stemmed result back is as easy as:

var stemmer = new Stemmer({ stemFunction: stemWord });
var result = stemmer.stem('a phrase to stem');  

Hopefully this will be useful to someone else out there!

XMP and Ruby

I am an amateur photographer in (most of) my spare time. When I’m not actually making photographs or blogging about photography, I tend to explore some of the eccentric intersections of photography and technology. One of those intersections is Adobe’s XMP, or Extensible Metadata Framework specification. XMP was conceived as a drop-in replacement for IPTC (literally “International Press Telecommunications Council,” but also used improperly to refer to the IPTC’s IIM, or Information Interchange Model), and allows you to store a bunch of information in the headers of graphics files.

IIM was put together around 1991, but its development was frozen in 1997 after XML started to take hold. XMP uses a subset of the XML format RDF (Resource Description Framework; I feel like there are too many frameworks in this post already), so it’s a lot easier (and more fun) to work with… Once you get it out of the file.

In order to extract the XMP data, I used a little tool available from a kind soul at the W3C. Specifically, I compiled the jpeg-xmp-2.2 package (yes, in OS X; it built with about 1,000 warnings but works fine) and got my filthy mitts on the rdjpgxmp tool, which simply reads the XMP data from a JPEG and spits it out in raw RDF format.

Here’s a sample of what my RDF file looks like. That’s a dump of the XMP from a photo I exported from Adobe Lightroom with my basic copyright information attached. As you can see, there is a lot more than just copyright information embedded in it. They manage to work in some of the stuff typically carried by Exif as well as the “IPTC Core,” the junk that was formerly only found in the IIM payload, which Adobe has graciously provided a synchronization process to preserve.

The output of rdjpgxmp has a ton of blank lines at the end for some reason, so we can use a bit of simple bash trickery to do away with that.

bash $ rdjpgxmp YourJPEGFile.jpg | grep -v "^ *$"  

The -v switch tells bash to only display lines that don’t match the pattern, and if you look closely you’ll see that there’s a space in there.

So, now that we have a fairly clean RDF output from the shell, we can get it into a Ruby script and mess around with it. I just used a simple shell execute to get the output of rdjpgxmp and then used REXML to get down to the title of the photograph.

require 'rexml/document'
include REXML

xmlstring = `~/bin/rdjpgxmp YourJPEGFile.jpg | grep -v "^ *$"`
doc = Document.new(xmlstring)

$stdout.print doc.root.elements["//dc:title/rdf:Alt/rdf:li/text()"]
$stdout.print "\n"  

Pretty sweet, yes? Using REXML’s fabulous support for XPath syntax, it’s easy to get right down to the correct element. Now it would be a simple thing to write scripts to pull different bits of data, insert them into databases, rename the files themselves (perhaps based on their XMP title?), or do anything else you can think of. There is also a companion tool, wrjpgxmp that allows you to write to the XMP data. There is a wealth of potential here.

Here’s more information about everything:

Maximum URL Length

My colleague at work is faced with handling URLs of a particularly great length. In our system, we don’t usually encounter URLs longer than 250 characters, and even our custom URL shortening system only accommodates URLs 1,000 or fewer characters in length. The question then arose of how long is too long when it comes to URLs? Should we be expected to store URLs as long as a SQL text field (2,147,483,647 characters)?

So I hopped around on the Internets (all three of them!) for a while, and found this neat reference on Boutell.com. This individual tested entering URLs of different lengths into the major browsers and also requesting them from both major web servers to see what would happen. Here’s the best part, for your enjoyment and edification.

The default limit [ in Microsoft IIS -ed ] is 16,384 characters (yes, Microsoft’s web server accepts longer URLs than Microsoft’s web browser). This is configurable.

I guess Microsoft’s inter-departmental communications aren’t quite as robust as one would hope. Who has a 16,384 character URL, anyway?

Sharpton Misunderstands Free Speech (Again)

When asked to comment (for the six thousandth time) on the Don Imus debacle, the Rev. Al Sharpton, described by CNN as a “civil rights activist,” said (and I quote):

Somewhere we must draw the line in what is tolerable in mainstream media. We cannot keep going through offending us and then apologizing and then acting like it never happened. Somewhere we’ve got to stop this.

Stop what, Reverend? Stop negotiating the response to each infraction of your personal rules of permissible speech? Stop determining what people are allowed to say on the radio and on TV through democratic and free-market-based consensus?

I suppose we should ask the Reverend to make us a list of what people should be permitted to say and censor everything else. Yes, that sounds like the American way to do it.

On the horizon I can barely make out the death of satire.

Per-Site Styles in Firefox

I try very hard not to be a buzzkill; my cynical personality makes it hard for me not to trash talk things that offend my sense of aesthetics, but if I did that whenever I felt the urge, I fear I’d never do anything else. As a web developer in 2007, as much as it pains me to say so, there is still an awful lot to feel queasy about.

You don’t see a lot of “under construction” graphics around these days, let alone animated ones, so at least we have that to be thankful for. We have certainly come a long way since:

www.geocities.com/city/alley/dumpster/MYHOMEP~1.HTM

Still, the Internet—as most other places as well, unfortunately—is riddled with tasteless, colorblind people. Fortunately for those of us who are total hardcore nerds, even other people’s lack of taste doesn’t have to affect our web browsing experience thanks to a little something called (or at least, what I call) per-site styles. Continue reading ‘Per-Site Styles in Firefox’

The Anniversary of the Repeal of Prohibition

Today, April 7th, is the 74th anniversary of the repeal of prohibition, that scourge on the freedom of Americans to indulge themselves in the firewater and forget the sorrows that daily consume them.

I, for one, will be celebrating this important day the way all nerds should: drinking myself into a hole from which I can only escape by following that most terrible of all roads, the Road to Hangover.

Have a good one, nerds!

DRM Is Less Smart Than Eating Lightbulbs

After reading Brian Guthrie’s article about Apple’s new iTunes non-DRM offering and the half-baked editorial from The Economist decrying this move as naive, I had to write something. First, I agree with what Brian said:

… I would go one step further and claim that, for those who are [avid downloaders], digital music has always been chiefly about convenience. I predict that the existence of a legal DRM-free method for obtaining the same, weighed against the threat of arbitrary and capricious lawsuits, will be enough to tip the scales.

I have been an enthusiastic patron of the iTunes Music Store since its inception—even despite the DRM issues—and I wholeheartedly side with Brian in his opinion that non-DRM offerings will simply sweeten the deal for those of us with absolutely no patience for buying stuff. My preference is to click some sort of a “button” and have my chosen merchandise handed to me by someone in a snappy uniform (snappy uniform optional) or, in the case of non-tangibles, download it instantly.

Furthermore, anyone who thinks that Digital Rights Management is effective or (even more frighteningly) anything less than a perversion of our legal system should read more Lawrence Lessig. For the sake of piling yet another anti-DMCA rant onto the mounting heap already gathered throughout the Internet, I will now recount a metaphor I made up while explaining DRM and the DMCA to my parents.

The biggest problem with DRM is that it stifles fair use; even though copyright law, as interpreted by Congress, should give anyone the right to manipulate copyrighted material to create what we call “derivative work,” DRM attempts to prevent that. Andy Warhol famously duplicated the ubiquitous Campbell’s Soup can as a statement about mass production and American consumerism. According to copyright law, it was (and is) completely legal for Andy Warhol to make huge silkscreen prints of the Campbell’s Soup can design because they were different enough, altered enough, to be considered “derivative.”

Now, let’s say there is some fantastical technology that could prevent people from making reproductions, photographs, copies, or otherwise duplicating the Campbell’s Soup can label. Some sort of a forcefield, perhaps. It is still within your rights to duplicate the label under the Constitutional rules of fair use (if you are duplicating it in a derivative work as described in the Constitution and in copyright law), but under the DMCA it becomes illegal to circumvent that fantastical copy protection technology.

The DMCA doesn’t protect copyrighted works. At least, not directly. The DMCA protects the technology that protects copyrighted works (chiefly Digital Rights Management), even in cases where the scope of protection created by the technology exceeds that of the copyright laws themselves. The DMCA is a clever and nefarious way to extend the reach of copyright law without actually changing a word of it.

The Economist is all wrong; DRM is worthless, music downloaders are sure to be largely made up of lazy and otherwise law-abiding people, and the new iTunes non-DRM offerings are already making my mouth water. I think being able to freely back up, restore, and play the music I buy in the way I prefer is worth a paltry $0.30.

Microsoft Log Parser 2.2

Thanks to Aaron Johnson’s pretty swell article about the Microsoft Log Parser (included in the IIS admin. toolkit and also available for separate download), I am now well on my way to Microsoft log-parsing heaven. If such a place exists.

Here at work we are faced with a similar problem to Aaron’s (no, the other one) where we are required to generate server log-based statistics that aren’t covered by standard web stats packages like AWStats et al. We have to count occurrences of specific URL variables within the requests and things of that nature. Hopefully Log Parser 2.2 with its completely awesome SQL interpreter will be able to give us the data we need in a format we can use.

Yes, we use ColdFusion here (for the time being), and yes, it does suck. A lot. Its largest points of failure are interoperability and extensibility (mostly due to a widespread apathy in the programming community; it supports extension through Java classes/beans/whatever, but not many people spend the time to build the kinds of things PHP has available), and also huge, gaping language deficiencies. But more about that later.

FP!

Welcome to my fresh new blog where I will wax philosophical on topics relating to my frustration with ColdFusion, the geekery of bash one-liners in OS X, and my friends’ exciting website projects.

Nobody is ever going to read this thing, but I can try, right?