Archive for the 'Uncategorized' Category

Dexter: See-Through

“I’m good with parents. The key is to simply think of them as aliens from a distant universe.” —Dexter Morgan

The Other Coast

I spent the last week in California, first in Yosemite National Park and then in Santa Cruz. My friend and I went specifically to photograph, but I made sure we spent time slacking off, too. Santa Cruz is such a beautiful place, I can’t imagine why anyone would want to leave. I mean, aside from the cost of living being among the highest in the country and the place being overrun with tourists year ‘round…

This year happens to be the centennial celebration of the “Boardwalk” amusement park—one of the main attractions to the city outside of surfing—so there was much ado, much pomp, much circumstance. Also, the place was engorged with visitors from Friday afternoon until Sunday when we left. They were riding rides, eating food, pushing strollers, the whole nine yards. The weather was absolutely gorgeous and it pained me to get off the plane yesterday at Logan International in Boston to a cheery, 47-degree morning.

I managed to eat six times my weight in food (by informal calculation), walk tens of miles in only a couple of short days, take about 2,000 photographs all said, and pretty badly sunburn both of my arms and my neck (go me!). I posted a few photos from the UCSC arboretum on my photo blog, and I’ll be posting a lot more very soon.

They What Said It Best

Brian Guthrie blogged today about an article in The Economist about the VT tragedy. The Economist is a publication I have a great deal of respect for, but in this case they’re flying their European socialist flag as high as ever. Here is the quote from the article that Brian used:

When it comes to most dangerous products—be they drugs, cigarettes or fast cars—this newspaper advocates a more [classically] liberal approach than the American government does. But when it comes to handguns, automatic weapons and other things specifically designed to kill people, we believe control is necessary, not least because the failure to deal with such violent devices often means that other freedoms must be curtailed. Instead of a debate about guns, America is now having a debate about campus security.

Later in the same article, they state:

Had powerful guns not been available to him, the deranged Cho would have killed fewer people, and perhaps none at all.

Let us not confuse the availability of guns with the legality of gun ownership. For reference, please note the availability of illegal drugs. Tightening regulations is one thing, but when reasonable firearms for self-defense (e.g. Glocks, non-automatic weapons, non-explosive weapons) are withheld from all but the police, you merely catalyze the growth and ubiquity of the weapons black market, which peddles its wares primarily to those with the urge to do real harm.

I’m a proponent of gun control. I do believe that the way we manage the sale and distribution of (especially) handguns in this country deserves to be revisited. But I also believe that the ability for an individual to buy a gun for self-defense, not having any intention or expectation to use it in public, is a boon to the overall safety of the country (or any country). In my mind, the critical issue to be addressed is how to better regulate the sale of arms to people like Cho, who clearly demonstrated (non-criminal) characteristics that would not inspire trust in any gun seller. Because those characteristics were not readily available to the gun seller, they had no way to know what his intentions might be.

The issue of gun control is divisive, as it sets forth premises that are so broad, socially, that there is scarcely a way to determine their accuracy. It was precisely these types of anecdotal premises that spurned the “war on drugs,” and as history has demonstrated, politicians are seldom very adept sociologists.

Porter Stemming Algorithm

Stemming refers to reducing words to their stems. Wikipedia says, “Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.” (Wikipedia, Stemming).

Word stemming is used extensively as part of search algorithms to increase the number of relevant, though not identical, word results that can be recovered. Using stemming, a search for the word “nursing” will also match the word “nurse” because both have a stem of “nurs.”

While working on a “live search” system for our website, I realized that the results being returned were not optimal for precisely this reason. We run a (very expensive) search package for our normal searching systems, but this was a one-off search for a small feature of the site, so I didn’t have access to the full search index and its capabilities. In searching for a reasonable stand-in for that stemming system, I came across the Porter Stemmer.

The Porter Stemmer is a word stemming algorithm developed in the late ’70s by Martin Porter, which has gained a good deal of attention over the years. On the site linked above, you can find implementations of this algorithm in 17 languages!

What I decided to do was implement the stemming portion in JavaScript, as part of the “live search,” and then let our back-end do its normal SQL search based on the word stems. Because stems generated by the Porter Stemmer are almost invariably truncated versions of the input words, and because our SQL-based matching uses simple T-SQL wildcards (WHERE foo LIKE '%bar%') , it works quite well without any back-end additions.

I used the JavaScript implementation of the Porter Stemmer written by someone identified only as Andargor, and bootstrapped it into the page with a custom Prototype-based class that I call, simply, Stemmer.

You can give it a try and view the code on my Porter Stemmer test page.

My class wrapper includes options to turn stemming off for quoted words and for capitalized words. Instantiating the class is dreadfully easy, the only required parameter is the name of the stemming function. In the Porter Stemmer JavaScript implementation, this function is called “stemWord”.

var stemmer = new Stemmer({
    stemQuoted: false,
    stemCaps: true,
    stemFunction: stemWord
});  

Providing the word-stemming function as a parameter allows you to use any stemming implementation you wish without messing around inside of the Stemmer class itself. The other two options, stemQuoted and stemCaps, are optional and default to false and true, respectively, causing quoted words not to be stemmed, but capitalized words to be stemmed as usual.

Just for giggles, there is also a function within my class called stemUpdate, which automatically updates the innerHTML of the named element with the stemmed result when it’s called. I originally used it for the demonstration page, and though I question its overall usefulness, it’s there for you. The syntax is:

var stemmer = new Stemmer({ stemFunction: stemWord });
stemmer.stemUpdate('element_or_ID', 'words to stem');  

Just getting the stemmed result back is as easy as:

var stemmer = new Stemmer({ stemFunction: stemWord });
var result = stemmer.stem('a phrase to stem');  

Hopefully this will be useful to someone else out there!

XMP and Ruby

I am an amateur photographer in (most of) my spare time. When I’m not actually making photographs or blogging about photography, I tend to explore some of the eccentric intersections of photography and technology. One of those intersections is Adobe’s XMP, or Extensible Metadata Framework specification. XMP was conceived as a drop-in replacement for IPTC (literally “International Press Telecommunications Council,” but also used improperly to refer to the IPTC’s IIM, or Information Interchange Model), and allows you to store a bunch of information in the headers of graphics files.

IIM was put together around 1991, but its development was frozen in 1997 after XML started to take hold. XMP uses a subset of the XML format RDF (Resource Description Framework; I feel like there are too many frameworks in this post already), so it’s a lot easier (and more fun) to work with… Once you get it out of the file.

In order to extract the XMP data, I used a little tool available from a kind soul at the W3C. Specifically, I compiled the jpeg-xmp-2.2 package (yes, in OS X; it built with about 1,000 warnings but works fine) and got my filthy mitts on the rdjpgxmp tool, which simply reads the XMP data from a JPEG and spits it out in raw RDF format.

Here’s a sample of what my RDF file looks like. That’s a dump of the XMP from a photo I exported from Adobe Lightroom with my basic copyright information attached. As you can see, there is a lot more than just copyright information embedded in it. They manage to work in some of the stuff typically carried by Exif as well as the “IPTC Core,” the junk that was formerly only found in the IIM payload, which Adobe has graciously provided a synchronization process to preserve.

The output of rdjpgxmp has a ton of blank lines at the end for some reason, so we can use a bit of simple bash trickery to do away with that.

bash $ rdjpgxmp YourJPEGFile.jpg | grep -v "^ *$"  

The -v switch tells bash to only display lines that don’t match the pattern, and if you look closely you’ll see that there’s a space in there.

So, now that we have a fairly clean RDF output from the shell, we can get it into a Ruby script and mess around with it. I just used a simple shell execute to get the output of rdjpgxmp and then used REXML to get down to the title of the photograph.

require 'rexml/document'
include REXML

xmlstring = `~/bin/rdjpgxmp YourJPEGFile.jpg | grep -v "^ *$"`
doc = Document.new(xmlstring)

$stdout.print doc.root.elements["//dc:title/rdf:Alt/rdf:li/text()"]
$stdout.print "\n"  

Pretty sweet, yes? Using REXML’s fabulous support for XPath syntax, it’s easy to get right down to the correct element. Now it would be a simple thing to write scripts to pull different bits of data, insert them into databases, rename the files themselves (perhaps based on their XMP title?), or do anything else you can think of. There is also a companion tool, wrjpgxmp that allows you to write to the XMP data. There is a wealth of potential here.

Here’s more information about everything:

Maximum URL Length

My colleague at work is faced with handling URLs of a particularly great length. In our system, we don’t usually encounter URLs longer than 250 characters, and even our custom URL shortening system only accommodates URLs 1,000 or fewer characters in length. The question then arose of how long is too long when it comes to URLs? Should we be expected to store URLs as long as a SQL text field (2,147,483,647 characters)?

So I hopped around on the Internets (all three of them!) for a while, and found this neat reference on Boutell.com. This individual tested entering URLs of different lengths into the major browsers and also requesting them from both major web servers to see what would happen. Here’s the best part, for your enjoyment and edification.

The default limit [ in Microsoft IIS -ed ] is 16,384 characters (yes, Microsoft’s web server accepts longer URLs than Microsoft’s web browser). This is configurable.

I guess Microsoft’s inter-departmental communications aren’t quite as robust as one would hope. Who has a 16,384 character URL, anyway?

Sharpton Misunderstands Free Speech (Again)

When asked to comment (for the six thousandth time) on the Don Imus debacle, the Rev. Al Sharpton, described by CNN as a “civil rights activist,” said (and I quote):

Somewhere we must draw the line in what is tolerable in mainstream media. We cannot keep going through offending us and then apologizing and then acting like it never happened. Somewhere we’ve got to stop this.

Stop what, Reverend? Stop negotiating the response to each infraction of your personal rules of permissible speech? Stop determining what people are allowed to say on the radio and on TV through democratic and free-market-based consensus?

I suppose we should ask the Reverend to make us a list of what people should be permitted to say and censor everything else. Yes, that sounds like the American way to do it.

On the horizon I can barely make out the death of satire.

The Anniversary of the Repeal of Prohibition

Today, April 7th, is the 74th anniversary of the repeal of prohibition, that scourge on the freedom of Americans to indulge themselves in the firewater and forget the sorrows that daily consume them.

I, for one, will be celebrating this important day the way all nerds should: drinking myself into a hole from which I can only escape by following that most terrible of all roads, the Road to Hangover.

Have a good one, nerds!

Microsoft Log Parser 2.2

Thanks to Aaron Johnson’s pretty swell article about the Microsoft Log Parser (included in the IIS admin. toolkit and also available for separate download), I am now well on my way to Microsoft log-parsing heaven. If such a place exists.

Here at work we are faced with a similar problem to Aaron’s (no, the other one) where we are required to generate server log-based statistics that aren’t covered by standard web stats packages like AWStats et al. We have to count occurrences of specific URL variables within the requests and things of that nature. Hopefully Log Parser 2.2 with its completely awesome SQL interpreter will be able to give us the data we need in a format we can use.

Yes, we use ColdFusion here (for the time being), and yes, it does suck. A lot. Its largest points of failure are interoperability and extensibility (mostly due to a widespread apathy in the programming community; it supports extension through Java classes/beans/whatever, but not many people spend the time to build the kinds of things PHP has available), and also huge, gaping language deficiencies. But more about that later.

FP!

Welcome to my fresh new blog where I will wax philosophical on topics relating to my frustration with ColdFusion, the geekery of bash one-liners in OS X, and my friends’ exciting website projects.

Nobody is ever going to read this thing, but I can try, right?