OpenFST patch for building with CMake.

May 12th, 2008

Here is a patch against OpenFST (beta-20070801) that adds CMake build support. It pretty much duplicates the existing make files. I tested on Mac OS 10.5.2 (Leopard) and Ubuntu (8.04 64-bit). While in principle builds for Visual Studio on Windows can now be generated, the OpenFST code itself is not yet portable. I didn’t try Cygwin, but it might work.

(Note: The p2 patch adds header installation to the earlier version.)

No comment

May 12th, 2008

Painful chuckle with the last post. How many ups and downs since then, eh? Suffice to say that my hopes and predictions about both the Super Bowl and Super Tuesday were, uh wrong (not that the former mattered too much to me). At this point, I just want the damn primaries to end. Please?

Permission to dream

February 3rd, 2008

Too much work to do, and today’s the one time of the year I allow myself to waste a day on watching football (ok, one of two times, I watch USC bowl games, too :-). But a brief post anyway.

The terrible thing about being leftward-leaning in this country is to watch the quadrennial self-destruction of the Democratic Party, nominating yet another New England intellectual who can dish off stacks of well reasoned policy documents, but can’t connect with anyone.

The one living exception is Bill, but we’ve been reminded all too well in the past few weeks of his character and the divisive environment he helps create and thrives upon. And for those nostalgic for the Clinton Nineties, are they really worth reliving? Or just better than the past seven years?

We’ve been told for at least four years that Hillary was coming, and that she was inevitable, which I resented right off the bat. Add to that the fact that, while she certainly possesses her husband’s brains and probably better policy, she possesses none of his charisma and common touch, but has definitely acquired a good measure of his sleaze and self-righteousness (ok, maybe she already had some of the latter) through years of osmosis and simple association. At times they both seem indignant and even incredulous that anyone would dare challenge their birthright to return to the throne.

Now, we as the Good Democratic Herd are supposed to make the usual calculations and triangulations and support The Party’s Anointed. Vote HHH, not RFK. Mondale and Dukakis, Gore and Kerry and … Billary. And while some years there have been other, more inspiring candidates, who may or may not have done better in November, the Fear always whispered, “No, too X, too Y. We need neutral, we need safe.”

Well, we may not win this one, either. I’m not saying Obama is flawless or the Chosen One, but this is the first candidate in my life that I’m genuinely excited about, not just tacking away from this or that. And so I’m elated that this month, finally, millions of others are feeling the permission to dream:

Dem-Polls-Jan2008

Isolationist policy

December 16th, 2007

Word analysis in isolating languages such as Vietnamese and Chinese is generally viewed as a problem of segmentation: given a sentence of tokens we aim to find the boundaries and segment the sequence into words.

From a morphology perspective, however, we might say that the text is over-segmented into a string of morphs, so that our task is to group morphs into chunks of words. The nice aspect of this view is that it falls nicely under what I would call sequential models of morphological analysis for non-isolating languages, where we have a lattice of possible segmentations (from some hypothesis generator) and we aim to find the best path. The isolating case just has a more constrained graph topology, with N+1 states for the N tokens and possible arcs limited by the maximum token length of words in the languages.

It’s also interesting that we run into the same type of problem here as with the analysis of non-isolating languages. Specifically, if the arc weights are something like probabilities, we will be biased toward longer words and thus fewer arcs, unless the costs of longer arcs are suitably balanced. With token classes and some labeled training data, we certainly might build highly accurate models. For a more unsupervised approach, however, seeking some intrinsic measure for an optimal segmentation, we run into the usual issue that there is no universally correct trade-off between coarse and fine analyses. We are always left with parameters, hidden or overt, that guide our models one way or another toward our preferences, which are, to a large degree, quite arbitrary.

At long last: Mac copy, KDE paste

November 10th, 2007

I happen to be crazed enough to run KDE on my Mactop. Actually, I run it for one application: Kile. Yes, Kile. Though I would prefer to use something like LyX, the main reason I suffer LaTeX is the convenience of having your papers and, maybe even, thesis formatted to specification automatically. LyX doesn’t work on LaTeX files directly, however, so I am paranoid about whether it will format everything exactly the same as LaTeX, and anyway I find LaTeX hacking in LyX a bit clumsy. So at some point I switched to working with LaTeX directly, but the general philosophy of LaTeX editor user interface design seems to be: Let’s stuff as many thousands of little toolbar buttons as we can for all the obscure commands, environments, and, especially mathematical symbols.

Except for Kile. Which still sucks since it doesn’t have WYSIWYG or at least real-time preview, but at least they figured out that dynamic command completion is really frigging useful. So that’s what I use.

But there’s one big problem: I can’t copy from a Mac application to Kile. Or any KDE application for that matter. As the following post discusses

http://lists.macosforge.org/pipermail/macports-users/2007-July/004451.html

you can paste to other X11 applications using the mouse middle button (option-click) (see http://developer.apple.com/qa/qa2001/qa1232.html), but it doesn’t work for KDE apps. Unfortunately the post hasn’t been answered, but I finally figured out a not-so-odious work-around: Just run xclipboard.

Yes, old, crusty, xclipboard in all of its Athena widget glory scoops up your native Mac copying activity and magically makes it available to KDE applications with the usual Control-v.

So that’s why today’s UNIX GUI distros still dump hundreds of old and often redundant X applications. Sometimes you actually need them!

Sneaking some thoughts

November 9th, 2007

The Boobear is napping, so a few moments of rumination are allowed.

I’ve banned myself from working on morphology learning right now, because it’s clearly risky, with a lot of work needed, and that looming January 10th ACL deadline says I need to get cracking on the morphemic MT paper until it’s nearing a finished state. But I feel anxious to spend some time at least thinking about the morphology, so I’ll grant an exception…

(On the other hand, the MT paper looks more hopeful for EMNLP than ACL; the morphology work is more of an ACL flavor, but I’m losing confidence that it will be ready in time.)

As I’ve already noted, the decision to do morphology is fraught with a different kind of peril compared to the quixotic grammar work. While the morphology is much more likely to achieve results in the nearer term, it is also much more likely to achieve similar results for others, and in fact there have been a gazillion papers on this in the past 8-10 years. I need to do more prior-work reading, but it seems probable that even the weakly supervised space is a little crowded already. And the state-of-the-art unsupervised systems are crafted by hardy Finns who live and breathe agglutinative morphology every day, and if my (cheating) system can’t even match that, well, why bother?

Read the rest of this entry »

Plan B

November 2nd, 2007

Current processing cycles are being devoted to the following basic question: Should I try to straddle two difficult topics, morphology and syntax, for my impending quals, or go for expediency and stick with one, staying the course on morphology?

Put more cynically, should I cling to that last idealistic drop of PhD motivation in my body, the drive to do something novel and exciting, the last, tenuous hope for a home run that will make the last 9 1/2 innings of drudgery seem worthwhile? Or just accept that those dreams are done and that now all I want is the paper reward, that piece of parchment suitable for framing and the little acronym that says: resistiré.

The Dreamer declaims the following:

  1. These ideas are exciting! They are novel, with nice linguistic foundations (albeit unorthodox), and could be a strong development in unsupervised and low-resource grammar learning, and in MT.
  2. The high bar for the quals are a bit self-imposed.
    1. The two morphology chapters plus the syntax smoothing (probably feasible for the spring) are sufficient for the quals, so I can still meet that deadline.
    2. With the smoothing completed, the grammar transformations are mostly done.
    3. Then I spend most of the final year on unsupervised learning, with the MT results limited to the most straightforward applications of it.
  3. If not syntax, what then? What novel work would you do in morphology to fill out a thesis? Especially since everyone and their cousin has taken a pass at it!

To which The Pragmatist retorts:

  1. They are exciting, but extremely speculative and risky. If you’d developed them in year two or even three, that would have been a great time to try something big. But we’re starting year five now, and it’s time to finish, not to finesse.
  2. Yes, but then you push more work to do after the quals, and do you really want to be here past May 2009?
    1. A bit hopeful, assuming mountains of SpeechLinks work doesn’t come crashing down, also no chance for a COLING paper, because it’s pretty clear that the current papers will occupy me fully through January 10th.
    2. Yes, but again no small piece of work. 6 months is a safe estimate, so that takes us through the NAACL deadline, without starting on the unsupervised learning, which is harder!
    3. I’d call it 18 months after the quals. Want to stay through December?
  3. Ah, you have me there a bit, but I can come up with something. Just watch me….

Read the rest of this entry »

Containing the morphology problem

October 29th, 2007

Like every aspect of human languages (or maybe better, human intelligence), modeling morphology has turned out to be much trickier than at first thought. Thus the extensive list of background work…

The current list of modeling and learning possibilities:

  1. EM.
    • pro: Simple; flows from nice, generative model.
    • con: Greatly biased toward classes with fewer inflections (especially atomic), since the affix vector distribution is less sparse. Different hypotheses within a (class, stem) group do contribute to each other’s likelihood, but this doesn’t translate across classes.
    • potential extensions:
      1. Putting a prior on classes may help, and very simple to try. Could go Bayesian, but I am skeptical of how much more that will help.
      2. Need to experiment with initialization.
      3. Higher-order model will help, but trickier in the setting where we analyze only a few word classes and assume all other words are atomic, which makes prior context less informative. One solution for that might be hierarchical class splitting of atomics, cf. Petrov.
      4. Try to model some of the more global effects that we want. To balance the varying affix sparseness, we might want to focus on maximizing the (class, stem) likelihood, and after that, we have differing expectations of a specific affix vector based on the size of the space for that class. There’s also a global expectation that the ratios between different affix values will remain roughly constant across a class. Not sure how to model these effects with a pure stochastic model; a really scary option is to try to adapt M-estimation to unsupervised learning and use the EM model as q0.
  2. Contrastive Estimation
    • pro: Allows arbitrary features, in principle may avoid the sparse affix space problem.
    • con: The only obvious neighborhood function, using inflections of the (class, stem) hypotheses, will generate many valid words. Uncertain how this will work, then, especially having the tagging problem at the same time.
    • possible extensions:
      1. Initialization and regularization will be very important.
      2. Multiple neighborhoods, maximized jointly.
      3. Alternate objective functions?
      4. Adding support and/or conflict sets.

Desiderata for a morphology learner

October 26th, 2007

Definitions

First, we define some concepts:

  • stem
    The underlying base of a non-compound word, possibly corresponding to the word’s lemma form or root, but we view the choice of stem as a modeling decision. We use the word “stem” and not “root,” to include more shallow types of analysis that may seek only to identify the affixes (and what’s left behind), and not to recover the fundamental word root.
  • stem class
    The type of the stem, often related to the familiar notion of part-of-speech, which determines which affixes may join with it. For more limited morphological analysis, we may constrain the stem and final word classes to be equal.
  • surface affix
    What is normally referred to as an affix, realized as zero or more characters of text.
  • functional affix
    An abstract affix that expresses a single underlying function of a surface affix, e.g. Number, Person, Tense, Aspect, etc., which may determine the final word class. Note that multiple functional affixes often map to a single surface affix.
  • affix vector
    A value in the space (either surface or functional) of all possible affix combinations allowed in a language.
  • affix position
    A set of one or more affixes that are mutually dependent and thus modeled jointly, occupying a single index in an affix vector. The concept of a position is also useful for specifying the mapping between functional and surface affixes. For example, in Spanish we may have multiple function positions for person, tense, etc. mapping to a single, “verb suffix” surface position.

Read the rest of this entry »

Research like it’s 1999

October 22nd, 2007

After a long time away, I finally dragged myself back to ISI for an NL seminar. The talk itself, by Slav Petrov, was really interesting, about his work on hierarchically split PCFGs. Basically you can view it as an extension of Klein and Manning’s unlexicalized parsing, except the classes are learned automatically by splitting them repeatedly. For the best performance, many optimizations are needed, such as pruning unnecessary splits (merging them back), smoothing rule probabilities over unsplit classes, and for search a coarse-to-fine strategy is used with a tiered set of models projected back up from the finest-split model. And AFAIK it’s now the best generative model for parsing English. Very cool.

Inevitably, though, haunting ISI again (or the other way around) gets me thinking about the many twists and turns that have taken me where I am, which is spinning my wheels in the mud after four frigging years and doing research that I most logically should be pursuing back there. So all that, in the middle of a major research funk, along with seeing yet another really smart student from Berkeley/Stanford/Penn doing really smart stuff, was plenty to get me downtrodden. But for the final pièce de résistance, I was talking with a certain NL professor whom I had seriously considered working with, and I made a joke about how it would have been a whole lot easier path to doing MT work (mind you, I don’t really consider myself an MT researcher, but it was a joke) if I’d just stayed at ISI. To which he replied, “Yeah, because we’re years ahead of [your lab].”

Ouch. Thanks.

But now, after a few days of recovery, I’m not so worried. In fact, I’ve decided it’s my angle, my brilliant niche. I’m going to do historical NLP! I’m going to do the very best of 2003-style NLP. Or maybe 1999. I can beat anything they did back then. Just watch me go.