Archive for October, 2007

Containing the morphology problem

Monday, October 29th, 2007

Like every aspect of human languages (or maybe better, human intelligence), modeling morphology has turned out to be much trickier than at first thought. Thus the extensive list of background work…

The current list of modeling and learning possibilities:

  1. EM.
    • pro: Simple; flows from nice, generative model.
    • con: Greatly biased toward classes with fewer inflections (especially atomic), since the affix vector distribution is less sparse. Different hypotheses within a (class, stem) group do contribute to each other’s likelihood, but this doesn’t translate across classes.
    • potential extensions:
      1. Putting a prior on classes may help, and very simple to try. Could go Bayesian, but I am skeptical of how much more that will help.
      2. Need to experiment with initialization.
      3. Higher-order model will help, but trickier in the setting where we analyze only a few word classes and assume all other words are atomic, which makes prior context less informative. One solution for that might be hierarchical class splitting of atomics, cf. Petrov.
      4. Try to model some of the more global effects that we want. To balance the varying affix sparseness, we might want to focus on maximizing the (class, stem) likelihood, and after that, we have differing expectations of a specific affix vector based on the size of the space for that class. There’s also a global expectation that the ratios between different affix values will remain roughly constant across a class. Not sure how to model these effects with a pure stochastic model; a really scary option is to try to adapt M-estimation to unsupervised learning and use the EM model as q0.
  2. Contrastive Estimation
    • pro: Allows arbitrary features, in principle may avoid the sparse affix space problem.
    • con: The only obvious neighborhood function, using inflections of the (class, stem) hypotheses, will generate many valid words. Uncertain how this will work, then, especially having the tagging problem at the same time.
    • possible extensions:
      1. Initialization and regularization will be very important.
      2. Multiple neighborhoods, maximized jointly.
      3. Alternate objective functions?
      4. Adding support and/or conflict sets.

Desiderata for a morphology learner

Friday, October 26th, 2007

Definitions

First, we define some concepts:

  • stem
    The underlying base of a non-compound word, possibly corresponding to the word’s lemma form or root, but we view the choice of stem as a modeling decision. We use the word “stem” and not “root,” to include more shallow types of analysis that may seek only to identify the affixes (and what’s left behind), and not to recover the fundamental word root.
  • stem class
    The type of the stem, often related to the familiar notion of part-of-speech, which determines which affixes may join with it. For more limited morphological analysis, we may constrain the stem and final word classes to be equal.
  • surface affix
    What is normally referred to as an affix, realized as zero or more characters of text.
  • functional affix
    An abstract affix that expresses a single underlying function of a surface affix, e.g. Number, Person, Tense, Aspect, etc., which may determine the final word class. Note that multiple functional affixes often map to a single surface affix.
  • affix vector
    A value in the space (either surface or functional) of all possible affix combinations allowed in a language.
  • affix position
    A set of one or more affixes that are mutually dependent and thus modeled jointly, occupying a single index in an affix vector. The concept of a position is also useful for specifying the mapping between functional and surface affixes. For example, in Spanish we may have multiple function positions for person, tense, etc. mapping to a single, “verb suffix” surface position.

(more…)

Research like it’s 1999

Monday, October 22nd, 2007

After a long time away, I finally dragged myself back to ISI for an NL seminar. The talk itself, by Slav Petrov, was really interesting, about his work on hierarchically split PCFGs. Basically you can view it as an extension of Klein and Manning’s unlexicalized parsing, except the classes are learned automatically by splitting them repeatedly. For the best performance, many optimizations are needed, such as pruning unnecessary splits (merging them back), smoothing rule probabilities over unsplit classes, and for search a coarse-to-fine strategy is used with a tiered set of models projected back up from the finest-split model. And AFAIK it’s now the best generative model for parsing English. Very cool.

Inevitably, though, haunting ISI again (or the other way around) gets me thinking about the many twists and turns that have taken me where I am, which is spinning my wheels in the mud after four frigging years and doing research that I most logically should be pursuing back there. So all that, in the middle of a major research funk, along with seeing yet another really smart student from Berkeley/Stanford/Penn doing really smart stuff, was plenty to get me downtrodden. But for the final pièce de résistance, I was talking with a certain NL professor whom I had seriously considered working with, and I made a joke about how it would have been a whole lot easier path to doing MT work (mind you, I don’t really consider myself an MT researcher, but it was a joke) if I’d just stayed at ISI. To which he replied, “Yeah, because we’re years ahead of [your lab].”

Ouch. Thanks.

But now, after a few days of recovery, I’m not so worried. In fact, I’ve decided it’s my angle, my brilliant niche. I’m going to do historical NLP! I’m going to do the very best of 2003-style NLP. Or maybe 1999. I can beat anything they did back then. Just watch me go.

Resurrection

Monday, October 15th, 2007

Yeah, I’m still here. Four years later. In school. Going nowhere.

But this is the monk-like phase of the PhD program when you sequester yourself and churn out nonsense to show how well educated you are in the art of writing nonsense.

Only I can’t bring myself to write. Anything.

So the blog is back. Maybe I’ll get back in the habit of writing, and the hooey will flow and flow.

I did fancy myself to be a writer once.

Once upon a time…