Archive for the ‘morfo’ Category

OpenFST patch for building with CMake.

Monday, May 12th, 2008

Here is a patch against OpenFST (beta-20070801) that adds CMake build support. It pretty much duplicates the existing make files. I tested on Mac OS 10.5.2 (Leopard) and Ubuntu (8.04 64-bit). While in principle builds for Visual Studio on Windows can now be generated, the OpenFST code itself is not yet portable. I didn’t try Cygwin, but it might work.

(Note: The p2 patch adds header installation to the earlier version.)

Isolationist policy

Sunday, December 16th, 2007

Word analysis in isolating languages such as Vietnamese and Chinese is generally viewed as a problem of segmentation: given a sentence of tokens we aim to find the boundaries and segment the sequence into words.

From a morphology perspective, however, we might say that the text is over-segmented into a string of morphs, so that our task is to group morphs into chunks of words. The nice aspect of this view is that it falls nicely under what I would call sequential models of morphological analysis for non-isolating languages, where we have a lattice of possible segmentations (from some hypothesis generator) and we aim to find the best path. The isolating case just has a more constrained graph topology, with N+1 states for the N tokens and possible arcs limited by the maximum token length of words in the languages.

It’s also interesting that we run into the same type of problem here as with the analysis of non-isolating languages. Specifically, if the arc weights are something like probabilities, we will be biased toward longer words and thus fewer arcs, unless the costs of longer arcs are suitably balanced. With token classes and some labeled training data, we certainly might build highly accurate models. For a more unsupervised approach, however, seeking some intrinsic measure for an optimal segmentation, we run into the usual issue that there is no universally correct trade-off between coarse and fine analyses. We are always left with parameters, hidden or overt, that guide our models one way or another toward our preferences, which are, to a large degree, quite arbitrary.

Sneaking some thoughts

Friday, November 9th, 2007

The Boobear is napping, so a few moments of rumination are allowed.

I’ve banned myself from working on morphology learning right now, because it’s clearly risky, with a lot of work needed, and that looming January 10th ACL deadline says I need to get cracking on the morphemic MT paper until it’s nearing a finished state. But I feel anxious to spend some time at least thinking about the morphology, so I’ll grant an exception…

(On the other hand, the MT paper looks more hopeful for EMNLP than ACL; the morphology work is more of an ACL flavor, but I’m losing confidence that it will be ready in time.)

As I’ve already noted, the decision to do morphology is fraught with a different kind of peril compared to the quixotic grammar work. While the morphology is much more likely to achieve results in the nearer term, it is also much more likely to achieve similar results for others, and in fact there have been a gazillion papers on this in the past 8-10 years. I need to do more prior-work reading, but it seems probable that even the weakly supervised space is a little crowded already. And the state-of-the-art unsupervised systems are crafted by hardy Finns who live and breathe agglutinative morphology every day, and if my (cheating) system can’t even match that, well, why bother?

(more…)

Plan B

Friday, November 2nd, 2007

Current processing cycles are being devoted to the following basic question: Should I try to straddle two difficult topics, morphology and syntax, for my impending quals, or go for expediency and stick with one, staying the course on morphology?

Put more cynically, should I cling to that last idealistic drop of PhD motivation in my body, the drive to do something novel and exciting, the last, tenuous hope for a home run that will make the last 9 1/2 innings of drudgery seem worthwhile? Or just accept that those dreams are done and that now all I want is the paper reward, that piece of parchment suitable for framing and the little acronym that says: resistiré.

The Dreamer declaims the following:

  1. These ideas are exciting! They are novel, with nice linguistic foundations (albeit unorthodox), and could be a strong development in unsupervised and low-resource grammar learning, and in MT.
  2. The high bar for the quals are a bit self-imposed.
    1. The two morphology chapters plus the syntax smoothing (probably feasible for the spring) are sufficient for the quals, so I can still meet that deadline.
    2. With the smoothing completed, the grammar transformations are mostly done.
    3. Then I spend most of the final year on unsupervised learning, with the MT results limited to the most straightforward applications of it.
  3. If not syntax, what then? What novel work would you do in morphology to fill out a thesis? Especially since everyone and their cousin has taken a pass at it!

To which The Pragmatist retorts:

  1. They are exciting, but extremely speculative and risky. If you’d developed them in year two or even three, that would have been a great time to try something big. But we’re starting year five now, and it’s time to finish, not to finesse.
  2. Yes, but then you push more work to do after the quals, and do you really want to be here past May 2009?
    1. A bit hopeful, assuming mountains of SpeechLinks work doesn’t come crashing down, also no chance for a COLING paper, because it’s pretty clear that the current papers will occupy me fully through January 10th.
    2. Yes, but again no small piece of work. 6 months is a safe estimate, so that takes us through the NAACL deadline, without starting on the unsupervised learning, which is harder!
    3. I’d call it 18 months after the quals. Want to stay through December?
  3. Ah, you have me there a bit, but I can come up with something. Just watch me….

(more…)

Containing the morphology problem

Monday, October 29th, 2007

Like every aspect of human languages (or maybe better, human intelligence), modeling morphology has turned out to be much trickier than at first thought. Thus the extensive list of background work…

The current list of modeling and learning possibilities:

  1. EM.
    • pro: Simple; flows from nice, generative model.
    • con: Greatly biased toward classes with fewer inflections (especially atomic), since the affix vector distribution is less sparse. Different hypotheses within a (class, stem) group do contribute to each other’s likelihood, but this doesn’t translate across classes.
    • potential extensions:
      1. Putting a prior on classes may help, and very simple to try. Could go Bayesian, but I am skeptical of how much more that will help.
      2. Need to experiment with initialization.
      3. Higher-order model will help, but trickier in the setting where we analyze only a few word classes and assume all other words are atomic, which makes prior context less informative. One solution for that might be hierarchical class splitting of atomics, cf. Petrov.
      4. Try to model some of the more global effects that we want. To balance the varying affix sparseness, we might want to focus on maximizing the (class, stem) likelihood, and after that, we have differing expectations of a specific affix vector based on the size of the space for that class. There’s also a global expectation that the ratios between different affix values will remain roughly constant across a class. Not sure how to model these effects with a pure stochastic model; a really scary option is to try to adapt M-estimation to unsupervised learning and use the EM model as q0.
  2. Contrastive Estimation
    • pro: Allows arbitrary features, in principle may avoid the sparse affix space problem.
    • con: The only obvious neighborhood function, using inflections of the (class, stem) hypotheses, will generate many valid words. Uncertain how this will work, then, especially having the tagging problem at the same time.
    • possible extensions:
      1. Initialization and regularization will be very important.
      2. Multiple neighborhoods, maximized jointly.
      3. Alternate objective functions?
      4. Adding support and/or conflict sets.

Desiderata for a morphology learner

Friday, October 26th, 2007

Definitions

First, we define some concepts:

  • stem
    The underlying base of a non-compound word, possibly corresponding to the word’s lemma form or root, but we view the choice of stem as a modeling decision. We use the word “stem” and not “root,” to include more shallow types of analysis that may seek only to identify the affixes (and what’s left behind), and not to recover the fundamental word root.
  • stem class
    The type of the stem, often related to the familiar notion of part-of-speech, which determines which affixes may join with it. For more limited morphological analysis, we may constrain the stem and final word classes to be equal.
  • surface affix
    What is normally referred to as an affix, realized as zero or more characters of text.
  • functional affix
    An abstract affix that expresses a single underlying function of a surface affix, e.g. Number, Person, Tense, Aspect, etc., which may determine the final word class. Note that multiple functional affixes often map to a single surface affix.
  • affix vector
    A value in the space (either surface or functional) of all possible affix combinations allowed in a language.
  • affix position
    A set of one or more affixes that are mutually dependent and thus modeled jointly, occupying a single index in an affix vector. The concept of a position is also useful for specifying the mapping between functional and surface affixes. For example, in Spanish we may have multiple function positions for person, tense, etc. mapping to a single, “verb suffix” surface position.

(more…)