Sneaking some thoughts

The Boobear is napping, so a few moments of rumination are allowed.

I’ve banned myself from working on morphology learning right now, because it’s clearly risky, with a lot of work needed, and that looming January 10th ACL deadline says I need to get cracking on the morphemic MT paper until it’s nearing a finished state. But I feel anxious to spend some time at least thinking about the morphology, so I’ll grant an exception…

(On the other hand, the MT paper looks more hopeful for EMNLP than ACL; the morphology work is more of an ACL flavor, but I’m losing confidence that it will be ready in time.)

As I’ve already noted, the decision to do morphology is fraught with a different kind of peril compared to the quixotic grammar work. While the morphology is much more likely to achieve results in the nearer term, it is also much more likely to achieve similar results for others, and in fact there have been a gazillion papers on this in the past 8-10 years. I need to do more prior-work reading, but it seems probable that even the weakly supervised space is a little crowded already. And the state-of-the-art unsupervised systems are crafted by hardy Finns who live and breathe agglutinative morphology every day, and if my (cheating) system can’t even match that, well, why bother?


Contingent on doing more background reading, I think the following areas are important to focus on:

  • Model analysis
    • It looks like (Creutz and Lagus, 2007) have done a good job analyzing the different types of information that models need to capture, but there may be some room left, especially if it leads to a successful model in the end.
    • One major issue for generative modeling: Differing entropy of affix space distributions for different classes. Hack solution is to scale affix probabilities where there are more allowed values, or perhaps some Bayesian technique. Another possibility is to model the letters directly, and their likelihood to generate a separate morpheme, so that, e.g. we assign a likelihood to the “ly” suffix being part of the stem given after the characters “sal” vs. “slow.”
  • Global effects
    • Closely related is the issue of how to model the more global aspects of a morphological system, e.g. the frequency ratios, expectation that a stem of one class will not be seen with morphemes of another class.
    • It is possible that the (successful) use of contrastive estimation would be the clincher for the work, perhaps with the addition of support sets (though this needs better theoretical justification).
  • Joint tagging and analysis
    • It seems that much of the focus in morphology learning is on providing the correct segmentation, without identifying the type of the word and the functional roles of the segmented morphemes.
    • Having now finished the (Creutz and Lagus, 2007) paper, it’s clear that (a) state-of-the-art unsupervised morphology is still a large number of heuristics generally tuned for a few languages (here, Finnish); and (b) the focus is very much on segmentation of the surface characters and less on identifying the word type(s), affix functions, etc. So it’s good that this is my focus.
  • Morphology and applications
    • A good deal of the motivation for my approach is for the practical use of morphology in applications such as MT, parsing, and (need to add) IR.
    • Aside from showing performance improvements with morphology, the thing to emphasize is the importance of adapting the morphological analysis to the problem, both in the structure (how to segment words) and the amount (which types of words and affixes to segment). Good to do: Apply two different segmentation structures–one traditional linguistic, one not–to a problem and compare results.

(Boobear woke up before I finished this yesterday. This is enough for now.)

Leave a Reply

You must be logged in to post a comment.