Containing the morphology problem
Like every aspect of human languages (or maybe better, human intelligence), modeling morphology has turned out to be much trickier than at first thought. Thus the extensive list of background work…
The current list of modeling and learning possibilities:
- EM.
- pro: Simple; flows from nice, generative model.
- con: Greatly biased toward classes with fewer inflections (especially atomic), since the affix vector distribution is less sparse. Different hypotheses within a (class, stem) group do contribute to each other’s likelihood, but this doesn’t translate across classes.
- potential extensions:
- Putting a prior on classes may help, and very simple to try. Could go Bayesian, but I am skeptical of how much more that will help.
- Need to experiment with initialization.
- Higher-order model will help, but trickier in the setting where we analyze only a few word classes and assume all other words are atomic, which makes prior context less informative. One solution for that might be hierarchical class splitting of atomics, cf. Petrov.
- Try to model some of the more global effects that we want. To balance the varying affix sparseness, we might want to focus on maximizing the (class, stem) likelihood, and after that, we have differing expectations of a specific affix vector based on the size of the space for that class. There’s also a global expectation that the ratios between different affix values will remain roughly constant across a class. Not sure how to model these effects with a pure stochastic model; a really scary option is to try to adapt M-estimation to unsupervised learning and use the EM model as q0.
- Contrastive Estimation
- pro: Allows arbitrary features, in principle may avoid the sparse affix space problem.
- con: The only obvious neighborhood function, using inflections of the (class, stem) hypotheses, will generate many valid words. Uncertain how this will work, then, especially having the tagging problem at the same time.
- possible extensions:
- Initialization and regularization will be very important.
- Multiple neighborhoods, maximized jointly.
- Alternate objective functions?
- Adding support and/or conflict sets.