<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>i/o</title>
	<atom:link href="http://lemurz.org/i/o/feed/" rel="self" type="application/rss+xml" />
	<link>http://lemurz.org/i/o</link>
	<description></description>
	<pubDate>Sat, 27 Sep 2008 19:41:12 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.5</generator>
	<language>en</language>
			<item>
		<title>De-sucking Textmate indentation</title>
		<link>http://lemurz.org/i/o/2008/09/27/de-sucking-textmate-indentation/</link>
		<comments>http://lemurz.org/i/o/2008/09/27/de-sucking-textmate-indentation/#comments</comments>
		<pubDate>Sat, 27 Sep 2008 18:53:28 +0000</pubDate>
		<dc:creator>yozhik</dc:creator>
		
		<category><![CDATA[pewter]]></category>

		<guid isPermaLink="false">http://lemurz.org/i/o/2008/09/27/de-sucking-textmate-indentation/</guid>
		<description><![CDATA[I really, really like Textmate. It is fast, powerful, and incredibly easy to script and customize.
But it has a few incredibly annoying weaknesses. The worst of these is an abysmal indentation model. (The second worst is terrible support for large files and projects, but I haven&#8217;t solved that yet.)
Happily, I ran across this nice post [...]]]></description>
			<content:encoded><![CDATA[<p>I really, really like Textmate. It is fast, powerful, and incredibly easy to script and customize.</p>
<p>But it has a few incredibly annoying weaknesses. The worst of these is an abysmal indentation model. (The second worst is terrible support for large files and projects, but I haven&#8217;t solved that yet.)</p>
<p>Happily, I ran across this <a href="http://gragusa.wordpress.com/2007/11/11/textmate-emacs-like-indentation-for-r-files/">nice post</a> to indent R files using Emacs in batch mode from Textmate. A very simple and effective idea (Emacs afficionados are of course scoffing, why not use Emacs? Why not? Because I still haven&#8217;t found the button to zap-phrase-long-counter-intuitive-command-names-out-of-the-1970s-and-into-a-modern-ui).</p>
<p>I did a <a href='http://lemurz.org/i/o/wp-content/uploads/2008/09/tidyrb.txt' title='tidy.rb'>reworking in Ruby</a> that can handle any file type for which an Emacs mode is installed, which is specified as a command-line argument to the script. So just put the Ruby script somewhere (it probably should go in a Textmate support directory, but I just put it in a personal script dir), and then create a &#8216;Tidy&#8217; command in Textmate for each desired language, specifying an appropriate file ending to tell Emacs the language. For example:</p>
<p>/path/to/tidy.rb cpp   # C++</p>
<p>/path/to/tidy.rb java   # Java</p>
<p>/path/to/tidy.rb ml      # OCaml</p>
<p>Voilà!</p>
<p>Again, the script (you&#8217;ll need to rename and make executable): <a href='http://lemurz.org/i/o/wp-content/uploads/2008/09/tidyrb.txt' title='tidy.rb'>tidy.rb</a></p>
]]></content:encoded>
			<wfw:commentRss>http://lemurz.org/i/o/2008/09/27/de-sucking-textmate-indentation/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OpenFST patch for building with CMake.</title>
		<link>http://lemurz.org/i/o/2008/05/12/openfst-patch-for-building-with-cmake/</link>
		<comments>http://lemurz.org/i/o/2008/05/12/openfst-patch-for-building-with-cmake/#comments</comments>
		<pubDate>Mon, 12 May 2008 21:39:48 +0000</pubDate>
		<dc:creator>yozhik</dc:creator>
		
		<category><![CDATA[morfo]]></category>

		<category><![CDATA[pewter]]></category>

		<guid isPermaLink="false">http://lemurz.org/i/o/2008/05/12/openfst-patch-for-building-with-cmake/</guid>
		<description><![CDATA[Here is a patch against OpenFST (beta-20070801) that adds CMake build support. It pretty much duplicates the existing make files. I tested on Mac OS 10.5.2 (Leopard) and Ubuntu (8.04 64-bit). While in principle builds for Visual Studio on Windows can now be generated, the OpenFST code itself is not yet portable. I didn&#8217;t try [...]]]></description>
			<content:encoded><![CDATA[<p>Here is a <a href="/io/wp-content/uploads/2008/05/OpenFst-beta-20080422-cmakep2.patch.gz">patch</a> against <a href="http://www.openfst.org/">OpenFST</a> (beta-20070801) that adds CMake build support. It pretty much duplicates the existing make files. I tested on Mac OS 10.5.2 (Leopard) and Ubuntu (8.04 64-bit). While in principle builds for Visual Studio on Windows can now be generated, the OpenFST code itself is not yet portable. I didn&#8217;t try Cygwin, but it might work.</p>
<p>(Note: The p2 patch adds header installation to the earlier version.)</p>
]]></content:encoded>
			<wfw:commentRss>http://lemurz.org/i/o/2008/05/12/openfst-patch-for-building-with-cmake/feed/</wfw:commentRss>
		</item>
		<item>
		<title>No comment</title>
		<link>http://lemurz.org/i/o/2008/05/12/no-comment/</link>
		<comments>http://lemurz.org/i/o/2008/05/12/no-comment/#comments</comments>
		<pubDate>Mon, 12 May 2008 21:24:06 +0000</pubDate>
		<dc:creator>yozhik</dc:creator>
		
		<category><![CDATA[polis]]></category>

		<guid isPermaLink="false">http://lemurz.org/i/o/2008/05/12/no-comment/</guid>
		<description><![CDATA[Painful chuckle with the last post. How many ups and downs since then, eh? Suffice to say that my hopes and predictions about both the Super Bowl and Super Tuesday were, uh wrong (not that the former mattered too much to me). At this point, I just want the damn primaries to end. Please?
]]></description>
			<content:encoded><![CDATA[<p>Painful chuckle with the last post. How many ups and downs since then, eh? Suffice to say that my hopes and predictions about both the Super Bowl and Super Tuesday were, uh wrong (not that the former mattered too much to me). At this point, I just want the damn primaries to end. Please?</p>
]]></content:encoded>
			<wfw:commentRss>http://lemurz.org/i/o/2008/05/12/no-comment/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Permission to dream</title>
		<link>http://lemurz.org/i/o/2008/02/03/permission-to-dream/</link>
		<comments>http://lemurz.org/i/o/2008/02/03/permission-to-dream/#comments</comments>
		<pubDate>Sun, 03 Feb 2008 20:27:49 +0000</pubDate>
		<dc:creator>yozhik</dc:creator>
		
		<category><![CDATA[polis]]></category>

		<guid isPermaLink="false">http://lemurz.org/i/o/2008/02/03/permission-to-dream/</guid>
		<description><![CDATA[Too much work to do, and today&#8217;s the one time of the year I allow myself to waste a day on watching football (ok, one of two times, I watch USC bowl games, too :-). But a brief post anyway.
The terrible thing about being leftward-leaning in this country is to watch the quadrennial self-destruction of [...]]]></description>
			<content:encoded><![CDATA[<p>Too much work to do, and today&#8217;s the one time of the year I allow myself to waste a day on watching football (ok, one of two times, I watch USC bowl games, too :-). But a brief post anyway.</p>
<p>The terrible thing about being leftward-leaning in this country is to watch the quadrennial self-destruction of the Democratic Party, nominating yet another New England intellectual who can dish off stacks of well reasoned policy documents, but can&#8217;t connect with anyone.</p>
<p>The one living exception is Bill, but we&#8217;ve been reminded all too well in the past few weeks of his character and the divisive environment he helps create and thrives upon. And for those nostalgic for the Clinton Nineties, are they really worth reliving? Or just <a href="http://www.latimes.com/news/columnists/la-oe-brooks24jan24,1,4797003.column">better than the past seven years</a>?</p>
<p>We&#8217;ve been told for at least four years that Hillary was coming, and that she was inevitable, which I resented right off the bat. Add to that the fact that, while she certainly possesses her husband&#8217;s brains and probably better policy, she possesses none of his charisma and common touch, but has definitely acquired a good measure of his sleaze and self-righteousness (ok, maybe she already had some of the latter) through years of osmosis and simple association. At times they both seem indignant and even incredulous that anyone would dare challenge their birthright to return to the throne.</p>
<p>Now, we as the Good Democratic Herd are supposed to make the usual calculations and triangulations and support The Party&#8217;s Anointed. Vote HHH, not RFK. Mondale and Dukakis, Gore and Kerry and &#8230; Billary. And while some years there have been other, more inspiring candidates, who may or may not have done better in November, the Fear always whispered, &#8220;No, too X, too Y. We need neutral, we need safe.&#8221;</p>
<p>Well, we may not win this one, either. I&#8217;m not saying Obama is flawless or the Chosen One, but this is the first candidate in my life that I&#8217;m genuinely excited about, not just tacking away from this or that. And so I&#8217;m elated that this month, finally, millions of others are feeling the permission to dream:</p>
<p style="text-align: center"><img src="http://lemurz.org/i/o/wp-content/uploads/2008/02/020308dailyupdategraph2.gif" alt="Dem-Polls-Jan2008" /></p>
]]></content:encoded>
			<wfw:commentRss>http://lemurz.org/i/o/2008/02/03/permission-to-dream/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Isolationist policy</title>
		<link>http://lemurz.org/i/o/2007/12/16/isolationist-policy/</link>
		<comments>http://lemurz.org/i/o/2007/12/16/isolationist-policy/#comments</comments>
		<pubDate>Sun, 16 Dec 2007 23:31:42 +0000</pubDate>
		<dc:creator>yozhik</dc:creator>
		
		<category><![CDATA[morfo]]></category>

		<guid isPermaLink="false">http://lemurz.org/i/o/2007/12/16/isolationist-policy/</guid>
		<description><![CDATA[Word analysis in isolating languages such as Vietnamese and Chinese is generally viewed as a problem of segmentation: given a sentence of tokens we aim to find the boundaries and segment the sequence into words.
From a morphology perspective, however, we might say that the text is over-segmented into a string of morphs, so that our [...]]]></description>
			<content:encoded><![CDATA[<p>Word analysis in isolating languages such as Vietnamese and Chinese is generally viewed as a problem of segmentation: given a sentence of tokens we aim to find the boundaries and segment the sequence into words.</p>
<p>From a morphology perspective, however, we might say that the text is <em>over</em>-segmented into a string of morphs, so that our task is to group morphs into chunks of words. The nice aspect of this view is that it falls nicely under what I would call sequential models of morphological analysis for non-isolating languages, where we have a lattice of possible segmentations (from some hypothesis generator) and we aim to find the best path. The isolating case just has a more constrained graph topology, with N+1 states for the N tokens and possible arcs limited by the maximum token length of words in the languages.</p>
<p>It&#8217;s also interesting that we run into the same type of problem here as with the analysis of non-isolating languages. Specifically, if the arc weights are something like probabilities, we will be biased toward longer words and thus fewer arcs, unless the costs of longer arcs are suitably balanced. With token classes and some labeled training data, we certainly might build highly accurate models. For a more unsupervised approach, however, seeking some intrinsic measure for an optimal segmentation, we run into the usual issue that there is no universally correct trade-off between coarse and fine analyses. We are always left with parameters, hidden or overt, that guide our models one way or another toward our preferences, which are, to a large degree, quite arbitrary.</p>
]]></content:encoded>
			<wfw:commentRss>http://lemurz.org/i/o/2007/12/16/isolationist-policy/feed/</wfw:commentRss>
		</item>
		<item>
		<title>At long last: Mac copy, KDE paste</title>
		<link>http://lemurz.org/i/o/2007/11/10/at-long-last-mac-copy-kde-paste/</link>
		<comments>http://lemurz.org/i/o/2007/11/10/at-long-last-mac-copy-kde-paste/#comments</comments>
		<pubDate>Sat, 10 Nov 2007 22:24:36 +0000</pubDate>
		<dc:creator>yozhik</dc:creator>
		
		<category><![CDATA[pewter]]></category>

		<guid isPermaLink="false">http://lemurz.org/i/o/2007/11/10/at-long-last-mac-copy-kde-paste/</guid>
		<description><![CDATA[I happen to be crazed enough to run KDE on my Mactop. Actually, I run it for one application: Kile. Yes, Kile. Though I would prefer to use something like LyX, the main reason I suffer LaTeX is the convenience of having your papers and, maybe even, thesis formatted to specification automatically. LyX doesn&#8217;t work [...]]]></description>
			<content:encoded><![CDATA[<p>I happen to be crazed enough to run KDE on my Mactop. Actually, I run it for one application: <a href="http://kile.sourceforge.net/">Kile</a>. Yes, Kile. Though I would prefer to use something like <a href="http://http://www.lyx.org/">LyX</a>, the main reason I suffer LaTeX is the convenience of having your papers and, maybe even, thesis formatted to specification automatically. LyX doesn&#8217;t work on LaTeX files directly, however, so I am paranoid about whether it will format everything exactly the same as LaTeX, and anyway I find LaTeX hacking in LyX a bit clumsy. So at some point I switched to working with LaTeX directly, but the general philosophy of LaTeX editor user interface design seems to be: Let&#8217;s stuff as many thousands of little toolbar buttons as we can for all the obscure commands, environments, and, especially mathematical symbols.</p>
<p>Except for Kile. Which still sucks since it doesn&#8217;t have WYSIWYG or at least real-time preview, but at least they figured out that dynamic command completion is really frigging useful. So that&#8217;s what I use.</p>
<p>But there&#8217;s one big problem: I can&#8217;t copy from a Mac application to Kile. Or any KDE application for that matter. As the following post discusses <a href="http://lists.macosforge.org/pipermail/macports-users/2007-July/004451.html"></a></p>
<p><a href="http://lists.macosforge.org/pipermail/macports-users/2007-July/004451.html">http://lists.macosforge.org/pipermail/macports-users/2007-July/004451.html</a></p>
<p>you can paste to other X11 applications using the mouse middle button (option-click) (see <a href="http://developer.apple.com/qa/qa2001/qa1232.html">http://developer.apple.com/qa/qa2001/qa1232.html</a>), but it doesn&#8217;t work for KDE apps. Unfortunately the post hasn&#8217;t been answered, but I finally figured out a not-so-odious work-around: Just run xclipboard.</p>
<p>Yes, old, crusty, xclipboard in all of its <a href="http://en.wikipedia.org/wiki/Xaw">Athena widget</a> glory scoops up your native Mac copying activity and magically makes it available to KDE applications with the usual Control-v.</p>
<p>So that&#8217;s why today&#8217;s UNIX GUI distros still dump hundreds of old and often redundant X applications. Sometimes you actually need them!</p>
]]></content:encoded>
			<wfw:commentRss>http://lemurz.org/i/o/2007/11/10/at-long-last-mac-copy-kde-paste/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Sneaking some thoughts</title>
		<link>http://lemurz.org/i/o/2007/11/09/sneaking-some-thoughts/</link>
		<comments>http://lemurz.org/i/o/2007/11/09/sneaking-some-thoughts/#comments</comments>
		<pubDate>Fri, 09 Nov 2007 16:29:17 +0000</pubDate>
		<dc:creator>yozhik</dc:creator>
		
		<category><![CDATA[morfo]]></category>

		<guid isPermaLink="false">http://lemurz.org/i/o/2007/11/09/sneaking-some-thoughts/</guid>
		<description><![CDATA[The Boobear is napping, so a few moments of rumination are allowed.
I&#8217;ve banned myself from working on morphology learning right now, because it&#8217;s clearly risky, with a lot of work needed, and that looming January 10th ACL deadline says I need to get cracking on the morphemic MT paper until it&#8217;s nearing a finished state. [...]]]></description>
			<content:encoded><![CDATA[<p align="left">The Boobear is napping, so a few moments of rumination are allowed.</p>
<p>I&#8217;ve banned myself from working on morphology learning right now, because it&#8217;s clearly risky, with a lot of work needed, and that looming January 10th ACL deadline says I need to get cracking on the morphemic MT paper until it&#8217;s nearing a finished state. But I feel anxious to spend some time at least thinking about the morphology, so I&#8217;ll grant an exception&#8230;</p>
<p>(On the other hand, the MT paper looks more hopeful for EMNLP than ACL; the morphology work is more of an ACL flavor, but I&#8217;m losing confidence that it will be ready in time.)</p>
<p>As I&#8217;ve already noted, the decision to do morphology is fraught with a different kind of peril compared to the quixotic grammar work. While the morphology is much more likely to achieve results in the nearer term, it is also much more likely to achieve similar results for others, and in fact there have been a gazillion papers on this in the past 8-10 years. I need to do more prior-work reading, but it seems probable that even the weakly supervised space is a little crowded already. And the state-of-the-art unsupervised systems are crafted by hardy Finns who live and breathe agglutinative morphology every day, and if my (cheating) system can&#8217;t even match that, well, why bother?</p>
<p><span id="more-10"></span><br />
Contingent on doing more background reading, I think the following areas are important to focus on:</p>
<ul>
<li><strong>Model analysis</strong>
<ul>
<li>It looks like (Creutz and Lagus, 2007) have done a good job analyzing the different types of information that models need to capture, but there may be some room left, especially if it leads to a successful model in the end.</li>
<li>One major issue for generative modeling: Differing entropy of affix space distributions for different classes. Hack solution is to scale affix probabilities where there are more allowed values, or perhaps some Bayesian technique. Another possibility is to model the letters directly, and their likelihood to generate a separate morpheme, so that, e.g. we assign a likelihood to the &#8220;ly&#8221; suffix being part of the stem given after the characters &#8220;sal&#8221; vs. &#8220;slow.&#8221;</li>
</ul>
</li>
<li><strong>Global effects</strong>
<ul>
<li>Closely related is the issue of how to model the more global aspects of a morphological system, e.g. the frequency ratios, expectation that a stem of one class will not be seen with morphemes of another class.</li>
<li>It is possible that the (successful) use of contrastive estimation would be the clincher for the work, perhaps with the addition of support sets (though this needs better theoretical justification).</li>
</ul>
</li>
<li><strong>Joint tagging and analysis</strong>
<ul>
<li>It seems that much of the focus in morphology learning is on providing the correct segmentation, without identifying the type of the word and the functional roles of the segmented morphemes.</li>
<li>Having now finished the (Creutz and Lagus, 2007) paper, it&#8217;s clear that  (a) state-of-the-art unsupervised morphology is still a  large number of heuristics generally tuned for a few languages (here, Finnish); and (b) the focus is very much on segmentation of the surface characters and less on identifying the word type(s), affix functions, etc. So it&#8217;s good that this is my focus.</li>
</ul>
</li>
<li><strong>Morphology and applications</strong>
<ul>
<li>A good deal of the motivation for my approach is for the practical use of morphology in applications such as MT, parsing, and (need to add) IR.</li>
</ul>
<ul>
<li>Aside from showing performance improvements with morphology, the thing to emphasize is the importance of adapting the morphological analysis to the problem, both in the structure (how to segment words) and the amount (which types of words and affixes to segment). Good to do: Apply two different segmentation structures&#8211;one traditional linguistic, one not&#8211;to a problem and compare results.</li>
</ul>
</li>
</ul>
<p>(Boobear woke up before I finished this yesterday. This is enough for now.)</p>
]]></content:encoded>
			<wfw:commentRss>http://lemurz.org/i/o/2007/11/09/sneaking-some-thoughts/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Plan B</title>
		<link>http://lemurz.org/i/o/2007/11/02/plan-b/</link>
		<comments>http://lemurz.org/i/o/2007/11/02/plan-b/#comments</comments>
		<pubDate>Fri, 02 Nov 2007 21:40:32 +0000</pubDate>
		<dc:creator>yozhik</dc:creator>
		
		<category><![CDATA[morfo]]></category>

		<category><![CDATA[phooey]]></category>

		<guid isPermaLink="false">http://lemurz.org/i/o/2007/11/02/plan-b/</guid>
		<description><![CDATA[Current processing cycles are being devoted to the following basic question: Should I try to straddle two difficult topics, morphology and syntax, for my impending quals, or go for expediency and stick with one, staying the course on morphology?
Put more cynically, should I cling to that last idealistic drop of PhD motivation in my body, [...]]]></description>
			<content:encoded><![CDATA[<p>Current processing cycles are being devoted to the following basic question: Should I try to straddle two difficult topics, morphology and syntax, for my impending quals, or go for expediency and stick with one, staying the course on morphology?</p>
<p>Put more cynically, should I cling to that last idealistic drop of PhD motivation in my body, the drive to do something novel and exciting, the last, tenuous hope for a home run that will make the last 9 1/2 innings of drudgery seem worthwhile?  Or just accept that those dreams are done and that now all I want is the paper reward, that piece of parchment suitable for framing and the little acronym that says: resistiré.</p>
<p>The Dreamer declaims the following:</p>
<ol>
<li>These ideas are exciting! They are novel, with nice linguistic foundations (albeit unorthodox), and could be a strong development in unsupervised and low-resource grammar learning, and in MT.</li>
<li>The high bar for the quals are a bit self-imposed.
<ol>
<li>The two morphology chapters plus the syntax smoothing (probably feasible for the spring) are sufficient for the quals, so I can still meet that deadline.</li>
<li>With the smoothing completed, the grammar transformations are mostly done.</li>
<li>Then I spend most of the final year on unsupervised learning, with the MT results limited to the most straightforward applications of it.</li>
</ol>
</li>
<li>If not syntax, what then? What novel work would you do in morphology to fill out a thesis? Especially since everyone and their cousin has taken a pass at it!</li>
</ol>
<p>To which The Pragmatist retorts:</p>
<ol>
<li>They are exciting, but extremely speculative and risky. If you&#8217;d developed them in year two or even three, that would have been a great time to try something big. But we&#8217;re starting year <em>five</em> now, and it&#8217;s time to finish, not to finesse.</li>
<li>Yes, but then you push more work to do after the quals, and do you <em>really</em> want to be here past May 2009?
<ol>
<li>A bit hopeful, assuming mountains of SpeechLinks work doesn&#8217;t come crashing down, also no chance for a COLING paper, because it&#8217;s pretty clear that the current papers will occupy me fully through January 10th.</li>
<li>Yes, but again no small piece of work. 6 months is a safe estimate, so that takes us through the NAACL deadline, without starting on the unsupervised learning, which is harder!</li>
<li>I&#8217;d call it 18 months after the quals. Want to stay through December?</li>
</ol>
</li>
<li>Ah, you have me there a bit, but I can come up with something. Just watch me&#8230;.</li>
</ol>
<p><span id="more-9"></span></p>
<h3>The Pragmatist&#8217;s Plan B</h3>
<p>The mantra of Plan B is this: My hammer is morphology, so what can I nail? Put more concretely, we find languages with interesting morphology, model them, and evaluate the performance of the segmentation and its use in applications.</p>
<p>A little brainstorming:</p>
<ol>
<li>Models
<ol>
<li>Basic concatenative morphology</li>
<li>Functional and surface affixes</li>
<li>Agglutinative morphology</li>
<li>Template morphology</li>
<li>Affix ordering (e.g. Chingtang)</li>
<li>Phonological modeling</li>
</ol>
</li>
<li>Learning
<ol>
<li>EM and variants</li>
<li>Log-linear: contrastive, etc.</li>
<li>Full semi-supervised, i.e. combine small supervised model with unsupervised.</li>
<li>Using Y&amp;W approaches to discover patterns, in addition to specification.</li>
</ol>
</li>
<li>MT
<ol>
<li>Segmentation and training techniques (IN PROGRESS).</li>
<li>Factored morphological models.</li>
<li>Translation with isolating language, mapping morphemes to particles (Vietnamese, Chinese).</li>
</ol>
</li>
<li>Other applications
<ol>
<li>Dependency parsing? Train McDonald on segmented data.</li>
<li>Joint tagging and segmentation.</li>
</ol>
</li>
</ol>
<p>Now let&#8217;s translate this into chapters for the quals:</p>
<ol>
<li>Models of Morphological Systems</li>
<li>Learning Morphology
<ol>
<li>EM and variants</li>
<li>Contrastive estimation</li>
<li>Partial specification (Y&amp;W proposals)</li>
</ol>
</li>
<li>Translating Morphemes
<ol>
<li>Persian and Spanish, maybe Czech (but what to offer over G&amp;M).</li>
</ol>
</li>
<li>Proposed: Factored Translation
<ol>
<li>Basic improvements</li>
<li>Isolating languages: identifying particles, etc.</li>
</ol>
</li>
<li>Proposed: Other Applications
<ol>
<li>Joint tagging</li>
<li>Dependency parsing</li>
</ol>
</li>
</ol>
<p>I&#8217;m not sure I like this division of the work, especially separating the models from the learning. Let&#8217;s try another:</p>
<ol>
<li>Learning Morphology by Specification
<ol>
<li>Basic concatenative model, with stem change, orders 1-3.</li>
<li>EM and variants.</li>
<li>Contrastive estimation (note: this requires generation!)</li>
</ol>
</li>
<li>Beyond Concatenative Morphology
<ol>
<li>Agglutinative, templatic, permutative.</li>
<li>Functional and surface affix modeling.</li>
<li>Phonological modeling.</li>
</ol>
</li>
<li>Translating Morphemes
<ol>
<li>Persian work, with patterns and above models.</li>
<li>Some other exciting language(s).</li>
</ol>
</li>
<li>Proposed: Learning from Partial Specifications
<ol>
<li>Use hypotheses proposals from Y&amp;W models and others, extend specification.</li>
<li>Also semi-supervised training, with small annotated data set.</li>
</ol>
</li>
<li>Proposed: Factored Translation
<ol>
<li>Basic improvements on morpheme translation.</li>
<li>Isolating languages: Identifying particles, etc.</li>
</ol>
</li>
<li>Proposed: Other Applications
<ol>
<li>Dependency parsing of morphemes (McDonald plus certain attachment constraints).</li>
<li>Joint tagging and segmentation (anything really to be done here?).</li>
</ol>
</li>
</ol>
<p>Ok, I think that&#8217;s better. Call it a plan.</p>
]]></content:encoded>
			<wfw:commentRss>http://lemurz.org/i/o/2007/11/02/plan-b/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Containing the morphology problem</title>
		<link>http://lemurz.org/i/o/2007/10/29/containing-the-morphology-problem/</link>
		<comments>http://lemurz.org/i/o/2007/10/29/containing-the-morphology-problem/#comments</comments>
		<pubDate>Mon, 29 Oct 2007 18:42:03 +0000</pubDate>
		<dc:creator>yozhik</dc:creator>
		
		<category><![CDATA[morfo]]></category>

		<guid isPermaLink="false">http://lemurz.org/i/o/2007/10/29/containing-the-morphology-problem/</guid>
		<description><![CDATA[Like every aspect of human languages (or maybe better, human intelligence), modeling morphology has turned out to be much trickier than at first thought. Thus the  extensive list of background work&#8230;
The current list of modeling and learning possibilities:

EM.

pro: Simple; flows from nice, generative model.
con:&#160;Greatly biased toward classes with fewer inflections (especially atomic), since the [...]]]></description>
			<content:encoded><![CDATA[<p>Like every aspect of human languages (or maybe better, human <em>intelligence</em>), modeling morphology has turned out to be much trickier than at first thought. Thus the  extensive list of background work&#8230;</p>
<p>The current list of modeling and learning possibilities:</p>
<ol>
<li>EM.
<ul>
<li><strong>pro:</strong> Simple; flows from nice, generative model.</li>
<li><strong style="margin-left: -4px">con:</strong>&nbsp;Greatly biased toward classes with fewer inflections (especially atomic), since the affix vector distribution is less sparse. Different hypotheses within a (class, stem) group do contribute to each other&#8217;s likelihood, but this doesn&#8217;t translate across classes.</li>
<li><strong>potential extensions:</strong>
<ol>
<li>Putting a prior on classes may help, and very simple to try. Could go Bayesian, but I am skeptical of how much more that will help.</li>
<li>Need to experiment with initialization.</li>
<li>Higher-order model will help, but trickier in the setting where we analyze only a few word classes and assume all other words are atomic, which makes prior context less informative. One solution for that might be hierarchical class splitting of atomics, cf. Petrov.</li>
<li>Try to model some of the more global effects that we want. To balance the varying affix sparseness, we might want to focus on maximizing the (class, stem) likelihood, and after that, we have differing expectations of a specific affix vector based on the size of the space for that class. There&#8217;s also a global expectation that the ratios between different affix values will remain roughly constant across a class. Not sure how to model these effects with a pure stochastic model; a really scary option is to try to adapt M-estimation to unsupervised learning and use the EM model as q<sub>0</sub>.</li>
</ol>
</li>
</ul>
</li>
<li>Contrastive Estimation
<ul>
<li><strong>pro:</strong> Allows arbitrary features, in principle may avoid the sparse affix space problem.</li>
<li><strong>con:</strong> The only obvious neighborhood function, using inflections of the (class, stem) hypotheses, will generate many valid words. Uncertain how this will work, then, especially having the tagging problem at the same time.</li>
<li><strong>possible extensions:</strong>
<ol>
<li>Initialization and regularization will be very important.</li>
<li>Multiple neighborhoods, maximized jointly.</li>
<li>Alternate objective functions?</li>
<li>Adding support and/or conflict sets.</li>
</ol>
</li>
</ul>
</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://lemurz.org/i/o/2007/10/29/containing-the-morphology-problem/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Desiderata for a morphology learner</title>
		<link>http://lemurz.org/i/o/2007/10/26/desiderata-for-a-morphology-learner/</link>
		<comments>http://lemurz.org/i/o/2007/10/26/desiderata-for-a-morphology-learner/#comments</comments>
		<pubDate>Fri, 26 Oct 2007 23:27:04 +0000</pubDate>
		<dc:creator>yozhik</dc:creator>
		
		<category><![CDATA[morfo]]></category>

		<guid isPermaLink="false">http://lemurz.org/i/o/2007/10/26/desiderata-for-a-morphology-learner/</guid>
		<description><![CDATA[Definitions 
First, we define some concepts:

stem
The underlying base of a non-compound word, possibly corresponding to the word&#8217;s lemma form or root, but we view the choice of stem as a modeling decision. We use the word &#8220;stem&#8221; and not &#8220;root,&#8221; to include more shallow types of analysis that may seek only to identify the affixes [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Definitions </strong></p>
<p>First, we define some concepts:</p>
<ul>
<li><strong>stem</strong><br />
The underlying base of a non-compound word, possibly corresponding to the word&#8217;s lemma form or root, but we view the choice of stem as a modeling decision. We use the word &#8220;stem&#8221; and not &#8220;root,&#8221; to include more shallow types of analysis that may seek only to identify the affixes (and what&#8217;s left behind), and not to recover the fundamental word root.</li>
<li><strong>stem class</strong><br />
The type of the stem, often related to the familiar notion of part-of-speech, which determines which affixes may join with it. For more limited morphological analysis, we may constrain the stem and final word classes to be equal.</li>
<li><strong>surface affix</strong><br />
What is normally referred to as an affix, realized as zero or more characters of text.</li>
<li><strong>functional affix</strong><br />
An abstract affix that expresses a single underlying function of a surface affix, e.g. Number, Person, Tense, Aspect, etc., which may determine the final word class. Note that multiple functional affixes often map to a single surface affix.</li>
<li><strong>affix vector</strong><br />
A value in the space (either surface or functional) of all possible affix combinations allowed in a language.</li>
<li><strong>affix position</strong><br />
A set of one or more affixes that are mutually dependent and thus modeled jointly, occupying a single index in an affix vector. The concept of a position is also useful for specifying the mapping between functional and surface affixes. For example, in Spanish we may have multiple function positions for person, tense, etc. mapping to a single, &#8220;verb suffix&#8221; surface position.</li>
</ul>
<p><span id="more-5"></span><br />
<strong>Word Formation </strong></p>
<p>Next, we describe a simplified model of word formation that, as suggested above, ignores stem-to-word class changes from affixes, which is sufficient for many applications, such as smoothing to address data sparsity. This framework allows us to model, for example, &#8220;working&#8221; as verbal stem &#8220;work&#8221; plus progressive suffix &#8220;ing,&#8221; but not &#8220;worker&#8221; as &#8220;work&#8221; plus nominalizing suffix &#8220;er.&#8221; Voilà:</p>
<ol>
<li>Select a word class <em>t</em>, given the utterance context.</li>
<li>Select a word stem <em>s</em> in <em>t</em>, given the utterance context.</li>
<li>Select a vector of functional affixes <strong><em>f</em></strong>, given <em>(t, s)</em> and the utterance context.</li>
<li>Given <em>(t, s, <strong>f</strong>)</em>, select a stem transformation (possibly identity) to produce the final stem <em>s&#8217;</em>.</li>
<li>Given <em>(t, s&#8217;, <strong>f</strong>)</em>, map <strong><em>f</em></strong> to a surface affix vector <strong><em>a</em></strong>.</li>
<li>Generate the final word from <em>w</em> from <em>(t, s&#8217;, <strong>a</strong>)</em>. For most familiar languages, this process is deterministic, but there are languages that allow some affix permutations with each other and even the stem.</li>
</ol>
<p>Now we extend this process with a few modifications to handle affixes that change the class of the stem:</p>
<ol>
<li>Select a word class <em>t</em>, given the utterance context.</li>
<li>Select a stem class <em>c</em> such that <em>t</em> is derivable, that is, <em>c</em> admits one or more affix vectors that will produce a final word class <em>t</em> when adjoined to a stem of class <em>c</em>.</li>
<li>Select a word stem <em>s</em> in <em>c</em>, given the utterance context.</li>
<li>Select a vector of functional affixes <strong><em>f</em></strong>, given <em>(t, c, s)</em> and the utterance context, such that <strong><em>f</em></strong> maps the stem class <em>c</em> to the final word class <em>t</em>.</li>
<li>Given <em>(t, c, s, <strong>f</strong>)</em>, select a stem transformation (possibly identity) to produce the final stem <em>s&#8217;</em>.</li>
<li>Given <em>(t, c, s&#8217;, <strong>f</strong>)</em>, map <strong><em>f</em></strong> to a surface affix vector <strong><em>a</em></strong>.</li>
<li>Generate the final word from <em>w</em> from <em>(t, c, s&#8217;, <strong>a</strong>)</em>.</li>
</ol>
<p><strong>Desiderata</strong></p>
<p>Finally, the promised desiderata for a morphology learner:</p>
<ol>
<li>If an analysis for word <em>w</em> with the stem <em>(c, s)</em> is correct, we expect to observe a number of other words that also have an analysis with <em>(c, s)</em>, but with different, valid affix vectors.</li>
<li>Conversely, the hypothesis that a word is atomic (unanalyzed) is an implicit claim that the word will not appear as the stem in other words with affixes.</li>
<li>Homographs make things more difficult, as one word token may appear with different classes, some atomic and some not, e.g. &#8220;get caught up in the ins and outs of something&#8221; or Persian &#8220;(در در(ها&#8221; (&#8221;dr dr(hA)&#8221; = &#8220;at the door(s)&#8221;, an atomic preposition plus inflecting noun). Thus context must be part of the model and learning process, and a word&#8217;s marginal hypothesis distribution should never be too skewed, especially between atomic and analyzed classes.</li>
<li>It would be nice to be able to integrate a discovery process, such as the method of Yarowski and Wicentowski, with (possibly partial) knowledge of the morphological system.</li>
<li>Another aspect of Y&amp;W that would be nice: Modeling the expectation of regularity over the relative frequency between different related word pairs, e.g. we expect <code>count</code>(walk)/<code>count</code>(walking) to be roughly equal to <code>count</code>(singe)/<code>count</code>(singeing).</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://lemurz.org/i/o/2007/10/26/desiderata-for-a-morphology-learner/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
