Wednesday, March 19, 2014

Multiwords GSoC project braindump

I started writing an email for one of Apertium's potential GSoC students, trying to outline the different things that had been tried, why they all suck, and why extending the compiler is better, but it ended up being a braindump (I think in footnotes - it's a problem), so rather than inflict it on him as is, I thought I'd put it here, give him a summary, and let him read, if he dares.


I'm not going to talk too much about the hows and whys of multiwords, because you ought to know that by now, having decided to work on this problem. It should be enough to split them into four basic categories:

  • Uninflected ('in order to')
  • Final inflection ('fruit fly'/fruit flies')
  • Inner inflection ('passer by'/'passers by')
  • Multiple inflection ('force majeure'/'forces majeures')[1]
Apertium (rather, lttoolbox) handles the first three just fine, essentially as 'words with spaces', and has its own tokenisation system to handle this. Which is quite nice, and makes it all the more frustrating when you have to use basically anything else that processes words. I'll come back to that, but at this point I would put a reference, if I could find it.

The way the multiply inflected entries are currently handled is essentially to add them as full form entries. This is a bit of an inconvenience for languages like Spanish (maximum 4 entries, usually), a major inconvenience for languages like Polish (14 entries).

The only way of generating multiword entries that I've seen in Apertium that worked was for Esperanto -- and it would only work for Esperanto (or other artificial languages), because Esperanto is regular to the point of having no exceptions (at least, that I'm aware of).

It was basically a shell script that did:
echo "<e><p><l>${word1}<b/>${word2}</l><r>${word1}<b/>${word2}<s n=\"n\"/><s n=\"sg\"/></r></p></e>"

Other languages are much less regular, so we would want to generate them based on the contents of the dictionary. Running something similar through the morphological generator wouldn't work well enough, because other languages also tend to have alternate forms, so we would need to swap 'LR' and 'RL', make a temporary generator from that, check if the entry has more than one output form, and generate an entry for each alternate form (or combination, because each word of the multiword can potentially have alternate forms). It would work, but it's horrible enough that, despite being relatively simple to do, nobody has bothered to do it.

While I'm waffling, I think I'll relate how I came to be involved in MT, so you'll hopefully understand my point of view. I once spent a few hours watching a girl (rather, the girl) translating a manual, word by word, using a dictionary. When I asked why she didn't use MT, she said they were all crap. So I started to look for something open source, and thus customisable, that could help. I found Moses. After playing with that long enough to figure out that basic SMT was[2] just not feasible for inflected languages, I looked again, and found Apertium. Now, the first thing that you might take away from this is that I basically got into MT to impress a girl[3], the more fundamental thing is that I see MT as primarily being a translator's tool[4], and my target user was decidedly non-expert: I want it to be as simple as possible, or if it can't be simple, it should at least be consistent, to match what you expect.

Apertium came close enough (I won't pretend it's anywhere near simple enough) to spend several years of my life on, so I did. When I started, the prevailing view, even among the linguists, was that the linguistic aspects were secondary to translation: it didn't matter if something was correct, only that the translation was: Apertium had its priorities right. I mention this because it may become relevant: a lot of linguists (generally, whose primary areas are not translation) will complain at length about the sort of things that are considered 'words' in Apertium.

I happily plugged away, porting over my shoddy, home-made morphological analyser[5] to lttoolbox for a while -- a couple of months, I think -- before I hit the problem of multiple inflection.

I tried to do something like this:
<e><i>foo</i><par n="foo__adj"><i><b/>bar</i><par n="bar__n"></e>
which is obviously wrong, because the output is something like this:
^foo<adj> bar<n>$

which the tagger turns into:
^foo bar<adj><n>$

but it gets worse when there's inflection information on both parts:
(foo<adj><sg>, foo<adj><pl>) (bar<n><sg>, bar<n><pl>)

overgenerates as:
^foo<adj><sg> bar<n><sg>$
^foo<adj><pl> bar<n><sg>$
^foo<adj><sg> bar<n><pl>$
^foo<adj><pl> bar<n><pl>$

(in my case, multiply by seven cases for nouns/adjectives and five genders for adjectives)

I mention this not just as (potential) mentor, pointing out my own beginner's mistake, but because it's one of the more common beginner's mistakes: a not insignificant proportion of people expect that this would work.


The first thing I looked at was Freeling, because it had a tool for multiwords, and it's open source. It's also very, very basic. 

The dictionary format is something like this (horrible, positional tags omitted):

dirección<n><f><$num> postal<adj><*> dirección_general<n><f><$num>

The problem is, there's no concordance check: there's no way to say that $num must match between the noun and the adjective for this to be a multiword, which is not so important for Spanish, but is for Polish.

My current favourite example for why this sucks ass is 'panna młoda'[7], because it's easier to come up with examples that don't look completely contrived.
     dał        pannie          młodego psa
[he] gave [the] young woman [a] young   dog

The words don't agree, so a multiword must not be formed from them.

Old GSoC project

I wouldn't have mentioned it (or remembered it), except you brought it up. Its one advantage over the Freeling thing was that it checked that the candidate words agree grammatically.
  • It used lists of strings, so the size of the dictionary is a huge factor in runtime speed; 
  • it worked by computing the Cartesian product of the lists of analyses, so the number of analyses would also greatly affect runtime speed (but this is true of any similar tool).
Concordance is not possible in a regular finite state machine (concordance requires memory), which the student pointed out. I thought it looked like a prototype that was cobbled together in a couple of days -- maybe I'm being unfair, but I don't think the result of that project looked anywhere close to three months' worth of work.

The really crappy idea

The first idea used FST-based matching, similarly to apertium-transfer (I say 'similarly', I ripped most of the code straight out of it), but with one slight change:

For simplicity's sake, let's say that what is matched by apertium-transfer looks like this:
'^' <any_char>* <any_tag>* '$'

I changed it to match:
'/' <dictionary entry part> <dictionary tags> ('/' or '$')

so it would match one analysis of each token. It would then perform the same sort of concordance as the GSoC project, but at least matching tokens wasn't painfully slow.

The 'special' tags were for concordance, and were specified (exactly) like def-attr in transfer, but processed differently:

<def-attr n="gen">
 <attr-item tag="m"/>
 <attr-item tag="f"/>
<def-attr n="num">
 <attr-item tag="sg"/>
 <attr-item tag="pl"/>

    <l>foo<s n="adj"/><m n="gen"/><m n="num"/><j/>bar<s n="n"/><o n="gen"/><o n="num"/></l>
    <r>foo<b/>bar<s n="n"/><m n="gen"/><m n="num"/></r>
The attr-items were handled like a paradigm in the bidix, "gen" would look like:

<pardef n="gen">
  <e><p><l><s n="m"/></l><r></r></p></e>
  <e><p><l><s n="f"/></l><r></r></p></e>
this paradigm equivalent was inserted in place of <m> (for match), while 'special' tags were inserted for <o><%gen%>. A map of the def-attrs was written at the head of the dictionary, at runtime it was converted to a regex (like in transfer).

On matching input, the % tags were extracted from the output, then agreement was checked using the regexes, with the matching tags replacing the % tag. <j> on the right was +, as usual, with the intention being that it should be possible to add entries for ambiguity between multiword and not; <j> on the left was $, optional <b/>, and ^, to match token boundaries.

At a shallow look, this seems pretty reasonable (well, it seemed reasonable enough to me to waste more time that I'd care to admit on it). What makes it fail hard is the case when there's another lemma in one set of analyses with tags that agree with the other, so it could output forms 'from nowhere'.

The slightly less crappy idea

As I said, concordance isn't possible in a regular finite state machine, because that requires memory. So, add memory! -- this is how back references in regular expressions work.

I added a pair of state queues, prevq and curq, and changed the way analysis happened to handle ambiguous input. I'll keep looking for the code, but what I remember of the modifications is:

/ was treated as end of token (stepped over as $). If the / was not at the beginning of the token, and there was a final state at this point, a boolean was set to true (the marking of the last buffer position only happened on $; the current analysis was copied into a string in the first iteration of prevq). If the current state had alive states.size != 0, then the current state was pushed to curq.

When it read ^, it stepped over it, then appended everything until / to a merged_lemma (if there was anything between $ and ^, a single space was added). If prevq was empty, current state was set to initial state, otherwise, for each analysis, iterate over the states in prevq until the next / (basically a breadth first search).

When it read $, if the current state was final or if the boolean had been set, it marked the position of the buffer. Alive states were handled as with /. If curq was not empty, prevq was set to curq, and curq reset.

The agreement would happen in the compiler (I never wrote one), which is what made me realise that this is just not good enough:

First, the fact that three out of four of the types of multiwords I mentioned would be handled in analysis, while the fourth would be farmed out to a separate tool: not consistent.
Second, it wouldn't be significantly more convenient for languages that already handle these multiwords with full form entries, so it would not be used.
Third, expanding the analyses just to collapse them again is always going to be slower than just doing the correct analysis the first time around.[8]
Fourth, if the agreement can be handled in a compiler, as would be necessary here, why not just take the extra step, and keep one single compiler.

All this leads me to believe that the only reasonable way forward is to extend the compiler to make this case more convenient.

The crappy dictionary extension

I had thought, just after the beginner's mistake, that it might be possible to do this with special treatment of <par>, with the addition of a special type of <pardef> to specify how the multiword is composed:

<pardef n="n-adj" type="mw">
    <l><w/><s n="n"/><s n="sg"/><j/><w/><s n="adj"/><s n="sg"/></l>
    <r><w/><b/><w/><s n="n"/><s n="sg"/></r>

<e><i>pan</i><par n="pan/na__n" proc="notags"/><p><l>młod</l><r>młoda</r></e><par n="młod/y__adj" proc="none"/><par n="n-adj" proc="agreement"/></e>

...where notags means only the textual part of <r> elements would be used, none means the <r> parts would be discarded entirely, and agreement... I hope that speaks for itself.

This is horrible, horrible, horrible. Don't do this. Consistency is important, but this is false consistency -- it treats different things as though they were the same.

Things to take from this:
You need access to the forms you're generating from. Here, it would have been a cache, filled when reading <pardefs>: map< lemma, map< string, pair<string, string> > >-- pardef to (tags to (left, right)).
specifying multiwords is different, so a different dictionary is probably better than trying to shoe-horn it in

For me, all roads lead back to the Esperanto-ish way: generating multiwords from a (separate!) template, with word forms filled in from the dictionary. To get things going, it might be better to just use strings (see lt-expand), but the aim would be to create a second, in-memory FST for 'analysis-as-generation' (or vice-versa), maybe by compiling each entry twice, or using a second pass over the dictionary.

[1] Note how I had to stoop to using a French term to get an 'English' example.
[2] I say 'was', I mean 'was, is, and will be for the foreseeable future'.
[3] If you were wondering: why yes! I have sought psychiatric help :)
[4] ...and not, e.g., as a means of showcasing a morphological analyser.
[5] It had originally been a lemmatiser, so I could look up words, but I found that it helped to learn case endings, which helped me to learn the (her) language better. As an aside to an aside to an aside: learning a language to impress girls is a good idea, to impress a is a bad idea. I think I impressed every Polish woman I met except the one I was trying[6] to impress.
[6] When I say this, it's usually picked up as some sort of judgement of her, when I intend it to be of me. 'Trying' is the key word.
[7] 'panna' is something like 'mademoiselle' in French: as a way of addressing a young and/or unmarried woman, it's like 'miss' in English. By itself, it's often translated as 'young woman', which becomes a problem with 'młoda panna', because 'młoda' means 'young'. 'panna młoda', however, (in that order only) means 'bride'.
[8] My Dad did a lot of carpentry when I was young. When I was in Cub Scouts, I thought I'd try for the woodwork badge, with his help. What followed was a day that I remember as my Dad screaming "measure twice, cut once" from morning to evening. I've done my best to avoid carpentry since, and this is cutting twice, which makes me anxious.