Yet another uninformative blog title

Wednesday, March 19, 2014

Multiwords GSoC project braindump

I started writing an email for one of Apertium's potential GSoC students, trying to outline the different things that had been tried, why they all suck, and why extending the compiler is better, but it ended up being a braindump (I think in footnotes - it's a problem), so rather than inflict it on him as is, I thought I'd put it here, give him a summary, and let him read, if he dares.

----

I'm not going to talk too much about the hows and whys of multiwords, because you ought to know that by now, having decided to work on this problem. It should be enough to split them into four basic categories:

Uninflected ('in order to')
Final inflection ('fruit fly'/fruit flies')
Inner inflection ('passer by'/'passers by')
Multiple inflection ('force majeure'/'forces majeures')[1]

Apertium (rather, lttoolbox) handles the first three just fine, essentially as 'words with spaces', and has its own tokenisation system to handle this. Which is quite nice, and makes it all the more frustrating when you have to use basically anything else that processes words. I'll come back to that, but at this point I would put a reference, if I could find it.

The way the multiply inflected entries are currently handled is essentially to add them as full form entries. This is a bit of an inconvenience for languages like Spanish (maximum 4 entries, usually), a major inconvenience for languages like Polish (14 entries).

The only way of generating multiword entries that I've seen in Apertium that worked was for Esperanto -- and it would only work for Esperanto (or other artificial languages), because Esperanto is regular to the point of having no exceptions (at least, that I'm aware of).

It was basically a shell script that did:

echo "<e><p><l>${word1}<b/>${word2}</l><r>${word1}<b/>${word2}<s n=\"n\"/><s n=\"sg\"/></r></p></e>"

etc.

Other languages are much less regular, so we would want to generate them based on the contents of the dictionary. Running something similar through the morphological generator wouldn't work well enough, because other languages also tend to have alternate forms, so we would need to swap 'LR' and 'RL', make a temporary generator from that, check if the entry has more than one output form, and generate an entry for each alternate form (or combination, because each word of the multiword can potentially have alternate forms). It would work, but it's horrible enough that, despite being relatively simple to do, nobody has bothered to do it.

While I'm waffling, I think I'll relate how I came to be involved in MT, so you'll hopefully understand my point of view. I once spent a few hours watching a girl (rather, the girl) translating a manual, word by word, using a dictionary. When I asked why she didn't use MT, she said they were all crap. So I started to look for something open source, and thus customisable, that could help. I found Moses. After playing with that long enough to figure out that basic SMT was[2] just not feasible for inflected languages, I looked again, and found Apertium. Now, the first thing that you might take away from this is that I basically got into MT to impress a girl[3], the more fundamental thing is that I see MT as primarily being a translator's tool[4], and my target user was decidedly non-expert: I want it to be as simple as possible, or if it can't be simple, it should at least be consistent, to match what you expect.

Apertium came close enough (I won't pretend it's anywhere near simple enough) to spend several years of my life on, so I did. When I started, the prevailing view, even among the linguists, was that the linguistic aspects were secondary to translation: it didn't matter if something was correct, only that the translation was: Apertium had its priorities right. I mention this because it may become relevant: a lot of linguists (generally, whose primary areas are not translation) will complain at length about the sort of things that are considered 'words' in Apertium.

I happily plugged away, porting over my shoddy, home-made morphological analyser[5] to lttoolbox for a while -- a couple of months, I think -- before I hit the problem of multiple inflection.

I tried to do something like this:

<e><i>foo</i><par n="foo__adj"><i><b/>bar</i><par n="bar__n"></e>

which is obviously wrong, because the output is something like this:

^foo<adj> bar<n>$

which the tagger turns into:

^foo bar<adj><n>$

but it gets worse when there's inflection information on both parts:

(foo<adj><sg>, foo<adj><pl>) (bar<n><sg>, bar<n><pl>)

overgenerates as:

^foo<adj><sg> bar<n><sg>$
^foo<adj><pl> bar<n><sg>$
^foo<adj><sg> bar<n><pl>$
^foo<adj><pl> bar<n><pl>$

(in my case, multiply by seven cases for nouns/adjectives and five genders for adjectives)

I mention this not just as (potential) mentor, pointing out my own beginner's mistake, but because it's one of the more common beginner's mistakes: a not insignificant proportion of people expect that this would work.

Freeling

The first thing I looked at was Freeling, because it had a tool for multiwords, and it's open source. It's also very, very basic.

The dictionary format is something like this (horrible, positional tags omitted):

dirección<n><f><$num> postal<adj><*> dirección_general<n><f><$num>

The problem is, there's no concordance check: there's no way to say that $num must match between the noun and the adjective for this to be a multiword, which is not so important for Spanish, but is for Polish.

My current favourite example for why this sucks ass is 'panna młoda'[7], because it's easier to come up with examples that don't look completely contrived.

E.g.:

     dał        pannie          młodego psa
[he] gave [the] young woman [a] young   dog

The words don't agree, so a multiword must not be formed from them.

Old GSoC project

I wouldn't have mentioned it (or remembered it), except you brought it up. Its one advantage over the Freeling thing was that it checked that the candidate words agree grammatically.

It used lists of strings, so the size of the dictionary is a huge factor in runtime speed;
it worked by computing the Cartesian product of the lists of analyses, so the number of analyses would also greatly affect runtime speed (but this is true of any similar tool).

Concordance is not possible in a regular finite state machine (concordance requires memory), which the student pointed out. I thought it looked like a prototype that was cobbled together in a couple of days -- maybe I'm being unfair, but I don't think the result of that project looked anywhere close to three months' worth of work.

The really crappy idea

The first idea used FST-based matching, similarly to apertium-transfer (I say 'similarly', I ripped most of the code straight out of it), but with one slight change:

For simplicity's sake, let's say that what is matched by apertium-transfer looks like this:

'^' <any_char>* <any_tag>* '$'

I changed it to match:

'/' <dictionary entry part> <dictionary tags> ('/' or '$')

so it would match one analysis of each token. It would then perform the same sort of concordance as the GSoC project, but at least matching tokens wasn't painfully slow.

The 'special' tags were for concordance, and were specified (exactly) like def-attr in transfer, but processed differently:

<def-attr n="gen">
 <attr-item tag="m"/>
 <attr-item tag="f"/>
</def-attr>
<def-attr n="num">
 <attr-item tag="sg"/>
 <attr-item tag="pl"/>
</def-attr>

  <e><p>
    <l>foo<s n="adj"/><m n="gen"/><m n="num"/><j/>bar<s n="n"/><o n="gen"/><o n="num"/></l>
    <r>foo<b/>bar<s n="n"/><m n="gen"/><m n="num"/></r>
  </p></e>

The attr-items were handled like a paradigm in the bidix, "gen" would look like:

<pardef n="gen">
  <e><p><l><s n="m"/></l><r></r></p></e>
  <e><p><l><s n="f"/></l><r></r></p></e>
</pardef>

this paradigm equivalent was inserted in place of <m> (for match), while 'special' tags were inserted for <o>: <%gen%>. A map of the def-attrs was written at the head of the dictionary, at runtime it was converted to a regex (like in transfer).

On matching input, the % tags were extracted from the output, then agreement was checked using the regexes, with the matching tags replacing the % tag. <j> on the right was +, as usual, with the intention being that it should be possible to add entries for ambiguity between multiword and not; <j> on the left was $, optional <b/>, and ^, to match token boundaries.

At a shallow look, this seems pretty reasonable (well, it seemed reasonable enough to me to waste more time that I'd care to admit on it). What makes it fail hard is the case when there's another lemma in one set of analyses with tags that agree with the other, so it could output forms 'from nowhere'.

The slightly less crappy idea

As I said, concordance isn't possible in a regular finite state machine, because that requires memory. So, add memory! -- this is how back references in regular expressions work.

I added a pair of state queues, prevq and curq, and changed the way analysis happened to handle ambiguous input. I'll keep looking for the code, but what I remember of the modifications is:

/ was treated as end of token (stepped over as $). If the / was not at the beginning of the token, and there was a final state at this point, a boolean was set to true (the marking of the last buffer position only happened on $; the current analysis was copied into a string in the first iteration of prevq). If the current state had alive states.size != 0, then the current state was pushed to curq.

When it read ^, it stepped over it, then appended everything until / to a merged_lemma (if there was anything between $ and ^, a single space was added). If prevq was empty, current state was set to initial state, otherwise, for each analysis, iterate over the states in prevq until the next / (basically a breadth first search).

When it read $, if the current state was final or if the boolean had been set, it marked the position of the buffer. Alive states were handled as with /. If curq was not empty, prevq was set to curq, and curq reset.

The agreement would happen in the compiler (I never wrote one), which is what made me realise that this is just not good enough:

First, the fact that three out of four of the types of multiwords I mentioned would be handled in analysis, while the fourth would be farmed out to a separate tool: not consistent.
Second, it wouldn't be significantly more convenient for languages that already handle these multiwords with full form entries, so it would not be used.
Third, expanding the analyses just to collapse them again is always going to be slower than just doing the correct analysis the first time around.[8]
Fourth, if the agreement can be handled in a compiler, as would be necessary here, why not just take the extra step, and keep one single compiler.

All this leads me to believe that the only reasonable way forward is to extend the compiler to make this case more convenient.

The crappy dictionary extension

I had thought, just after the beginner's mistake, that it might be possible to do this with special treatment of <par>, with the addition of a special type of <pardef> to specify how the multiword is composed:

<pardef n="n-adj" type="mw">
  <e><p>
    <l><w/><s n="n"/><s n="sg"/><j/><w/><s n="adj"/><s n="sg"/></l>
    <r><w/><b/><w/><s n="n"/><s n="sg"/></r>
  </p></e>
</pardef>

<e><i>pan</i><par n="pan/na__n" proc="notags"/><p><l>młod</l><r>młoda</r></e><par n="młod/y__adj" proc="none"/><par n="n-adj" proc="agreement"/></e>

...where notags means only the textual part of <r> elements would be used, none means the <r> parts would be discarded entirely, and agreement... I hope that speaks for itself.

This is horrible, horrible, horrible. Don't do this. Consistency is important, but this is false consistency -- it treats different things as though they were the same.

Things to take from this:
You need access to the forms you're generating from. Here, it would have been a cache, filled when reading <pardefs>: map< lemma, map< string, pair<string, string> > >-- pardef to (tags to (left, right)).
specifying multiwords is different, so a different dictionary is probably better than trying to shoe-horn it in

For me, all roads lead back to the Esperanto-ish way: generating multiwords from a (separate!) template, with word forms filled in from the dictionary. To get things going, it might be better to just use strings (see lt-expand), but the aim would be to create a second, in-memory FST for 'analysis-as-generation' (or vice-versa), maybe by compiling each entry twice, or using a second pass over the dictionary.

[1] Note how I had to stoop to using a French term to get an 'English' example.

[2] I say 'was', I mean 'was, is, and will be for the foreseeable future'.

[3] If you were wondering: why yes! I have sought psychiatric help :)

[4] ...and not, e.g., as a means of showcasing a morphological analyser.

[5] It had originally been a lemmatiser, so I could look up words, but I found that it helped to learn case endings, which helped me to learn the (her) language better. As an aside to an aside to an aside: learning a language to impress girls is a good idea, to impress a is a bad idea. I think I impressed every Polish woman I met except the one I was trying[6] to impress.

[6] When I say this, it's usually picked up as some sort of judgement of her, when I intend it to be of me. 'Trying' is the key word.

[7] 'panna' is something like 'mademoiselle' in French: as a way of addressing a young and/or unmarried woman, it's like 'miss' in English. By itself, it's often translated as 'young woman', which becomes a problem with 'młoda panna', because 'młoda' means 'young'. 'panna młoda', however, (in that order only) means 'bride'.

[8] My Dad did a lot of carpentry when I was young. When I was in Cub Scouts, I thought I'd try for the woodwork badge, with his help. What followed was a day that I remember as my Dad screaming "measure twice, cut once" from morning to evening. I've done my best to avoid carpentry since, and this is cutting twice, which makes me anxious.

Thursday, March 1, 2012

RDFa data mirage

I click 'view source' on this page and I see this:

which, in my hallucinations, I imagine should contain this:
But the W3C's RDFa Distiller tells me it actually contains this:

(because everyone needs to know what CSS files a page is using, right?)
and URIBurner spits back this shit:

What
the
fuck?

Wednesday, November 2, 2011

I tried out DBpedia Spotlight with part of the introduction to "With Fire and Sword":

Now let us see what this western history was. In the middle of the ninth century Slav tribes of various denominations occupied the entire Baltic coast west of the Vistula; a line drawn from Lubeck to the Elbe, ascending the river to Magdeburg, thence to the western ridge of the Bohemian mountains, and passing on in a somewhat irregular course, leaving Carinthia and Styria on the east, gives the boundary between the Germans and the Slavs at that period. Very nearly in the centre of the territory north of Bohemia and the Carpathians lived one of a number of Slav tribes, the Polyane (or men of the plain), who occupied the region afterwards called Great Poland by the Poles, and now called South Prussia by the Germans. In this Great Poland political life among the Northwestern Slavs began in the second half of the ninth century. About the middle of the tenth, Mechislav (Mieczislaw), the ruler, received Christianity, and the modest title of Count of the German Empire. Boleslav the Brave, his son and successor, extended his territory to the upper Elbe, from which region its boundary line passed through or near Berlin, whence it followed the Oder to the sea. Before his death, in 1025, Boleslav wished to be anointed king by the Pope. The ceremony was denied him, therefore he had it performed by bishops at home. About a century later the western boundary was pushed forward by Boleslav Wry-mouth (1132-1139) to a point on the Baltic about half-way between Stettin and Lubeck. This was the greatest extension of Poland to the west. Between this line and the Elbe were Slav tribes; but the region had already become marken (marches) where the intrusive Germans were struggling for the lands and persons of the Slavs.

Corrected:

Now let us see what this western history was. In the middle of the ninth century Slav tribes of various denominations occupied the entire Baltic coast west of the Vistula; a line drawn from Lubeck to the Elbe, ascending the river to Magdeburg, thence to the western ridge of the Bohemian mountains, and passing on in a somewhat irregular course, leaving Carinthia and Styria on the east, gives the boundary between the Germans and the Slavs at that period. Very nearly in the centre of the territory north of Bohemia and the Carpathians lived one of a number of Slav tribes, the Polyane (or men of the plain), who occupied the region afterwards called Great Poland by the Poles, and now called South Prussia by the Germans. In this Great Poland political life among the Northwestern Slavs began in the second half of the ninth century. About the middle of the tenth, Mechislav (Mieczislaw), the ruler, received Christianity, and the modest title of Count of the German Empire. Boleslav the Brave, his son and successor, extended his territory to the upper Elbe, from which region its boundary line passed through or near Berlin, whence it followed the Oder to the sea. Before his death, in 1025, Boleslav wished to be anointed king by the Pope. The ceremony was denied him, therefore he had it performed by bishops at home. About a century later the western boundary was pushed forward by Boleslav Wry-mouth (1132-1139) to a point on the Baltic about half-way between Stettin and Lubeck. This was the greatest extension of Poland to the west. Between this line and the Elbe were Slav tribes; but the region had already become marken (marches) where the intrusive Germans were struggling for the lands and persons of the Slavs.

Friday, August 26, 2011

stupid-unknown-extractor

This is a small utility for extracting unknowns from a pair of tagged streams. It's based on that idea that, if each sentence has a single unknown word, then those words are likely to be translations of each other. It's stupid because it does nothing if there is more than one unknown.

Usage:

$ cat en.txt | apertium -d [path-to]/apertium-en-es/ en-es-tagger > en.tagged
$ cat es.txt | apertium -d [path-to]/apertium-en-es/ es-en-tagger > es.tagged
$ stupid-unknown-extractor en.tagged es.tagged

Sample output:

buckler	adarga	old<adj><sint> *buckler ,<cm> :: *adarga antiguo<adj><f><sg>
homespun	velludo	good<adj><sint><sup> *homespun .<sent> :: de<pr> *velludo para<pr>
gaunt	recia	*gaunt -<guio> :: complexión<n><f><sg> *recia ,<cm>

Compile:

g++ stupid_unknown_extractor.cc -o stupid-unknown-extractor

Code:

#include <iostream>
#include <string>
#include <cstdio>
#include <list>
#include <vector>

using namespace std;

//Set to true to also split at <cm>
bool split_cm = true;

inline bool
is_sent(wstring &in)
{
return ((in.size() > 6) && (in.compare(in.size()-6, 6, L"<sent>") == 0));
}

inline bool
is_cm(wstring &in)
{
return ((in.size() > 4) && (in.compare(in.size()-4, 4, L"<cm>") == 0));
}

inline bool
is_split (wstring &in)
{
if (split_cm)
return (is_sent(in) || is_cm(in));
else
return (is_sent(in));
}

wstring
read_word(FILE *input)
{
wstring out = L"";
wchar_t c;
bool inword = false;

while(!feof(input))
{
c = static_cast<wchar_t>(fgetwc(input));
if (!inword)
{
if (c == L'^')
{
inword = true;
}
if (c == L'\\')
{
c = static_cast<wchar_t>(fgetwc(input));
}
}
else
{
if (c == L'$')
{
return out;
}
if(c == L'\\')
{
out += L'\\';
c = static_cast<wchar_t>(fgetwc(input));
out += c;
}
else
{
out += c;
}
}
}

return L"";
}

void usage()
{
wcout << L"usage: stupid-unknown-extractor file1 file2 [output]" << endl;
}

bool
read_sentence (FILE* file, vector<wstring> &tokens)
{
wstring word;
while (!feof(file))
{
word = read_word(file);
tokens.push_back(word);
if (is_split(word))
{
return true;
}
}
return false;
}

vector<int>
unknown_indices(vector<wstring> sentence)
{
vector<int> index;
vector<wstring>::iterator it;
int count = 0;

for (it=sentence.begin(); it < sentence.end(); it++)
{
if ((*it)[0] == L'*')
{
index.push_back(count);
}
count++;
}
return index;
}

void
print_context(FILE* out, vector<wstring> &sent, int index)
{
wstring tmp;
if (index >= 1)
{
fputws(sent[index - 1].c_str(), out);
fputwc(L' ', out);
}
fputws(sent[index].c_str(), out);
fputwc(L' ', out);
fputws(sent[index + 1].c_str(), out);
}

void
print_unk(FILE* out, vector<wstring> &sent, int index)
{
if (sent.at(index)[0] != L'*')
{
wcerr << L"Error with unknown: " << sent.at(index) << endl;
}
fputws(sent.at(index).substr(1, sent.at(index).length() - 1).c_str(), out);
}

void
try_output(FILE* out, vector<wstring> &left, vector<wstring> &right)
{
vector<int> lvec;
vector<int> rvec;
lvec = unknown_indices(left);
rvec = unknown_indices(right);

if ((lvec.size() == 1) && (rvec.size() == 1))
{
print_unk(out, left, lvec.at(0));
fputwc(L'\t', out);
print_unk(out, right, rvec.at(0));
fputwc(L'\t', out);

// context
print_context(out, left, lvec.at(0));
fputws(L" :: ", out);
print_context(out, right, rvec.at(0));
fputwc(L'\n', out);
}
}

int main (int argc, char** argv)
{
FILE* left;
FILE* right;
FILE* out;

if (argc < 3 || argc > 4)
{
usage();
exit(1);
}
if (argc == 3)
{
out = stdout;
}
else
{
out = fopen(argv[3], "wb");
}

left = fopen(argv[1], "rb");
right = fopen(argv[2], "rb");

vector<wstring> sentl;
vector<wstring> sentr;

while (read_sentence(left, sentl) && read_sentence(right, sentr))
{
try_output(out, sentl, sentr);
sentl.clear();
sentr.clear();
}

fclose(left);
fclose(right);
fclose(out);
exit(0);
}

Tuesday, June 28, 2011

Tenancy agreement, GFDL

Saturday, May 7, 2011

Bitext in my pocket

Smoking is a bad habit blah blah blah but being involved in MT gives me a convenient excuse - parallel text

Thursday, May 5, 2011

The missing slide/The first Apertium nursery rhyme

In my first talk on day one of the workshop, I forgot the slide that explained the most important part of the dictionaries. I promised to include it the next day (but day one was so disastrous, and I presented everything so badly, that instead I just worked through the notes again, explaining things slowly and adding the new material at the end).

So, we had a meeting to discuss how things went (so badly wrong), when I decided not to use a presentation at all, and towards the end, someone said "it's getting late, we should go to sleep". I said I was too wired to sleep, and Juan Antonio offered to read me a nursery rhyme to help me sleep.

A few minutes later, I went outside to smoke, and thought... hey, I can explain this as a nursery rhyme... so I thought of something quickly, then looked for a cutesy template to use for the slide, and put it on my phone to show the others. They liked the rhyme, but not the picture -- thought it would be insultingly childish. They were right, so I skipped it.

After the workshop, we had the social evening (Chinese buffet), and everyone was talking, drinking, laughing, so I thought I'd show it to a few of the participants. They all loved it. Most of the women had congregated at one table, so I showed them all at the same time. Gema's assessment later was something like "with that slide, Jim chatted up every woman at the table. Including me!".

Day 3 went pretty smoothly (we'd made enough changes for day 2 to recover from day 1), but Tomas decided to rant about XML again (he had launched a ~30 minute discussion about it on day 2. He clearly dislikes XML.) and Felipe asked me if I had the text, I told him I'd put it on the wiki, and he asked me to put it on screen. Some of the women who had been at the restaurant spotted it, and laughed, but most eyes were on Tomas, so I took the first chance to interrupt him, and point at the screen as an example that "we don't think it's too hard to remember".

Then Felipe called on me to read it. Bastard. So I did, and, thankfully, everyone laughed. And applauded, which was odd.

So, that's the story of the first (and hopefully last) Apertium nursery rhyme.