Wednesday, November 2, 2011


I tried out DBpedia Spotlight with part of the introduction to "With Fire and Sword":
Now let us see what this western history was. In the middle of the ninth century Slav tribes of various denominations occupied the entire Baltic coast west of the Vistula; a line drawn from Lubeck to the Elbe, ascending the river to Magdeburg, thence to the western ridge of the Bohemian mountains, and passing on in a somewhat irregular course, leaving Carinthia and Styria on the east, gives the boundary between the Germans and the Slavs at that period. Very nearly in the centre of the territory north of Bohemia and the Carpathians lived one of a number of Slav tribes, the Polyane (or men of the plain), who occupied the region afterwards called Great Poland by the Poles, and now called South Prussia by the Germans. In this Great Poland political life among the Northwestern Slavs began in the second half of the ninth century. About the middle of the tenth, Mechislav (Mieczislaw), the ruler, received Christianity, and the modest title of Count of the German Empire. Boleslav the Brave, his son and successor, extended his territory to the upper Elbe, from which region its boundary line passed through or near Berlin, whence it followed the Oder to the sea. Before his death, in 1025, Boleslav wished to be anointed king by the Pope. The ceremony was denied him, therefore he had it performed by bishops at home. About a century later the western boundary was pushed forward by Boleslav Wry-mouth (1132-1139) to a point on the Baltic about half-way between Stettin and Lubeck. This was the greatest extension of Poland to the west. Between this line and the Elbe were Slav tribes; but the region had already become marken (marches) where the intrusive Germans were struggling for the lands and persons of the Slavs.
Corrected:

Now let us see what this western history was. In the middle of the ninth century Slav tribes of various denominations occupied the entire Baltic coast west of the Vistula; a line drawn from Lubeck to the Elbe, ascending the river to Magdeburg, thence to the western ridge of the Bohemian mountains, and passing on in a somewhat irregular course, leaving Carinthia and Styria on the east, gives the boundary between the Germans and the Slavs at that period. Very nearly in the centre of the territory north of Bohemia and the Carpathians lived one of a number of Slav tribes, the Polyane (or men of the plain), who occupied the region afterwards called Great Poland by the Poles, and now called South Prussia by the Germans. In this Great Poland political life among the Northwestern Slavs began in the second half of the ninth century. About the middle of the tenth, Mechislav (Mieczislaw), the ruler, received Christianity, and the modest title of Count of the German Empire. Boleslav the Brave, his son and successor, extended his territory to the upper Elbe, from which region its boundary line passed through or near Berlin, whence it followed the Oder to the sea. Before his death, in 1025, Boleslav wished to be anointed king by the Pope. The ceremony was denied him, therefore he had it performed by bishops at home. About a century later the western boundary was pushed forward by Boleslav Wry-mouth (1132-1139) to a point on the Baltic about half-way between Stettin and Lubeck. This was the greatest extension of Poland to the west. Between this line and the Elbe were Slav tribes; but the region had already become marken (marches) where the intrusive Germans were struggling for the lands and persons of the Slavs.

Friday, August 26, 2011

stupid-unknown-extractor

This is a small utility for extracting unknowns from a pair of tagged streams. It's based on that idea that, if each sentence has a single unknown word, then those words are likely to be translations of each other. It's stupid because it does nothing if there is more than one unknown.

Usage:
$ cat en.txt | apertium -d [path-to]/apertium-en-es/ en-es-tagger > en.tagged
$ cat es.txt | apertium -d [path-to]/apertium-en-es/ es-en-tagger > es.tagged
$ stupid-unknown-extractor en.tagged es.tagged

Sample output:

buckler	adarga	old<adj><sint> *buckler ,<cm> :: *adarga antiguo<adj><f><sg>
homespun	velludo	good<adj><sint><sup> *homespun .<sent> :: de<pr> *velludo para<pr>
gaunt	recia	*gaunt -<guio> :: complexión<n><f><sg> *recia ,<cm>

Compile:

g++ stupid_unknown_extractor.cc -o stupid-unknown-extractor

Code:

#include <iostream>
#include <string>
#include <cstdio>
#include <list>
#include <vector>

using namespace std;

//Set to true to also split at <cm>
bool split_cm = true;

inline bool
is_sent(wstring &in)
{
return ((in.size() > 6) && (in.compare(in.size()-6, 6, L"<sent>") == 0));
}

inline bool
is_cm(wstring &in)
{
return ((in.size() > 4) && (in.compare(in.size()-4, 4, L"<cm>") == 0));
}

inline bool
is_split (wstring &in)
{
if (split_cm)
return (is_sent(in) || is_cm(in));
else
return (is_sent(in));
}

wstring
read_word(FILE *input)
{
wstring out = L"";
wchar_t c;
bool inword = false;

while(!feof(input))
{
c = static_cast<wchar_t>(fgetwc(input));
if (!inword)
{
if (c == L'^')
{
inword = true;
}
if (c == L'\\')
{
c = static_cast<wchar_t>(fgetwc(input));
}
}
else
{
if (c == L'$')
{
return out;
}
if(c == L'\\')
{
out += L'\\';
c = static_cast<wchar_t>(fgetwc(input));
out += c;
}
else
{
out += c;
}
}
}

return L"";
}

void usage()
{
wcout << L"usage: stupid-unknown-extractor file1 file2 [output]" << endl;
}

bool
read_sentence (FILE* file, vector<wstring> &tokens)
{
wstring word;
while (!feof(file))
{
word = read_word(file);
tokens.push_back(word);
if (is_split(word))
{
return true;
}
}
return false;
}

vector<int>
unknown_indices(vector<wstring> sentence)
{
vector<int> index;
vector<wstring>::iterator it;
int count = 0;

for (it=sentence.begin(); it < sentence.end(); it++)
{
if ((*it)[0] == L'*')
{
index.push_back(count);
}
count++;
}
return index;
}

void
print_context(FILE* out, vector<wstring> &sent, int index)
{
wstring tmp;
if (index >= 1)
{
fputws(sent[index - 1].c_str(), out);
fputwc(L' ', out);
}
fputws(sent[index].c_str(), out);
fputwc(L' ', out);
fputws(sent[index + 1].c_str(), out);
}

void
print_unk(FILE* out, vector<wstring> &sent, int index)
{
if (sent.at(index)[0] != L'*')
{
wcerr << L"Error with unknown: " << sent.at(index) << endl;
}
fputws(sent.at(index).substr(1, sent.at(index).length() - 1).c_str(), out);
}

void
try_output(FILE* out, vector<wstring> &left, vector<wstring> &right)
{
vector<int> lvec;
vector<int> rvec;
lvec = unknown_indices(left);
rvec = unknown_indices(right);

if ((lvec.size() == 1) && (rvec.size() == 1))
{
print_unk(out, left, lvec.at(0));
fputwc(L'\t', out);
print_unk(out, right, rvec.at(0));
fputwc(L'\t', out);

// context
print_context(out, left, lvec.at(0));
fputws(L" :: ", out);
print_context(out, right, rvec.at(0));
fputwc(L'\n', out);
}
}

int main (int argc, char** argv)
{
FILE* left;
FILE* right;
FILE* out;

if (argc < 3 || argc > 4)
{
usage();
exit(1);
}
if (argc == 3)
{
out = stdout;
}
else
{
out = fopen(argv[3], "wb");
}

left = fopen(argv[1], "rb");
right = fopen(argv[2], "rb");

vector<wstring> sentl;
vector<wstring> sentr;

while (read_sentence(left, sentl) && read_sentence(right, sentr))
{
try_output(out, sentl, sentr);
sentl.clear();
sentr.clear();
}

fclose(left);
fclose(right);
fclose(out);
exit(0);
}

Saturday, May 7, 2011

Bitext in my pocket

Smoking is a bad habit blah blah blah but being involved in MT gives me a convenient excuse - parallel text


Thursday, May 5, 2011

The missing slide/The first Apertium nursery rhyme

In my first talk on day one of the workshop, I forgot the slide that explained the most important part of the dictionaries. I promised to include it the next day (but day one was so disastrous, and I presented everything so badly, that instead I just worked through the notes again, explaining things slowly and adding the new material at the end).


So, we had a meeting to discuss how things went (so badly wrong), when I decided not to use a presentation at all, and towards the end, someone said "it's getting late, we should go to sleep". I said I was too wired to sleep, and Juan Antonio offered to read me a nursery rhyme to help me sleep.



A few minutes later, I went outside to smoke, and thought... hey, I can explain this as a nursery rhyme... so I thought of something quickly, then looked for a cutesy template to use for the slide, and put it on my phone to show the others. They liked the rhyme, but not the picture -- thought it would be insultingly childish. They were right, so I skipped it.



After the workshop, we had the social evening (Chinese buffet), and everyone was talking, drinking, laughing, so I thought I'd show it to a few of the participants. They all loved it. Most of the women had congregated at one table, so I showed them all at the same time. Gema's assessment later was something like "with that slide, Jim chatted up every woman at the table. Including me!".

Day 3 went pretty smoothly (we'd made enough changes for day 2 to recover from day 1), but Tomas decided to rant about XML again (he had launched a ~30 minute discussion about it on day 2. He clearly dislikes XML.) and Felipe asked me if I had the text, I told him I'd put it on the wiki, and he asked me to put it on screen. Some of the women who had been at the restaurant spotted it, and laughed, but most eyes were on Tomas, so I took the first chance to interrupt him, and point at the screen as an example that "we don't think it's too hard to remember".

Then Felipe called on me to read it. Bastard. So I did, and, thankfully, everyone laughed. And applauded, which was odd.

So, that's the story of the first (and hopefully last) Apertium nursery rhyme.