I click 'view source' on this page and I see this:
which, in my hallucinations, I imagine should contain this:
But the W3C's RDFa Distiller tells me it actually contains this:
(because everyone needs to know what CSS files a page is using, right?)
and URIBurner spits back this shit:
What
the
fuck?
Thursday, March 1, 2012
Wednesday, November 2, 2011
I tried out DBpedia Spotlight with part of the introduction to "With Fire and Sword":
Now let us see what this western history was. In the middle of the ninth century Slav tribes of various denominations occupied the entire Baltic coast west of the Vistula; a line drawn from Lubeck to the Elbe, ascending the river to Magdeburg, thence to the western ridge of the Bohemian mountains, and passing on in a somewhat irregular course, leaving Carinthia and Styria on the east, gives the boundary between the Germans and the Slavs at that period. Very nearly in the centre of the territory north of Bohemia and the Carpathians lived one of a number of Slav tribes, the Polyane (or men of the plain), who occupied the region afterwards called Great Poland by the Poles, and now called South Prussia by the Germans. In this Great Poland political life among the Northwestern Slavs began in the second half of the ninth century. About the middle of the tenth, Mechislav (Mieczislaw), the ruler, received Christianity, and the modest title of Count of the German Empire. Boleslav the Brave, his son and successor, extended his territory to the upper Elbe, from which region its boundary line passed through or near Berlin, whence it followed the Oder to the sea. Before his death, in 1025, Boleslav wished to be anointed king by the Pope. The ceremony was denied him, therefore he had it performed by bishops at home. About a century later the western boundary was pushed forward by Boleslav Wry-mouth (1132-1139) to a point on the Baltic about half-way between Stettin and Lubeck. This was the greatest extension of Poland to the west. Between this line and the Elbe were Slav tribes; but the region had already become marken (marches) where the intrusive Germans were struggling for the lands and persons of the Slavs.Corrected:
Now let us see what this western history was. In the middle of the ninth century Slav tribes of various denominations occupied the entire Baltic coast west of the Vistula; a line drawn from Lubeck to the Elbe, ascending the river to Magdeburg, thence to the western ridge of the Bohemian mountains, and passing on in a somewhat irregular course, leaving Carinthia and Styria on the east, gives the boundary between the Germans and the Slavs at that period. Very nearly in the centre of the territory north of Bohemia and the Carpathians lived one of a number of Slav tribes, the Polyane (or men of the plain), who occupied the region afterwards called Great Poland by the Poles, and now called South Prussia by the Germans. In this Great Poland political life among the Northwestern Slavs began in the second half of the ninth century. About the middle of the tenth, Mechislav (Mieczislaw), the ruler, received Christianity, and the modest title of Count of the German Empire. Boleslav the Brave, his son and successor, extended his territory to the upper Elbe, from which region its boundary line passed through or near Berlin, whence it followed the Oder to the sea. Before his death, in 1025, Boleslav wished to be anointed king by the Pope. The ceremony was denied him, therefore he had it performed by bishops at home. About a century later the western boundary was pushed forward by Boleslav Wry-mouth (1132-1139) to a point on the Baltic about half-way between Stettin and Lubeck. This was the greatest extension of Poland to the west. Between this line and the Elbe were Slav tribes; but the region had already become marken (marches) where the intrusive Germans were struggling for the lands and persons of the Slavs.
Friday, August 26, 2011
stupid-unknown-extractor
This is a small utility for extracting unknowns from a pair of tagged streams. It's based on that idea that, if each sentence has a single unknown word, then those words are likely to be translations of each other. It's stupid because it does nothing if there is more than one unknown.
$ cat en.txt | apertium -d [path-to]/apertium-en-es/ en-es-tagger > en.tagged $ cat es.txt | apertium -d [path-to]/apertium-en-es/ es-en-tagger > es.tagged $ stupid-unknown-extractor en.tagged es.tagged
Sample output:
buckler adarga old<adj><sint> *buckler ,<cm> :: *adarga antiguo<adj><f><sg> homespun velludo good<adj><sint><sup> *homespun .<sent> :: de<pr> *velludo para<pr> gaunt recia *gaunt -<guio> :: complexión<n><f><sg> *recia ,<cm>
Compile:
g++ stupid_unknown_extractor.cc -o stupid-unknown-extractor
Code:
#include <iostream>
#include <string>
#include <cstdio>
#include <list>
#include <vector>
using namespace std;
//Set to true to also split at <cm>
bool split_cm = true;
inline bool
is_sent(wstring &in)
{
return ((in.size() > 6) && (in.compare(in.size()-6, 6, L"<sent>") == 0));
}
inline bool
is_cm(wstring &in)
{
return ((in.size() > 4) && (in.compare(in.size()-4, 4, L"<cm>") == 0));
}
inline bool
is_split (wstring &in)
{
if (split_cm)
return (is_sent(in) || is_cm(in));
else
return (is_sent(in));
}
wstring
read_word(FILE *input)
{
wstring out = L"";
wchar_t c;
bool inword = false;
while(!feof(input))
{
c = static_cast<wchar_t>(fgetwc(input));
if (!inword)
{
if (c == L'^')
{
inword = true;
}
if (c == L'\\')
{
c = static_cast<wchar_t>(fgetwc(input));
}
}
else
{
if (c == L'$')
{
return out;
}
if(c == L'\\')
{
out += L'\\';
c = static_cast<wchar_t>(fgetwc(input));
out += c;
}
else
{
out += c;
}
}
}
return L"";
}
void usage()
{
wcout << L"usage: stupid-unknown-extractor file1 file2 [output]" << endl;
}
bool
read_sentence (FILE* file, vector<wstring> &tokens)
{
wstring word;
while (!feof(file))
{
word = read_word(file);
tokens.push_back(word);
if (is_split(word))
{
return true;
}
}
return false;
}
vector<int>
unknown_indices(vector<wstring> sentence)
{
vector<int> index;
vector<wstring>::iterator it;
int count = 0;
for (it=sentence.begin(); it < sentence.end(); it++)
{
if ((*it)[0] == L'*')
{
index.push_back(count);
}
count++;
}
return index;
}
void
print_context(FILE* out, vector<wstring> &sent, int index)
{
wstring tmp;
if (index >= 1)
{
fputws(sent[index - 1].c_str(), out);
fputwc(L' ', out);
}
fputws(sent[index].c_str(), out);
fputwc(L' ', out);
fputws(sent[index + 1].c_str(), out);
}
void
print_unk(FILE* out, vector<wstring> &sent, int index)
{
if (sent.at(index)[0] != L'*')
{
wcerr << L"Error with unknown: " << sent.at(index) << endl;
}
fputws(sent.at(index).substr(1, sent.at(index).length() - 1).c_str(), out);
}
void
try_output(FILE* out, vector<wstring> &left, vector<wstring> &right)
{
vector<int> lvec;
vector<int> rvec;
lvec = unknown_indices(left);
rvec = unknown_indices(right);
if ((lvec.size() == 1) && (rvec.size() == 1))
{
print_unk(out, left, lvec.at(0));
fputwc(L'\t', out);
print_unk(out, right, rvec.at(0));
fputwc(L'\t', out);
// context
print_context(out, left, lvec.at(0));
fputws(L" :: ", out);
print_context(out, right, rvec.at(0));
fputwc(L'\n', out);
}
}
int main (int argc, char** argv)
{
FILE* left;
FILE* right;
FILE* out;
if (argc < 3 || argc > 4)
{
usage();
exit(1);
}
if (argc == 3)
{
out = stdout;
}
else
{
out = fopen(argv[3], "wb");
}
left = fopen(argv[1], "rb");
right = fopen(argv[2], "rb");
vector<wstring> sentl;
vector<wstring> sentr;
while (read_sentence(left, sentl) && read_sentence(right, sentr))
{
try_output(out, sentl, sentr);
sentl.clear();
sentr.clear();
}
fclose(left);
fclose(right);
fclose(out);
exit(0);
}
Tuesday, June 28, 2011
Saturday, May 7, 2011
Bitext in my pocket
Smoking is a bad habit blah blah blah but being involved in MT gives me a convenient excuse - parallel text

Thursday, May 5, 2011
The missing slide/The first Apertium nursery rhyme
In my first talk on day one of the workshop, I forgot the slide that explained the most important part of the dictionaries. I promised to include it the next day (but day one was so disastrous, and I presented everything so badly, that instead I just worked through the notes again, explaining things slowly and adding the new material at the end).
So, we had a meeting to discuss how things went (so badly wrong), when I decided not to use a presentation at all, and towards the end, someone said "it's getting late, we should go to sleep". I said I was too wired to sleep, and Juan Antonio offered to read me a nursery rhyme to help me sleep.
A few minutes later, I went outside to smoke, and thought... hey, I can explain this as a nursery rhyme... so I thought of something quickly, then looked for a cutesy template to use for the slide, and put it on my phone to show the others. They liked the rhyme, but not the picture -- thought it would be insultingly childish. They were right, so I skipped it.

After the workshop, we had the social evening (Chinese buffet), and everyone was talking, drinking, laughing, so I thought I'd show it to a few of the participants. They all loved it. Most of the women had congregated at one table, so I showed them all at the same time. Gema's assessment later was something like "with that slide, Jim chatted up every woman at the table. Including me!".
Day 3 went pretty smoothly (we'd made enough changes for day 2 to recover from day 1), but Tomas decided to rant about XML again (he had launched a ~30 minute discussion about it on day 2. He clearly dislikes XML.) and Felipe asked me if I had the text, I told him I'd put it on the wiki, and he asked me to put it on screen. Some of the women who had been at the restaurant spotted it, and laughed, but most eyes were on Tomas, so I took the first chance to interrupt him, and point at the screen as an example that "we don't think it's too hard to remember".
Then Felipe called on me to read it. Bastard. So I did, and, thankfully, everyone laughed. And applauded, which was odd.
So, that's the story of the first (and hopefully last) Apertium nursery rhyme.
Wednesday, August 18, 2010
How Google Translate turns millions into a billion
I was using some stories from the Spanish Wikinews to see how Apertium's Spanish to English translation compares against other MT systems, and noticed something a little entertaining in this article with Google Translate:
became:
So, although I work with rule-based MT, I also use SMT tools, so let's see if I can explain this.
The really high-level overview of statistical MT is that it uses existing translations to provide new ones: by looking at thousands of sentences that are (assumed to be - important point!) translations of each other, software (usually giza++) slowly determines which words are probable translations of each other.
One common technique to make the bilingual text more general is to replace all numbers with a pseudo-word, like __NUMBER__, in the training text, and to pass them through unchanged at run time; that is, to assume that numbers don't need to be translated.
However, that's not always true, because of the long and short scales. In English-speaking countries, we generally write '2 billion', whereas in Spanish-speaking countries, they write '2.000 millones'... well, I think it should now be clear what happened.
This is why I like RBMT. We can't approach SMT in terms of (so-called) fluency, but at least we can avoid this sort of stupid inaccuracy.
El complejo industrial es uno de los cuatro más grandes de China, y se estima que tiene reservas de 19 millones de barriles.
became:
The industrial complex is one of the four largest in China, and is estimated to have reserves of 19 billion barrels.
So, although I work with rule-based MT, I also use SMT tools, so let's see if I can explain this.
The really high-level overview of statistical MT is that it uses existing translations to provide new ones: by looking at thousands of sentences that are (assumed to be - important point!) translations of each other, software (usually giza++) slowly determines which words are probable translations of each other.
One common technique to make the bilingual text more general is to replace all numbers with a pseudo-word, like __NUMBER__, in the training text, and to pass them through unchanged at run time; that is, to assume that numbers don't need to be translated.
However, that's not always true, because of the long and short scales. In English-speaking countries, we generally write '2 billion', whereas in Spanish-speaking countries, they write '2.000 millones'... well, I think it should now be clear what happened.
This is why I like RBMT. We can't approach SMT in terms of (so-called) fluency, but at least we can avoid this sort of stupid inaccuracy.
Subscribe to:
Posts (Atom)

