Wednesday, August 18, 2010

How Google Translate turns millions into a billion

I was using some stories from the Spanish Wikinews to see how Apertium's Spanish to English translation compares against other MT systems, and noticed something a little entertaining in this article with Google Translate:

El complejo industrial es uno de los cuatro más grandes de China, y se estima que tiene reservas de 19 millones de barriles.


The industrial complex is one of the four largest in China, and is estimated to have reserves of 19 billion barrels.

So, although I work with rule-based MT, I also use SMT tools, so let's see if I can explain this.

The really high-level overview of statistical MT is that it uses existing translations to provide new ones: by looking at thousands of sentences that are (assumed to be - important point!) translations of each other, software (usually giza++) slowly determines which words are probable translations of each other.

One common technique to make the bilingual text more general is to replace all numbers with a pseudo-word, like __NUMBER__, in the training text, and to pass them through unchanged at run time; that is, to assume that numbers don't need to be translated.

However, that's not always true, because of the long and short scales. In English-speaking countries, we generally write '2 billion', whereas in Spanish-speaking countries, they write '2.000 millones'... well, I think it should now be clear what happened.

This is why I like RBMT. We can't approach SMT in terms of (so-called) fluency, but at least we can avoid this sort of stupid inaccuracy.