Wednesday, August 18, 2010

How Google Translate turns millions into a billion

I was using some stories from the Spanish Wikinews to see how Apertium's Spanish to English translation compares against other MT systems, and noticed something a little entertaining in this article with Google Translate:

El complejo industrial es uno de los cuatro más grandes de China, y se estima que tiene reservas de 19 millones de barriles.


became:

The industrial complex is one of the four largest in China, and is estimated to have reserves of 19 billion barrels.


So, although I work with rule-based MT, I also use SMT tools, so let's see if I can explain this.

The really high-level overview of statistical MT is that it uses existing translations to provide new ones: by looking at thousands of sentences that are (assumed to be - important point!) translations of each other, software (usually giza++) slowly determines which words are probable translations of each other.

One common technique to make the bilingual text more general is to replace all numbers with a pseudo-word, like __NUMBER__, in the training text, and to pass them through unchanged at run time; that is, to assume that numbers don't need to be translated.

However, that's not always true, because of the long and short scales. In English-speaking countries, we generally write '2 billion', whereas in Spanish-speaking countries, they write '2.000 millones'... well, I think it should now be clear what happened.

This is why I like RBMT. We can't approach SMT in terms of (so-called) fluency, but at least we can avoid this sort of stupid inaccuracy.