The web is full of pages in a multitude of languages. Some of these pages happen to have the answer that the user of the Yandex search engine is looking for. So that the users could read an appliance manual or a news story originally published in a language they don’t understand, Yandex, since 2009, has been offering in its search results a web page translation option based on translation technology provided by PROMT.
Early in 2011, Yandex implemented a proprietary machine translation technology. Currently, the system can translate any type of text from English or Ukrainian into Russian and from Russian into either of these languages.
Yandex’s machine translation is based on statistical regularities rather than on sets of rules. Current machine translation systems cannot even be aware of the rules of a natural language. For a statistical machine translation system to ‘learn’ a language means to compare hundreds of thousands of parallel texts – the originals and their translations. This could be texts for different language versions of the same website. Initially, the system scans the internet for parallel texts using the web page addresses, which may differ only in language marking segments like «en» or «us» for the English language version and «ru» for the Russian one.
To identify texts as parallel, the system builds a list of unique characteristics for each new pair of texts it ‘learns’. These characteristics are categories like rare words, numbers, special characters used in a specific sequence. Each new document is compared against the existing set of characteristics created from the previously ‘learnt’ texts.
The current quality standards for machine translation require a system to process hundreds of millions phrases in many different languages. Since translation is a seriously resource consuming process demanding a lot of hard disk drive space or a large amount of RAM, the existing machine learning systems are few and far between.
Yandex’s machine learning system has three key components: translation model, language model and decoder.
Translation model is a list of all words and phrases known to the system in one language and all possible translations for each word or phrase known to the system in another language together with probability value for each translation. There is a translation model for each pair of languages the system can process. To create a translation model, the system has to, first, find parallel texts, then, find pairs of matching phrases within these texts, and only then find pairs of matching words or word combinations.
To build a translation model for a pair of languages, say, Russian and English, the system analyzes pairs of phrases in both of these languages:
«London stands on the river Thames» — «Лондон стоит на берегу реки Темзы»
«Crossing the river by the Tower Bridge you can see the Tower of London» — «Пересекая реку по Тауэрскому мосту, можно увидеть Тауэр»
When the system is first exposed to the first pair, it doesn’t have enough information to find statistical patterns. So, the word stands or the word river or the word on is as good as the word London to translate Лондон into Russian.
But the words river and река used in a different context in the second pair of phrases increase probability of being each other’s equivalents in English and Russian. Now the system knows that, at least, river is a better translation for река than it is for Лондон.
The system constantly performs this comparing-matching process on millions of phrases in hundreds thousands of texts.
The system compares not only single words, but also sequences consisting of two, three, four or five words. The language model for each pair of languages processed by Yandex’s machine translation system has over billion pairs of words and phrases.
To create a language model – another component of Yandex’s machine translation technology – the system scans hundreds thousands of texts in a language and creates a list of all words and word combinations it finds in these texts, together with their frequency values. This is the system’s knowledge of the translation target language.
The actual process of translation is done by decoder. For every phrase in the source text it finds potentially matching phrases or their combinations in its translation model database and ranks these translations according to their probability. It may happen so that, say, for an English phrase ‘to be or not to be’, the potential Russian match with the highest probability value is the phrase быть или не бывает (to be or is not), while the phrase быть или не быть (to be or not to be) is only second best, etc. Then, the system uses evaluates all possible translation versions according to their frequency as they occur in its language model. In this case, the language model will clearly show that быть или не быть (to be or not to be) has a higher frequency than быть или не бывает (to be or is not). Finally, decoder chooses the version with the winning combination of probability – based on translation model, and frequency – based on language model.
In addition to isolated text, Yandex’s machine translation system can process entire web pages. A user typing a web address at translate.yandex.ru can, first, see the original web page. Then, the user’s browser parses the page’s html-code and sends to the server only text, paragraph by paragraph. So, an English text, for instance, turns into a text in Russian right before the user’s eyes.
In contrast to other systems that send to the server for translation the link to the whole web page, the Yandex system sends only text. If all a server receives is a web address, it may not be able to process exactly the page that the user sees (e.g. authorized access pages). With the Yandex translation system, the end user sees exactly the page they have accessed only with all words and phrases translated into another language. In addition, the user does not have to wait till every little thing on the page is translated. They can read as they go.
Progress in statistical machine translation
One of the benefits of the statistical method of machine translation is that it progresses together with the language. A statistics-based machine translation system adds a new word, or phrase, or form, or spelling to its language model the moment it finds it. The faster an innovation or a change spreads in a language, the sooner it will appear in the system’s language and translation models.
To improve translation quality, the databases of the Yandex machine translation system are regularly updated. Every new update is tested with BLEU (Bilingual Evaluation Understudy), an algorithm evaluating the quality of machine-translated text. A set of texts translated with Yandex’s system is compared against the reference set. Those new additions to the system’s databases that didn’t improve translation quality are rejected.