Machine Translation

In early 2011, Yandex implemented its own machine translation system. Yandex.Translate works with the major European languages, in both directions – for example, from English into Spanish and back.

Yandex’s machine translation system is statistical: its translations aren’t based on language rules, the system isn’t even aware of the rules – they are based on statistics. In order to learn a language, the system compares hundreds of thousands of texts, or texts containing the same information but in different languages. These might be, say, the different language versions of a company web site. First the system identifies parallel texts by their web addresses – usually these addresses differ only in their country or language codes, such as “en” for English or “es” for Spanish.

For every text it studies, the system makes a list of unique characteristics. These may be rarely used words, numbers or special symbols found in a text in a certain sequence. When the system has gathered a large enough volume of texts with such characteristics, it begins to use them in its search for parallel texts – comparing those characteristics in new texts with those in texts it has already studied.

To meet current quality standards for machine translation, the system has to study hundreds of millions of phrases in different languages. This requires considerable resources: lots of hard disk space, tons of RAM, and so on, which explains why the number of machine translation systems existing today are few and far between.

Language learning

The three key components of Yandex’s machine translation system are a translation model, a language model and a decoder.

The translation model is a list of all the words and phrases known to the system in a single language, with all their possible translations into another language, including each translation’s probability. Each pair of languages has its own list. To create a translation model, the system has to, first, find matching – parallel – texts, then, find pairs of matching phrases within these texts, and only then find pairs of matching words or word combinations.

When the system is exposed to the first pair of parallel phrases in different languages, e.g. English and Spanish, it doesn’t have enough information to find statistical patterns for translation.

The London Bridge is a bridge in London (England), which crosses the river Thames between the City of London and Southwark. – El Puente de Londres es un puente en Londres (Inglaterra) que cruza el río Támesis, entre City of London y Southwark.

Any word in the second phrase has an equal probability to be a good translation for any word in the first phrase. But, after processing the second pair of parallel phrases the system recalculates the probability for some of the words.

It is situated between the bridges of "Cannon Street Railway" and "Tower Bridge". –
Se sitúa entre los puentes de "Cannon Street Railway" y "Tower Bridge".

Now the word ‘puentes’ has a higher probability to be the equivalent of ‘bridges’, as it has already been a candidate, albeit in a different grammatical form – ‘puente’, for a good translation for ‘bridge’ in the first pair. In such manner the system continually performs this comparing-matching process on millions of phrases in hundreds of thousands of texts.

The system compares not only single words, but also sequences consisting of two, three, four or five words. The language model for each pair of languages processed by Yandex’s machine translation system has hundreds of millions of pairs of words and phrases.

To create a language model – another component of Yandex’s machine translation technology – the system scans hundreds of thousands of texts in a target language – a language into which a text is translated – and puts together a list of the most frequently used words and word combinations, including information about frequency of use. This is the system’s knowledge of the target language.

Translation process

The actual translation is performed by the decoder. For every sentence of the source text, it selects all the translation options, combining phrases from the translation model, and sorts them in a descending order of probability.

In the English-Spanish translation model, of all possible options, the one with the highest probability to be a good equivalent for William Shakespeare’s famous ‘to be or not to be’ would be ‘ser o no ser’, with the combination ‘estar o no estar’ a close second.

The decoder uses the language model to evaluate all the variations, and finds out that ‘ser o no ser’ occurs more frequently than ‘estar o no estar’. As a result, it chooses the sentence with the highest probability – based on the translation model, and the highest frequency of use – based on the language model.

Yandex’s machine translation system translates not only free-form texts, but also whole web pages. When a user enters a web address on translate.yandex.com, at first the page opens in the original language. Then the user’s browser parses the page’s html code and sends the text to the translator’s server paragraph by paragraph. The web page transforms from, say, English into Spanish right before the user’s eyes. The user doesn’t even have to wait for the whole text to be translated – the first paragraphs can be read while the rest is yet to be processed.

Machine dictionary

Besides working with texts, Yandex.Translate can be used for individual words. The service has a full-fledged dictionary with detailed entries for words and set expressions. These entries are created on the basis of the same statistical data, but in this case with attention to the rules.

Unlike the translation model, where words and word combinations are incorporated in any form, for the machine dictionary the system selects only basic forms of words, for example, singular nouns in the nominative case or verbs in the infinitive, and set expressions. The system performs morphological and syntactical analysis. It identifies parts of speech and the basic form of words, determines the limits of word combinations, and figures out which word an adjective relates to or what objects verbs in a sentence take. This information helps the system sift out incomplete combinations. For example, ‘at the same time – al mismo tiempo’ and ‘chemistry ¬– quimica’ are included in the machine dictionary, while ‘the same time it – al mismo tiempo se’ and ‘chemistry – de quimica’ are not.

The machine dictionary works with large volumes of parallel texts, and as a result the dictionary entries are very detailed. However, it is important that there be no mistakes or typos in the translation, so an algorithm based on machine learning checks all potential translation pairs and filters out those that are unreliable. As a result, ‘always – siempre’ and ‘wifi – connexion inalambrica’ make it into the dictionary, while ‘always – siempe’ and ‘wifi – radio’ don’t.

Translations that are close in meaning are grouped together with the help of a dictionary of synonyms. This is also a statistical dictionary based on parallel texts. It includes words that are often translated identically or that form combinations with the same words. For example, the system identifies the Spanish words ‘perfectamente’ and ‘absolutamente’ to be synonyms; they share the same translations (‘completely’, ‘wholly’, ‘entirely’, etc.) and are often encountered in the same contexts (‘perfectamente limpio – perfectly clean’ and ‘absolutamente limpio – absolutely clean’).

In this way, the machine dictionary receives everything it needs to know about every word and expression: its basic form, part of speech, meaning and synonyms. It illustrates the translations with examples taken from those same parallel texts.

Development of statistical translation

One of the advantages of statistical machine translation is that it is alive, just like the language itself. That is to say, if something changes in a language, for example, people start spelling a certain word differently, the system notices this as soon as new texts are introduced. The faster a change catches on and becomes widely used in a language, the quicker its incorporation into the translation and language models.

The system is regularly updated to improve the quality of translation. Every newly introduced change is checked using quality metrics for statistical machine translation.

Translations of specially selected texts are compared with standard reference samples and if the quality of translation has worsened, the changes are scrapped.