Everything we see on the Internet is poor-quality machine translation

Brother

Professional
Messages
2,565
Reputation
3
Reaction score
363
Points
83
Scientists have identified the impact of African languages on online content.

A recent study conducted by the Amazon Web Services Artificial Intelligence Lab (AWS AI Lab) found that much of the content on the Internet, especially in languages spoken in Africa and the Global South, consists of machine-translated texts.

More than half of sentences on the Internet are translated into two or more languages, often with errors due to poor-quality machine translation, which raises concerns about training Large Language Models (LLM).

AWS noted that interest in this topic arose after colleagues of Amazon researchers working in the field of machine translation and who are native speakers of sparsely distributed languages pointed out a large amount of content in their native languages created using machine translation.

The study included an analysis of 6.38 billion offers collected from the Internet. It was found that 57.1% of the sentences were translated into three or more languages. This is especially true for languages spoken in Africa and other regions with a low content volume, which leads to poor translation quality.

Sentences are more often translated into French than into non-widespread languages, because there is much more data available in French. Languages with a large amount of resources, such as English or French, had an average concurrency of 4 languages (sentences have translated equivalents in the other three languages), while low - spread languages, such as the African Wolof or Xhosa languages, had 8.6 languages. In addition, less common languages tended to have much worse translations.

Translated equivalents are words, phrases, or sentences in one language that have a corresponding counterpart in another language that conveys the same meaning or meaning. For example, the English expression "good morning" in Russian corresponds to the phrase "good morning". The phrases are not literally identical, but they convey the same message in the appropriate cultural and linguistic context.

It was also found that languages with a high level of multi-way concurrency often choose shorter and more predictable sentences of 5-10 words. Most of them are taken from articles that researchers characterized as low-quality and do not require special knowledge or effort to create.

The researchers emphasized that such a choice of short sentences from low-quality articles is explained by the desire to generate advertising revenue through mass machine translation into sparsely distributed languages. Such activities raise questions about the development of large language models in these languages.

The study says that modern AI requires huge amounts of training data, and the presence of such problems with the quality and accuracy of machine translation can lead to the creation of less competent models with a large number of errors.
 
Top