Why is it so difficult to generate natural, native-sounding language?

However, why is it still not perfect? Why is it so difficult to generate natural, native-sounding language? To understand that we must consider how humans produce language as opposed to a machine.

How does the human brain produce language?

Humans create variations of words based on their interactions. Whether with friends, business contacts, family — we all do it. Unlike machines, we understand words from their overall context.

Humans take numerous things into account when forming a sentence: our knowledge about a subject, the words’ relationship to our environment, and the company we’re in as we speak.

As we grow we continuously absorb images, developing emotions towards people and things that affect us. Consequently, we recognize sounds and voices and subconsciously associate them with those images, actions, and emotions. The older we get, the more talented we become at dissecting language. Our ability to express our thoughts coherently and efficiently improves.

Experiences are recognised by our visual senses. They’re then recorded and categorized. Emotions and intuition are triggered by chemicals that affect our nervous systems, we may then use words to express those emotions.

We see something that triggers an emotional reaction, and subconsciously cross-reference it with our own language “database” to identify how best to express ourselves. These things must be experienced, and experience takes time to develop. There aren’t any shortcuts. Simply remembering words isn’t enough. It’s what we associate with different variations of words that counts.

How does AI translation differ?

For machines such as NMT systems, the starting point is in the words and phrases that make up a source text, as well as any relevant content material in their database. This is what a machine translation’s linguistic decisions for the best target text are based on, as opposed to the images, emotions and experiences that are a primary reference point for humans.

While much can be achieved with highly advanced analysis of a source text, the many environmental nuances which subconsciously guide the human linguistic decision-making process are impacted by implicit elements that we naturally incorporate into our language.

The expression “reading between the lines” refers to the drawing of inferences — not from words directly, but from our knowledge, experience, or the context associated with a piece of text. We can then use this to better express ideas or thoughts in another language when cultural nuances differ, or we are constrained by the structure of our mother tongue.

If future NMT solutions could continuously consult enormous amounts of pre-existing, constantly updated, humanly translated data from which they could analyse and incorporate surrounding variables in a speaking environment, future NMT solutions could very well match human-level translations.

How do we distinguish the good translations from the bad?

One of the biggest challenges with machine translations is the on-going quality assessments they must undergo to ensure human standards are maintained. Numerous rating systems have been developed over recent years to judge the accuracy of SMTs (Statistical Machine Translations) and NMTs. The more common ones are: BLEU (Bilingual Evaluation Understudy), TER (Translation Edit Rate), and GTM (General Text Matcher).

Each has its pros and cons. BLEU scoring splits text into segments, compares them to a corpus of existing human translations, and measures closeness using several statistical metrics.

TER supports linguists with post-editing. It uses the finalised target and its source text, compares the total text to existing, accepted translations in the MT, then provides the minimum number of edits required to optimise the target. This “edit rate” isn’t to be taken as an exact figure. Rather, it gives the linguist an idea of the required total effort. It may be that many of the edit suggestions aren’t serious or necessary, so this methodology isn’t ideal as a judge of the overall quality of a machine translation output.

GTM uses several similarity metrics which check for “hits” (two words that match in the candidate and reference text) and matches of “runs” (adjacent sequences of matching words). It does so across numerous variations, factoring in all identified matches regardless of length.

TER and GTM are said to function better than other rating systems, as they look at numerous variations at different lengths, then provide a final metric indicating the effort required to improve the target text.

What does this mean for the future of AI translation?

For the foreseeable future, the difficulty any language scoring system has is functioning without a single objectively defined winning standard. There is no such thing as “the perfect, most accurate translation”. Anybody describing a translation positively in the superlative doesn’t understand that there are many accepted versions, all considered linguistically and possibly, stylistically correct.

For now, automatic machine translation evaluation measures are less reliable than human evaluations, and are still far from being able to entirely substitute human judgement. However, through the application of well-known evaluation metrics, machine translation will continue to support linguists in getting a better idea of the quality of different translations —especially when used for standard text that must stick close to the source (e.g. technical manuals, instructions/descriptions).

Only by combining smart technology, as well as human input and decision-making, can a final version be produced within a minimal time frame, at a reasonable price and, most importantly, at the expected level of quality.

Further reading: