
When it comes to the nuanced world of language translation, how do we discern a mediocre translation from a masterpiece?
The answer doesn’t come from a human expert but rather from an ingenious metric known as the BLEU score.
BLEU, which stands for Bilingual Evaluation Understudy, has revolutionized the field of machine translation by providing a method to quantitatively measure the quality of automated translations against high-quality human translations.
Imagine the challenge faced by machines in not only grasping the lexical tapestry of a language but also its cultural undertones and contextual nuances.
Translating “the spirit of the text,” so to speak, is a lofty goal that machine translation endeavors to achieve. The BLEU score is at the forefront of this challenge, acting as a critical yardstick for progress. It’s not just a number; it’s a benchmark that tells us how close a machine-generated translation is to achieving the fluency and understanding of a skilled human translator.
The beauty of the BLEU score lies in its simplicity and its profound implications. It uses a statistical approach to compare a machine’s translation output with that of a human, taking into account factors such as precision in word choice and sentence structure. By doing so, it helps developers fine-tune their algorithms, pushing the boundaries of what artificial intelligence can achieve in breaking down language barriers.
In this article, we’ll delve into the origins of the metric, how it works, its impact on the field of natural language processing, and how the limitations of this metric have been overrided.
The BLEU score was introduced in a seminal paper by Kishore Papineni and colleagues in 2002. Before its inception, the evaluation of machine translations was a subjective process, heavily reliant on human judgment. This posed a significant challenge for the development of machine translation systems, as human evaluations were not only time-consuming and costly but also fraught with inconsistency. The introduction of BLEU brought a revolutionary shift, offering an automated, quick, and objective way to evaluate the quality of translations produced by machines.
The BLEU score quantifies the quality of machine-translated text by comparing it with reference human translations at the level of word sequences (n-grams). The comparison is based on two key components:
The overall BLEU score is computed as follows:
Where pn is the precision for n-grams, wn is the weight for each gram (typically uniform, e.g., 0.25 for each if N=4), and N is the number of grams considered (usually 4).
The logarithm is used to turn the product of precisions into a sum, and the exponential function is the inverse of the logarithm, ensuring that the BLEU score is a positive number.
Now we understand the second part of the function, lets explore more in depth how the Brevity Penalty is computed:
As stated before, BP compensates for the length of the translation: if the candidate translation is shorter than the reference translations, BP will be less than 1, otherwise, it will be 1. It is calculated as follows:
where:
The reference length r is determined from the set of reference translations that serve as the standard or “gold” translations to which the machine-generated translation (the candidate translation) is compared. The goal is to select a reference length that best matches the length of the candidate translation to avoid unfairly penalizing or rewarding it solely based on length differences.
There are a few different strategies to choose the reference length r:
BLEU scores range from 0 to 1 (or 0 to 100 when expressed in percentage terms), where a score of 1 indicates a perfect match with the reference translation.
Let’s take a simple example to illustrate how the BLEU score is calculated. Suppose we have the following sentences:
Unigram Precision (1-gram matches):
MT has 6 unigrams, all of which are in the RT, so the unigram precision is 6/6 = 1.
Bigram Precision (2-gram matches):
MT has 5 bigrams, 4 of which are in the RT, so the bigram precision is 4/5.
Trigram Precision (3-gram matches):
MT has 4 trigrams, 3 of which are in the RT, so the trigram precision is 3/4.
4-gram Precision (4-gram matches):
MT has 3 four-grams, 2 of which are in the RT, so the four-gram precision is 2/3.
The length of MT is 6, and the RT is 6 as well. Since MT is not shorter than RT, the BP is 1.
Assuming uniform weights wn = 0.25 for each n-gram length
The BLEU score is calculated as:
Plugging in the values:
First, compute the logarithms:
Sum the log values:
Finally, calculate the exponent:
The BLEU score for the given example is approximately 0.7958
To explain how BLEU can be interpreted, we can take the result of the example as a reference.
In the example we obtained a BLEU score of 0.7958 which is quite high, especially considering that human translations typically do not achieve a perfect score because there are many valid ways to translate the same sentence. Scores closer to 1 indicate that the translation is very similar to the reference translation, and thus, in theory, it should be of higher quality.
However, interpreting BLEU scores can be a bit more nuanced:
In our example where the machine translation was: “The cat is on the mat.” and the reference translation: “The cat is sitting on the mat.” we can consider it a good translation to a certain point.
Technically the cat is on the mat so there is nothing wrong with the translation up to this point, but if we want to know what the cat is doing on the mat, this translation is not good enough as it´s too general.
Here we can see a clear example which reflects that the quality of the BLEU score really depends on the context of the situation and our purpose.
The BLEU score has been instrumental in evaluating translation quality and a huge step to advance in this field, but as with any metric, it has limitations. It doesn’t account well for semantic accuracy, grammatical correctness, or the flexibility of language use. Lets see these limitations more in depth:
Due to these limitations, BLEU is often used alongside other automatic metrics like METEOR, ROUGE, or more advanced one such as BERTScore, which attempt to account for some of BLEU’s shortcomings.
Powered by Design8. All rights reserved.