Correlation of Automatic and Manual Metrics
The credibility of automatic evaluation metrics rests on their correlating with reliable human judgments.
Correlation Of Metrics is the main subject of 15 publications. 12 are discussed here.
finds evidence in support of BLEU in a total of 124 evaluations for many European language pairs. Other evaluation campaigns continuously assess correlation of human and automatic metrics, such as in the CESTA campaign (Surcin et al., 2005
; Hamon et al., 2007)
. Yasuda et al. (2003)
; Finch et al. (2004)
investigate the required number of reference translations. Hamon and Mostefa (2008)
find that the quality of the reference translations is not very important.
Akiba et al. (2003)
show strong correlation for BLEU, only if systems are of similar type. BLEU tends to correlate less when comparing human translators with machine translation systems (Culy and Riehemann, 2003
; Popescu-Belis, 2003)
, or when comparing statistical and rule-based systems (Callison-Burch et al., 2006)
. Amigó et al. (2006)
find that the relationship between manual metrics that measure human acceptability and the automatic metrics that check the similarity of system output with human translations is a bit more complex.
Cer et al. (2010)
examine how tuning towards one metric effects the evaluation score measured with another metric. They find that tuning towards BLEU gives reasonable results according to manual judgment, compared to other more recently proposed metrics (TER, METEOR).
- Le et al. (2016)
- Graham et al. (2015)
- Castilla et al. (2005)