Correlation of Automatic and Manual Metrics

The credibility of automatic evaluation metrics rests on their correlating with reliable human judgments.

Correlation Of Metrics is the main subject of 15 publications. 12 are discussed here.

Topics in Evaluation

Publications

Coughlin (2003) finds evidence in support of BLEU in a total of 124 evaluations for many European language pairs. Other evaluation campaigns continuously assess correlation of human and automatic metrics, such as in the CESTA campaign (Surcin et al., 2005; Hamon et al., 2007). Yasuda et al. (2003); Finch et al. (2004) investigate the required number of reference translations. Hamon and Mostefa (2008) find that the quality of the reference translations is not very important.

Akiba et al. (2003) show strong correlation for BLEU, only if systems are of similar type. BLEU tends to correlate less when comparing human translators with machine translation systems (Culy and Riehemann, 2003; Popescu-Belis, 2003), or when comparing statistical and rule-based systems (Callison-Burch et al., 2006). Amigó et al. (2006) find that the relationship between manual metrics that measure human acceptability and the automatic metrics that check the similarity of system output with human translations is a bit more complex.

Cer et al. (2010) examine how tuning towards one metric effects the evaluation score measured with another metric. They find that tuning towards BLEU gives reasonable results according to manual judgment, compared to other more recently proposed metrics (TER, METEOR).

Benchmarks

Discussion

New Publications

Ngoc-Tien Le and Christophe Servan and Benjamin Lecouteux and Laurent Besacier (2016): Better Evaluation of ASR in Speech Translation Context Using Word Embeddings, INTERSPEECH 2016 mentioned in Evaluation, Speech Translation, Correlation Of Metrics and Confidence Measures
add
@InProceedings{LeIS2016,
author = {Ngoc-Tien Le and Christophe Servan and Benjamin Lecouteux and Laurent Besacier},
title = {Better Evaluation of ASR in Speech Translation Context Using Word Embeddings},
booktitle = {INTERSPEECH 2016},
year = 2016
}
Le et al. (2016)
Graham, Yvette and Baldwin, Timothy and Mathur, Nitika (2015): Accurate Evaluation of Segment-level Machine Translation Metrics, Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
add
@InProceedings{graham-baldwin-mathur:2015:NAACL-HLT,
author = {Graham, Yvette and Baldwin, Timothy and Mathur, Nitika},
title = {Accurate Evaluation of Segment-level Machine Translation Metrics},
booktitle = {Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
month = {May--June},
address = {Denver, Colorado},
publisher = {Association for Computational Linguistics},
pages = {1183--1191},
url = {http://www.aclweb.org/anthology/N15-1124},
year = 2015
}
Graham et al. (2015)
Andre Castilla and Alice Bacic and Sergio Furuie (2005): Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting, Proceedings of the Tenth Machine Translation Summit (MT Summit X)
add
@InProceedings{Castilla:2005:MTS,
author = {Andre Castilla and Alice Bacic and Sergio Furuie},
title = {Machine Translation on the Medical Domain: The Role of {BLEU/NIST} and {METEOR} in a Controlled Vocabulary Setting},
url = {http://www.mt-archive.info/MTS-2005-Castilla.pdf},
googlescholar = {6067460913023773678},
booktitle = {Proceedings of the Tenth Machine Translation Summit (MT Summit X)},
month = {September},
address = {Phuket, Thailand},
year = 2005
}
Castilla et al. (2005)

MT Research Survey Wiki

A Comprehensive Survey of Neural and Statistical Machine Translation Research Publications

Search Descriptions

Correlation of Automatic and Manual Metrics

Publications

Benchmarks

Discussion

Related Topics

New Publications