Correlation of Automatic and Manual Metrics
The credibility of automatic evaluation metrics rests on their correlating with reliable human judgments.
Correlation Of Metrics is the main subject of 15 publications. 12 are discussed here.
Publications
D. Coughlin (2003):
Correlating Automated and Human Assessments of Machine Translation Quality, Proceedings of the MT Summit IX
@inproceedings{Coughlin:2003,
author = {D. Coughlin},
title = { Correlating Automated and Human Assessments of Machine Translation Quality},
url = {
http://mt-archive.info/MTS-2003-Coughlin.pdf},
booktitle = {Proceedings of the {MT} Summit IX},
year = 2003
}
Coughlin (2003) finds evidence in support of BLEU in a total of 124 evaluations for many European language pairs. Other evaluation campaigns continuously assess correlation of human and automatic metrics, such as in the CESTA campaign
Sylvain Surcin and Olivier Hamon and Anthony Hartley and Martin Rajman and Andrei Popescu-Belis and Widad Mustafa El Hadi and Ismaïl Timimi and Marianne Dabbadie and Khalid Choukri (2005):
Evaluation of Machine Translation with Predictive Metrics beyond BLEU/NIST: CESTA Evaluation Campaign, Proceedings of the Tenth Machine Translation Summit (MT Summit X)
@InProceedings{Surcin:2005:MTS,
author = {Sylvain Surcin and Olivier Hamon and Anthony Hartley and Martin Rajman and Andrei Popescu-Belis and Widad Mustafa El Hadi and Isma\"{i}l Timimi and Marianne Dabbadie and Khalid Choukri},
title = {Evaluation of Machine Translation with Predictive Metrics beyond {BLEU/NIST}: {CESTA} Evaluation Campaign},
url = {
http://mt-archive.info/MTS-2005-Surcin.pdf},
googlescholar = {15915659083988352654},
booktitle = {Proceedings of the Tenth Machine Translation Summit (MT Summit X)},
month = {September},
address = {Phuket, Thailand},
year = 2005
}
(Surcin et al., 2005;
Olivier Hamon and Anthony Hartley and Andrei Popescu-Belis and Khalid Choukri (2007):
Assessing Human and Automated Quality Judgments in the French MT Evaluation Campaign CESTA, Proceedings of the MT Summit XI
@inproceedings{Hamon2:2007:MTSummit,
author = {Olivier Hamon and Anthony Hartley and Andrei Popescu-Belis and Khalid Choukri},
title = {Assessing Human and Automated Quality Judgments in the {F}rench {MT} Evaluation Campaign {CESTA}},
url = {
http://www.mt-archive.info/MTS-2007-Hamon-2.pdf},
googlescholar = {8823586558914330133},
booktitle = {Proceedings of the {MT} Summit XI},
year = 2007
}
Hamon et al., 2007).
Keiji Yasuda and Fumiaki Sugaya and Toshiyuki Takezawa and Seiichi Yamamoto and Masuzo Yanagida (2003):
Automatic Evaluation for a Palpable Measure of a Speech Translation System's Capability, Proceedings of Meeting of the European Chapter of the Association of Computational Linguistics (EACL)
@InProceedings{Yasuda:2003,
author = {Keiji Yasuda and Fumiaki Sugaya and Toshiyuki Takezawa and Seiichi Yamamoto and Masuzo Yanagida},
title = {Automatic Evaluation for a Palpable Measure of a Speech Translation System's Capability},
booktitle = {Proceedings of Meeting of the European Chapter of the Association of Computational Linguistics (EACL)},
year = 2003
}
Yasuda et al. (2003);
Andrew Finch and Yasuhiro Akiba and Eiichiro Sumita (2004):
Using a Paraphraser to Improve Machine Translation Evaluation, Proceedings of the Internation Joint Conference on Natural Language Processing (IJCNLP)
mentioned in Paraphrasing and Correlation Of Metrics@inproceedings{Finch:2004,
author = {Andrew Finch and Yasuhiro Akiba and Eiichiro Sumita},
title = {Using a Paraphraser to Improve Machine Translation Evaluation},
booktitle = {Proceedings of the Internation Joint Conference on Natural Language Processing (IJCNLP)},
year = 2004
}
Finch et al. (2004) investigate the required number of reference translations.
Hamon, Olivier and Mostefa, Djamel (2008):
The Impact of Reference Quality on Automatic MT Evaluation, Coling 2008: Companion volume: Posters and Demonstrations
@InProceedings{hamon-mostefa:2008:POSTERS,
author = {Hamon, Olivier and Mostefa, Djamel},
title = {The Impact of Reference Quality on Automatic {MT} Evaluation},
booktitle = {Coling 2008: Companion volume: Posters and Demonstrations},
month = {August},
address = {Manchester, UK},
publisher = {Coling 2008 Organizing Committee},
pages = {37--40},
url = {
http://www.aclweb.org/anthology/C08-3010},
year = 2008
}
Hamon and Mostefa (2008) find that the quality of the reference translations is not very important.
Yasuhiro Akiba and Eiichiro Sumita and Hiromi Nakaiwa and Seiichi Yamamoto and Hiroshi G. Okuno (2003):
Experimental Comparison of MT Evaluation Methods: RED vs. BLEU, Proceedings of the MT Summit IX
@inproceedings{Akiba:2003,
author = {Yasuhiro Akiba and Eiichiro Sumita and Hiromi Nakaiwa and Seiichi Yamamoto and Hiroshi G. Okuno},
title = { Experimental Comparison of {MT} Evaluation Methods: {RED} vs. {BLEU}},
url = {
http://www.mt-archive.info/MTS-2003-Akiba.pdf},
googlescholar = {11542011781753255001},
booktitle = {Proceedings of the {MT} Summit IX},
year = 2003
}
Akiba et al. (2003) show strong correlation for BLEU, only if systems are of similar type. BLEU tends to correlate less when comparing human translators with machine translation systems
C. Culy and S. Riehemann (2003):
The Limits of N-gram Translation Evaluation Metrics, Proceedings of the MT Summit IX
@inproceedings{Culy:2003,
author = {C. Culy and S. Riehemann},
title = {The Limits of N-gram Translation Evaluation Metrics},
url = {
http://www.mt-archive.info/MTS-2003-Culy.pdf},
booktitle = {Proceedings of the {MT} Summit IX},
year = 2003
}
(Culy and Riehemann, 2003;
Andrei Popescu-Belis (2003):
An Experiment in Comparative Evaluation: humans vs. computers, Proceedings of the MT Summit IX
@inproceedings{Popescu-Belis:2003,
author = { Andrei Popescu-Belis},
title = { An Experiment in Comparative Evaluation: humans vs. computers},
url = {
http://www.mt-archive.info/MTS-2003-Popescu.pdf},
googlescholar = {6486419101431530559},
booktitle = {Proceedings of the {MT} Summit IX},
year = 2003
}
Popescu-Belis, 2003), or when comparing statistical and rule-based systems
Chris Callison-Burch and Miles Osborne and Philipp Koehn (2006):
Re-evaluation the Role of Bleu in Machine Translation Research, Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics
@InProceedings{Callison-Burch:2006:EACL,
author = {Chris Callison-Burch and Miles Osborne and Philipp Koehn},
title = {Re-evaluation the Role of Bleu in Machine Translation Research},
booktitle = {Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics},
month = {April},
address = {Trento, Italy},
year = 2006
}
(Callison-Burch et al., 2006).
Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Màrquez, Lluís (2006):
MT Evaluation: Human-Like vs. Human Acceptable, Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions
@InProceedings{amigo-EtAl:2006:POS,
author = {Amig\'{o}, Enrique and Gim\'{e}nez, Jes\'{u}s and Gonzalo, Julio and M\`{a}rquez, Llu\'{\i}s},
title = {MT Evaluation: Human-Like vs. Human Acceptable},
booktitle = {Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions},
month = {July},
address = {Sydney, Australia},
publisher = {Association for Computational Linguistics},
pages = {17--24},
url = {
http://www.aclweb.org/anthology/P/P06/P06-2003},
year = 2006
}
Amigó et al. (2006) find that the relationship between manual metrics that measure human acceptability and the automatic metrics that check the similarity of system output with human translations is a bit more complex.
Cer, Daniel and Manning, Christopher D. and Jurafsky, Daniel (2010):
The Best Lexical Metric for Phrase-Based Statistical MT System Optimization, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
@InProceedings{cer-manning-jurafsky:2010:NAACLHLT,
author = {Cer, Daniel and Manning, Christopher D. and Jurafsky, Daniel},
title = {The Best Lexical Metric for Phrase-Based Statistical {MT} System Optimization},
booktitle = {Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics},
month = {June},
address = {Los Angeles, California},
publisher = {Association for Computational Linguistics},
pages = {555--563},
url = {
http://www.aclweb.org/anthology/N10-1080},
year = 2010
}
Cer et al. (2010) examine how tuning towards one metric effects the evaluation score measured with another metric. They find that tuning towards BLEU gives reasonable results according to manual judgment, compared to other more recently proposed metrics (TER, METEOR).
Benchmarks
Discussion
Related Topics
New Publications
Ngoc-Tien Le and Christophe Servan and Benjamin Lecouteux and Laurent Besacier (2016):
Better Evaluation of ASR in Speech Translation Context Using Word Embeddings, INTERSPEECH 2016
mentioned in Evaluation, Speech Translation, Correlation Of Metrics and Confidence Measures@InProceedings{LeIS2016,
author = {Ngoc-Tien Le and Christophe Servan and Benjamin Lecouteux and Laurent Besacier},
title = {Better Evaluation of ASR in Speech Translation Context Using Word Embeddings},
booktitle = {INTERSPEECH 2016},
year = 2016
}
Le et al. (2016)
Graham, Yvette and Baldwin, Timothy and Mathur, Nitika (2015):
Accurate Evaluation of Segment-level Machine Translation Metrics, Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
@InProceedings{graham-baldwin-mathur:2015:NAACL-HLT,
author = {Graham, Yvette and Baldwin, Timothy and Mathur, Nitika},
title = {Accurate Evaluation of Segment-level Machine Translation Metrics},
booktitle = {Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
month = {May--June},
address = {Denver, Colorado},
publisher = {Association for Computational Linguistics},
pages = {1183--1191},
url = {
http://www.aclweb.org/anthology/N15-1124},
year = 2015
}
Graham et al. (2015)
Andre Castilla and Alice Bacic and Sergio Furuie (2005):
Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting, Proceedings of the Tenth Machine Translation Summit (MT Summit X)
@InProceedings{Castilla:2005:MTS,
author = {Andre Castilla and Alice Bacic and Sergio Furuie},
title = {Machine Translation on the Medical Domain: The Role of {BLEU/NIST} and {METEOR} in a Controlled Vocabulary Setting},
url = {
http://www.mt-archive.info/MTS-2005-Castilla.pdf},
googlescholar = {6067460913023773678},
booktitle = {Proceedings of the Tenth Machine Translation Summit (MT Summit X)},
month = {September},
address = {Phuket, Thailand},
year = 2005
}
Castilla et al. (2005)