Statistical Significance
Do differences in evaluation scores om a given test set indicate real quality differences of the underlying machine translation systems? We would like to compute the statistical significance of these differences.
Statistical Significance is the main subject of 7 publications. 3 are discussed here.
Publications
For the commonly used BLEU score, there is no analytical method to determine statistical significance, so we need to rely on methods such as bootstrap resampling
Koehn, Philipp (2004):
Statistical Significance Tests for Machine Translation Evaluation , Proceedings of EMNLP 2004
@inproceedings{Koehn:2004,
author = {Koehn, Philipp},
title = {Statistical Significance Tests for Machine Translation Evaluation },
url = {
http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Koehn.pdf},
booktitle = {Proceedings of EMNLP 2004},
editor = {Dekang Lin and Dekai Wu},
month = {July},
address = {Barcelona, Spain},
publisher = {Association for Computational Linguistics},
pages = {388--395},
year = 2004
}
(Koehn, 2004). For further comments on this technique and an alternative, see work by
Riezler, Stefan and Maxwell, John T. (2005):
On Some Pitfalls in Automatic Evaluation and Significance Testing for MT, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization
@InProceedings{riezler-maxwell:2005:MTSumm,
author = {Riezler, Stefan and Maxwell, John T.},
title = {On Some Pitfalls in Automatic Evaluation and Significance Testing for {MT}},
booktitle = {Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization},
month = {June},
address = {Ann Arbor, Michigan},
publisher = {Association for Computational Linguistics},
pages = {57--64},
url = {
http://www.aclweb.org/anthology/W/W05/W05-0908},
year = 2005
}
Riezler and Maxwell (2005).
Paula Estrella and Olivier Hamon and Andrei Popescu-Belis (2007):
How Much Data is Needed for Reliable MT Evaluation? Using Bootstrapping to Study Human and Automatic Metrics, Proceedings of the MT Summit XI
@inproceedings{Estrella:2007:MTSummit,
author = {Paula Estrella and Olivier Hamon and Andrei Popescu-Belis},
title = {How Much Data is Needed for Reliable {MT} Evaluation? {U}sing Bootstrapping to Study Human and Automatic Metrics},
url = {
http://www.mt-archive.info/MTS-2007-Estrella-1.pdf},
googlescholar = {12613178650441710662},
booktitle = {Proceedings of the {MT} Summit XI},
year = 2007
}
Estrella et al. (2007) examine the minimum size of the test set for a reliable comparison of different machine translation systems.
Benchmarks
Discussion
Related Topics
New Publications
Graham, Yvette and Mathur, Nitika and Baldwin, Timothy (2014):
Randomized Significance Tests in Machine Translation, Proceedings of the Ninth Workshop on Statistical Machine Translation
@InProceedings{graham-mathur-baldwin:2014:W14-33,
author = {Graham, Yvette and Mathur, Nitika and Baldwin, Timothy},
title = {Randomized Significance Tests in Machine Translation},
booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
month = {June},
address = {Baltimore, Maryland, USA},
publisher = {Association for Computational Linguistics},
pages = {266--274},
url = {
http://www.aclweb.org/anthology/W14-3333},
year = 2014
}
Graham et al. (2014)
Graham, Yvette and Baldwin, Timothy (2014):
Testing for Significance of Increased Correlation with Human Judgment, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
@InProceedings{graham-baldwin:2014:EMNLP2014,
author = {Graham, Yvette and Baldwin, Timothy},
title = {Testing for Significance of Increased Correlation with Human Judgment},
booktitle = {Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
month = {October},
address = {Doha, Qatar},
publisher = {Association for Computational Linguistics},
pages = {172--176},
url = {
http://www.aclweb.org/anthology/D14-1020},
year = 2014
}
Graham and Baldwin (2014)
- Efron and Tibshirani (1993)
- Efron and Tibshirani (1993)
Ying Zhang and Stephan Vogel (2010):
Significance tests of automatic machine translation evaluation metrics, Machine Translation
@article{MTJ:2010:Zhang,
author = {Ying Zhang and Stephan Vogel},
title = {Significance tests of automatic machine translation evaluation metrics},
pages = {51-65},
journal = {Machine Translation},
volume = {24},
number = {1},
month = {March},
year = 2010
}
Zhang and Vogel (2010)