N-Gram Matching Metrics
Good machine translation output not only matches single words of a reference translation, but larger chunks of text, motivating the use of n-gram based metrics.
N Gram Metrics is the main subject of 26 publications. 12 are discussed here.
Publications
The BLEU evaluation metric is based on n-grams, typically up to the order of four
Kishore Papineni and Salim Roukos and Todd Ward and Wei-Jing Zhu (2001):
BLEU: a Method for Automatic Evaluation of Machine Translation @Techreport{BLEU,
author = {Kishore Papineni and Salim Roukos and Todd Ward and Wei-Jing Zhu},
title = {{BLEU}: a Method for Automatic Evaluation of Machine Translation},
url = {
http://www.mt-archive.info/IBM-2001-Papineni.pdf},
googlescholar = {9019091454858686906},
month = {September 17},
institution = {IBM Research Report},
number = {RC22176(W0109-022)},
year = 2001
}
(Papineni et al., 2001). Several variants of n-gram matching have been proposed: weighting n-grams based on their frequency
Babych, Bogdan and Hartley, Anthony (2004):
Extending the BLEU MT Evaluation Method with Frequency Weightings, Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), Main Volume
@inproceedings{Babych:2004b,
author = {Babych, Bogdan and Hartley, Anthony},
title = {Extending the {BLEU} {MT} Evaluation Method with Frequency Weightings},
url = {
http://acl.ldc.upenn.edu/acl2004/main/pdf/349\_pdf\_2-col.pdf},
booktitle = {Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), Main Volume},
month = {July},
address = {Barcelona, Spain},
pages = {621--628},
year = 2004
}
(Babych and Hartley, 2004), or other complexity metrics
Babych, Bogdan and Elliott, Debbie and Hartley, Anthony (2004):
Extending MT evaluation tools with translation complexity metrics , Proceedings of Coling 2004
@inproceedings{Babych:2004,
author = {Babych, Bogdan and Elliott, Debbie and Hartley, Anthony},
title = {Extending {MT} evaluation tools with translation complexity metrics },
url = {
http://acl.ldc.upenn.edu/C/C04/C04-1016.pdf},
booktitle = {Proceedings of Coling 2004 },
editor = {{}},
month = {Aug 23--Aug 27},
address = {Geneva, Switzerland},
publisher = {COLING},
pages = {106--112},
year = 2004
}
(Babych et al., 2004). GTM is based on precision and recall
I. Dan Melamed and Ryan Green and Joseph P. Turian (2003):
Precision and Recall in Machine Translation, Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL)
@InProceedings{Melamed:2003b,
author = {I. Dan Melamed and Ryan Green and Joseph P. Turian},
title = {Precision and Recall in Machine Translation},
url = {
http://acl.ldc.upenn.edu/N/N03/N03-2021.pdf},
googlescholar = {12178522208641226376},
booktitle = {Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL)},
year = 2003
}
(Melamed et al., 2003;
Joseph P. Turian and I. Dan Melamed and Libin Shen (2003):
Evaluation of Machine Translation Using Maximum Matching, Proceedings of the MT Summit IX
@inproceedings{Turian:2003,
author = { Joseph P. Turian and I. Dan Melamed and Libin Shen},
title = { Evaluation of Machine Translation Using Maximum Matching},
booktitle = {Proceedings of the {MT} Summit IX},
year = 2003
}
Turian et al., 2003).
Hiroshi Echizen-ya and Kenji Araki (2007):
Automatic Evaluation of Machine Translation based on Recursive Acquisition of an Intuitive Common Parts Continuum, Proceedings of the MT Summit XI
@inproceedings{Echizen-ya:2007:MTSummit,
author = {Hiroshi Echizen-ya and Kenji Araki},
title = {Automatic Evaluation of Machine Translation based on Recursive Acquisition of an Intuitive Common Parts Continuum},
url = {
http://www.eli.hokkai-s-u.ac.jp/~echi/MTS\_XI\_2007\_Echizen-ya.pdf},
googlescholar = {8640780652878745070},
booktitle = {Proceedings of the {MT} Summit XI},
year = 2007
}
Echizen-ya and Araki (2007) propose IMPACT, which is more sensitive to the longest matching n-grams.
A metric may benefit from using an explicit alignment of system output and reference while maintaining the advantages of n-gram based methods such as BLEU
Liu, Ding and Gildea, Daniel (2006):
Stochastic Iterative Alignment for Machine Translation Evaluation, Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions
@InProceedings{liu-gildea:2006:POS,
author = {Liu, Ding and Gildea, Daniel},
title = {Stochastic Iterative Alignment for Machine Translation Evaluation},
booktitle = {Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions},
month = {July},
address = {Sydney, Australia},
publisher = {Association for Computational Linguistics},
pages = {539--546},
url = {
http://www.aclweb.org/anthology/P/P06/P06-2070},
year = 2006
}
(Liu and Gildea, 2006) and by training such a metric to correlate to human judgment
Liu, Ding and Gildea, Daniel (2007):
Source-Language Features and Maximum Correlation Training for Machine Translation Evaluation, Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference
@InProceedings{liu-gildea:2007:main,
author = {Liu, Ding and Gildea, Daniel},
title = {Source-Language Features and Maximum Correlation Training for Machine Translation Evaluation},
booktitle = {Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference},
month = {April},
address = {Rochester, New York},
publisher = {Association for Computational Linguistics},
pages = {41--48},
url = {
http://www.aclweb.org/anthology/N/N07/N07-1006},
year = 2007
}
(Liu and Gildea, 2007).
Alon Lavie and Kenji Sagae and Shyamsundar Jayaraman (2004):
The significance of recall in automatic metrics for MT evaluation, Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA 2004)
@inproceedings{lavie:2004:AMTA,
author = {Alon Lavie and Kenji Sagae and Shyamsundar Jayaraman},
title = {The significance of recall in automatic metrics for {MT} evaluation},
url = {
http://cs.cmu.edu/afs/cs.cmu.edu/Web/People/alavie/papers/Recall-AMTA-04.pdf},
googlescholar = {14891464181717826572},
booktitle = {Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA 2004)},
pages = {134--143},
year = 2004
}
Lavie et al. (2004) emphasize the importance of recall and stemmed matches in evaluation, which led to the development of the METEOR metric
Banerjee, Satanjeev and Lavie, Alon (2005):
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization
@InProceedings{banerjee-lavie:2005:MTSumm,
author = {Banerjee, Satanjeev and Lavie, Alon},
title = {{METEOR}: An Automatic Metric for {MT} Evaluation with Improved Correlation with Human Judgments},
booktitle = {Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization},
month = {June},
address = {Ann Arbor, Michigan},
publisher = {Association for Computational Linguistics},
pages = {65--72},
url = {
http://www.aclweb.org/anthology/W/W05/W05-0909},
year = 2005
}
(Banerjee and Lavie, 2005;
Lavie, Alon and Agarwal, Abhaya (2007):
METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments, Proceedings of the Second Workshop on Statistical Machine Translation
@InProceedings{lavie-agarwal:2007:WMT,
author = {Lavie, Alon and Agarwal, Abhaya},
title = {{METEOR}: An Automatic Metric for {MT} Evaluation with High Levels of Correlation with Human Judgments},
booktitle = {Proceedings of the Second Workshop on Statistical Machine Translation},
month = {June},
address = {Prague, Czech Republic},
publisher = {Association for Computational Linguistics},
pages = {228--231},
url = {
http://www.aclweb.org/anthology/W/W07/W07-0234},
year = 2007
}
Lavie and Agarwal, 2007). Partial credit for stemmed matches may also be applied to BLEU and TER
Agarwal, Abhaya and Lavie, Alon (2008):
Meteor, M-BLEU and M-TER: Evaluation Metrics for High-Correlation with Human Rankings of Machine Translation Output, Proceedings of the Third Workshop on Statistical Machine Translation
@InProceedings{agarwal-lavie:2008:WMT,
author = {Agarwal, Abhaya and Lavie, Alon},
title = {Meteor, {M-BLEU} and {M-TER}: Evaluation Metrics for High-Correlation with Human Rankings of Machine Translation Output},
booktitle = {Proceedings of the Third Workshop on Statistical Machine Translation},
month = {June},
address = {Columbus, Ohio},
publisher = {Association for Computational Linguistics},
pages = {115--118},
url = {
http://www.aclweb.org/anthology/W/W08/W08-0312},
year = 2008
}
(Agarwal and Lavie, 2008).
Benchmarks
Discussion
Related Topics
New Publications
Zied Elloumi and Hervé Blanchon and Gilles Serasset and Laurent Besacier (2015):
METEOR for multiple target languages using DBnary, Machine Translation Summit XV
@inproceedings{MTS2015-Elloumi,
author = {Zied Elloumi and Hervé Blanchon and Gilles Serasset and Laurent Besacier},
title = {METEOR for multiple target languages using DBnary},
url = {
http://www.mt-archive.info/15/MTS-2015-Elloumi.pdf},
pages = {80-89},
booktitle = {Machine Translation Summit XV},
year = 2015
}
Elloumi et al. (2015)
Popović, Maja (2015):
chrF: character n-gram F-score for automatic MT evaluation, Proceedings of the Tenth Workshop on Statistical Machine Translation
@InProceedings{popovic:2015:WMT,
author = {Popovi\'{c}, Maja},
title = {chrF: character n-gram F-score for automatic {MT} evaluation},
booktitle = {Proceedings of the Tenth Workshop on Statistical Machine Translation},
month = {September},
address = {Lisbon, Portugal},
publisher = {Association for Computational Linguistics},
pages = {392--395},
url = {
http://aclweb.org/anthology/W15-3049},
year = 2015
}
Popović (2015)
Virpioja, Sami and Grönroos, Stig-Arne (2015):
LeBLEU: N-gram-based Translation Evaluation Score for Morphologically Complex Languages, Proceedings of the Tenth Workshop on Statistical Machine Translation
@InProceedings{virpioja-gronroos:2015:WMT,
author = {Virpioja, Sami and Gr\"{o}nroos, Stig-Arne},
title = {LeBLEU: N-gram-based Translation Evaluation Score for Morphologically Complex Languages},
booktitle = {Proceedings of the Tenth Workshop on Statistical Machine Translation},
month = {September},
address = {Lisbon, Portugal},
publisher = {Association for Computational Linguistics},
pages = {411--416},
url = {
http://aclweb.org/anthology/W15-3052},
year = 2015
}
Virpioja and Grönroos (2015)
Apidianaki, Marianna and Marie, Benjamin (2015):
METEOR-WSD: Improved Sense Matching in MT Evaluation, Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation
@InProceedings{apidianaki-marie:2015:SSST-9,
author = {Apidianaki, Marianna and Marie, Benjamin},
title = {METEOR-WSD: Improved Sense Matching in {MT} Evaluation},
booktitle = {Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation},
month = {June},
address = {Denver, Colorado, USA},
publisher = {Association for Computational Linguistics},
pages = {49--51},
url = {
http://www.aclweb.org/anthology/W15-1006},
year = 2015
}
Apidianaki and Marie (2015)
Libovický, Jindřich and Pecina, Pavel (2014):
Tolerant BLEU: a Submission to the WMT14 Metrics Task, Proceedings of the Ninth Workshop on Statistical Machine Translation
@InProceedings{libovicky-pecina:2014:W14-33,
author = {Libovick\'{y}, Jind\v{r}ich and Pecina, Pavel},
title = {Tolerant BLEU: a Submission to the WMT14 Metrics Task},
booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
month = {June},
address = {Baltimore, Maryland, USA},
publisher = {Association for Computational Linguistics},
pages = {409--413},
url = {
http://www.aclweb.org/anthology/W14-3353},
year = 2014
}
Libovický and Pecina (2014)
Chen, Boxing and Cherry, Colin (2014):
A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU, Proceedings of the Ninth Workshop on Statistical Machine Translation
@InProceedings{chen-cherry:2014:W14-33,
author = {Chen, Boxing and Cherry, Colin},
title = {A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU},
booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
month = {June},
address = {Baltimore, Maryland, USA},
publisher = {Association for Computational Linguistics},
pages = {362--367},
url = {
http://www.aclweb.org/anthology/W14-3346},
year = 2014
}
Chen and Cherry (2014)
Chiang, David and DeNeefe, Steve and Chan, Yee Seng and Ng, Hwee Tou (2008):
Decomposability of Translation Metrics for Improved Evaluation and Efficient Algorithms, Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing
@InProceedings{chiang-EtAl:2008:EMNLP,
author = {Chiang, David and DeNeefe, Steve and Chan, Yee Seng and Ng, Hwee Tou},
title = {Decomposability of Translation Metrics for Improved Evaluation and Efficient Algorithms},
booktitle = {Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing},
month = {October},
address = {Honolulu, Hawaii},
publisher = {Association for Computational Linguistics},
pages = {610--619},
url = {
http://www.aclweb.org/anthology/D08-1064},
year = 2008
}
Chiang et al. (2008)
Alon Lavie and Michael J. Denkowski (2009):
The Meteor metric for automatic evaluation of machine translation, Machine Translation
@article{MTJ:2009:Lavie2,
author = {Alon Lavie and Michael J. Denkowski},
title = {The {M}eteor metric for automatic evaluation of machine translation},
url = {
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/mteval-1/Papers/MT-Journal-2009/meteor-mtj-2009.pdf},
googlescholar = {15468685715273817238},
pages = {105--115},
journal = {Machine Translation},
volume = {23},
number = {2--3},
month = {September},
year = 2009
}
Lavie and Denkowski (2009)
Billy Wong and Chunyu Kit (2009):
ATEC: automatic evaluation of machine translation via word choice and word order, Machine Translation
@article{MTJ:2009:Wong,
author = {Billy Wong and Chunyu Kit},
title = {ATEC: automatic evaluation of machine translation via word choice and word order},
pages = {141-155},
journal = {Machine Translation},
volume = {23},
number = {2--3},
month = {September},
year = 2009
}
Wong and Kit (2009)
Li, Maoxi and Zong, Chengqing and Ng, Hwee Tou (2011):
Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Techologies
@InProceedings{li-zong-ng:2011:ACL-HLT2011,
author = {Li, Maoxi and Zong, Chengqing and Ng, Hwee Tou},
title = {Automatic Evaluation of {Chinese} Translation Output: Word-Level or Character-Level?},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Techologies},
month = {June},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {159--164},
url = {
http://www.aclweb.org/anthology/P11-2028},
year = 2011
}
Li et al. (2011)
Chen, Boxing and Kuhn, Roland (2011):
AMBER: A Modified BLEU, Enhanced Ranking Metric, Proceedings of the Sixth Workshop on Statistical Machine Translation
@InProceedings{chen-kuhn:2011:WMT,
author = {Chen, Boxing and Kuhn, Roland},
title = {AMBER: A Modified BLEU, Enhanced Ranking Metric},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {71--77},
url = {
http://www.aclweb.org/anthology/W11-2105},
year = 2011
}
Chen and Kuhn (2011)
Denkowski, Michael and Lavie, Alon (2011):
Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems, Proceedings of the Sixth Workshop on Statistical Machine Translation
@InProceedings{denkowski-lavie:2011:WMT,
author = {Denkowski, Michael and Lavie, Alon},
title = {Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {85--91},
url = {
http://www.aclweb.org/anthology/W11-2107},
year = 2011
}
Denkowski and Lavie (2011)
Popović, Maja (2011):
Morphemes and POS tags for n-gram based evaluation metrics, Proceedings of the Sixth Workshop on Statistical Machine Translation
@InProceedings{popovic:2011:WMT,
author = {Popovi\'{c}, Maja},
title = {Morphemes and POS tags for n-gram based evaluation metrics},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {104--107},
url = {
http://www.aclweb.org/anthology/W11-2110},
year = 2011
}
Popović (2011)
Albrecht, Joshua and Hwa, Rebecca (2008):
The Role of Pseudo References in MT Evaluation, Proceedings of the Third Workshop on Statistical Machine Translation
@InProceedings{albrecht-hwa:2008:WMT,
author = {Albrecht, Joshua and Hwa, Rebecca},
title = {The Role of Pseudo References in {MT} Evaluation},
booktitle = {Proceedings of the Third Workshop on Statistical Machine Translation},
month = {June},
address = {Columbus, Ohio},
publisher = {Association for Computational Linguistics},
pages = {187--190},
url = {
http://www.aclweb.org/anthology/W/W08/W08-0330},
year = 2008
}
Albrecht and Hwa (2008)