Search Descriptions


Neural machine Translation

Statistical Machine Translation

Search Publications





N-Gram Matching Metrics

Good machine translation output not only matches single words of a reference translation, but larger chunks of text, motivating the use of n-gram based metrics.

N Gram Metrics is the main subject of 26 publications. 12 are discussed here.


The BLEU evaluation metric is based on n-grams, typically up to the order of four (Papineni et al., 2001). Several variants of n-gram matching have been proposed: weighting n-grams based on their frequency (Babych and Hartley, 2004), or other complexity metrics (Babych et al., 2004). GTM is based on precision and recall (Melamed et al., 2003; Turian et al., 2003). Echizen-ya and Araki (2007) propose IMPACT, which is more sensitive to the longest matching n-grams.
A metric may benefit from using an explicit alignment of system output and reference while maintaining the advantages of n-gram based methods such as BLEU (Liu and Gildea, 2006) and by training such a metric to correlate to human judgment (Liu and Gildea, 2007).
Lavie et al. (2004) emphasize the importance of recall and stemmed matches in evaluation, which led to the development of the METEOR metric (Banerjee and Lavie, 2005; Lavie and Agarwal, 2007). Partial credit for stemmed matches may also be applied to BLEU and TER (Agarwal and Lavie, 2008).



Related Topics

New Publications

  • Elloumi et al. (2015)
  • Popović (2015)
  • Virpioja and Grönroos (2015)
  • Apidianaki and Marie (2015)
  • Libovický and Pecina (2014)
  • Chen and Cherry (2014)
  • Chiang et al. (2008)
  • Lavie and Denkowski (2009)
  • Wong and Kit (2009)
  • Li et al. (2011)
  • Chen and Kuhn (2011)
  • Denkowski and Lavie (2011)
  • Popović (2011)
  • Albrecht and Hwa (2008)