Search Descriptions


Neural machine Translation

Statistical Machine Translation

Search Publications





Manual Metrics

The most intuitively trustworthy evaluation of machine translation systems is to ask human judges. However, what to ask them is an open research question.

Manual Metrics is the main subject of 45 publications. 15 are discussed here.


King et al. (2003) present a large range of evaluation metrics for machine translation systems that go well beyond the translation quality measures who devoted the bulk of this chapter to. Miller and Vanni (2005) propose clarity and coherence as manual metrics. Reeder (2004) shows the correlation between fluency and the number of words it takes to distinguish between human and machine translations.
Grading standards for essays from foreign language learners may be used for machine translation evaluation. Using these standards reveals that machine translation has trouble with basic levels, but scores relatively high on advanced categories (Reeder, 2006). A manual metric that can be automated is to ask for specific translation errors — the questions may be based on past errors (Uchimoto et al., 2007).
Vilar et al. (2007) argue for pairwise system comparisons as metric, which leads to higher inter and intra annotator agreement (Callison-Burch et al., 2007). Bojar et al. (2011) presents a critique of current manual evaluation practice in the WMT campaign, such as handling of ties and bias of annotators. Lopez (2012) points out inconsistencies in rankings produced by these evaluation campaigns. Koehn (2012) proposes a model that allows the simulations of these ranking evaluations and gives recommendations about the number of manual judgments needed to detect statistically significant differences.
One goal of manual assessment is to get better insight into the types of errors systems make. Vilar et al. (2006) proposes a taxonomy of error types, such as: unknown word, incorrect word form or long range word order. Popovic et al. (2006); Popovic and Ney (2007) introduce automatic metrics that correspond to some of these error categories. Popović and Burchardt (2011) refine their automatic analytical metrics to assess word order, morphology, deletion and insertion errors, and compare them against human judgments on these error categories.



Related Topics

New Publications

  • Ma et al. (2017)
  • Isabelle et al. (2017)
  • Lommel et al. (2014)
  • Graham et al. (2013)
  • Guzmán et al. (2015)
  • Macháček and Bojar (2015)
  • Costa et al. (2015)
  • Klejch et al. (2015)
  • Birch et al. (2016)
  • Abdelali et al. (2016)
  • Otani et al. (2016)
  • Graham et al. (2017)
  • Fomicheva and Specia (2016)
  • Herrmann et al. (2014)
  • Aranberri (2015)
  • Lo and Wu (2013)
  • Birch et al. (2013)
  • Bojar (2011)
  • Sakaguchi et al. (2014)
  • Bouamor et al. (2014)
  • Toral et al. (2013)
  • Hopkins and May (2013)
  • Gonzàlez et al. (2013)
  • Doherty et al. (2010)
  • Bentivogli et al. (2011)
  • Paul et al. (2012)
  • Zaidan and Callison-Burch (2010)
  • Hovy et al. (2002)
  • Zaidan (2011)
  • Henderson and Morgan (2005)
  • Boitet et al. (2006)