Evaluating Adequacy of MT Output without Reference

Project leaders: Yashar Mehdad, Matteo Negri, Marcello Federico


Evaluation is becoming increasingly important in Machine Translation (MT) as an essential component, as in other areas of Natural Language Processing (NLP). The quality of MT systems, so far, can be evaluated by human judge or automatically using many metrics such as BLEU (Papineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005) and TER (Snover et al., 2006) and with different associated resource and requirements. Due to the time and cost constraints, the automatic evaluation has been given a considerable attention through the recent workshops and shared-tasks1 (Callison-Burch et al., 2008) in order to overcome the shortcomings with the current methods employed for the evaluation of Machine Translation technology. However, there are still several problems in the current technology for MT evaluation:

  • The lack of automatic metric's performance when there is no availability of the reference translations.
  • The difficulties in integrating the current metrics into a MT system.
  • The lack of information about the quality of a MT system output (even in the case of post-editing), which could be relevant and interesting for human translators.
  • The lack of integrating semantic information in MT evaluation and MT systems, specifically at the multilingual level.

This emerges the need to overcome the mentioned problems through the development of systems and algorithms which can judge the adequacy of MT output without any form of reference translation. Without more suitable approaches to address these difficulties, the improvement of MT technology and the level of its applicability is far to be provided.

In addition, two main features draw a border line between this project and Confidence Estimation (CE) as well as MT technology. In the first place, the challenge is distinctfrom MT, because the quality score is assigned on the output of the MT system without any information about the expected output and MT system's core algorithm. Besides providing additional information about the output, this score can facilitate the efficient evaluation of MT with no manual effort. Likewise, avoiding the complexity imposed by SMT technology (e.g. search), there is more room for exploiting semantic features such as: Word Sense Disambiguation (WSD) and Semantic Role Labeling (SMT). On the other hand, in contract with CE which focuses more on the overall quality of the MT output, this score can focus on the adequacy. This would bring along more interesting issues which concern more the semantic and structural aspects of MT. Moreover, the outcome technology can benefit other applications including cross-language semantic similarity, cross-lingual textual entailment (Mehdad et al., 2010), lexical choice in machine translation (Bangalore et al., 2007), cross-lingual content synchronization and merging. In general, moving to this direction can integrate more semantic information in MT which in principal can help improving this technology.

Towards this project, we hope to partially defeat this problem. We also hope that such project makes it easier for researchers to be more active in bridging semantics and MT.

Proposed schadule:

  • Day1: Survey on the relevant literature
  • Day2:
    • Ideas
    • Data collection (WMT evaluation task)
    • Relevant features
    • Development task allocations for features
    • Learning algorithms
  • Day3:
    • Preprocessing data
    • Implementation tasks
    • Debugging and running on toy examples
  • Day4:
    • Running the codes on the data
    • Extract features
    • Learning models
    • Evaluating correlation
  • Day5:
    • Results
    • Conclusion
    • Future work


Project meeting day 1:

Attendees: Marcello Federico, Daniele Pighin, Hana Bechara, Angelilu Lazandoo, Nikos Engonpoulos, Alina Petrova, Jose Camargo De Souza, Yashar Mehdad

Some literature to study:

  • John Blatz, Erin Fitzgerald, George Foster, Simona Gan drabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing. 2003. Confidence estimation for machine translation. Final report, JHU/CLSP Summer Workshop.
  • Improving the Confidence of Machine Translation Quality Estimates. Lucia Specia et. al.,In Proceedings of MT Summit XII (2009).
  • Evaluating Without references: IBM1 scores as evaluation metrics. Maja Popvic et. al. In proceeding of WMT 2011.
  • Goodness: A Method for Measuring Machine Translation Confidence. N Bach et. al. In Proceedings of The ACL-HLT 2011.

The issues discussed:

  • The difficulties of the task.
  • The problem with the dataset (WMT data: only few hundred pairs available, L.
  • Processing tools in different languages are not very robust for noisy data (translation output).

Some solutions and discussions:

  • Using parallel data as training data, generating the negative examples using different methods.
  • Projecting parserís results from the source side to the target side.
  • Using multilingual parsers for source and output.


  • Using L. Speciaís dataset.
  • Extracting the global features that can not be implemented in MT decoders.
  • Reading the literature and suggesting the relevant features for the next meeting.

Project meeting day 2

Attendees: Marcello Federico, Daniele Pighin, Hanna Bechara, Angeliki Lazaridou, Nikos Engonopoulos, Jose Camargo De Souza, Yashar Mehdad, Marco Turchi, Antonio Valerio

The issues discussed:

  • The difficulty of parsing machine translated sentences
  • The difficulty of drawing the border line between quality and adequacy
  • Feature sets
  • Learning algorithms

Proposed discussion:

  • Extracting features in different levels: surface, lexical, syntactic (including shallow), semantic (possibly)
  • Classifiers: we start using SVM, we can try using different algorithms.
  • We will focus on binary classification (1:positive, 2,3&4:negative)
  • The tool developed by J. Gimenez could be useful to extract some features (suggested by Daniele)
  • The aligned dataset can be used for the start (the alignment is provided by Daniele)

Tasks: - Feature extraction:

  • Hanna: surface based features
  • Angelilu and Nikos: WSD and topic modelling
  • Antonio: parsing and syntactic features
  • Jose: possibly semantic roles

- Dataset prepration: Daniele and Yashar

Iíve recieved an email from Eleftherios Avramidis and he mentioned about another paper of DFKI which is very related to our work: http%3a%2f%2fwww.statmt.org%2fwmt11%2fpdf%2fWMT04.pdf He kindly proposed to help us in this project. He already sent me the data (WMT 2008,9 and 2010), while they worked on ranking.


- What has been done:

  • Dataset preparation : thanks to Daniele
  • Surface feature extraction : thanks to Hanna
  • Some syntactic and dependency features extracted: Yashar
  • Preliminary results as binary classification : Yashar

- Things to be done:

  • Topic modeling by Angeliki
  • WSD by Nikos
  • Parsing by Antonio

- Preliminary result

  • Dataset: 16K pairs (source: L. Specia)
  • ~7.5k: good quality (3&4), ~8.5k: bad quality (1&2)
  • ~50 features
  • Binary classification
  • Results: 66% accuracy using SVM.
Page last modified on September 10, 2011, at 08:55 AM