MT Evaluation Subtask 2: Segment-Level Quality Score Prediction

EMNLP 2026

ELEVENTH CONFERENCE ON
MACHINE TRANSLATION (WMT26)

28-29 October, 2026
Budapest, Hungary HOME

TRANSLATION TASKS:	GENERAL MT •︎ INDIC MT •︎ ARABIC-ASIAN MT •︎ CHINESE-SOUTHEAST ASIAN MT •︎ TERMINOLOGY •︎ MODEL COMPRESSION •︎ CREOLE MT •︎ VIDEO SUBTITLE TRANSLATION
EVALUATION TASKS:	MT TEST SUITES •︎︎ AUTOMATED MT EVALUATION
OTHER TASKS:	OPEN DATA •︎ MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES LLM

ANNOUNCEMENTS

2026-04-29: Detailed description of the task released

TASK OVERVIEW

The goal of this task is to predict a numerical quality score for each source–target segment pair in the evaluation set, which will cover the same data and set of language pairs used in the WMT 2026 General MT task. References will be provided as optional inputs for some but likely not all language pairs; note that whatever references do exist will consist of MT output that has been post-edited by humans and/or selected and reviewed automatically (pseudo-references). Submissions will be evaluated and ranked based on their predictions’ correlations with human-annotated ESA scores at both the segment and corpus levels.

Participants will also run their automatic score prediction systems on collected “challenge sets” that illustrate particular linguistic phenomena, domains, or even non-WMT language pairs of interest to the developers of the sets. The predicted scores will be returned to the developers of each set for further analysis. See the detailed page on the challenge set subtask for further information.

To broaden participation and facilitate a more comprehensive analysis of current MT evaluation capabilities, this year we are introducing an automatic opt-in for participants across subtasks. Submissions to Task 1 (error span detection and severity classification) will by default be converted to participate in this score-prediction task. Submissions to this task will by default be converted to participate in Task 3 (error-free segment detection). See the “Submissions” section below for details.

SCORE COMPUTATION

The ESA score is a direct segment-level assessment primed by the act of annotating precise error spans within the segment. Values fall between 0 and 100, with higher scores indicating a higher translation quality. Gold-standard scores will be collected using a contrastive ESA methodology via the Pearmut annotation tool, with document context visible to the annotators.

DATA

Training and Development Data

You are welcome to build your system from any desired training data, foundation model, etc. We therefore do not release any specific training or development corpora. However, labeled data from previous editions of the Metrics, QE, and Evaluation shared tasks is available to help you train and tune your system if you would like. Those resources are summarized and linked below. Note, however, that the set of language pairs covered in prior years is not an exact match for this year’s: some have training data, while some are zero-shot.

ESA annotations from prior MT tasks	2025
DA and MQM annotations from prior QE tasks	2022, 2023, 2024
MQM annotations from prior Metrics tasks	2020–24
DA annotations from prior Metrics tasks	2016, 2017, 2018, 2019, 2020, 2021, 2022
Relative rank annotations from prior Metrics tasks	2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015

Test Data

The official test set will consist of a collection of documents, each divided into segments, to allow systems to evaluate quality in context. The segments, however, will consist of longer multi-sentence units of text. (By contrast, segments in 2024 data and earlier consisted of individual sentences.) Contents of the challenge sets will vary.

We expect to follow the JSON-lines data format from the General MT shared task this year. Likewise, we will retain detailed translation instructions provided by the General MT task for each segment. This metadata will express desired properties of a correct translation, such as the level of formality, adherence to a terminology glossary, replication of style or voice, etc. Submissions are encouraged to make use of the provided information to more accurately judge each translation’s quality.

Information on the availability and nature of reference translations per language pair will be updated here around the end of May.

SUBMISSIONS

Submissions will be collected via Codabench; check here later for the link. You may submit up to two system variants to this shared task per participating organization or research group.

Automatic Opt-In

To broaden participation and facilitate a more comprehensive analysis of current MT evaluation capabilities, we are introducing an automatic opt-in for participants across subtasks.

Unless you choose to opt out, your submission to this Task 2 will be evaluated as a submission to Task 3 (detection of error-free segments) as well. For this, you will be asked upon submission to provide a custom threshold for your model’s scores as well as an indication of whether the boundary is exclusive (>) or inclusive (≥). Any segment where your model score passes the threshold will be classified as error-free; all others will be classified as containing errors.

Submissions to Task 1 (automated error detection and span annotation) will likewise be automatically converted to participate in this Task 2. Task 1 participants may also submit up to two standalone systems to Task 2: this allows you to test specialized score-prediction techniques that may differ from your approach to error span detection. If you however wish for the same Task 1 system to be used for Task 2 via automatic conversion, there is no need to submit it twice.

EVALUATION

We will evaluate the quality of automatic score prediction on the official test set at both the segment level and the corpus (system) level. Details are forthcoming.

We will distinguish original Task 2 submission from converted Task 1 systems in the results and analysis tables.

Scored challenge sets will be returned to the individual challenge set developers for evaluation and analysis. Performance on the challenge sets will not be counted as part of the official results.

Note that the official results will be based on correlation with human judgements, which will not be complete until around September. At the time of test set submission, we will provide on the Codabench leaderboard an expression of each system’s correlation with automatically derived pseudo-gold-standard judgements, using simple metrics. This will allow participants to confirm that their files were uploaded in a valid format and that the performance of their system was broadly reasonable. Displayed scores, however, are for these basic verification purposes only and do not reflect official results or rankings.

BASELINES

We will include in Codabench and in the official results a number of baseline systems. The exact selection will be announced later.

ELEVENTH CONFERENCE ON MACHINE TRANSLATION (WMT26)