MT Evaluation Subtask 1: Segment-Level Error Detection and Span Annotation

EMNLP 2026

ELEVENTH CONFERENCE ON
MACHINE TRANSLATION (WMT26)

28-29 October, 2026
Budapest, Hungary HOME

TRANSLATION TASKS:	GENERAL MT •︎ INDIC MT •︎ ARABIC-ASIAN MT •︎ CHINESE-SOUTHEAST ASIAN MT •︎ TERMINOLOGY •︎ MODEL COMPRESSION •︎ CREOLE MT •︎ VIDEO SUBTITLE TRANSLATION
EVALUATION TASKS:	MT TEST SUITES •︎︎ AUTOMATED MT EVALUATION
OTHER TASKS:	OPEN DATA •︎ MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES LLM

ANNOUNCEMENTS

2026-04-29: Detailed description of the task released

TASK OVERVIEW

The goal of this task is to identify translation errors and assign a severity label to each error. Given a source text, its translation, and optionally a reference translation (available only for a subset of language directions), participants are asked to:

Detect error spans in the form of start and end indices,
Classify each span according to its severity (severity descriptions might change):
- Minor: Imperfections or stylistic issues that do not impact the core message (e.g., awkward phrasing).
- Major: Confuses meaning, misrepresents the source, or violates the message (e.g., incorrect information, confusing wording).

This task uses the same data and language pairs as the WMT 2026 General MT task (additional information below).

Participants are also expected to run their automatic systems on collected “challenge sets” that focus on particular linguistic phenomena, domains, or even non-WMT language pairs of interest to the developers of the sets. The predicted error spans will be returned to the developers of each set for further analysis. See the detailed page on the challenge set subtask for further information. We invite participants who encounter issues annotating the challenge sets to reach out to the organizing committee at wmt-qe-metrics-organizers@googlegroups.com.

To broaden participation and facilitate a more comprehensive analysis of current MT evaluation capabilities, this year, we are also introducing an automatic opt-in for participants across tasks. Submissions to this task will automatically be converted to participate in Task 2 (Segment-Level Quality Score Prediction) and Task 3 (Detection of Error-Free Segments). See the “Submissions” section below for details.

DATA

Test Data

The ground-truth error spans are derived from the ESA-based human annotations collected in the WMT General MT shared task. This year, human annotations will be collected using a contrastive ESA methodology via the Pearmut annotation tool. For additional information and to check the official list of translation directions, please refer to their website.

The test set will consist of a collection of documents, each divided into segments, where a segment can be a single- or multi-sentence unit of text. Participants may choose to perform error detection using only the provided segments or to incorporate the broader document context.

References

References will be provided as optional inputs for some but likely not all language pairs; note that whatever references do exist will consist of MT output that has been post-edited by humans and/or selected and reviewed automatically (pseudo-references).

Translation (Evaluation) Instructions

We expect to follow the JSON-lines data format from the General MT shared task this year. Likewise, we will retain detailed translation instructions provided by the General MT task for each segment. This metadata will express desired properties of a correct translation, such as the level of formality, adherence to a terminology glossary, replication of style or voice, etc. Submissions are encouraged to use the provided information to more accurately judge each translation’s quality.

In this evaluation task, only the translation is annotated. Thus, to indicate the presence of omission errors (content from the source that is missing from the translation), each translation will be followed by a final “[MISSING]” tag, which can be treated as a normal text span when detecting translation errors.

Training and Development Data

You are welcome to build your system from any desired training data, foundation model, etc. We therefore do not release any specific training or development corpora. However, labeled data from previous editions of WMT shared tasks is available to help you train and tune your system if you would like.

These resources are summarized below:

ESA WMT25: github.com/wmt-conference/wmt25-general-mt
ESA WMT24: github.com/wmt-conference/wmt24-news-systems
MQM WMT20 to WMT24: github.com/google/wmt-mqm-human-evaluation

SUBMISSION

Submissions will be collected via Codabench. You may submit up to two system variants per participating organization or research group.

Automatic Opt-In

Unless you explicitly opt out, submissions to this task will also be automatically evaluated for Tasks 2 and 3.

Task 2: We will automatically convert the predicted error spans and their severities for each segment into a numerical quality score using an MQM-style weighting scheme. Major and Minor errors will be assigned weights of 10 and 2.5 points, respectively. The score for each translation is computed by subtracting the total penalty from 100 (that is, the score of a perfect translation under ESA). The minimum possible score is 0; any additional penalties that would reduce the score below zero are ignored.
Task 3: We will automatically derive binary labels based on your predicted error spans. Segments where your system predicts no errors will be labeled as 1 (error-free), while segments with one or more predicted spans will be labeled 0 (contains errors).

Participants in Task 1 may either submit separately to Tasks 2 and 3 or have their Task 1 submissions automatically evaluated for those tasks. Participants may opt out of this automatic evaluation by explicitly stating that their Task 1 submission should be evaluated only for Task 1.

EVALUATION

The primary evaluation metric will be a severity-weighted F1 score, where predicted error spans are evaluated jointly for span detection and severity classification.

Specifically, we will compute Precision, Recall, and F-score, using MPP with micro-averaging (arxiv.org/abs/2603.19921), modified to incorporate error severity.

Given a pair of sets of hypotheses and ground truth error spans, MPP comprises the following steps:

Derive a one-to-one matching between hypothesis and ground-truth error spans.
Compute P and R for each pair of matched errors based on their proportion of overlapping characters.
Compute average P and R to derive segment-level (or corpus-level) Precision and Recall, which are used to compute the final F-score.

We modify the MPP formulation to incorporate severity information. Our severity-weighted MPP keeps the original MPP procedure of one-to-one span matching and overlap-based partial credit, but replaces equal per-span averaging with severity-aware weighting. Each predicted span contributes to precision in proportion to its severity, each gold span contributes to recall in proportion to its severity, and matched pairs receive overlap credit that is reduced when the predicted and gold severities disagree. As a result, minor false positives and false negatives are penalized less than major ones, while severity mismatches receive partial rather than full credit.

The weights of Minor and Major errors will be determined using the gold data when made available.

BASELINES

We will provide baseline systems based on xCOMET and LLM-as-a-judge approaches. (Additional baselines TBD.)

ELEVENTH CONFERENCE ON MACHINE TRANSLATION (WMT26)