Shared Task: Automated Translation Quality Evaluation Systems

EMNLP 2026

ELEVENTH CONFERENCE ON
MACHINE TRANSLATION (WMT26)

28-29 October, 2026
Budapest, Hungary HOME

TRANSLATION TASKS:	GENERAL MT •︎ INDIC MT •︎ ARABIC-ASIAN MT •︎ CHINESE-SOUTHEAST ASIAN MT •︎ TERMINOLOGY •︎ MODEL COMPRESSION •︎ CREOLE MT •︎ VIDEO SUBTITLE TRANSLATION
EVALUATION TASKS:	MT TEST SUITES •︎︎ AUTOMATED MT EVALUATION
OTHER TASKS:	OPEN DATA •︎ MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES LLM

ANNOUNCEMENTS

2026-05-18: Language pairs updated to match additions in the General MT task
2026-05-04: Detailed descriptions for Challenge sets (Task 4) added
2026-04-29: Detailed descriptions for Task 1, Task 2, and Task 3 added
2026-04-13: The shared task is announced

OVERVIEW

This shared task focuses on evaluating the performance of automated systems that assess the quality of language translation systems. It continues the WMT 2025 shared task that unified and consolidated the separate shared tasks on Machine Translation Metrics and Quality Estimation (QE) from previous years, under an updated structure designed to encourage the development and assessment of new state-of-the-art translation quality evaluation systems.

A primary focus of the task is on systems that can evaluate translation quality in context — where context is at the document or lengthy multi-segment level — even when the granularity of the generated quality assessments is at the word or segment levels. Content segmentation will still be provided as part of the input, but similar to last year and unlike previous years, these segments will be long multi-sentence units of text. Reference translations will also be provided as an optional (but not required) input parameter to the evaluation systems, thus covering both the classical MT metrics and translation-time QE scenarios. However, unlike previous years, reference translations will be generated by post-editing of MT or via synthetic generation and selection.

The shared task this year consists of three primary subtasks that address translation quality assessment from three perspectives: (1) segment-level translation error detection and span annotation, (2) segment-level quality score prediction, and (3) detection of error-free segments. Curated evaluation data sets will be provided for all three subtasks. These include test sets obtained from the General MT shared task as well as a collection of “challenge sets” that were developed by the organizers and members of the research community. A fourth subtask solicits the submission of these challenge sets.

Languages Covered

The list below provides the language pairs covered this year (which fully parallel the languages covered by the General MT shared task):

Czech to German
Czech to Ukrainian
Czech to Vietnamese [new]
Chinese (Simplified) to Japanese [new]
English to Arabic (Egyptian)
English to Armenian [new]
English to Belarusian [new]
English to Chinese (Simplified)
English to Chinese (Traditional Taiwan) [new]
English to Czech
English to Estonian
English to German
English to Icelandic
English to Indonesian [new]
English to Japanese
English to Kazakh [new]
English to Korean
English to Ladin (Italy) [new]
English to Ligurian (Italy) [new]
English to Russian
English to Northern Sámi [new]
English to Thai [new]
English to Ukrainian

Data and Submission

Human assessments of translation quality, collected by the General MT shared task using an ESA-based protocol, will act as the “gold standard” for our shared task evaluation. For details, see the “Human Evaluation” section of the General MT shared task description.

Training and validation datasets will be made available for each of our three subtasks. See the detailed task descriptions (once available) for the list of specific resources available for each of the tasks.

Submissions for Tasks 1, 2, and 3 (described below) will be automated and conducted via Codabench. We will provide an automated mechanism for easing the evaluation of submitted systems for all three tasks: direct submissions to Task 1 will by default also be evaluated on Tasks 2 and 3 via automations developed by the organizing committee. Similarly, direct submissions to Task 2 will by default also be evaluated on Task 3 via a similar automation. Participants have the option of opting out of this automated evaluation as well as submitting separate direct systems to the three tasks. Details of this automation process will be posted here at a later date.

Participants to any shared task will be expected to contribute a one-paragraph description of their system(s) for inclusion in our overall Findings paper, along with a four- to six-page system description paper for inclusion in the wider WMT conference proceedings. See the main WMT website for paper submission information.

TASKS ORGANIZED

Task 1: Segment-Level Error Detection and Span Annotation

Task 1 is a segment-level subtask where the goal is to detect translation errors and identify the precise span of each error within the target-side translation along with its severity. For this subtask we use the error spans obtained from the ESA human annotations generated for the General MT primary task as the target “gold standard.” Participants are asked to predict both the error spans (start and end indices) as well as the error severities (major or minor) within each segment. Submissions will be evaluated and ranked based on their ability to correctly identify the presence of errors, correctly mark the spans of any identified errors, and correctly identify the severity of each of these errors.

Information about the annotation specifics, the evaluation criteria, and the available training and development resources is available in the Task 1 detailed description.

Task 2: Segment-Level Quality Score Prediction

Task 2 is similar to the corresponding task from last year and is largely an updated version of similar tasks from previous years’ QE and Metrics tracks. The goal of the segment-level quality prediction subtask is to predict a quality score for each source–target segment pair in the evaluation set. Participants this year will be asked to predict the ESA score. Submissions will be evaluated and ranked based on their prediction correlations with human-annotated scores at both the segment and system levels.

Information about the annotation specifics, the evaluation criteria, and the available training and development resources is available in the Task 2 detailed description.

Task 3: Detection of Error-Free Segments

Task 3 is a new subtask focused on a concrete application of translation quality evaluation systems: detection of error-free segments. While traditional quality estimation focuses on continuous scores or fine-grained error spans, this task simplifies the objective to a binary classification: Is this translation error-free? The goal is to identify segments that can be published or consumed without further human intervention. Participants must predict a binary label (1 for Error-Free, 0 for Contains Errors) for a given set of source–target pairs. Submissions will be evaluated and ranked based on their alignment with gold labels derived from human annotations, measured by the Matthews Correlation Coefficient.

Information about the input and output specifics and the evaluation methodology is available in the Task 3 detailed description.

Task 4: Challenge Sets

While the first three tasks are focused on the development of stronger and better automated quality evaluation systems, the goal of this subtask is for participants to create test sets with challenging evaluation examples that current automated metrics and evaluation systems fail to identify or score correctly. This subtask is organized into three rounds:

Breaking Round: Challenge set participants (Breakers) create challenging examples for metrics. They then send their resulting challenge sets to the organizers.
Scoring Round: The challenge sets created by Breakers will be packaged along with the rest of the evaluation data and sent to all participants in the three previous tasks (the Builders) to score.
Analysis Round: Breakers will receive their data with all the metrics scores for analysis. They are encouraged to then submit an analysis paper describing their findings to the WMT 2026 conference.

This year we are inviting submissions of challenge sets for all (or any of the) three other subtasks (see detailed descriptions above). In addition, challenge sets can target languages beyond the official language pairs listed for Tasks 1–3, but note that evaluation system developers may opt out from evaluating languages other than the official ones. If you are interested in submitting a challenge set this year, the organizers request that you indicate your intentions by completing the sign-up form here.

More details about the challenge sets can be found in the Task 4 detailed description.

DEADLINES

Challenge set submission deadline

16 July 2026

Tasks 1, 2, and 3 test data release and submission opening

23 July 2026

Tasks 1, 2, and 3 submission deadline

30 July 2026

Scored challenge sets returned to creators for analysis

3 August 2026

WMT paper submission deadline

TBA (follows EMNLP)

WMT notification of acceptance

TBA (follows EMNLP)

WMT camera-ready submission deadline

TBA (follows EMNLP)

Conference

24–29 October 2026

All deadlines are in Anywhere on Earth (AoE) time.

CONTACT

Please contact the organizing committee at wmt-qe-metrics-organizers@googlegroups.com in the event of any questions or difficulties regarding any of the four subtasks.

The organizing committee consists of:

Alon Lavie
Archchana Sindhujan
Chi-kiu (Jackie) Lo
Chrysoula Zerva
Diptesh Kanojia
Eleftherios Avramidis
Fabio Barth
Fred Blain
Giorgos Filandrianos
Greg Hanneman
Lorenzo Proietti
Monishwaran Maheshwaran
Orfeas Menis Mastromichalakis
Shuoyang Ding
Stefano Perrella
Tom Kocmi
Vilém Zouhar

ELEVENTH CONFERENCE ON MACHINE TRANSLATION (WMT26)

Shared Task: Automated Translation Quality Evaluation Systems

ANNOUNCEMENTS

OVERVIEW

Languages Covered

Data and Submission

TASKS ORGANIZED

Task 1: Segment-Level Error Detection and Span Annotation

Task 2: Segment-Level Quality Score Prediction

Task 3: Detection of Error-Free Segments

Task 4: Challenge Sets

DEADLINES

CONTACT

ELEVENTH CONFERENCE ON
MACHINE TRANSLATION (WMT26)