EMNLP 2025

TENTH CONFERENCE ON
MACHINE TRANSLATION (WMT25)

November 5-9, 2025
Suzhou, China
 
TRANSLATION TASKS: GENERAL MT (NEWS) •︎ INDIC MT •︎ TERMINOLOGY •︎ CREOLE MT •︎ MODEL COMPRESSION
EVALUATION TASKS: MT TEST SUITES •︎ (UNIFIED) MT EVALUATION
OTHER TASKS: OPEN DATA
MULTILINGUAL TASKS: MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES SLAVIC LLM

ANNOUNCEMENTS

  • 2025-04-17: The shared task announced

DESCRIPTION

This shared task focuses on evaluating the performance of automated systems that assess the quality of language translation systems. It unifies and consolidates the separate shared tasks on Machine Translation Metrics and Quality Estimation (QE) from previous years, under an updated structure designed to encourage the development and assessment of new state-of-the-art translation quality evaluation systems. A primary focus of the updated task is on systems that can evaluate translation quality in context – where context is at the document or lengthy multi-segment level – even when the granularity of the generated quality assessments is at the word or segment levels. Content segmentation into individual segments will be provided as part of the input, but unlike previous years, these segments will be long, multi-sentence units of text. Human reference translations (when available) will be provided as an optional (but not required) input parameter to the evaluation system, thus covering both the classical MT metrics and translation-time QE scenarios.

The shared task this year consists of three primary subtasks that address translation quality assessment from three perspectives: (1) segment-level quality score prediction, (2) word-level translation error detection and span annotation, and (3) quality-informed segment-level error correction. Curated evaluation data sets will be provided for all three subtasks. These include test sets obtained from the general MT task as well as a collection of “challenge sets” that were developed by the organizers and members of the research community. A fourth subtask solicits the submission of these challenge sets.

Languages covered

The list below provides the language pairs covered this year (which fully parallel the general machine translation task) and the quality annotations that will be used as targets for each language pair:

  • Czech to Ukrainian (ESA)

  • Czech to German (ESA)

  • Japanese to Chinese (MQM)

  • English to Arabic (ESA)

  • English to Chinese (ESA)

  • English to Czech (ESA)

  • English to Estonian (ESA)

  • English to Icelandic (ESA)

  • English to Japanese (ESA)

  • English to Korean (MQM)

  • English to Russian (ESA)

  • English to Serbian (ESA)

  • English to Ukrainian (ESA)

No new training or validation datasets will be released for Task 1: Segment-level quality score prediction and Task 2: Word-level error detection and span annotation. Participants are encouraged to use the datasets from the previous years’ QE shared task available at: wmt-qe-task.github.io/wmt-qe-2023/ and from the Metrics shared task available at: www2.statmt.org/wmt24/metrics-task.html . For Task 3: Quality-informed segment-level error correction, we will release CometKiwi quality annotations for the WMT 2023 Metrics and QE test sets along with human reference translations for these test sets. Submissions for Tasks 1, 2 and 3 will be automated and conducted via Codabench.

Tasks organised

Task 1: Segment-level quality score prediction

Task 1 is largely an updated version of similar tasks from previous years’ QE and Metrics tracks. The goal of the segment-level quality prediction subtask is to predict a quality score for each source–target segment pair in the evaluation set. The input to the system is however an entire document or sequence of segments, allowing systems to evaluate quality in context. Content segmentation into individual segments will be provided as part of the input, but unlike previous years, these segments will be long multi-sentence units of text. Depending on the language pair, the participants will be asked to predict either the Error Span Annotation (ESA) score or the Multi-dimensional Quality Metrics (MQM) score. Submissions will be evaluated and ranked based on their prediction correlations with these human annotated scores at both the segment and system levels. Detailed information about the language pairs, the annotation specifics, and the available training and development resources will be announced here at a later date.

Task 2: Word-level error detection and span annotation

Task 2 is a word-level subtask where the goal is to predict the precise span of each translation error along with its severity. For this subtask we use the error spans obtained from the MQM and ESA human annotations generated for the General MT primary task as the target “gold standard”. Participants are asked to predict both the error spans (start and end indices) as well as the error severities (major or minor) within each segment. The input to the system is however an entire document or sequence of segments, allowing systems to evaluate quality in context. Similar to Task 1, content segmentation into individual segments will be provided as part of the input, but unlike previous years, these segments will be long multi-sentence units of text. Submissions will be evaluated and ranked based on their ability to correctly identify the presence of errors, correctly mark the spans of any identified errors, and correctly identify the severity of each of these errors. Detailed information about the language pairs, the annotation specifics, and the available training and development resources will be announced here at a later date.

Task 3: Quality-informed segment-level error correction

Task 3 is designed to combine quality estimation and automated post-editing in order to correct the output of machine translation. This year, we invite participants to submit systems capable of automatically generating changes and corrections to machine-translated text, given the source, the MT-generated target, and a QE-generated quality annotation of the MT. The objective is to explore how quality information (both error annotations and scores) can inform automated error correction. For instance, sentence-level quality scores may help identify which segments require correction, while word-level annotations can be used for fine-grained, pinpointed corrections. QE annotations generated by CometKiwi will be provided as part of the input. Participants are free to utilize the provided CometKiwi annotations or to perform their own quality annotation and consequently generate a proposed correction. We encourage approaches that leverage quality explanations generated by large language models. The task is primarily focused on obtaining corrected output and participants are allowed to submit systems with or without generated QE predictions or analysis. As training data, we will release CometKiwi quality annotations for the WMT 2023 Metrics and QE test sets along with human reference translations for these test sets. Submissions will be evaluated and ranked based on the correlation and agreement between the corrections they generate and human reference translations. Detailed information about the language pairs, the input and output specifics, and the evaluation methodology will be announced here at a later date.

Task 4: Challenge Sets

While the first three tasks are focused on the development of stronger and better automated quality evaluation systems, participants of this subtask are asked to build challenge sets that identify where automated metrics and quality evaluation systems fail! Inspired by the challenge sets of previous years (Metrics task, QE task), the goal of this subtask is for participants to create test sets with challenging evaluation examples that current automated metrics and evaluation systems do not capture well.

This subtask is organized into 3 rounds:

1) Breaking Round: Challenge set participants (Breakers) create challenging examples for metrics. They then send their resulting "challenge sets" to the organizers.

2) Scoring Round: The challenge sets created by Breakers will be packaged along with the rest of the evaluation data and sent to all participants in the three previous tasks (the Builders) to score.

3) Analysis Round: Breakers will receive their data with all the metrics scores for analysis. They are encouraged to then submit an analysis paper describing their findings to the WMT 2025 conference.

Challenge set types: This year we are inviting submissions of challenge sets for all three official subtasks (see detailed descriptions above):

  • Task 1: Segment-level quality score prediction

  • Task 2: Word-level error detection and span annotation

  • Task 3: Quality-informed segment-level error correction

In addition, we note that challenge sets can target languages beyond the language-pairs targeted for subtasks 1-3, e.g., investigating performance of metrics on very low-resource MT (see, for example, AfriMTE from last year). If you are interested in submitting a challenge set this year, you are encouraged to contact us at: wmt-metrics@googlegroups.com

DEADLINES

Challenge set submission deadline

5th July 2025

Tasks 1,2 and 3 Test Data Release and Submission opening

17th July 2025

Tasks 1,2 and 3 Submission deadline

24th July 2025

Scored challenge sets return to creators for analysis

31st July 2025

Paper submission deadline to WMT

TBA (follows EMNLP)

WMT Notification of acceptance

TBA (follows EMNLP)

WMT Camera-ready deadline

TBA (follows EMNLP)

All deadlines are in AoE (Anywhere on Earth). Dates are specified with respect to EMNLP 2025.