EMNLP 2025

TENTH CONFERENCE ON
MACHINE TRANSLATION (WMT25)

November 5-9, 2025
Suzhou, China
 
TRANSLATION TASKS: GENERAL MT (NEWS) •︎ INDIC MT •︎ TERMINOLOGY •︎ CREOLE MT •︎ MODEL COMPRESSION
EVALUATION TASKS: MT TEST SUITES •︎ (UNIFIED) MT EVALUATION
OTHER TASKS: OPEN DATA
MULTILINGUAL TASKS: MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES SLAVIC LLM

ANNOUNCEMENTS

  • Development data to be made available in June

  • Task details published [May]

SUMMARY

Following last year’s QE-informed APE subtask, we invite participants to submit systems capable of automatically generating corrections to machine-translated text, given the source, the MT-generated target, and a QE-generated quality annotation of the MT. The objective is to explore how quality information (both error annotations and scores) can inform automated error correction. For instance, sentence-level quality scores may help identify which segments require correction, while span-level annotations can be used for fine-grained, pinpointed corrections. We encourage approaches that leverage quality explanations generated by large language models. Submissions will be evaluated and ranked based on the quality of the corrections with as few changes as possible.

LANGUAGE PAIRS

  • English to Chinese

  • English to Czech

  • English to Icelandic

  • English to Japanese

  • English to Russian

  • English to Ukrainian

TASK DESCRIPTION

The goal in this subtask is to generate corrected MT output with as few changes as possible based on the provided quality annotations. For this subtask, participants should use automated, human, or own error annotations as part of the input.

DEVELOPMENT DATA

We will provide development data for the language pairs mentioned above based on the WMT 2024 evaluation campaign, which includes 130k instances. Each instance contains the following:

  • source text

  • machine translation

  • mt score (either based on ESA or MQM)

  • error span annotations (with character indices to the translation and minor/major error severities)

As an example see:

src: "Come on, Tenuk, you've shapeshifted into a Thraki before!"
mt: „Vertu nú Tenuk, þú hefur breytt um lögun í Thraki áður!“
mt score: 41.0
error spans: [{"start_i":1,"end_i":14,"severity":"major"},{"start_i":26,"end_i":51,"severity":"minor"}]

You can expect the test data to be in the same format.

EVALUATION METRICS

The primary evaluation metric for this shared task is ΔCOMET, which measures the improvement introduced by the correction system. Specifically, it is computed as the difference between the COMET scores of the original machine translation (MT) output and the post-edited output, given the same source sentence:

\(\Delta \mathrm{COMET} = \mathrm{COMET}(\mathrm{source},\ \mathrm{improved\ translation}) - \mathrm{COMET}(\mathrm{source},\ \mathrm{original\ translation})\)

In addition to ΔCOMET, we also consider the Gain-to-Edit Ratio to better understand the trade-off between quality improvement and the amount of change introduced by the correction system.

\(\mathrm{Gain\text{-}to\text{-}Edit\ Ratio} = \frac{\mathrm{Quality\ Improvement\ or\ Error\ Reduction}}{\mathrm{TER}(\mathrm{original\ translation},\ \mathrm{corrected\ translation})}\)

BASELINES

To help participants gauge the performance of their systems, we will include multiple baseline systems in our evaluation. These baselines represent a range of approaches, from a simple do-nothing scenario to methods employing large language models for post-editing. The performance of these baselines will be reported alongside participant submissions.

The baselines we consider are the following three:

  • No-op: no edits made to the translation.

  • Translate from scratch: disregarding the input QE

  • One-shot: a baseline large language model prompted to post-edit the translation based on supplied error spans.

  • Iterative refining: a baseline large language model prompted to post-edit in iterative steps the error spans from most to least severe.

All deadlines are in AoE (Anywhere on Earth). Dates are specified with respect to EMNLP 2025.