ANNOUNCEMENTS
-
Development data to be made available in June
-
Task details published [May]
SUMMARY
Following last year’s QE-informed APE subtask, we invite participants to submit systems capable of automatically generating changes and corrections to machine-translated text, given the source, the MT-generated target, and a QE-generated quality annotation of the MT. The objective is to explore how quality information (both error annotations and scores) can inform automated error correction. For instance, sentence-level quality scores may help identify which segments require correction, while span-level annotations can be used for fine-grained, pinpointed corrections. We encourage approaches that leverage quality explanations generated by large language models. The task is primarily focused on obtaining corrected output, and participants are allowed to submit systems with or without generated QE predictions or analysis. As training data, we will release CometKiwi quality annotations for the WMT 2023 Metrics and QE test sets, along with human reference translations for these test sets. Submissions will be evaluated and ranked based on the correlation and agreement between the corrections they generate and human reference translations. Detailed information about the language pairs, the input and output specifics, and the evaluation methodology is provided below.
LANGUAGE PAIRS
-
English to Chinese
-
English to Czech
-
English to Icelandic
-
English to Japanese
-
English to Russian
-
English to Ukrainian
TASK DESCRIPTION
The goal in this subtask is to generate corrected MT output based on the provided quality annotations. For this subtask, we utilise QE annotations generated by CometKiwi as part of the input. Participants are free to utilise the provided CometKiwi annotations or to perform their own quality annotation and consequently generate a proposed correction. We will provide baseline predictions using multiple baseline approaches- Do nothing, single-pass LLM APE, and iterative LLM APE.
DEVELOPMENT DATA
We will provide development data for the language pairs mentioned above based on the WMT 2024 evaluation campaign, which includes 130k instances. Each instance contains the following:
-
source text
-
machine translation
-
mt score (either based on ESA or MQM)
-
error span annotations (with character indices to the translation and minor/major error severities)
As an example see:
src: "Come on, Tenuk, you've shapeshifted into a Thraki before!" mt: „Vertu nú Tenuk, þú hefur breytt um lögun í Thraki áður!“ mt score: 41.0 error spans: [{"start_i":1,"end_i":14,"severity":"major"},{"start_i":26,"end_i":51,"severity":"minor"}]
You can expect the test data to be in the same format.
EVALUATION METRICS
The primary evaluation metric for this shared task is ΔCOMET, which measures the improvement introduced by the correction system. Specifically, it is computed as the difference between the COMET scores of the original machine translation (MT) output and the post-edited output, given the same source sentence:
\(\Delta \mathrm{COMET} = \mathrm{COMET}(\mathrm{source},\ \mathrm{improved\ translation}) - \mathrm{COMET}(\mathrm{source},\ \mathrm{original\ translation})\)
In addition to ΔCOMET, we also consider the Gain-to-Edit Ratio to better understand the trade-off between quality improvement and the amount of change introduced by the correction system.
\(\mathrm{Gain\text{-}to\text{-}Edit\ Ratio} = \frac{\mathrm{Quality\ Improvement\ or\ Error\ Reduction}}{\mathrm{TER}(\mathrm{original\ translation},\ \mathrm{corrected\ translation})}\)
BASELINES
To help participants gauge the performance of their systems, we will include multiple baseline systems in our evaluation. These baselines represent a range of approaches, from a simple do-nothing scenario to methods employing large language models for post-editing. The performance of these baselines will be reported alongside participant submissions.
The baselines we consider are the following three:
-
No-op: no edits made to the translation.
-
One-shot LLM: a baseline large language model prompted to post-edit the translation based on supplied error spans.
-
Iterative LLM refining: a baseline large language model prompted to post-edit in iterative steps the error spans from most to least severe.
All deadlines are in AoE (Anywhere on Earth). Dates are specified with respect to EMNLP 2025.