QE-informed Segment-level Error Correction

TRANSLATION TASKS:	GENERAL MT (NEWS) •︎ INDIC MT •︎ TERMINOLOGY •︎ CREOLE MT •︎ MODEL COMPRESSION
EVALUATION TASKS:	MT TEST SUITES •︎ (UNIFIED) MT EVALUATION
OTHER TASKS:	OPEN DATA
MULTILINGUAL TASKS:	MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES SLAVIC LLM

ANNOUNCEMENTS

Submission deadline on Codabench extended to August 6 [July 30]
Participants are kindly requested to provide the details of their automatic evaluation systems via this form [July 25]
Test set released with Codabench submission link for submission. See details below. [July 24]
Test set to be released [July 24]
Development data released [July 09]
Task details published [May]

SUMMARY

Following last year’s QE-informed APE subtask, we invite participants to submit systems capable of automatically generating corrections to machine-translated text, given the source, the MT-generated target, and a QE-generated quality annotation of the MT. The objective is to explore how quality information (both error annotations and scores) can inform automated error correction. For instance, sentence-level quality scores may help identify which segments require correction, while span-level annotations can be used for fine-grained, pinpointed corrections. We encourage approaches that leverage quality explanations generated by large language models. Submissions will be evaluated and ranked based on the quality of the corrections with as few changes as possible.

LANGUAGE PAIRS

English to Chinese
English to Czech
English to Icelandic
English to Japanese
English to Russian
English to Ukrainian

TASK DESCRIPTION

The goal in this subtask is to generate corrected MT output with as few changes as possible based on the provided quality annotations. For this subtask, participants should use automated, human, or own error annotations as part of the input.

TEST DATA

We provide a test set of segments for each language pair, consisting of source sentences, MT outputs, and quality annotations, as a plain-text UTF-8 encoded TSV file. The test set will be used for the primary evaluation of the submitted systems. Each segment in the test set will include:

Field Name

Explanation

doc_id

String identifier of the document to which the segment belongs

segment_id

Numeric index of the segment’s position among all the segments for a given language pair and test set

source_lang

String code identifying the segment’s source language

target_lang

String code identifying the segment’s target language

set_id

String identifier of the portion of the test set to which the segment belongs

system_id

String identifier of the MT system that translated the segment

source_segment

String contents of the segment’s source side

hypothesis_segment

String contents of the segment’s machine translation

overall

Predicted quality score of the hypothesis_segment

error_spans

Predicted list of errors with minor/major/critical severities

domain_name

String identifier of the domain of the test set to which the segment belongs

post_edit

Empty string, as this field is reserved for the post-edited output of the submitted system

Note that some segments (source, hypothesis, and/or reference) will contain internal newline characters. These characters will be presented as escaped tokens, consisting of a literal backslash \ followed by a literal n, with a space on each side. You are free to process the components of such multi-line segments in any way you like, but your post-edited output should reflect the contents of the entire input — including the presence and position of any internal newlines.

Please download the test data from this link. You can also access the test set link via the Codabench platform, where you can submit your system’s post-edited output.

Please submit your system’s post-edited output in the same format as the test set, with the post_edit field populated with your system’s output. The post_edit field should contain the corrected translation for each segment, while all other fields should remain unchanged.

Submission should be done via Codabench here.

After finalizing all of your submissions on Codabench, please provide the details of your systems by filling in this form.

DEVELOPMENT DATA

Please download the development data from this link.

We have provide development data for the language pairs mentioned above based on the WMT 2024 evaluation campaign, which includes ~70k instances. Each instance contains the following:

source text
machine translation output
MT score (either based on ESA or MQM)
error span annotations (with character indices to the translation and minor/major error severities)

As an example see:

src: "Come on, Tenuk, you've shapeshifted into a Thraki before!"
mt: „Vertu nú Tenuk, þú hefur breytt um lögun í Thraki áður!“
mt score: 41.0
error spans: [{"start_i":1,"end_i":14,"severity":"major"},{"start_i":26,"end_i":51,"severity":"minor"}]

You can expect the test data to be in the same format.

EVALUATION METRICS

The primary evaluation metric for this shared task is ΔCOMET, which measures the improvement introduced by the correction system. Specifically, it is computed as the difference between the COMET scores of the original machine translation (MT) output and the post-edited output, given the same source sentence:

\(\Delta \mathrm{COMET} = \mathrm{COMET}(\mathrm{source},\ \mathrm{improved\ translation}) - \mathrm{COMET}(\mathrm{source},\ \mathrm{original\ translation})\)

In addition to ΔCOMET, we also consider the Gain-to-Edit Ratio to better understand the trade-off between quality improvement and the amount of change introduced by the correction system.

\(\mathrm{Gain\text{-}to\text{-}Edit\ Ratio} = \frac{\mathrm{Quality\ Improvement\ or\ Error\ Reduction}}{\mathrm{TER}(\mathrm{original\ translation},\ \mathrm{corrected\ translation})}\)

Our meta-evaluation will be guided primarily by these two metrics, each for a different question on system performance- Delta-COMET, and Gain-to-Edit Ratio, i.e., This is Delta-COMET divided by TER between (hypothesis and post_edit segments), to assess the quality improvement relative to the post-editing effort.

The corrected translation output from participant systems will be obtained form the post_edit column in the submitted predictions.tsv file for all metric calculations. Ground truth for all quality calculations is the set of reference translations provided by the General MT task.

During the competition, the CodaBench platform will run an evaluation on a random sample of 100 segments. For this live evaluation, the COMET score of the original MT output will be pre-computed, and our script only computes COMET score for the post_edit segment. This is designed to provide 1) a sanity check of the submission, and 2) a sample assessment of system performance.

Submission error ⚠️

Please do note that the Codabench evaluation may fail due to a docker pull error at times. In such case, this will be remedied by us within 24 hours.

BASELINES

To help participants gauge the performance of their systems, we will include multiple baseline systems in our evaluation. These baselines represent a range of approaches, from a simple do-nothing scenario to methods employing large language models for post-editing. The performance of these baselines will be reported alongside participant submissions.

The baselines we consider are the following three:

No-op: no edits made to the translation.
Translate from scratch: disregarding the input QE
One-shot: a baseline large language model prompted to post-edit the translation based on supplied error spans.
Iterative refining: a baseline large language model prompted to post-edit in iterative steps the error spans from most to least severe.

All deadlines are in AoE (Anywhere on Earth). Dates are specified with respect to EMNLP 2025.

TENTH CONFERENCE ON MACHINE TRANSLATION (WMT25)