ANNOUNCEMENTS
-
2025-07-25: Participants are kindly requested to provide the details of their automatic evaluation systems via this form
-
2025-07-24: Codabench opened
-
2025-05-18: Detailed description of the task announced
DESCRIPTION
The goal in this subtask is to predict translation error spans along with the severity labels. For this subtask we use the error spans obtained from the MQM and ESA human annotations generated for the General MT primary task as the target “gold standard”. Participants will be asked to predict both the error span (start and end indices) as well as the error severity (major or minor) for each segment. Minor issues as those that do not impact meaning or usability whereas major issues as those that impact meaning or usability but do not render the text unusable.
Languages covered
The list below provides the language pairs covered this year (which fully parallel the general machine translation task) and the quality annotations that will be used as targets for each language pair:
-
Czech to Ukrainian (ESA)
-
Czech to German (ESA)
-
Japanese to Chinese (MQM)
-
English to Arabic (ESA)
-
English to Chinese (ESA)
-
English to Czech (ESA)
-
English to Estonian (ESA)
-
English to Icelandic (ESA)
-
English to Italian (ESA)
-
English to Japanese (ESA)
-
English to Korean (MQM)
-
English to Russian (ESA)
-
English to Serbian (ESA)
-
English to Ukrainian (ESA)
-
English to Bhojpuri (ESA)
-
English to Maasai (ESA)
Language pairs highlighted in green have development or training sets available from previous versions of the shared task (see below). All language pairs except for English-Maasai and English–Italian will also be provided with a reference translation.
TRAINING AND DEVELOPMENT SETS
For training and development, participants can use the MQM and ESA annotations released in the previous editions of the WMT shared tasks. Some repositories that contain these datasets are listed below:
-
ESA: wmt24-humeval
We recommend using WMT24 as the development set.
BASELINES
We will provide xCOMET and LLM-as-a-judge baselines for this subtask.
Evaluation
The primary evaluation metric will be an error severity weighted F1 score.
Submission format
For each submission you wish to make, upload a single zip file with the predictions and the system metadata.
For the metadata, we expect a two-line metadata.txt file: the first line must contain your team name (either Codabench username or your teamname); the second line must contain a short description (2-3 sentences) of the system you used to generate your predictions. This description will not be shown to other participants. It is fine to use the same description for multiple submissions/phases if you use the same model (e.g. a multilingual or multitasking model).
The testset is a plain-text UTF-8-encoded TSV file containing the columns 1-11 as defined below. For the predictions, we expect a single TSV file for each submitted system output, named predictions.tsv with the added columns (12-14) as described below. Columns 7-9 are optional in the submission.
# |
Field Name |
Explanation |
1 |
doc_id |
String identifier of the document to which the segment belongs |
2 |
segment_id |
Numeric index of the segment’s position among all the segments for a given language pair |
3 |
source_lang |
Code identifying the segment’s source language |
4 |
target_lang |
Code identifying the segment’s target language |
5 |
set_id |
String identifier of the portion of the test set to which the segment belongs |
6 |
system_id |
String identifier of the MT system that translated the segment |
7 |
source_segment |
String contents of the segment’s source side |
8 |
hypothesis_segment |
String contents of the segment’s machine translation |
9 |
reference_segment |
String contents of the segment’s gold-standard translation |
10 |
domain_name |
String identifier of the domain of the test set to which the segment belongs |
11 |
method |
String indication of whether we expect the segment to be quality-scored according to ESA or MQM criteria |
12 |
start_indices |
The start indices (character level) of every error span extracted. For multiple error spans separate indices by a whitespace. For no errors the value should be -1. |
13 |
end_indices |
The end indices (character level) of every error span extracted. For multiple error spans separate indices by a whitespace. For no errors the value should be -1. |
14 |
error_types |
String indication of minor or major error for each detected error span. The number of indices should match the number of errors. If there is no error span in a segment indicate with no-error. |
After finalizing all of their submissions on Codabench, participants are kindly requested to provide the details of their systems by filling in this form.