Fine-grained error span detection

EMNLP 2025

TENTH CONFERENCE ON
MACHINE TRANSLATION (WMT25)

November 8-9, 2025
Suzhou, China

HOME •︎ PROGRAM •︎ PAPERS •︎ AUTHORS

TRANSLATION TASKS:	GENERAL MT (NEWS) •︎ INDIC MT •︎ TERMINOLOGY •︎ CREOLE MT •︎ MODEL COMPRESSION
EVALUATION TASKS:	MT TEST SUITES •︎ (UNIFIED) MT EVALUATION
OTHER TASKS:	OPEN DATA
MULTILINGUAL TASKS:	MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES SLAVIC LLM

ANNOUNCEMENTS

2025-07-30: Submission deadline extended to 6 August
2025-07-25: Participants are kindly requested to provide the details of their automatic evaluation systems via this form
2025-07-24: Codabench opened
2025-05-18: Detailed description of the task announced

DESCRIPTION

The goal in this subtask is to predict translation error spans along with the severity labels. For this subtask we use the error spans obtained from the MQM and ESA human annotations generated for the General MT primary task as the target “gold standard”. Participants will be asked to predict both the error span (start and end indices) as well as the error severity (major or minor) for each segment. Minor issues as those that do not impact meaning or usability whereas major issues as those that impact meaning or usability but do not render the text unusable.

Languages covered

The list below provides the language pairs covered this year (which fully parallel the general machine translation task) and the quality annotations that will be used as targets for each language pair:

Czech to Ukrainian (ESA)
Czech to German (ESA)
Japanese to Chinese (MQM)
English to Arabic (ESA)
English to Chinese (ESA)
English to Czech (ESA)
English to Estonian (ESA)
English to Icelandic (ESA)
English to Italian (ESA)
English to Japanese (ESA)
English to Korean (MQM)
English to Russian (ESA)
English to Serbian (ESA)
English to Ukrainian (ESA)
English to Bhojpuri (ESA)
English to Maasai (ESA)

Language pairs highlighted in green have development or training sets available from previous versions of the shared task (see below). All language pairs except for English-Maasai and English–Italian will also be provided with a reference translation.

TRAINING AND DEVELOPMENT SETS

For training and development, participants can use the MQM and ESA annotations released in the previous editions of the WMT shared tasks. Some repositories that contain these datasets are listed below:

ESA: wmt24-humeval
MQM: google/wmt-mqm-human-evaluation

We recommend using WMT24 as the development set.

BASELINES

We will provide xCOMET and LLM-as-a-judge baselines for this subtask.

Evaluation

The primary evaluation metric will be an error severity weighted F1 score.

Submission format

For each submission you wish to make, upload a single zip file with the predictions and the system metadata.

For the metadata, we expect a two-line metadata.txt file: the first line must contain your team name (either Codabench username or your teamname); the second line must contain a short description (2-3 sentences) of the system you used to generate your predictions. This description will not be shown to other participants. It is fine to use the same description for multiple submissions/phases if you use the same model (e.g. a multilingual or multitasking model).

The testset is a plain-text UTF-8-encoded TSV file containing the columns 1-11 as defined below. For the predictions, we expect a single TSV file for each submitted system output, named predictions.tsv with the added columns (12-14) as described below. Columns 7-9 are optional in the submission.

Field Name

Explanation

doc_id

String identifier of the document to which the segment belongs

segment_id

Numeric index of the segment’s position among all the segments for a given language pair

source_lang

Code identifying the segment’s source language

target_lang

Code identifying the segment’s target language

set_id

String identifier of the portion of the test set to which the segment belongs

system_id

String identifier of the MT system that translated the segment

source_segment

String contents of the segment’s source side

hypothesis_segment

String contents of the segment’s machine translation

reference_segment

String contents of the segment’s gold-standard translation

domain_name

String identifier of the domain of the test set to which the segment belongs

method

String indication of whether we expect the segment to be quality-scored according to ESA or MQM criteria

start_indices

The start indices (character level) of every error span extracted. For multiple error spans separate indices by a whitespace. For no errors the value should be -1.

end_indices

The end indices (character level) of every error span extracted. For multiple error spans separate indices by a whitespace. For no errors the value should be -1.

error_types

String indication of minor or major error for each detected error span. The number of indices should match the number of errors. If there is no error span in a segment indicate with no-error.

After finalizing all of their submissions on Codabench, participants are kindly requested to provide the details of their systems by filling in this form.

TENTH CONFERENCE ON MACHINE TRANSLATION (WMT25)