QE Subtask 2: Fine-grained error span detection

EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 15-16, 2024
Miami, Florida, USA

[HOME] [PROGRAM] [PAPERS] [AUTHORS]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL] [MULTIINDIC22MT TASK] [ENGLISH-TO-LOWRES MULTIMODAL MT TASK] [NON-REPETITIVE] [PATENT] [LITERARY]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

Task description

This is a word-level subtask where the goal is to predict the translation error spans as opposed to binary OK/BAD tasks.

For this task we will use the error spans obtained from the MQM annotations. Participants will be asked to predict both the error span (start and end indices) as well as the error severity (major or minor) for each segment.

Training and development data

This year no training and validation datasets will be released, except for the English-Hindi MQM annotation. Instead, the participants will use the datasets from the previous year’s shared task available at wmt-qe-task.github.io/wmt-qe-2023/

Language pair

Annotation

Training and development data

English to German

MQM

See the data from last year 🔗

English to Spanish

MQM

Zero shot. No training or development data will be released.

English to Hindi

MQM

Zero shot. No training or development data will be released.

Baselines

We will use CometKiwi as a baseline.

Evaluation

The primary evaluation metric will be F1-score and we also plan to report Recall and Precision.

Submission format

For the predictions we expect a single TSV file for each submitted QE system output (submitted online in the respective codalab competition), named predictions.txt with the following format:

Line 1: <DISK FOOTRPINT (in bytes, without compression)>

Line 2: <NUMBER OF PARAMETERS>

Line 3: <NUMBER OF ENSEMBLED MODELS> (set to 1 if there is no ensemble)

Lines 4-n where -n is the number of test samples: <LANGUAGE PAIR> <METHOD NAME> <SEGMENT NUMBER> <TARGET SENTENCE> <ERROR START INDICES> <ERROR END INDICES> <ERROR TYPES>

Where:

LANGUAGE PAIR is the ID (e.g. en-de) of the language pair of the plain text translation file you are scoring. Follow the LP naming convention provided in the test set.
METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
TARGET SENTENCE is the target sentence based on which the error span indices were extracted. You should use exactly the target sentence as provided by the test set to ensure alignment with the gold labels.
ERROR START INDICES the start indices (character level) of every error span extracted. For multiple error spans separate indices by a whitespace. For no errors output -1.
ERROR END INDICES the end indices (character level) of every error span extracted. For multiple error spans separate indices by a whitespace. For no errors output -1.
ERROR TYPES indication of minor or major error for each detected error span. The number of indices should match the number of errors. If there is no error span in a segment indicate with no-error.

Each field should be delimited by a single tab (\t) character.

Output example

2409244995
2280000000
3
he-en <\t> example-ensemble <\t> 0 <\t> This is a sample translation without errors. <\t> -1 <\t> -1 <\t> no-error
he-en <\t> example-ensemble <\t> 1 <\t> This is a sample translation with a span that is considered major error and another span that is considered minor error. <\t> 49 97 <\t> 70 118 <\t> major minor …

NINTH CONFERENCE ON MACHINE TRANSLATION (WMT24)

QE Subtask 2: Fine-grained error span detection

Task description

Training and development data

Baselines

Evaluation

Submission format

Output example

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)