QE Subtask 1: Sentence-level quality estimation

EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 15-16, 2024
Miami, Florida, USA

[HOME] [PROGRAM] [PAPERS] [AUTHORS]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL] [MULTIINDIC22MT TASK] [ENGLISH-TO-LOWRES MULTIMODAL MT TASK] [NON-REPETITIVE] [PATENT] [LITERARY]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

Task description

We follow the trend of the previous years and we organise a sentence-level quality estimation subtask where the goal is to predict the quality score for each source-target sentence pair. Depending on the language pair, the participants will be asked to predict either the direct assessment (DA) score or the multi-dimensional quality metrics (MQM) score. In the case of English-Hindi participants can predict both scores.

Training and development data

This year no training and validation datasets will be released, except for the English-Hindi MQM annotation. Instead, the participants will use the datasets from the previous year’s shared task available at wmt-qe-task.github.io/wmt-qe-2023/

Language pair

Annotation

Training and development data

English to German

MQM

See the data from last year 🔗

English to Spanish

MQM

Zero shot. No training or development data release.

English to Hindi

See the data from last year 🔗

English to Hindi

MQM

Zero shot. No training or development data will be released.

English to Gujarati

See the data from last year 🔗

English to Telugu

See the data from last year 🔗

English to Tamil

See the data from last year 🔗

Note

The MQM annotation convention that we follow is similar to last year’s QE task (see dev files in the data table). Specifically based on the provided MQM annotations we compute the MQM error by summing penalties for each error category:

+1 point for minor errors
+5 points for major errors
+10 points for critical errors

To align with DA annotations we subtract the summed penalties from 100 (perfect score) and we then divide by the sentence length (computed as number of words). We finally standardize the scores for each language pair/annotator.

Baselines

We will use the following baseline: CometKiwi

Evaluation

We will use Spearman correlation as primary metric and also compute Kendall and Pearson correlations as secondary metrics.

Following the previous edition, we will evaluate submitted models not only on correlations with human scores, but also with respect to their robustness to a set of different phenomena which will span from hallucinations and biases, to localized errors, which can significantly impact real-world applications. To that end the provided test sets will include a set of source-target segments with critical errors for which no additional training data will be provided, and which will not count We thus aim to investigate whether submitted models are robust to cases such as significant deviations in meaning, hallucinations, etc.

Note: The evaluation for critical errors will be separate to the main evaluation of performance for quality prediction and will not be included in the leaderboard.

Submission format

For the predictions we expect a single TSV file for each submitted QE system output (submitted online in the respective codalab competition), named predictions.txt.

The file should be formatted with the two first lines indicating model size, then indication of ensemble model number,and the rest containing predicted scores, one per line for each sentence, as follows:

Line 1: <DISK FOOTRPINT (in bytes, without compression)>

Line 2: <NUMBER OF PARAMETERS>

Line 3: <NUMBER OF ENSEMBLED MODELS> (set to 1 if there is no ensemble)

Lines 4-n where -n is the number of test samples: <LANGUAGE PAIR> <DA/MQM> <METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE>

Where:

LANGUAGE PAIR is the ID (e.g. en-de) of the language pair of the plain text translation file you are scoring. Follow the LP naming convention provided in the test set.
DA/MQM : Indicate DA or MQM depending on the type of the test data type
METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
SEGMENT SCORE is the predicted numerical (MQM/DA) score for the particular segment.

Each field should be delimited by a single tab (\t) character.

NINTH CONFERENCE ON MACHINE TRANSLATION (WMT24)

QE Subtask 1: Sentence-level quality estimation

Task description

Training and development data

Baselines

Evaluation

Submission format

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)