Task description
We follow the trend of the previous years and we organise a sentence-level quality estimation subtask where the goal is to predict the quality score for each source-target sentence pair. Depending on the language pair, the participants will be asked to predict either the direct assessment (DA) score or the multi-dimensional quality metrics (MQM) score. In the case of English-Hindi participants can predict both scores.
Training and development data
This year no training and validation datasets will be released, except for the English-Hindi MQM annotation. Instead, the participants will use the datasets from the previous yearβs shared task available at wmt-qe-task.github.io/wmt-qe-2023/
Language pair |
Annotation |
Training and development data |
English to German |
MQM |
See the data from last year π |
English to Spanish |
MQM |
Zero shot. No training or development data release. |
English to Hindi |
DA |
See the data from last year π |
English to Hindi |
MQM |
Zero shot. No training or development data will be released. |
English to Gujarati |
DA |
See the data from last year π |
English to Telugu |
DA |
See the data from last year π |
English to Tamil |
DA |
See the data from last year π |
Baselines
We will use the following baseline: CometKiwi
Evaluation
We will use Spearman correlation as primary metric and also compute Kendall and Pearson correlations as secondary metrics.
Following the previous edition, we will evaluate submitted models not only on correlations with human scores, but also with respect to their robustness to a set of different phenomena which will span from hallucinations and biases, to localized errors, which can significantly impact real-world applications. To that end the provided test sets will include a set of source-target segments with critical errors for which no additional training data will be provided, and which will not count We thus aim to investigate whether submitted models are robust to cases such as significant deviations in meaning, hallucinations, etc.
Note: The evaluation for critical errors will be separate to the main evaluation of performance for quality prediction and will not be included in the leaderboard.
Submission format
For the predictions we expect a single TSV file for each submitted QE system output (submitted online in the respective codalab competition), named predictions.txt
.
The file should be formatted with the two first lines indicating model size, then indication of ensemble model number,and the rest containing predicted scores, one per line for each sentence, as follows:
Line 1: <DISK FOOTRPINT (in bytes, without compression)>
Line 2: <NUMBER OF PARAMETERS>
Line 3: <NUMBER OF ENSEMBLED MODELS> (set to 1 if there is no ensemble)
Lines 4-n where -n is the number of test samples: <LANGUAGE PAIR> <DA/MQM> <METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE>
Where:
-
LANGUAGE PAIR is the ID (e.g. en-de) of the language pair of the plain text translation file you are scoring. Follow the LP naming convention provided in the test set.
-
DA/MQM : Indicate DA or MQM depending on the type of the test data type
-
METHOD NAME is the name of your quality estimation method.
-
SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
-
SEGMENT SCORE is the predicted numerical (MQM/DA) score for the particular segment.
Each field should be delimited by a single tab (\t
) character.