Task description
This is a new task which proposes to combine quality estimation and automatic post-editing in order to correct the output of machine translation. While evaluating MT output at scale is the primary objective of the QE shared task, this task aims to promote the combination of the two tasks of quality prediction and automatic correction.
In light of this, we invite participants to submit systems capable of automatically generating QE predictions for machine-translated text and the corresponding output corrections. The objective is to explore how quality estimates (possibly at different levels of granularity) can inform post-editing. For instance, global sentence-level QE annotations may guide more or less radical post-editing strategies, while word-level annotations can be used for fine-grained, pinpointed corrections. We also encourage approaches that leverage quality explanations generated by large language models. Although the task is focused on quality informed APE, we also allow participants to submit APE output without QE predictions to understand the impact of their QE system. Submission w/o QE predictions will also be considered official. (please refer to the submission format)
The task will focus on two language pairs: English-Hindi and English-Tamil.
Training and development data
To build their systems, participants will be provided with manually-annotated training and development data containing sentence-level QE annotations along with manual post-edits. The training and development data comprise, for each source sentence, its automatic translation from a “black-box” MT system unknown to participants, along with the corresponding manual quality annotation (in the form of direct assessment — DA — scores) and manual post-edit. The DA score in the test data will be generated using COMET (WMT-21-COMET-DA). Participants are free to use their own QE systems to generate these scores.
Synthetic training data for APE for both English-Hindi and English-Tamil is available in our GitHub repository: github.com/WMT-QE-Task/wmt-qe-2024-data/tree/main/train_dev/Task_3.
Summary of data for 2024 shared task
Language pair |
Sentence-level annotation |
Word-level annotation |
Train Data |
Dev data |
English-Hindi (En-Hi) |
DA |
Post-edits |
DA annotated: 7000 (previously released) Post-edits: 7000 (new dataset) |
DA annotated: 1000 (previously released) Post-edits: 1000 (new dataset) |
English-Tamil (En-Ta) |
DA |
Post-edits |
DA annotated: 7000 (previously released) Post-edits: 7000 (new dataset) |
DA annotated: 1000 (previously released) Post-edits: 1000 (new dataset) |
Test data
This task releases two new test sets covering two language directions: English-Hindi and English-Tamil. Both the test sets will consist of 1,000 (source, target) pairs
Evaluation
The primary evaluation metric for this subtask is HTER, and additionally, we will also use COMET to further assess the corrected output. Similar to previous rounds of the WMT Automatic Post-Editing task, the HTER calculated between the original MT output and human post-edits of the test set instances will be used as baseline (i.e. the baseline is a system that leaves all the test instances unmodified).
Submission format
For the predictions we expect a single TSV file for each submitted system output (submitted online in the respective codalab competition), named predictions.txt
.
The file should be formatted with the two first lines indicating model size, then the indication of ensemble model number, and the rest containing predicted scores, one per line for each sentence, as follows:
Line 1: <DISK FOOTPRINT (in bytes, without compression)>
Line 2: <NUMBER OF PARAMETERS>
Line 3: <NUMBER OF ENSEMBLED MODELS> (set to 1 if there is no ensemble)
Lines 4-n where -n is the number of test samples: <LANGUAGE PAIR> <METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <BINARY WORD LEVEL TAGS> <CORRECTED OUTPUT>
Where:
-
LANGUAGE PAIR is the ID (e.g., en-hi) of the language pair of the plain text translation file you are scoring. Follow the LP naming convention provided in the test set.
-
METHOD NAME is the name of your quality estimation method.
-
SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
-
SEGMENT SCORE is the predicted numerical (DA) score for the particular segment, or “NA”, if a sentence-level QE system has not been used.
-
BINARY WORD LEVEL TAGS are Python list of elements/tags either ‘OK’ for no issue or ‘BAD’ for any issue. In case a word-level system is not used, use “NA” as a value instead of the list.
-
CORRECTED OUTPUT is the generated output from your system.
Each field should be delimited by a single tab (\t
) character. Please note that the APE output should not contain tab \t
because it will corrupt the file format. So, please replace any tab in APE output with space before submission.