The WMT24 Metric Shared Task will examine automatic evaluation metrics for machine translation. We will provide you with MT system outputs along with source text and a human reference translation. We are looking for automatic metric scores for translations at the system-level, and segment-level. We will calculate the system-level, and segment-level correlations of your scores with human judgements.
We invite submissions of reference-free metrics in addition to reference-based metrics.
Have questions or suggestions? Feel free to contact us!
Task Description
We will provide you with the source sentences, output of machine translation systems and reference translations.
-
Official results: Correlation with MQM scores at the sentence and system level for the following language pairs:
-
English → German
-
English → Spanish
-
Japanese → Chinese
-
-
Secondary Evaluation: Correlation with official WMT Human Evaluation at the sentence and system level for all langauge pairs from the generalMT task.
Important Dates
All dates are at the end of the day for Anywhere on Earth (AoE)
Challenge sets submission |
July 11 |
System outputs ready to download |
July 23 |
Metric outputs submission |
July 30 |
Scored challenge sets given to participants for analysis |
August 06 |
Paper submission deadline to WMT |
August 20 |
Paper notification of acceptance |
TBA |
Paper camera-ready deadline |
TBA |
Conference |
November 12-13 |
Important Links
mt-metrics-eval: MTME is a simple toolkit for calculating correlation numbers aka the sacreBLEU for metric developers. You can also dump the most recent test sets.
Training data and previous editions
The WMT Metrics shared task takes place yearly since 2008. You may want to use data from previous editions to tune/train your metric. The following table provides links to the descriptions, the raw data, and the findings papers of the previous editions:
year |
MQM |
DA system level |
DA segment level |
relative ranking |
paper |
.bib |
Subtasks
This year, we have 2 subtasks:
-
Challenge Sets: While other participants are worried with building stronger and better metrics, participants of this subtask have to build challengesets that identify where metrics fail!
-
Span Detection Task: We ask participants to produce error span annotations similar to what MQM is giving us.
Challenge Set Subtask
Similarly to last year’s challenge sets subtask, the goal of this subtask is for participants to submit test sets with challenging evaluation examples that current metrics do not evaluate well. This subtask is organized into 3 rounds:
1) Breaking Round: Challenge set participants (Breakers) create challenging examples for metrics. They must send the resulting "challenge sets" to the organizers.
2) Scoring Round: The challenge sets created by Breakers will be sent to all Metrics participants (Builders) to score. Also, the organizers will score all the data with baseline metrics such as BLEU, chrF, BERTScore, COMET, BLEURT, Prism, and YiSi-1.
3) Analysis Round: Breakers will receive their data with all the metrics scores for analysis.
Challenge set submission format
There is a new challenge set format this year:
-
One plain text file for all the source segments (one segment per line)
-
One plain text file for all the reference segments
-
One plain text file for the translation output segments from each translation system. (If the challenge set creator has only one system, then they will submit one file. If they have 2 candidate systems, then they will submit two separate files.)
-
One README file with the names of the authors and an e-mail contact
Please name the files:
-
{challenge_set_name}.{langpair}.src.txt,
-
{challenge_set_name}.{langpair}.ref.txt
-
{challenge_set_name}.{langpair}.hyp-1.txt,
-
{challenge_set_name}.readme.txt respectively.
'langpair' should be similar to 'en-cs'. It is also possible to submit a (compressed) archive.
Contrary to the previous years, the translation outputs will not be distinguished between "correct" and "incorrect" at the time of the submission, neither they will be shuffled in a pairwise manner by the organizers. The challenge set developers are encouraged to organize the challenge set the way it suits better their analysis.
Challenge set registration
Please register your intended submission with contacts and a short description ahead of time here.
Challenge set submission
Please submit your challenge set(s) here, after having completed the registration above.
Span Detection Subtask
Recent metrics have been improving their interpretability by producing "error spans" similarly to MQM annotations. In this subtask, we will look at such metrics and evaluate the quality of those predictions by comparing them with real MQM evaluations.
More details will be released soon!
Paper Describing Your Metric
You are invited to submit a short paper (4 to 6 pages) to WMT describing your metric and/or challenge set. Shared task submission description papers are non-archival, and you are not required to submit a paper if you do not want to. If you don’t, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.
Organizers
-
David Adelani
-
Eleftherios Avramidis
-
FrΓ©dΓ©ric Blain
-
Marianna Buchicchio
-
Sheila Castilho
-
Dan Deutsch
-
George Foster
-
Markus Freitag
-
Tom Kocmi
-
Alon Lavie
-
Chi-kiu Lo
-
Nitika Mathur
-
Ricardo Rei
-
Craig Stewart
-
Brian Thompson
-
Jiayi Wang
-
Chrysoula Zerva
Sponsors
We would like to extend our gratitude to Google and Unbabel for generously sponsoring the MQM human annotations required to run this task.