Metrics Task

EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 15-16, 2024
Miami, Florida, USA

[HOME] [PROGRAM] [PAPERS] [AUTHORS]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL] [MULTIINDIC22MT TASK] [ENGLISH-TO-LOWRES MULTIMODAL MT TASK] [NON-REPETITIVE] [PATENT] [LITERARY]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

The WMT24 Metric Shared Task will examine automatic evaluation metrics for machine translation. We will provide you with MT system outputs along with source text and a human reference translation. We are looking for automatic metric scores for translations at the system-level, and segment-level. We will calculate the system-level, and segment-level correlations of your scores with human judgements.

We invite submissions of reference-free metrics in addition to reference-based metrics. Please register your metrics here.

Have questions or suggestions? Feel free to contact us!

Deadline Extension

❗ The deadline for submissions 12:00 pm July 31 (AoE) or 00:00 August 1st (UTC) has been extended to 12:00 pm August 1st (AoE) or 00:00 August 2nd (UTC) ❗. Please check all the dates below.

NEW: Metric inputs and Codabench release

Submissions are now open via Codabench

Register your metric here, if you haven’t already.
Create an account on Codabench.
- You’re allowed one primary submission for a reference-based metric, and one primary submission for a reference-free metric. If you are submitting two metrics that have widely different approaches, for example, one LLM-based metric and one lexical metric, then create 2 accounts on Codabench.
Download the data (link; link also available on Codabench).
Please follow instructions at "Get Started → evaluation" page at Codabench to prepare your scores with the right submission format and submit your scores to Codabench.

Task Description

We will provide you with the source sentences, output of machine translation systems and reference translations.

Official results: Correlation with MQM scores at the sentence and system level for the following language pairs:
- English → German
- English → Spanish
- Japanese → Chinese
Secondary Evaluation: Correlation with official WMT Human Evaluation at the sentence and system level for all langauge pairs from the generalMT task.

Important Dates

All dates are at the end of the day for Anywhere on Earth (AoE) except the metrics submission deadline

Challenge sets submission

July 11

System outputs ready to download

July 23 now available

Metric outputs submission

❗ July 31 12:00pm (AoE) 12:00 pm August 1st (AoE) or 00:00 August 2nd (UTC)❗

Scored challenge sets given to participants for analysis

August 06

Paper submission deadline to WMT

August 20

Paper notification of acceptance

September 20

Paper camera-ready deadline

October 3

Conference

November 15-16

Important Links

mt-metrics-eval: MTME is a simple toolkit for calculating correlation numbers aka the sacreBLEU for metric developers. You can also dump the most recent test sets.

Training data and previous editions

The WMT Metrics shared task takes place yearly since 2008. You may want to use data from previous editions to tune/train your metric. The following table provides links to the descriptions, the raw data, and the findings papers of the previous editions:

year

MQM

DA system level

DA segment level

relative ranking

paper

.bib

Challenge Set Subtask

Challenge Sets: While other participants are worried with building stronger and better metrics, participants of this subtask have to build challengesets that identify where metrics fail! Similarly to last year’s challenge sets subtask, the goal of this subtask is for participants to submit test sets with challenging evaluation examples that current metrics do not evaluate well. This subtask is organized into 3 rounds:

1) Breaking Round: Challenge set participants (Breakers) create challenging examples for metrics. They must send the resulting "challenge sets" to the organizers.

2) Scoring Round: The challenge sets created by Breakers will be sent to all Metrics participants (Builders) to score. Also, the organizers will score all the data with baseline metrics such as BLEU, chrF, BERTScore, COMET, BLEURT, Prism, and YiSi-1.

3) Analysis Round: Breakers will receive their data with all the metrics scores for analysis.

Challenge set submission format

There is a new challenge set format this year:

One plain text file for all the source segments (one segment per line)
One plain text file for all the reference segments
One plain text file for the translation output segments from each translation system. (If the challenge set creator has only one system, then they will submit one file. If they have 2 candidate systems, then they will submit two separate files.)
One README file with the names of the authors and an e-mail contact

Please name the files:

{challenge_set_name}.{langpair}.src.txt,
{challenge_set_name}.{langpair}.ref.txt
{challenge_set_name}.{langpair}.hyp-1.txt,
{challenge_set_name}.readme.txt respectively.

'langpair' should be similar to 'en-cs'. It is also possible to submit a (compressed) archive.

Contrary to the previous years, the translation outputs will not be distinguished between "correct" and "incorrect" at the time of the submission, neither they will be shuffled in a pairwise manner by the organizers. The challenge set developers are encouraged to organize the challenge set the way it suits better their analysis.

Challenge set registration

Please register your intended submission with contacts and a short description ahead of time here.

Challenge set submission

Please submit your challenge set(s) here, after having completed the registration above.

Paper Describing Your Metric

You are invited to submit a short paper (4 to 6 pages) to WMT describing your metric and/or challenge set. Shared task submission description papers are non-archival, and you are not required to submit a paper if you do not want to. If you don’t, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.

Organizers

David Adelani
Eleftherios Avramidis
Frédéric Blain
Marianna Buchicchio
Sheila Castilho
Dan Deutsch
George Foster
Markus Freitag
Tom Kocmi
Alon Lavie
Chi-kiu Lo
Nitika Mathur
Ricardo Rei
Craig Stewart
Brian Thompson
Jiayi Wang
Chrysoula Zerva

NINTH CONFERENCE ON MACHINE TRANSLATION (WMT24)