EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 12-13, 2024
Miami, Florida, USA
 
[HOME]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

The WMT24 Metric Shared Task will examine automatic evaluation metrics for machine translation. We will provide you with MT system outputs along with source text and a human reference translation. We are looking for automatic metric scores for translations at the system-level, and segment-level. We will calculate the system-level, and segment-level correlations of your scores with human judgements.

We invite submissions of reference-free metrics in addition to reference-based metrics.

Have questions or suggestions? Feel free to contact us!

Task Description

We will provide you with the source sentences, output of machine translation systems and reference translations.

  1. Official results: Correlation with MQM scores at the sentence and system level for the following language pairs:

    • English → German

    • English → Spanish

    • Japanese → Chinese

  2. Secondary Evaluation: Correlation with official WMT Human Evaluation at the sentence and system level for all langauge pairs from the generalMT task.

Important Dates

All dates are at the end of the day for Anywhere on Earth (AoE)

Challenge sets submission

July 11

System outputs ready to download

July 23

Metric outputs submission

July 30

Scored challenge sets given to participants for analysis

August 06

Paper submission deadline to WMT

August 20

Paper notification of acceptance

TBA

Paper camera-ready deadline

TBA

Conference

November 12-13

mt-metrics-eval: MTME is a simple toolkit for calculating correlation numbers aka the sacreBLEU for metric developers. You can also dump the most recent test sets.

Training data and previous editions

The WMT Metrics shared task takes place yearly since 2008. You may want to use data from previous editions to tune/train your metric. The following table provides links to the descriptions, the raw data, and the findings papers of the previous editions:

year

MQM

DA system level

DA segment level

relative ranking

paper

.bib

2023

πŸ”—

πŸ”—

πŸ”—

2022

πŸ”—

πŸ”—

πŸ”—

πŸ”—

πŸ”—

2021

πŸ”—

πŸ”—

πŸ”—

πŸ”—

πŸ”—

2020

πŸ”—

πŸ”—

πŸ”—

πŸ”—

πŸ”—

2019

πŸ”—

πŸ”—

πŸ”—

πŸ”—

2018

πŸ”—

πŸ”—

πŸ”—

πŸ”—

2017

πŸ”—

πŸ”—

πŸ”—

πŸ”—

2016

πŸ”—

πŸ”—

πŸ”—

πŸ”—

2015

πŸ”—

πŸ”—

πŸ”—

2014

πŸ”—

πŸ”—

πŸ”—

2013

πŸ”—

πŸ”—

πŸ”—

2012

πŸ”—

πŸ”—

πŸ”—

2011

πŸ”—

πŸ”—

πŸ”—

2010

πŸ”—

πŸ”—

πŸ”—

2009

πŸ”—

πŸ”—

πŸ”—

2008

πŸ”—

πŸ”—

πŸ”—

Subtasks

This year, we have 2 subtasks:

  1. Challenge Sets: While other participants are worried with building stronger and better metrics, participants of this subtask have to build challengesets that identify where metrics fail!

  2. Span Detection Task: We ask participants to produce error span annotations similar to what MQM is giving us.

Challenge Set Subtask

Similarly to last year’s challenge sets subtask, the goal of this subtask is for participants to submit test sets with challenging evaluation examples that current metrics do not evaluate well. This subtask is organized into 3 rounds:

1) Breaking Round: Challenge set participants (Breakers) create challenging examples for metrics. They must send the resulting "challenge sets" to the organizers.

2) Scoring Round: The challenge sets created by Breakers will be sent to all Metrics participants (Builders) to score. Also, the organizers will score all the data with baseline metrics such as BLEU, chrF, BERTScore, COMET, BLEURT, Prism, and YiSi-1.

3) Analysis Round: Breakers will receive their data with all the metrics scores for analysis.

Challenge set submission format

There is a new challenge set format this year:

  • One plain text file for all the source segments (one segment per line)

  • One plain text file for all the reference segments

  • One plain text file for the translation output segments from each translation system. (If the challenge set creator has only one system, then they will submit one file. If they have 2 candidate systems, then they will submit two separate files.)

  • One README file with the names of the authors and an e-mail contact

Please name the files challenge_set_name.src.txt, challenge_set_name.ref.txt, challenge_set_name.hyp-1.txt, challenge_set_name.readme.txt respectively. It is also possible to submit a (compressed) archive.

Contrary to the previous years, the translations outputs will not be distinguished between "correct" and "incorrect" at the time of the submission, neither they will be shuffled in a pairwise manner by the organizers. The challenge set developers are encouraged to organize the challenge set the way it suits better their analysis.

Challenge set registration

Please register your intended submission with contacts and a short description ahead of time here.

Challenge set submission

Please submit your challenge set(s) here, after having completed the registration above.

Span Detection Subtask

Recent metrics have been improving their interpretability by producing "error spans" similarly to MQM annotations. In this subtask, we will look at such metrics and evaluate the quality of those predictions by comparing them with real MQM evaluations.

More details will be released soon!

Paper Describing Your Metric

You are invited to submit a short paper (4 to 6 pages) to WMT describing your metric and/or challenge set. Shared task submission description papers are non-archival, and you are not required to submit a paper if you do not want to. If you don’t, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.

Organizers

  • David Adelani

  • Eleftherios Avramidis

  • FrΓ©dΓ©ric Blain

  • Marianna Buchicchio

  • Sheila Castilho

  • Dan Deutsch

  • George Foster

  • Markus Freitag

  • Tom Kocmi

  • Alon Lavie

  • Chi-kiu Lo

  • Nitika Mathur

  • Ricardo Rei

  • Craig Stewart

  • Brian Thompson

  • Jiayi Wang

  • Chrysoula Zerva

Sponsors

We would like to extend our gratitude to Google and Unbabel for generously sponsoring the MQM human annotations required to run this task.