EMNLP 2025

TENTH CONFERENCE ON
MACHINE TRANSLATION (WMT25)

November 5-9, 2025
Suzhou, China
 
TRANSLATION TASKS: GENERAL MT (NEWS) •︎ INDIC MT •︎ TERMINOLOGY •︎ CREOLE MT •︎ MODEL COMPRESSION
EVALUATION TASKS: MT TEST SUITES •︎ (UNIFIED) MT EVALUATION
OTHER TASKS: OPEN DATA
MULTILINGUAL TASKS: MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES SLAVIC LLM

ANNOUNCEMENTS

  • 2025-05-17: Detailed description of the task announced

TASK DESCRIPTION

The goal of this task is to predict a quality score for each source–target segment pair in the evaluation set. Depending on the language pair, participants will be asked to predict the numerical score deriving from the segment’s ESA or MQM annotation as provided by a human. Submissions will be evaluated and ranked based on their predictions’ correlations with these human-annotated scores at both the segment and system levels.

We welcome submissions covering any or all of the language pairs used in the WMT 2025 General MT task, as per the following:

  • The ESA score is a direct assessment primed by the act of annotating precise error spans within the segment. We solicit systems that predict segment-level ESA scores for test sets that we will publish in Czech–German, Czech–Ukrainian, English–Arabic, English–Bhojpuri, English–Chinese, English–Czech, English–Estonian, English–Icelandic, English–Japanese, English–Maasai, English–Russian, English–Serbian, and English–Ukrainian.

  • The MQM score is a direct assessment mathematically derived from the count and severity of precise error spans annotated within the segment. We solicit systems that predict segment-level MQM scores for test sets that we will publish in English–Korean and Japanese–Chinese.

The test sets for all language pairs except for Czech–German and English–Czech will be provided with a reference translation, which therefore can be used by submitted systems as an additional input.

Participants will also run their automatic score prediction systems on collected “challenge sets” that illustrate particular linguistic phenomena, domains, or even non-WMT language pairs of interest to the developers of the sets. The predicted scores will be returned to the developers of each set for further analysis. See the detailed page on Metrics/QE challenge sets for further details.

Note that, for this task, the necessary output is only the segment-level numerical scores. Submitted systems are free, however, to derive their scores in any manner that appears useful. In particular, systems may first predict precise error spans (as in Task 2) and use them as the basis for computing the numerical score. Such systems that perform Task 1 and Task 2 jointly may be entered in both tracks.

TRAINING AND DEVELOPMENT DATA

You are welcome to build your system from any desired training data, foundation model, etc. We therefore do not release any specific training data. However, labeled data from previous editions of the Metrics and/or QE shared tasks is available to help you train and tune your system if you would like. Those resources are summarized and linked below. Note, however, that the set of language pairs covered in prior years is not an exact match for this year’s: some have training data, while some are zero-shot.

DA and MQM annotations from prior QE tasks

2022, 2023, 2024

MQM annotations from prior Metrics tasks

2020–24

DA annotations from prior Metrics tasks

2016, 2017, 2018, 2019, 2020, 2021, 2022

Relative rank annotations from prior Metrics tasks

2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015

For help in tuning, model selection, etc., we recommend using the human-annotated General MT 2024 test set as a development set. We will release an updated version of this set soon.

SUBMISSION PROCEDURE AND FORMAT

Register your participation by creating an account on Codabench and joining the WMT 2025 segment-level score prediction competition. (The link will be forthcoming.)

At the start of the test week (24 July), you will receive input files for each language pair containing source–target segment pairs for scoring. The official test set will consist of a collection of documents, each divided into segments; the segments however will consist of multiple sentences together rather than just one. Contents of the challenge sets will vary.

Submit your systems to the shared task by uploading scored test sets to our Codabench competition by the end of the test week on 31 July. Details on the expected file format will be provided later. Received files will be automatically tracked on a leaderboard for verification purposes.

EVALUATION

We will evaluate the quality of automatic score prediction on the official test set at both the segment level and the system level. Evaluation will be carried out according to a combination of standard and more recently proposed meta-evaluation methodologies to assess correlation with human judgements: Pearson’s r, Spearman’s ρ, Kendall’s τ, soft pairwise accuracy, and tie-calibrated pairwise accuracy.

Scored challenge sets will be returned to the individual challenge set developers for evaluation and analysis. Performance on the challenge sets will not be tracked on the Codabench leaderboard or counted as part of the official results.

Note that the official results will be based on correlation with human judgements, which will not be complete until around September. At the time of test set submission (24–31 July), we will provide on the Codabench leaderboard an expression of each system’s correlation with automatically derived pseudo-gold-standard judgements, using simple metrics. This will allow participants to confirm that their files were uploaded in a valid format and that the performance of their system was broadly reasonable; scores, however, are for these basic verification purposes only and do not reflect official results.

System description papers must be submitted for initial review by 14 August and in camera-ready format by 25 September. Participants are therefore advised to begin preparation of their papers prior to the test week, using analysis based on a chosen development set rather than the official shared task results.

BASELINES

We will include in Codabench and in the official results a number of baseline systems. The exact selection will be announced later.