EMNLP 2025

TENTH CONFERENCE ON
MACHINE TRANSLATION (WMT25)

November 5-9, 2025
Suzhou, China
 
TRANSLATION TASKS: GENERAL MT (NEWS) •︎ INDIC MT •︎ TERMINOLOGY •︎ CREOLE MT •︎ MODEL COMPRESSION
EVALUATION TASKS: MT TEST SUITES •︎ (UNIFIED) MT EVALUATION
OTHER TASKS: OPEN DATA
MULTILINGUAL TASKS: MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES SLAVIC LLM

Coming Soon!

DESCRIPTION

The goal in this subtask is to predict translation error spans along with the severity labels. For this subtask we use the error spans obtained from the MQM and ESA human annotations generated for the General MT primary task as the target “gold standard”. Participants will be asked to predict both the error span (start and end indices) as well as the error severity (major or minor) for each segment. Minor issues as those that do not impact meaning or usability whereas major issues as those that impact meaning or usability but do not render the text unusable.

Languages covered

The list below provides the language pairs covered this year (which fully parallel the general machine translation task) and the quality annotations that will be used as targets for each language pair:

  • Czech to Ukrainian (ESA)

  • Czech to German (ESA)

  • Japanese to Chinese (MQM)

  • English to Arabic (ESA)

  • English to Chinese (ESA)

  • English to Czech (ESA)

  • English to Estonian (ESA)

  • English to Icelandic (ESA)

  • English to Japanese (ESA)

  • English to Korean (MQM)

  • English to Russian (ESA)

  • English to Serbian (ESA)

  • English to Ukrainian (ESA)

  • English to Bhojpuri (ESA)

  • English to Maasai (ESA)

Language pairs highlighted in green have development or training sets available from previous versions of the shared task (see below). All language pairs except for Czech–German and English–Czech will also be provided with a reference translation.

TRAINING AND DEVELOPMENT SETS

For training and development, participants can use the MQM and ESA annotations released in the previous editions of the WMT shared tasks. Some repositories that contain these datasets are listed below:

We recommend using WMT24 as the development set.

BASELINES

We will provide xCOMET and LLM-as-a-judge baselines for this subtask.

Evaluation

The primary evaluation metric will be the error severity weighted F1 score. More details coming soon!

Submission format

Will be added soon!