EMNLP 2025

TENTH CONFERENCE ON
MACHINE TRANSLATION (WMT25)

November 5-9, 2025
Suzhou, China
 
TRANSLATION TASKS: GENERAL MT (NEWS) •︎ INDIC MT •︎ TERMINOLOGY •︎ CREOLE MT •︎ MODEL COMPRESSION
EVALUATION TASKS: MT TEST SUITES •︎ (UNIFIED) MT EVALUATION
OTHER TASKS: OPEN DATA
MULTILINGUAL TASKS: MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES SLAVIC LLM

Announcements and Important Dates

All dates are end of Anywhere on Earth (AoE).

Task details finalized, data snippets and evaluation measures released

7th May 2025

Task details finalized, dev data released

20th May 2025

Test data released and task starts

15th June 2025

Submission deadline

15th July 2025

Paper submission to WMT25

in-line with WMT25

Camera-ready submission to WMT25

in-line with WMT25

Conference in Suzhou, China

05-09 November 2025

Overview

The advances in neural MT and LLM-assisted translation of the last decade show nearly human quality in general domain translation at least for the high-resource languages. However, when it comes to specialized domains like science, finance, or legal texts, where the correct and consistent use of special terms is crucial, the task is far from being solved. The WMT 2025 Terminology Shared Task aims to assess the extent to which machine translation models can utilize additional information regarding the translation of terminologies. Building on two previous editions in 2021 and 2023, we are hosting the task in 2025. The new test data have more various test cases, are more consistent in domains for each translation direction, and are broader in language coverage.

Task Description 

We are arranging two tracks that differ by the size of the input texts and the terminologies. The evaluation criteria will be the same for both tracks.

Track №1: Sentence/Paragraph-Level Translation

Setup

You will be provided with chunks of input text (each chunk one to several sentences long), and small terminology dictionaries that will correspond only to the terms present in the given chunk. The data snippet is shown below. Your system is expected to work both given explicit terminology from the dictionary and given no terminology. You are expected to run your system in three modes: given no terminology, given proper terminology, and given random terminology (see the explanation below). 

Data Snippet - jsonl

[
    {
        "en": "At its February meeting, the Governing Council also established modalities for reducing Eurosystem securities holdings under the APP. This followed on from its December 2022 decision to no longer reinvest the principal payments from maturing securities in full from March onwards, so that the APP portfolio would decrease by a monthly average of €15 billion from March to June 2023, with the subsequent reduction pace to be determined later. Corporate bond reinvestments would be tilted more strongly towards issuers with a better climate performance. ",
        "es": "En su reunión de febrero, el Consejo de Gobierno también estableció las modalidades para reducir las tenencias de valores mantenidos por el Eurosistema en el marco del APP, tras la decisión adoptada en su reunión de diciembre de 2022 de dejar de reinvertir íntegramente el principal de los valores que fueran venciendo a partir de marzo, de modo que la cartera del APP disminuiría, en promedio, en 15mm de euros al mes entre marzo y junio de 2023, y su ritmo posterior se determinaría más adelante. Las reinversiones de bonos corporativos se inclinarían en mayor medida hacia emisores con mejor comportamiento climático. ",
        "proper_terms": {
            "Governing Council": "Consejo de Gobierno",
            "Eurosystem": "Eurosistema",
            "APP": "APP"
        },
        "random_terms": {
            "February": "febrero",
            "more strongly": "en mayor medida",
            "its": "su"
        }
    },
    {
        "en": "Open the consumption model containing the measures and attributes you want to include in your perspective, and click the  Perspectives tab.",
        "de": "Öffnen Sie das Verbrauchsmodell mit den Kennzahlen und Attribute, die Sie in Ihre Perspektive aufnehmen möchten, un wechseln Sie zur Registerkarte Perspektiven.",
        "proper_terms": {
            "consumption model": "Verbrauchsmodell"
        },
        "random_terms": {
            "include": "aufnehmen",
            "want": "möchten"
        }
    }
]

Language Pairs

The language pairs include: 

  • en-de and de-en (English → German and vice versa)

  • en-ru and ru-en (English → Russian and vice versa)

  • en-es and es-en (English → Spanish and vice versa)

  • en-Hans and Hans-en (English → Simplified Chinese and vice versa)

In addition, we will announce other pairs of more low-resourced European languages (in direction en→lowres) at the test set publication.

Domains: 

  • finance

  • information technology

Track №2: Document-Level Translation

The setup is similar to Track №1, with two exceptions: the length of the input texts now equals the document, and the dictionaries correspond to the whole set of input texts (i.e. they are corpus-level). This makes the task close to the real-life setup (where the dictionaries exist independently from the texts), while it may complicate the implementation (since for the solutions that require storing the whole dictionary it will take more memory). Additionally, for the whole document setup, the problem of the consistent usage of terms is becoming more important. 

Data Overview

The format of the data will be similar to the one from the Track 1, the difference will be in the input length. For this track, we expect your systems to process the input of 2k tokens to cover the whole texts in one batch.

Language Pairs

  • en-zh-Hant (English → Traditional Chinese)

  • zh-Hant-en (Traditional Chinese → English)

Domains: 

  • finance

Evaluation 

Terminology Modes

You are expected to compare your system’s performance under three modes:

  1. No terminology: the system is only provided with input sentences/documents.

  2. Proper terminology: the system is provided with input texts (same as 1.) and dictionaries of the format {source_term: target_term}.

  3. Random terminology: the system is provided with input texts and translation dictionaries of the same format as in 2. The difference is that the dictionary items are not special terms but words randomly drawn from input texts. This mode is of special interest since we want to measure to what extent the proper term translations help to improve the system performance (2.), as opposed to an arbitrary broader input that does not contain the domain-specific terminology.    In the json files, you will be provided with both proper and random terminology dictionaries. Thus, for mode 1, you need to ignore them, and for modes 2 and 3, you need to use the corresponding dictionary. 

Metrics

The submissions will be evaluated based on:

  1. Overall Translation Quality: we will evaluate the general aspects of machine translation outputs such as fluency, adequacy and grammaticality. We will do that with the general MT automatic metrics such as BLEU or COMET. In addition to that, we will pay special attention to the grammaticality of the translated terms. 

  2. Terminology Success Rate: This metric assesses the ability of the system to accurately translate technical terms given the specialized vocabulary. This will be carried out by comparing the occurrences of the correct term translations (i.e. the ones present in the dictionary) to the output terms. The goal is to have a higher success rate that will show adherence to dictionary translations. 

  3. Terminology Consistency: for domains such as science or legal texts, the consistent use of an introduced term throughout the text is crucial. In other words, we want a system to not only pick up a correct term in a target language but to use it consistently once it is chosen. This will be evaluated by comparing all translations of a given source term in a text and measuring the percentage of deviations from the most consistent translation. This metric is more important for the Document-Level track, but it will be used for both tracks. 

Usage of three metrics makes the comparison multidimensional. To minimize the dimensionality, we are planning to use the Pareto optimal between the Overall Translation Quality and Terminology Success Rate. Therefore, the solutions which will end up at the frontier will be considered optimal. 

Participation

As terminology translation is a highly applicable task, we encourage participation from both academic researchers and industrial practitioners. You may choose to participate in any translation direction with any modeling approach as you prefer, including but not limited to:

  • lexical constrained decoding,

  • large language model fine-tuning and/or prompting,

  • translation editing and refinement.

Participants will have the option to publish a system description paper (4 to 6 pages) at WMT. Otherwise, we kindly ask you to provide a brief description of your approach with your test submission.

Submission Guidelines

To be announced at test data publication.

Organizers

  •   Nathaniel Berger (Heidelberg University)

  •   Pinzhen Chen (University of Edinburgh & Aveni.ai)

  •   Xu Huang (Nanjing University)

  •   Arturo Oncevay (JP Morgan)

  •   Kirill Semenov (University of Zurich), main contact: firstname.lastname@uzh.ch

  •   Dawei Zhu

  •   Vilém Zouhar (ETH Zurich)     

Acknowledgements

To be updated.

F.A.Q.

  1. Do I have to submit results for all language pairs?

No, but we highly encourage it as it makes comparisons fairer. If you already have a system for a specific language pair, replicating it (without optimizing hyperparameters) for another language pair should be very easy and yield a good demonstration of your method. Also, since terminology consistency is an under-studied field, we are especially interested in understanding the replicability of the systems’ performance given the languages from the different families and writing systems. Thus, by submitting results for all language pairs, you have a chance to say a new word in the general understanding of terminology-assisted translation!