Terminology Translation Task

TRANSLATION TASKS:	GENERAL MT (NEWS) •︎ INDIC MT •︎ TERMINOLOGY •︎ CREOLE MT •︎ MODEL COMPRESSION
EVALUATION TASKS:	MT TEST SUITES •︎ (UNIFIED) MT EVALUATION
OTHER TASKS:	OPEN DATA
MULTILINGUAL TASKS:	MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES SLAVIC LLM

Announcements

29th July 2025: Shared Task Finished!

We have finished collecting the submissions from the participants: this year we got 21 systems from 15 teams! Thank you very much for your participation!

The reference translations will be published soon; in the meantime, we are starting to process and evaluate the submissions.

❗If you have not submitted a detailed your system description yet are willing to do it, we ask you to do it before the submission deadline for WMT (14 August) by sending it in an email to Kirill Semenov (see contacts in the end of the page).❗

5th July 2025: Validation Scripts and Submission Forms Released!

We are thrilled to announce that the last steps for the competition are out: you can find the validation script and the submission form in the Submission Guidelines! We are looking forward to your submissions!

24th June 2025: Test Data Released, Shared Task Started!

We are happy to announce that the test data is released for both tracks, and with that we announce the start of our Shared task! You will find the data and their description in the Data section. Feel free to start experiments with the datasets; in the nearest future we will also release the validation code and the submission form. In the meantime, do not forget to pre-register as we want to understand the number of submissions for our shared task.

We apologize for delays with the data publication; this happened due to our curation of the document-level test dataset.

9th June 2025: Pre-Registration Opened

We have opened a pre-registration form for our shared task that you can find here. This is optional, however, we strongly recommend you to do that as we will be able to keep you updated about the shared task deadlines.

22nd May 2025: Dev Data Released

We release the development data for the Sentence-Level translation task! The data can be found in the Data section. We apologize for a small delay in publication.

Important Dates

All dates are end of Anywhere on Earth (AoE).

Data snippets and evaluation measures released

~~7th May 2025~~

Task details finalized, dev data released

~~20th May 2025 (delayed till 22nd May)~~

Test data released and task starts

~~24th June 2025 (delayed till 20th June)~~

Validation code released, submission form opened

~~30th June 2025~~

Submission deadline

~~25th July 2025~~

Paper submission to WMT25

in-line with WMT25

Camera-ready submission to WMT25

in-line with WMT25

Conference in Suzhou, China

05-09 November 2025

Overview

The advances in neural MT and LLM-assisted translation of the last decade show nearly human quality in general domain translation at least for the high-resource languages. However, when it comes to specialized domains like science, finance, or legal texts, where the correct and consistent use of special terms is crucial, the task is far from being solved. The WMT 2025 Terminology Shared Task aims to assess the extent to which machine translation models can utilize additional information regarding the translation of terminologies. Building on two previous editions in 2021 and 2023, we are hosting the task in 2025. The new test data have more various test cases, are more consistent in domains for each translation direction, and are broader in language coverage.

Task Description

We are arranging two tracks that differ by the size of the input texts and the terminologies. The evaluation criteria will be the same for both tracks.

Track №1: Sentence/Paragraph-Level Translation

The dev and test data can be found below.

Setup

You will be provided with chunks of input text (each chunk one to several sentences long), and small terminology dictionaries that will correspond only to the terms present in the given chunk. The data snippet is shown below. Your system is expected to work both given explicit terminology from the dictionary and given no terminology. You are expected to run your system in three modes: given no terminology, given proper terminology, and given random terminology (see the explanation below).

The detailed description of the dataset is provided in the test data folder.

Data Snippet - `jsonl`

[
    {
        "en": "At its February meeting, the Governing Council also established modalities for reducing Eurosystem securities holdings under the APP. This followed on from its December 2022 decision to no longer reinvest the principal payments from maturing securities in full from March onwards, so that the APP portfolio would decrease by a monthly average of €15 billion from March to June 2023, with the subsequent reduction pace to be determined later. Corporate bond reinvestments would be tilted more strongly towards issuers with a better climate performance. ",
        "es": "En su reunión de febrero, el Consejo de Gobierno también estableció las modalidades para reducir las tenencias de valores mantenidos por el Eurosistema en el marco del APP, tras la decisión adoptada en su reunión de diciembre de 2022 de dejar de reinvertir íntegramente el principal de los valores que fueran venciendo a partir de marzo, de modo que la cartera del APP disminuiría, en promedio, en 15mm de euros al mes entre marzo y junio de 2023, y su ritmo posterior se determinaría más adelante. Las reinversiones de bonos corporativos se inclinarían en mayor medida hacia emisores con mejor comportamiento climático. ",
        "proper_terms": {
            "Governing Council": "Consejo de Gobierno",
            "Eurosystem": "Eurosistema",
            "APP": "APP"
        },
        "random_terms": {
            "February": "febrero",
            "more strongly": "en mayor medida",
            "its": "su"
        }
    },
    {
        "en": "Open the consumption model containing the measures and attributes you want to include in your perspective, and click the  Perspectives tab.",
        "de": "Öffnen Sie das Verbrauchsmodell mit den Kennzahlen und Attribute, die Sie in Ihre Perspektive aufnehmen möchten, un wechseln Sie zur Registerkarte Perspektiven.",
        "proper_terms": {
            "consumption model": "Verbrauchsmodell"
        },
        "random_terms": {
            "include": "aufnehmen",
            "want": "möchten"
        }
    }
]

Language Pairs

The language pairs include:

en-de (English → German)
en-ru (English → Russian)
en-es (English → Spanish)

Domains:

information technology

Track №2: Document-Level Translation

The test data can be found below.

The setup is similar to Track №1, with two exceptions: the length of the input texts now equals the document, and the dictionaries correspond to the whole set of input texts (i.e. they are corpus-level). This makes the task close to the real-life setup (where the dictionaries exist independently from the texts), while it may complicate the implementation (since for the solutions that require storing the whole dictionary it will take more memory). Additionally, for the whole document setup, the problem of the consistent usage of terms is becoming more important.

Data Overview

The format of the data will be similar to the one from the Track 1, the difference will be in the input length. For this track, we expect your systems to process the input of 2k English words to cover the whole texts in one batch. The detailed description of the dataset is provided in the test data folder.

Language Pairs

en-zh-Hant (English → Traditional Chinese)
zh-Hant-en (Traditional Chinese → English)

Domains:

finance

Evaluation

Terminology Modes

You are expected to compare your system’s performance under three modes:

No terminology: the system is only provided with input sentences/documents.
Proper terminology: the system is provided with input texts (same as 1.) and dictionaries of the format {source_term: target_term}.
Random terminology: the system is provided with input texts and translation dictionaries of the same format as in 2. The difference is that the dictionary items are not special terms but words randomly drawn from input texts. This mode is of special interest since we want to measure to what extent the proper term translations help to improve the system performance (2.), as opposed to an arbitrary broader input that does not contain the domain-specific terminology. In the json files, you will be provided with both proper and random terminology dictionaries. Thus, for mode 1, you need to ignore them, and for modes 2 and 3, you need to use the corresponding dictionary.

Metrics

The submissions will be evaluated based on:

Overall Translation Quality: we will evaluate the general aspects of machine translation outputs such as fluency, adequacy and grammaticality. We will do that with the general MT automatic metrics such as BLEU or COMET. In addition to that, we will pay special attention to the grammaticality of the translated terms.
Terminology Success Rate: This metric assesses the ability of the system to accurately translate technical terms given the specialized vocabulary. This will be carried out by comparing the occurrences of the correct term translations (i.e. the ones present in the dictionary) to the output terms. The goal is to have a higher success rate that will show adherence to dictionary translations.
Terminology Consistency: for domains such as science or legal texts, the consistent use of an introduced term throughout the text is crucial. In other words, we want a system to not only pick up a correct term in a target language but to use it consistently once it is chosen. This will be evaluated by comparing all translations of a given source term in a text and measuring the percentage of deviations from the most consistent translation. This metric is more important for the Document-Level track, but it will be used for both tracks.

Usage of three metrics makes the comparison multidimensional. To minimize the dimensionality, we are planning to use the Pareto optimal between the Overall Translation Quality and Terminology Success Rate. Therefore, the solutions which will end up at the frontier will be considered optimal.

Participation

As terminology translation is a highly applicable task, we encourage participation from both academic researchers and industrial practitioners. You may choose to participate in any translation direction with any modeling approach as you prefer, including but not limited to:

lexical constrained decoding,
large language model fine-tuning and/or prompting,
translation editing and refinement.

Participants will have the option to publish a system description paper (4 to 6 pages) at WMT. Otherwise, we kindly ask you to provide a brief description of your approach with your test submission.

Data

the dev data (for Track 1) can be found here.
the test data (for both tracks) can be found here; apart from the data, you can find a detailed description of the data and the submission format.
the validation code will be published here soon.

Submission Guidelines

0. Please notify us about your participation prior to submission (optional)

This is not a required action, but we’d appreciate a lot if you contact us once your team decides to participate in the competition. With this, we will have a better understanding of our workload after submission. Also, we will be able to send you a gentle reminder before the deadline. The easiest way to do that is through our short Google Form; but you can also contact organizers via email.

1. Check your submission files with the validation script

The validation script can be found here. The systems with the outputs that are not compatible with the validation scripts will be desk-rejected.

2. Write a description of your system (optional)

First, in the submission form (see 3) you will be required to provide a short text description of your system (4-6 sentences).

Additionally, we’d appreciate more detailed descriptions of your systems. Therefore, you are invited to submit a short system description paper (4 to 6 pages) to WMT describing your system. If you are already submitting long papers about your system, or if you have already published something about that, feel free to attach the links to such documents, you do not have to tailor your detailed description to our shared task specifically.

3. Submit your system via Google Forms

The Google form with all necessary sumbission details is published here.

Detailed instructions about the formats of inputs and the outputs, as well as submission guidelines, can be found here.
If your team already participated in the WMT general task, please use the same team name.
You are allowed to submit multiple systems. If you are submitting multiple systems, we kindly request that you submit each system individually. In addition, we need you to indicate the primary system.
Submissions should be uploaded by deadline stated above.
If you have any questions regarding to the submission, please contact the organizers with a private inquiry.

Organizers

Nathaniel Berger (Heidelberg University)
Pinzhen Chen (University of Edinburgh & Aveni.ai)
Xu Huang (Nanjing University)
Arturo Oncevay (JP Morgan)
Kirill Semenov (University of Zurich), main contact: firstname.lastname@uzh.ch
Dawei Zhu (Amazon)
Vilém Zouhar (ETH Zurich)

Acknowledgements

To be updated.

F.A.Q.

1. Do I have to submit results for all language pairs?

No, but we highly encourage it as it makes comparisons fairer. If you already have a system for a specific language pair, replicating it (without optimizing hyperparameters) for another language pair should be very easy and yield a good demonstration of your method. Also, since terminology consistency is an under-studied field, we are especially interested in understanding the replicability of the systems’ performance given the languages from the different families and writing systems. Thus, by submitting results for all language pairs, you have a chance to say a new word in the general understanding of terminology-assisted translation!

2. Do I have to register for the competition or follow any other resource?

You should follow only this website, all updates (such as data releases) will be noted on the top of the webpage. For your (and our) convenience, we kindly ask you to fill in the pre-registration form: with it, we’d be able to notify you about important dates. At the publication of the test data, we will also attach the Google Form for registration in the competition.

3. I cannot see the dev set for document translation track.

Unfortunately, we are unable to publish the dev part of the dataset. You will only be able to see the test set soon.

UPD: the test set is published, see the Data section

4. Can you share the code or the exact guidelines on how to compute Terminology Success Rate and Term Consistency?

At the moment, we are in process of refining of these two metrics. For the sake of reproducibility, we will share all codes and data into open access, but no earlier than after the submission deadline. However, to have a better understanding of the Terminology Consistency metric, you can refer to the paper where it was presented: (Semenov and Bojar, 2022), and to give an understanding of the Terminology Success Rate score, you can refer to a code snippet from the last shared task in 2023.

5. Do we have constraints on data or models that we can use for our submitted systems?

No, contrary to the General MT task, we do not impose restrictions on the usage of particular resources or instruments within our task. However, for the sake of comparability and interpretation of systems, we’d appreciate if you explicitly note the data that you used for your submission.

6. For Track 2 (document-level MT), do we have to sumbit all years?"

Yes, you have to submit all 10 reports (in 3 modes - no/proper/random terminology), therefore 30 files.

7. In no-terminology and random terminology modes, which terminology is used to evaluate the term-based metrics (Success Rate, Consistency)?

Since the main task for us is to analyze the capabilities of the models with the terminology-aware translation, the main evaluation and ranking for all three modes will be comparing their performance against the proper terminology. However, to analyze the sensitivity of your systems to a particular terminology input, in random setup we will additionally compute the term-based metrics for the random terms (however, this evaluation will not influence the ranking of the systems).

8. Can I submit the DETAILED system description later?

If you are submitting your description to the main shared task or to the main WMT conference, and your deadline is August 14th, then you do not have to send us a paper before that. Still, we’d appreciate if you send your paper to us right after this deadline, because we do not have an access to the SoftConf submission system. You can send it to Kirill Semenov, the main contact person of the shared task.

9. Will be submissions, scores and evaluation codes be shared?

Yes, we are planning to publish that in the beginning of September.

TENTH CONFERENCE ON MACHINE TRANSLATION (WMT25)

Terminology Translation Task

Announcements

29th July 2025: Shared Task Finished!

5th July 2025: Validation Scripts and Submission Forms Released!

24th June 2025: Test Data Released, Shared Task Started!

9th June 2025: Pre-Registration Opened

22nd May 2025: Dev Data Released

Important Dates

Overview

Task Description

Track №1: Sentence/Paragraph-Level Translation

Setup

Data Snippet - jsonl

Language Pairs

Domains:

Track №2: Document-Level Translation

Data Overview

Language Pairs

Domains:

Evaluation

Terminology Modes

Metrics

Participation

Data

Submission Guidelines

Organizers

Acknowledgements

F.A.Q.

TENTH CONFERENCE ON
MACHINE TRANSLATION (WMT25)

Data Snippet - `jsonl`