Terminology Translation Task

EMNLP 2026

ELEVENTH CONFERENCE ON
MACHINE TRANSLATION (WMT26)

28-29 October, 2026
Budapest, Hungary HOME

TRANSLATION TASKS:	GENERAL MT •︎ INDIC MT •︎ ARABIC-ASIAN MT •︎ CHINESE-SOUTHEAST ASIAN MT •︎ TERMINOLOGY •︎ MODEL COMPRESSION •︎ CREOLE MT •︎ VIDEO SUBTITLE TRANSLATION
EVALUATION TASKS:	MT TEST SUITES •︎︎ AUTOMATED MT EVALUATION
OTHER TASKS:	OPEN DATA •︎ MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES LLM

Announcements

January 2026: Terminology shared task under preparation, will take place with WMT 2026
June 2026: We have published the main information about the shared task! Fill in our pre-registration form, and we’ll remind you of the important dates!
❗30th June 2026❗: We have published the test data and input/output guidelines for our data! You can find them here or in the respective sections below for Track 1 and Track 2. 🎉🎉This marks the start of our shared task!🎉🎉

More information will be announced soon.

Important Dates

All dates are end of Anywhere on Earth (AoE).

Pre-registration form released

1th June 2026

Data snippets and evaluation measures released

5th June 2026

Task details finalized

~~15th June 2026~~ 25th June 2026

Test data released, task starts

~~25th June 2026~~ 30th June 2026

Validation code released, submission form opened

~~25th June 2026~~ 1st July 2026

Translation submission deadline

~~25th June 2026~~31st July 2026

Paper submission deadline

in-line with WMT26

Camera-ready submission deadline

in-line with WMT26

Conference in Budapest, Hungary

October 2026

Overview

While general-purpose MT systems have been showing nearly-human performance at least for high-resourced languages, translation of the terminology-heavy texts (science, technology, legal texts) is still far from saturation. The results of the last year’s shared task have shown that, in particular, it is hard to find a trade-off between the general MT quality and accurate terminology translation; additionally, while the systems show nearly-perfect performance at the sentence-level, the document-level terminology-aware MT leaves much to be desired. To assess the progress in the field of the terminology-aware machine translation, we organize the fourth Terminology Shared Task, that will take place at the WMT26. Compared to the previous three iterations (in 2021, 2023 and 2025), this year’s competition will finally expand to lower-resourced languages with rich morphology (including Polish and Basque). For continuity with last year’s shared task, one of the competition tracks follows the same setup (document-level MT with an explicit dictionary); at the same time, we replaced the simpler sentence-level MT task with a more challenging "seed bitexts + document-level MT" track.

Task Description

We are arranging two tracks that differ by the formats of the input data and the number of necessary steps required from the participants.

Track №1: Document-Level Translation with Explicit Dictionary

You will be provided with 1) chunks of input text (each chunk corresponds to approx. 2000 words, i.e., a short document); and 2) terminology dictionaries for the whole set of input texts (i.e. they are corpus-level). The data snippet is shown below. Your system is expected to work both given explicit terminology from the dictionary and given no terminology. You are expected to run your system in three modes: given no terminology, given proper terminology, and given random terminology (see the explanation below).

Data Snippet:

You will have inputs in two files: documents to translate and dictionaries with terminologies.

Documents: JSON with a list of source texts for a given domain

[
    "1.0 Primary Actuation and Structural Support The powertrain regulates the foundational kinetic element of the entire air-fuel exchange mechanism with a camshaft.\nThe geometry of the camshaft governs torque delivery, peak horsepower, and idle stability by strictly controlling valve events.\nEngineers extract precise valve lifts from an eccentric cam, which is machined directly into the camshaft at meticulously calculated intervals...",
    ...
]

Dictionaries: JSON with two domain-specific term dictionaries: random and proper terminologies

    {
        "proper": {
            "camshaft": "wał rozrządu",
            "cam": "krzywka",
            ...
        },
        "random": {
            "entire": "cały",
            "Structural": "strukturalne",
            ...
        }
    }

Expected output: JSON with a list of target texts

[    "1.1 Główne uruchomienie i wsparcie strukturalne Układ napędowy kontroluje podstawowy element kinetyczny w całym mechanizmie wymiany mieszanki powietrzno- paliwowej za pomocą wału rozrządu.\nGeometria wału rozrządu wpływa na sposób przekazania momentu obrotowego, maksymalną moc silnika oraz stabilność pracy na biegu jałowym poprzez dokładne sterowanie pracą zaworów.\nInżynierowie uzyskują dokładne uniesienie zaworów dzięki mimośrodowej krzywce, która jest skrawana bezpośrednio w wale rozrządu w dokładnie obliczonych odstępach...",
    ...
]

The detailed data description and links to files are provided in this Google Doc

Track №2: Document-Level Translation with Sample Bitexts

In this track, you will not be provided with the explicit term dictionary. Instead, you will be provided with two sets of texts:

The first group of sentences or paragraphs will be the "seed" bitexts: your systems can learn information about the relevant terms from these sentence pairs.
The second group would comprise the documents that include the terms from group 1. Your systems are supposed to translate this group using the same terminologies and style.

You are not bound to a specific method for this task: you can first extract terms from the seeds and apply the same procedure as in Track №1 (input text + explicit dictionary), perform in-context learning using the example sentences, or use any other approach of your choice.

Data Snippets

Set 1: Sample bitexts rich in terminologies in JSON

[
    {
        "en": "1.0 Primary Actuation and Structural Support The powertrain regulates the foundational kinetic element of the entire air-fuel exchange mechanism with a camshaft.",
        "pl": "1.1 Główne uruchomienie i wsparcie strukturalne Układ napędowy kontroluje podstawowy element kinetyczny w całym mechanizmie wymiany mieszanki powietrzno- paliwowej za pomocą wału rozrządu."
    },
...
]

Set 2: Source texts in JSON to be translated

[
    "The geometry of the camshaft governs torque delivery, peak horsepower, and idle stability by strictly controlling valve events.\nEngineers extract precise valve lifts from an eccentric cam, which is machined directly into the camshaft at meticulously calculated intervals...",
    ...
]

Expected Output: JSON with target text translations

[
    "Geometria wału rozrządu wpływa na sposób przekazania momentu obrotowego, maksymalną moc silnika oraz stabilność pracy na biegu jałowym poprzez dokładne sterowanie pracą zaworów.\nInżynierowie uzyskują dokładne uniesienie zaworów dzięki mimośrodowej krzywce, która jest skrawana bezpośrednio w wale rozrządu w dokładnie obliczonych odstępach...",
    ...
]

The detailed data description and links to files are provided in this Google Doc

Data Description

Language Pairs

The language pairs are the following:

es-eu (Spanish → Basque)
en-pl (English → Polish)
zh-Hant-en (Traditional Chinese → English)

The first two pairs (es-eu and en-pl) will be present in both tracks, the Chinese-English pairs will only be included in Track 2.

Domains

The domain sample with respect to translation pairs is as follows:

es-eu: Engineering and Technology
en-pl: Engineering and Technology, Medicine
zh-Hant-en: Finance

For es-eu and en-pl pairs, we will use all their domains for both tracks.

Links

The data is arranged into tracks in this Google Drive.

Evaluation

Terminology Modes

To estimate the causal effect of the proper terminology, we distinguish between three modes of translation of the terminology-heavy texts:

No terminology: the system is only provided with input sentences/documents.
Proper terminology: the system is provided with input texts (same as 1.) and dictionaries of the format {source_term: target_term}.
Random terminology: the system is provided with input texts and translation dictionaries of the same format as in 2. The difference is that the dictionary items are not special terms but words randomly drawn from input texts. This mode is of special interest since we want to measure to what extent the proper term translations help to improve the system performance (2.), as opposed to an arbitrary broader input that does not contain the domain-specific terminology.

For Track 1, you will be provided with both proper and random terminology dictionaries in the JSON files. Thus, for mode 1, you need to ignore them, and for modes 2 and 3, you need to use the corresponding dictionary. For Track 2, you are not provided with the terminology dictionaries, therefore you only compare modes 1 and 2.

Metrics

The submissions will be evaluated based on:

Overall Translation Quality: we will evaluate the general aspects of machine translation outputs such as fluency (incl. grammaticality of terms) and adequacy. This includes two aspects of evaluation:
- general translation quality: measured by automatic metrics like BLEU, chrF, and COMET.
- grammaticality of the term usage in context: since the target languages in our samples have rich morphology, we will assess how grammatically coherent are the occurrences of their translations in the texts.
Terminology-Oriented Metrics: This group of metrics assesses the ability of the system to accurately translate technical terms given the specialized vocabulary. We will assess two aspects of it:
- terminology success rate measures the percentage of the correct term translations in the target texts. This will be carried out by comparing the occurrences of the correct term translations (i.e. the ones present in the dictionary) to the output terms. The goal is to have a higher success rate that will show adherence to dictionary translations.
- Terminology Consistency: for domains such as science or legal texts, the consistent use of an introduced term throughout the text is crucial. In other words, we want a system to not only pick up a correct term in a target language, but to use it consistently once it is chosen. This will be evaluated by comparing all translations of a given source term in a text and measuring the percentage of deviations from the most consistent translation.

Usage of the two groups of metrics makes the comparison multidimensional. To minimize the dimensionality, we are planning to use the Pareto optimal between the Overall Translation Quality and Terminology Success Rate. Therefore, the solutions which will end up at the frontier will be considered optimal.

Participation

As terminology translation is a highly applicable task, we encourage participation from both academic researchers and industrial practitioners. You may choose to participate in any translation direction with any modeling approach as you prefer, including but not limited to:

lexical constrained decoding,
large language model fine-tuning and/or prompting,
translation editing and refinement,
multi-agent approaches.

Participants will have the option to publish a system description paper (4 to 6 pages) at WMT. In this case, the participants are expected to submit their system descriptions according to the WMT guidelines (see the main page).

Otherwise, we kindly ask you to provide a brief description of your approach alongside your test submission, e.g. as a txt, pdf, or md file.

Data

there is no dev data for this shared task.
the test data (for both tracks) will be published here soon.
the validation code will be published here soon.

Submission Guidelines

0. Please notify us about your participation prior to submission (optional)

This is not a required action, but we’d appreciate a lot if you contact us once your team decides to participate in the competition. With this, we will have a better understanding of our workload after submission. Also, we will be able to send you a gentle reminder before the deadline. The easiest way to do that is through our short Google Form; but you can also contact organizers via email.

1. Check your submission files with the validation script

You can find the validation script here. The systems with the outputs that are not compatible with the validation scripts will be desk-rejected.

2. Write a description of your system (optional) First, in the submission form (see 3) you will be required to provide a short text description of your system (4-6 sentences).

Additionally, we’d appreciate more detailed descriptions of your systems. You have several options for it:

If you are already submitting long papers about your system for WMT, please mention your submission name; the organizers will provide us with it.
If you have already published something about the system previously, feel free to attach the links to such documents, you do not have to tailor your detailed description to our shared task specifically.
If you wish to submit a description on its own AND you want it to be published in the WMT proceedings, you are invited to submit a short system description paper (4 to 6 pages) to WMT describing your system. Please submit it according to the guidelines of the main conference (i.e., with respect to all deadlines, formats, etc.), as we are NOT responsible for handling the publications.
If you do not wish to publish your system details but still have something to say about your system, you are very welcome to attach the PDF in a free format. We will carefully consider it for our analysis, but we will NOT publish it in WMT proceedings (see the point above).

3. Submit your system via Google Forms

You can find the Google Form here.

Detailed instructions about the formats of inputs and the outputs, as well as submission guidelines, can be found here.
If your team already participated in the WMT general task, please use the same team name.
You are allowed to submit multiple systems. If you are submitting multiple systems, we kindly request that you submit each system individually. In addition, we need you to indicate the primary system.
Submissions should be uploaded by deadline stated above.
If you have any questions regarding to the submission, please contact the organizers with a private inquiry.

Organizers (in the alphabetical order)

Nathaniel Berger (Amazon)
Adrian Charkiewicz (Laniqo & Adam Mickiewicz University in Poznań)
Pinzhen Chen (Queen’s University Belfast)
Thierry Etchegoyhen (Vicomtech)
Harritxu Gete Ugarte (Vicomtech)
Kamil Guttmann (Laniqo & Adam Mickiewicz University in Poznań)
Xu Huang (Nanjing University)
David Ponce (Vicomtech)
Artur Nowakowski (Laniqo & Adam Mickiewicz University in Poznań)
Frédéric Odermatt (44ai)
Arturo Oncevay (independent)
Kirill Semenov (University of Zurich), main contact: firstname.lastname@uzh.ch
Dawei Zhu (Amazon)
Vilém Zouhar (ETH Zurich)

F.A.Q.

1. Do I have to submit results for all language and/or tracks?

No, but we highly encourage it. If you already have a system for a language pair, replicating it (without optimizing hyperparameters) for another language pair could be very easy and yield a good demonstration of your method. Also, since terminology consistency is an under-studied field, we are especially interested in understanding the replicability of the systems’ performance given the languages from the different families and writing systems. Thus, by submitting results for all language pairs, you have a chance to say a new word in the general understanding of terminology-assisted translation!

2. Do I have to register for the competition or follow any other resource?

You should follow only this website, all updates (such as data releases) will be noted on the top of the webpage. For your (and our) convenience, we kindly ask you to fill in the pre-registration form: with it, we’d be able to notify you about important dates. At the publication of the test data, we will also attach the Google Form for registration in the competition.

3. I cannot see the dev set for this shared task.

Unfortunately, we are unable to publish the dev part of the dataset. However, you can already see the test data and its description here.

4. Can you share the code or the exact guidelines on how to compute the metrics for the submitted files?

We are in the process of updating the metrics compared to the last year (for example, we are re-examining the Terminology Success Rate computation and developing the metric for term grammaticality evaluation). However, you can look at the metric implementations from the last year; they would partially overlap with this year’s implementations.

5. Do we have constraints on data or models that we can use for our submitted systems?

No, contrary to the General MT task, we do not impose restrictions on the usage of particular resources or instruments within our task. However, for the sake of comparability and interpretation of systems, we’d appreciate if you explicitly note the data that you used for your submission.

ELEVENTH CONFERENCE ON MACHINE TRANSLATION (WMT26)