EMNLP 2026

ELEVENTH CONFERENCE ON
MACHINE TRANSLATION (WMT26)

November, 2026
Budapest, Hungary
 
HOME

 
TRANSLATION TASKS: GENERAL MT (NEWS) •︎ INDIC MT •︎ TERMINOLOGY
OTHER TASKS: OPEN DATA •︎ MULTILINGUAL INSTRUCTION

ANNOUNCEMENTS

  • 2026-03-13: Almost finalized LPs. We added: Armenian, Belarusian, and Kazakh

  • 2026-02-02: Most details are finalized WMT26

  • 2026-02-02: Initial proposal of WMT26, currently under preparation.

DESCRIPTION

Formerly known as News translation task of the WMT focusses on evaluation of general capabilities of machine translation (MT) systems. Its primary goal is to test performance across a wide range of languages, domains, genres, and modalities. Systems are evaluated with humans.

In addition to the main General MT task, there is also Test suite subtask - design a challenge testset and analyze the system translations

The goals of the shared translation task are:

  • To investigate the applicability of current MT techniques when translating across various domains and language pairs

  • To investigate the translation of low-resource, morphologically rich languages

  • To create publicly available testset for machine translation evaluation

  • To generate up-to-date performance scores in order to provide a basis of comparison in future research

  • To offer newcomers a smooth start with hands-on experience in state-of-the-art machine translation methods

  • To investigate the usefulness of multilingual and third language resources

  • To assess the effectiveness of document-level approaches and uses of non-textual modalities

We hope that both beginners and established research groups will participate in this task.

Main changes of 2026

  • Instruction following context - we will test systems on their ability to follow instructions context, we will provide additional instructions to each sample. We are considering following phenomena:

    • formal/informal voice

    • glossaries - having additional context in the instructions

    • structured translation - being able to translate document inside common formats such as JSON, HTML, CSV, …​ and producing valid output

    • style and expressions - when requested, accurately replicating the user’s non-standard emotional style, such as character elongation ("yuhuuuuuu"), internet slang ("tbh"), and expressive punctuation ("WHAT?!?!").

  • Human evaluation - one of the largest changes this year is for human evaluation, where we plan to run contrastive human evaluation.

  • All systems will be human evaluated - we will not rely on automatic evaluation to drop low performing systems, rather devise dynamic sampling which allows us to evaluate all submitted systems (we may remove obviously bad outlier systems from human evaluation)

  • We won’t release preliminary results based on automatic evaluation

  • Domains:

    • We are considering translation of infographics for the image context

    • Social domain will be focussed on multiple users conversing and will come with possibly useful instruction following such as terminology or style guide of expected translation

    • Speech domain is transformed into spoken domain

  • Human reference will be build with updated translators brief and encouraged post-editing of MT and many language pairs will not have human reference at all

  • LLM benchmarking - for systems collected by organizers, we will focus more on open weight models and reduce the number of proprietary models. With exception for few language pairs, where we will collect larger pool of systems for benchmarking purposes.

  • We replace abstract submission with requirement to fill model card poll (similar to last year’s). Only systems which fill the poll will be accepted for participation. System paper submission is still required for all participant

Language pairs

The list of languages that are going to be evaluated (only specified direction):

  • Czech to German

  • Czech to Ukrainian

  • Czech to Vietnamese (new)

  • Chinese, Simplified to Japanese (new)

  • English to Arabic, Egyptian

  • English to Armenian (new)

  • English to Belarusian (new)

  • English to Chinese, Simplified

  • English to Czech

  • English to Estonian

  • English to German

  • English to Icelandic

  • English to Indonesian (new)

  • English to Japanese

  • English to Kazakh (new)

  • English to Korean

  • English to Ladin, Italy (new)

  • English to Ligurian, Italy (new)

  • English to Northern Sámi (new)

  • English to Thai (new)

We are finishing the list of languages, following languages are among considered:

  • Bhojpuri

  • Maasai, Kenya

  • Inari Sámi

IMPORTANT DATES

All dates are at the end of the day for Anywhere on Earth (AoE)

Finalized task details

February 2026

Final test suite source texts must reach us

4th June 2026

Test data released

18th June 2026

Translation submission deadline

2nd July 2026

Translated test suites shipped back to test suites authors

16th July 2026

System description paper submission & constrained weights release

TBA August 2026

Camera-ready submission

TBA September 2026

TASK DESCRIPTION

The task is to improve and evaluate current MT methods. We encourage a broad participation — if you feel that your method is interesting but not state-of-the-art, then please participate in order to disseminate it and measure progress. Participants will use their systems to translate a testset of unseen documents. The translation quality is measured by a human evaluation.

You may participate in any or all of the language pairs. To lower the entry barrier, we define a constrained track allowing to compare systems from same category.

Each participant is required to submit submission paper, which will describe the process of how the model was built and highlight in which ways their own methods and data differ from the usual training pipeline. You should make it clear which tools and training sets you used. The name of the system submission paper must contain name of the system. See main page for the link to the submission site.

Constrained/Unconstrained track

The General MT task has two separate tracks with different constraints on the models allowing for better comparability. Constrained models compete only against other constrained, while unconstrained compete against all. All participants are required to submit submission paper.

  • Constrained open weights systems

    • you are allowed to use any training data, models, and frameworks, under any open source license that allows unrestricted use for non-commercial purposes (e.g. Apache, MIT, …​) allowing to make your work replicable by any research group

  • the final model’s total number of parameters must me smaller than 20B parameters. See suggested LLMs that falls into this category below. Intermediate steps may use larger models, such as distilling. However, mixture-of-experts counts as the total parameter count. Ensemble of several models count as a sum of their weights.

    • you are required to release the model weights under some open source license together with system submission.

  • Unconstrained track - track that do not place any limitations, neither requires publishing the model weights. Closed systems such as GPT are part of this track.

Although we no longer restric training data, we currated corpora that should cover majority of available data. A non-exhaustive list of suggested LLMs that falls into the category under 20B parameters:

Furthermore, we recommend using last year’s constrained systems all of which have been published as open weights, including the top constrained system Hunyuan.

Domains

We plan following domains: news, social, spoken, and TBA. There may be some differences in domains for non-English source languages. All testsets will be provided as a document-level.

Instruction following Modern LLM based MT systems are capable of instruction following on top of regular translation tasks. We will test these capabilities by providing each sample with additional instructions on how it should be translated. We provided basic instructions with WMT25 testset. It is up to participants if they will utilize the instructions, however, not following instructions will be considered a translation error. Here are phenomena you may consider to focus on:

  • formal/informal voice - the instructions may go against the style of source segment (translate this post in slang of language Y into a formal style in language X); or utilization T-V distinction

  • glossaries - instructions would provide additional context useful for translation

  • structured translation - being able to translate document inside common formats (such as JSON, HTML, CSV, …​) and producing valid output

  • style and expressions - when requested, accurately replicating the user’s non-standard emotional style, such as character elongation ("yuhuuuuuu"), internet slang ("tbh"), and expressive punctuation ("WHAT?!?!").

Spoken dialogue domain. The spoken dialogue domain is similar to last year’s speech domain and it is going to be released with the original video together with machine generated transcript. We will use only the original video as a source for human evaluation. All testsets will contain the transcript which will be generated with speech recognition model and will contain errors, therefore the audio is not required to be used but could improve the performance, but may be beneficial. As a useful development set, we recommend using WMT24 and WMT25 that was released together with videos.

Social domain. In addition to the textual content, we will provide the printscreen of the original social page that may or may not be useful to the translation. However, human evaluation will use the printscreens when judging the quality of the translation. Our setup will be similar to the one from last year which can be used as a development set.

Test Suites subtask

This year’s shared task will also include the “Test suites” sub-task, which has been part of WMT since 2018. The participants of the test suite track will provide challenge testsets that will be used to test specific phenomena of machine translation. More details about the test suites are provided in a separate page.

DATA

Licensing. The data released for the General MT task can be freely used for research purposes, we ask that you cite the shared task overview findings paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.

Useful training data. We aim to use publicly available datasets wherever possible. See details. You can download all corpora via command line:

pip install mtdata==0.4.3  # on python 3.9-3.11
wget https://www.statmt.org/wmt25/mtdata/mtdata.recipes.wmt25-constrained.yml
mtdata -no-pb cache -j 8 -ri "wmt25-*"
for id in wmt25-{eng-{ara,bho,ces,est,isl,jpn,kor,rus,srp,ukr,zho},ces-{ukr,deu},jpn-zho}; do
  mtdata get-recipe -i $id -o $id --compress --no-merge -j 8
done

Development data. To evaluate your system during development, we suggest using testsets from past WMT years. You can use following testsets when developing your systems

EVALUATION

Human Evaluation

We will evaluate all primary submissions with human evaluation using dynamic sampling with an exception of broken outlier systems. In the scenario of extremely high number of submissions, we may remove the worst performing unconstrained systems, while keeping all constrained systems. We collect translation from online services and popular LLM systems, primarily focussing on open weights models. The collection will happen during the evaluation period between releasing the testset and translation deadline.

Human evaluation will be carried out using the Pearmut annotation tool with the exact ESA-based annotation protocol to be specified later. The human evaluation will be carried out with document-context. You can run the following to see an example of human evaluation:

pip install pearmut
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
pearmut add esa.json
pearmut run

Testset Submission

TBA

CONTACT

For queries, please use the mailing list or contact Tom Kocmi.

Organizers

  • Tom Kocmi - kocmi@cohere.com

  • Ekaterina Artemova

  • Eleftherios Avramidis

  • Rachel Bawden

  • Ondřej Bojar

  • Sergey Dukanov

  • Mark Fishel

  • Markus Freitag

  • TG Gowda

  • Roman Grundkiewicz

  • Marzena Karpinska

  • Philipp Koehn

  • Kenton Murray

  • Masaaki Nagata

  • Stefano Perrella

  • Lorenzo Proietti

  • Martin Popel

  • Maja Popović

  • Sara Rajaee

  • Parker Riley

  • Mariya Shmatova

  • Steinþór Steingrímsson

  • Lisa Yankovskaya

  • Vilém Zouhar

Acknowledgements

This task would not have been possible without the sponsorship of testsets translations and human evaluation done with our partners. Namely (in alphabetical order): Alconost, Árni Magnússon Institute for Icelandic Studies, Charles University, Cohere, Google, Institute of the Estonian Language, Microsoft, NTT, Toloka, University of Tartu, University of Tokyo. Furthermore, we are grateful to Barry Haddow, Toshiaki Nakazawa, Konstantin Dranch, and for the support from the UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee (Grant No 10039436).