Shared Task: General Machine Translation

EMNLP 2026

ELEVENTH CONFERENCE ON
MACHINE TRANSLATION (WMT26)

28-29 October, 2026
Budapest, Hungary HOME

TRANSLATION TASKS:	GENERAL MT •︎ INDIC MT •︎ ARABIC-ASIAN MT •︎ CHINESE-SOUTHEAST ASIAN MT •︎ TERMINOLOGY •︎ MODEL COMPRESSION •︎ CREOLE MT •︎ VIDEO SUBTITLE TRANSLATION
EVALUATION TASKS:	MT TEST SUITES •︎︎ AUTOMATED MT EVALUATION
OTHER TASKS:	OPEN DATA •︎ MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES LLM

ANNOUNCEMENTS

2026-05-05: Adding two languages: English into Russian and Ukrainian
2026-04-22: Specified language: Eastern Armenian
2026-04-13: Add instructions for downloading datasets using mtdata
2026-04-01: Test suite task details announced
2026-03-21: Added English to Traditional Chinese language pair and tutorial to finetune Hunyuan
2026-03-20: Improved wording of constrained model weights licensing
2026-03-13: Almost finalized LPs. We added: Armenian, Belarusian, and Kazakh
2026-02-02: Most details are finalized WMT26
2026-02-02: Initial proposal of WMT26, currently under preparation.

DESCRIPTION

Formerly known as News translation task of the WMT focusses on evaluation of general capabilities of machine translation (MT) systems. Its primary goal is to test performance across a wide range of languages, domains, genres, and modalities. Systems are evaluated with humans.

In addition to the main General MT task, there is also Test suite subtask - design a challenge testset and analyze the system translations

The goals of the shared translation task are:

To investigate the applicability of current MT techniques when translating across various domains and language pairs
To investigate the translation of low-resource, morphologically rich languages
To create publicly available testset for machine translation evaluation
To generate up-to-date performance scores in order to provide a basis of comparison in future research
To offer newcomers a smooth start with hands-on experience in state-of-the-art machine translation methods
To investigate the usefulness of multilingual and third language resources
To assess the effectiveness of document-level approaches and uses of non-textual modalities

We hope that both beginners and established research groups will participate in this task.

Main changes of 2026

Instruction following context - we will test systems on their ability to follow instructions context, we will provide additional instructions to each sample. We are considering following phenomena:
- formal/informal voice
- glossaries - having additional context in the instructions
- structured translation - being able to translate document inside common formats such as JSON, HTML, CSV, … and producing valid output
- style and expressions - when requested, accurately replicating the user’s non-standard emotional style, such as character elongation ("yuhuuuuuu"), internet slang ("tbh"), and expressive punctuation ("WHAT?!?!").
Human evaluation - one of the largest changes this year is for human evaluation, where we plan to run contrastive human evaluation.
All systems will be human evaluated - we will not rely on automatic evaluation to drop low performing systems, rather devise dynamic sampling which allows us to evaluate all submitted systems (we may remove obviously bad outlier systems from human evaluation)
We won’t release preliminary results based on automatic evaluation
Domains:
- We are considering translation of infographics for the image context
- Social domain will be focussed on multiple users conversing and will come with possibly useful instruction following such as terminology or style guide of expected translation
- Speech domain is transformed into spoken domain
Human reference will be build with updated translators brief and encouraged post-editing of MT and many language pairs will not have human reference at all
LLM benchmarking - for systems collected by organizers, we will focus more on open weight models and reduce the number of proprietary models. With exception for few language pairs, where we will collect larger pool of systems for benchmarking purposes.
We replace abstract submission with requirement to fill model card poll (similar to last year’s). Only systems which fill the poll will be accepted for participation. System paper submission is still required for all participant

Language pairs

The list of languages that are going to be evaluated (only specified direction):

Czech to German
Czech to Ukrainian
Czech to Vietnamese (new)
Chinese, Simplified to Japanese (new)
English to Arabic, Egyptian
English to Eastern Armenian (new)
English to Belarusian (new)
English to Chinese, Simplified
English to Chinese, Traditional Taiwan (new)
English to Czech
English to Estonian
English to German
English to Icelandic
English to Indonesian (new)
English to Japanese
English to Kazakh (new)
English to Korean
English to Ladin, Italy (new)
English to Ligurian, Italy (new)
English to Northern Sámi (new)
English to Russian
English to Thai (new)
English to Ukrainian

IMPORTANT DATES

All dates are at the end of the day for Anywhere on Earth (AoE)

Finalized task details

February 2026

Final test suite source texts must reach us

4th June 2026

Test data released

18th June 2026

Translation submission deadline

2nd July 2026

Translated test suites shipped back to test suites authors

16th July 2026

System description paper submission & constrained weights release

TBA August 2026

Camera-ready submission

TBA September 2026

TASK DESCRIPTION

The task is to improve and evaluate current MT methods. We encourage a broad participation — if you feel that your method is interesting but not state-of-the-art, then please participate in order to disseminate it and measure progress. Participants will use their systems to translate a testset of unseen documents. The translation quality is measured by a human evaluation.

You may participate in any or all of the language pairs. To lower the entry barrier, we define a constrained track allowing to compare systems from same category.

Each participant is required to submit submission paper, which will describe the process of how the model was built and highlight in which ways their own methods and data differ from the usual training pipeline. You should make it clear which tools and training sets you used. The name of the system submission paper must contain name of the system. See main page for the link to the submission site.

Constrained/Unconstrained track

The General MT task has two separate tracks with different constraints on the models allowing for better comparability. Constrained models compete only against other constrained, while unconstrained compete against all. All participants are required to submit submission paper.

Constrained open weights systems
- you are allowed to use any training data, models, and frameworks, as long as it will allow releasing your model weights under a license that at least allows unrestricted non-commercial use (e.g. Apache, MIT, …) allowing research groups to build on top your model
the final model’s total number of parameters must me smaller than 20B parameters. See suggested LLMs that falls into this category below. Intermediate steps may use larger models, such as distilling. However, mixture-of-experts counts as the total parameter count. Ensemble of several models count as a sum of their weights.
- you are required to release the model weights under some open source license together with system submission.
Unconstrained track - track that do not place any limitations, neither requires publishing the model weights. Closed systems such as GPT are part of this track.

Although we no longer restrict training data, we curated corpora that should cover a majority of the available data. A non-exhaustive list of suggested LLMs that falls into the category under 20B parameters:

Textual: Aya Expanse 8B, Cohere R 7B, Gemma 3 12B, GPT-OSS 20B, Llama 3.1 8B, Qwen 2.5 7B, Ministral 3 14B, Mistral 7B, EuroLLM
Multimodal: Whisper, Seamless M4T

Furthermore, we recommend using last year’s constrained systems all of which have been published as open weights, including the top constrained system Hunyuan together with a tutorial for quick start

Domains

We plan following domains: news, social, spoken, and TBA. There may be some differences in domains for non-English source languages. All testsets will be provided as a document-level.

Instruction following Modern LLM based MT systems are capable of instruction following on top of regular translation tasks. We will test these capabilities by providing each sample with additional instructions on how it should be translated. We provided basic instructions with WMT25 testset. It is up to participants if they will utilize the instructions, however, not following instructions will be considered a translation error. Here are phenomena you may consider to focus on:

formal/informal voice - the instructions may go against the style of source segment (translate this post in slang of language Y into a formal style in language X); or utilization T-V distinction
glossaries - instructions would provide additional context useful for translation
structured translation - being able to translate document inside common formats (such as JSON, HTML, CSV, …) and producing valid output
style and expressions - when requested, accurately replicating the user’s non-standard emotional style, such as character elongation ("yuhuuuuuu"), internet slang ("tbh"), and expressive punctuation ("WHAT?!?!").

Spoken dialogue domain. The spoken dialogue domain is similar to last year’s speech domain and it is going to be released with the original video together with machine generated transcript. We will use only the original video as a source for human evaluation. All testsets will contain the transcript which will be generated with speech recognition model and will contain errors, therefore the audio is not required to be used but could improve the performance, but may be beneficial. As a useful development set, we recommend using WMT24 and WMT25 that was released together with videos.

Social domain. In addition to the textual content, we will provide the printscreen of the original social page that may or may not be useful to the translation. However, human evaluation will use the printscreens when judging the quality of the translation. Our setup will be similar to the one from last year which can be used as a development set.

Test Suites subtask

This year’s shared task will also include the “Test suites” sub-task, which has been part of WMT since 2018. The participants of the test suite track will provide challenge testsets that will be used to test specific phenomena of machine translation. More details about the test suites are provided in a separate page.

DATA

Licensing. The data released for the General MT task can be freely used for research purposes, we ask that you cite the shared task overview findings paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.

Useful training data. We aim to use publicly available datasets wherever possible. See details. For dialect-specific targets such as Traditional Chinese (Taiwan), Ladin, Ligurian, and Northern Sámi, the WMT26 recipes intentionally retain locale-specific resources where available. You can download all corpora via command line:

pip install mtdata[hf,xlsx]==0.5.1  # Use Python 3.10+
wget https://www.statmt.org/wmt26/mtdata/mtdata.recipes.wmt26-constrained.yml

# optional: prefetch/cache datasets across all WMT26 recipes; adjust -j 12 as needed
mtdata cache -j 12 -ri "wmt26-*"

# materialize every currently supported WMT26 recipe
for id in $(mtdata list-recipe -id | grep '^wmt26-'); do
  mtdata get-recipe -ri "$id" -o "$id" --compress --no-merge -j 12
done

Development data. To evaluate your system during development, we suggest using testsets from past WMT years. You can use following testsets when developing your systems

EVALUATION

Human Evaluation

We will evaluate all primary submissions with human evaluation using dynamic sampling with an exception of broken outlier systems. In the scenario of extremely high number of submissions, we may remove the worst performing unconstrained systems, while keeping all constrained systems. We collect translation from online services and popular LLM systems, primarily focussing on open weights models. The collection will happen during the evaluation period between releasing the testset and translation deadline.

Human evaluation will be carried out using the Pearmut annotation tool with the exact ESA-based annotation protocol to be specified later. The human evaluation will be carried out with document-context. You can run the following to see an example of human evaluation:

pip install pearmut
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
pearmut add esa.json
pearmut run

Testset Submission

TBA

CONTACT

For queries, please use the mailing list (write a reason when joining) or contact Tom Kocmi in case it is not a general question useful for all participants.

Organizers

Tom Kocmi - kocmi@cohere.com
Ekaterina Artemova
Eleftherios Avramidis
Rachel Bawden
Ondřej Bojar
Sergey Dukanov
Mark Fishel
Markus Freitag
Samuel Frontull
TG Gowda
Roman Grundkiewicz
Philipp Koehn
Jean Maillard
Kenton Murray
Masaaki Nagata
Stefano Perrella
Lorenzo Proietti
Martin Popel
Maja Popović
Sara Rajaee
Parker Riley
Mariya Shmatova
Steinþór Steingrímsson
Lisa Yankovskaya
Vilém Zouhar

Acknowledgements

This task would not have been possible without the sponsorship of testsets translations and human evaluation done with our partners. Namely (in alphabetical order): Alconost, Árni Magnússon Institute for Icelandic Studies, Charles University, Cohere, Council for Ligurian Linguistic Heritage, Dubformer, Google, Istitut Ladin Micurà de Rü, Institute of the Estonian Language, Microsoft, NTT, Tencent, Toloka, University of Tartu, University of Tokyo. Furthermore, we are grateful to Barry Haddow, Toshiaki Nakazawa, Konstantin Dranch, and for the support from the UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee (Grant No 10039436).

ELEVENTH CONFERENCE ON MACHINE TRANSLATION (WMT26)