EMNLP 2026

ELEVENTH CONFERENCE ON
MACHINE TRANSLATION (WMT26)

28-29 October, 2026
Budapest, Hungary
HOME

TRANSLATION TASKS: GENERAL MT •︎ INDIC MT •︎ ARABIC-ASIAN MT •︎ CHINESE-SOUTHEAST ASIAN MT •︎ TERMINOLOGY •︎ MODEL COMPRESSION •︎ CREOLE MT •︎ VIDEO SUBTITLE TRANSLATION
EVALUATION TASKS: MT TEST SUITES •︎︎ AUTOMATED MT EVALUATION
OTHER TASKS: OPEN DATA •︎ MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES LLM

Latest Updates

  • 22 June 2026: Thanks to a participant, we discovered a problem with our data mix on Hugging Face. We have fixed the issue and re-uploaded the dataset. Please download the new version if you downloaded the data before 22 June here.

Important Dates

Task announcement

January

Train/dev/sample data release

16 June

Test release and evaluation details

01 July

System output submission

01 August

Paper submission

07 August, in-line with WMT26

Paper notification and camera-ready

September, in-line with WMT26

Conference

28-29 October 2026

All deadlines are AoE (Anywhere on Earth).

Description

The Multilingual Instruction Shared Task (MIST), as its name suggests, evaluates model capability in following instructions across languages and tasks. The objective this year is to analyse methods that can produce small multilingual models.

Constraints

We do not impose any constraints other than that the final system’s total parameter count must be under 10B:

  • No limit on data.

  • No limit on the techniques for model development: training, fine-tuning, inference-time methods, etc.

  • No limit on model choice, except the total parameter count.

  • We ask you to share technical details in your paper, such as the architecture, training configuration, data, etc for analysis.

Tasks

We will evaluate systems using three sub-tasks, to cover same-language and cross-lingual comprehension and generation. There may be additional user instructions, such as a required word limit or style, for each test question. A detailed description of the languages and sub-tasks is provided below.

We provide sample data consisting of input-output pairs in 27 languages. It is similar to our test sub-tasks and avilable on Hugging Face as wmt26-mist-sample, which you can use as a starting point for fine-tuning. You are also free to use any other data for model development.

Languages

In our sample data above, there are 27 languages: arb_Arab, ben_Beng, ces_Latn, ckb_Arab, deu_Latn, eng_Latn, fin_Latn, fra_Latn, hat_Latn, hin_Deva, ind_Latn, ita_Latn, jpn_Jpan, kor_Hang, mar_Deva, pes_Arab, por_Latn, rus_Cyrl, slk_Latn, spa_Latn, swh_Latn, tel_Telu, tha_Thai, tur_Latn, vie_Latn, yor_Latn, zho_Hans`. In our test set, we plan to cover a subset of these languages plus two surprise languages — stay tuned!

We encourage you to participate in all sub-tasks/languages to be featured in the overall leaderboard, but you may choose to focus on a subset of the tasks and languages.

Participation and Evaluation (To be updated)

We will release model inputs as test data, and we collect your model outputs as submissions. Submissions will be evaluated using a mix of automatic metrics and human evaluation. Further details on evaluation and submission will be announced closer to the test release.

If you are interested in participating, please kindly fill this EoI form so we can keep you updated. You may find the [FAQs] at the bottom of this page helpful. If you have any questions, please do not hesitate to get in touch with the organizers.

Individual Sub-tasks

Sub-task 1: Context-based question answering

Given a document in language X, the model is asked questions about the content of that document, also in language X.

Example 1. Example from BELEBELE (Arabic)

Passage: وحصل الفلم اللي شاركو بي رايان غوسلينغ وإيما ستون، ترشيحات بجميع الفئات الرئيسية. حصل جوسلينغ وستون ترشيحات لأفضل ممثل وممثلة على التوالي.

Question: أي جائزة ترشحت إلها أيما ستون؟

Answer: افضل ممثلة

Sub-task 2: Summarization from and to any language

The model is provided a document in language X and asked to summarize it in language Y.

Example 2. Example from CrossSum (Japanese → English)

Source (Japanese): 英オックスフォード大学の研究チームによると、低用量のデキサメタゾンは新型ウイルスとの戦いで画期的な突破口になる。新型コロナウイルスに対し、様々な既存の治療法の効果を試す世界的規模の臨床試験の一貫として、デキサメタゾンが試された。その結果、人工呼吸器を必要とする重症患者の致死率が3割下がり […​]

Target summary (English): Scientists say the cheap and widely available drug dexamethasone will help save the lives of patients who are seriously ill with coronavirus.

Task: Open-ended generation

The model is given an open-ended question and asked to provide a helpful and natural response.

Example 3. Example from Aya Dataset (French)

Prompt: Quels président des États-Unis ne s’est jamais marié ?

Response: James Buchanan est le seul président qui ne s’est jamais marié.

Example 4. Example from WMT25 MIST OEG (Chinese)

Prompt: 如果你能改变中国初中教育体制中的一件事,你会选择改变什么?为什么?

Response: 如果我能改变中国初中教育体制中的一件事,我会选择减少应试压力,增加综合素质教育的比重。<br><br>原因如下:<br><br>1. 促进学生全面发展 <br>目前,初中教育过于侧重考试成绩,学生和老师都把大量时间和精力投入到刷题、应付考试上,忽视了对学生创新能力、实践能力、沟通能力等综合素质的培养。这样容易导致学生缺乏兴趣、创造力和社会适应能力。[…​]

Organizers

  • Pinzhen Chen

  • Patrícia Schmidtová, contact at schmidtova@ufal.mff.cuni.cz

  • Katia Artemova

  • Seth Aycock

  • Niyati Bafna

  • Tom Kocmi

  • Philipp Koehn

  • Danni Liu

  • Nam Luu

  • Sara Papi

  • Eduardo Sánchez

  • Mariya Shmatova

  • Vilém Zouhar

FAQs

  1. Can I participate only in a subset of the tasks and languages?

    • Yes, of course. We will discuss your solutions and performance for each sub-task and language. However, we might not be able to rank you fairly in the overall leaderboard. Another option might be treating the rest of the languages as "zero-shot".

  2. How does the parameter count limit apply to mixture-of-experts (MoE) models?

    • The total number of parameters in an MoE model cannot exceed 10B, regardless of the activated parameters in a forward pass.

    • For example, a 9B-A2B is allowed but a 35B-A2B model would not be allowed.

  3. How does the parameter count limit apply to an inference framework with multiple models?

    • The total number of parameters of all your deployed components cannot exceed 10B.

    • Example 1: if you have a single 4B model and use it for different purposes in the inference pipeline, the total parameter count is still 4B, which is allowed.

    • Example 2: if you have one 4B model and three different 1B models in an agentic framework, the total parameter count is 7B, which is allowed.

    • Example 3: if you have a fleet of three different 4B models, the total will be 12B parameters, which is not allowed.

  4. Can I prune/distill…​ from a >10B model?

    • Yes, it is allowed to involve any model in the development, as long as the final model is under 10B parameters.