Latest Updates
-
22 June 2026: Thanks to a participant, we discovered a problem with our data mix on Hugging Face. We have fixed the issue and re-uploaded the dataset. Please download the new version if you downloaded the data before 22 June here.
Important Dates
Task announcement |
January |
Train/dev/sample data release |
16 June |
Test release and evaluation details |
01 July |
System output submission |
01 August |
Paper submission |
07 August, in-line with WMT26 |
Paper notification and camera-ready |
September, in-line with WMT26 |
Conference |
28-29 October 2026 |
All deadlines are AoE (Anywhere on Earth).
Description
The Multilingual Instruction Shared Task (MIST), as its name suggests, evaluates model capability in following instructions across languages and tasks. The objective this year is to analyse methods that can produce small multilingual models.
Constraints
We do not impose any constraints other than that the final system’s total parameter count must be under 10B:
-
No limit on data.
-
No limit on the techniques for model development: training, fine-tuning, inference-time methods, etc.
-
No limit on model choice, except the total parameter count.
-
We ask you to share technical details in your paper, such as the architecture, training configuration, data, etc for analysis.
Tasks
We will evaluate systems using three sub-tasks, to cover same-language and cross-lingual comprehension and generation. There may be additional user instructions, such as a required word limit or style, for each test question. A detailed description of the languages and sub-tasks is provided below.
We provide sample data consisting of input-output pairs in 27 languages. It is similar to our test sub-tasks and avilable on Hugging Face as wmt26-mist-sample, which you can use as a starting point for fine-tuning. You are also free to use any other data for model development.
Languages
In our sample data above, there are 27 languages: arb_Arab, ben_Beng, ces_Latn, ckb_Arab, deu_Latn, eng_Latn, fin_Latn, fra_Latn, hat_Latn, hin_Deva, ind_Latn, ita_Latn, jpn_Jpan, kor_Hang, mar_Deva, pes_Arab, por_Latn, rus_Cyrl, slk_Latn, spa_Latn, swh_Latn, tel_Telu, tha_Thai, tur_Latn, vie_Latn, yor_Latn, zho_Hans`. In our test set, we plan to cover a subset of these languages plus two surprise languages — stay tuned!
We encourage you to participate in all sub-tasks/languages to be featured in the overall leaderboard, but you may choose to focus on a subset of the tasks and languages.
Participation and Evaluation (To be updated)
We will release model inputs as test data, and we collect your model outputs as submissions. Submissions will be evaluated using a mix of automatic metrics and human evaluation. Further details on evaluation and submission will be announced closer to the test release.
Individual Sub-tasks
Sub-task 1: Context-based question answering
Given a document in language X, the model is asked questions about the content of that document, also in language X.
Passage: وحصل الفلم اللي شاركو بي رايان غوسلينغ وإيما ستون، ترشيحات بجميع الفئات الرئيسية. حصل جوسلينغ وستون ترشيحات لأفضل ممثل وممثلة على التوالي.
Question: أي جائزة ترشحت إلها أيما ستون؟
Answer: افضل ممثلة
Sub-task 2: Summarization from and to any language
The model is provided a document in language X and asked to summarize it in language Y.
Source (Japanese): 英オックスフォード大学の研究チームによると、低用量のデキサメタゾンは新型ウイルスとの戦いで画期的な突破口になる。新型コロナウイルスに対し、様々な既存の治療法の効果を試す世界的規模の臨床試験の一貫として、デキサメタゾンが試された。その結果、人工呼吸器を必要とする重症患者の致死率が3割下がり […]
Target summary (English): Scientists say the cheap and widely available drug dexamethasone will help save the lives of patients who are seriously ill with coronavirus.
Task: Open-ended generation
The model is given an open-ended question and asked to provide a helpful and natural response.
Prompt: Quels président des États-Unis ne s’est jamais marié ?
Response: James Buchanan est le seul président qui ne s’est jamais marié.
Prompt: 如果你能改变中国初中教育体制中的一件事,你会选择改变什么?为什么?
Response: 如果我能改变中国初中教育体制中的一件事,我会选择减少应试压力,增加综合素质教育的比重。<br><br>原因如下:<br><br>1. 促进学生全面发展 <br>目前,初中教育过于侧重考试成绩,学生和老师都把大量时间和精力投入到刷题、应付考试上,忽视了对学生创新能力、实践能力、沟通能力等综合素质的培养。这样容易导致学生缺乏兴趣、创造力和社会适应能力。[…]
Organizers
-
Pinzhen Chen
-
Patrícia Schmidtová, contact at schmidtova@ufal.mff.cuni.cz
-
Katia Artemova
-
Seth Aycock
-
Niyati Bafna
-
Tom Kocmi
-
Philipp Koehn
-
Danni Liu
-
Nam Luu
-
Sara Papi
-
Eduardo Sánchez
-
Mariya Shmatova
-
Vilém Zouhar
FAQs
-
Can I participate only in a subset of the tasks and languages?
-
Yes, of course. We will discuss your solutions and performance for each sub-task and language. However, we might not be able to rank you fairly in the overall leaderboard. Another option might be treating the rest of the languages as "zero-shot".
-
-
How does the parameter count limit apply to mixture-of-experts (MoE) models?
-
The total number of parameters in an MoE model cannot exceed 10B, regardless of the activated parameters in a forward pass.
-
For example, a 9B-A2B is allowed but a 35B-A2B model would not be allowed.
-
-
How does the parameter count limit apply to an inference framework with multiple models?
-
The total number of parameters of all your deployed components cannot exceed 10B.
-
Example 1: if you have a single 4B model and use it for different purposes in the inference pipeline, the total parameter count is still 4B, which is allowed.
-
Example 2: if you have one 4B model and three different 1B models in an agentic framework, the total parameter count is 7B, which is allowed.
-
Example 3: if you have a fleet of three different 4B models, the total will be 12B parameters, which is not allowed.
-
-
Can I prune/distill… from a >10B model?
-
Yes, it is allowed to involve any model in the development, as long as the final model is under 10B parameters.
-