Multilingual Instruction Shared Task

EMNLP 2025

TENTH CONFERENCE ON
MACHINE TRANSLATION (WMT25)

November 5-9, 2025
Suzhou, China

[HOME]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [INDIC MT]
EVALUATION TASKS: [MT TEST SUITES]
OTHER TASKS: [MULTILINGUAL INSTRUCTION] [LIMITED RESOURCES SLAVIC LLM]

ANNOUNCEMENTS

2025-03-11: Shared task announced

DESCRIPTION

The Multilingual Instruction Shared Task focuses on evaluating and advancing models capable of following instructions across multiple languages and diverse task types. The objective, among others, is to establish a comprehensive evaluation framework that assesses multilingual models on a range of instruction-following capabilities. The selected tasks are:

machine translation;
linguistic reasoning;
open-ended generation, and
LLM-as-a-Judge, where large language models serve as evaluators of other model outputs.

The general format will be to provide LLM outputs given fixed prompts. The tasks will be on a subset of languages covered by the General MT shared task, namely: Arabic, Bengali, Bhojpuri, Chinese, Czech, English, Estonian, Farsi (Persian), German, Greek, Hindi, Icelandic, Indonesian, Italian, Japanese, Kannada, Korean, Lithuanian, Maasai, Marathi, Romanian, Russian, Serbian, Swedish, Thai, Turkish, Ukrainian, and Vietnamese.

We encourage participation from both research groups and industry practitioners.

IMPORTANT DATES

All dates are at the end of the day for Anywhere on Earth (AoE). Participants are required to participate in both waves as the secondary wave contains outputs of other systems for judging.

Finalized task details

end of April

Test data for first wave released (first three tasks)

26th June

Submission first wave deadline

3rd July

Test data for the second wave released (LLM-as-a-judge task)

10th July

Submission deadline for second wave

17th July

Individual subtasks

As a proxy to test general multilingual capabilities, we will test participant models on a set of following tasks. Each task will contain multiple languages from the list above.

Task: Machine Translation

The setup for the machine translation task in MIST is compatible with the General MT shared task and will be evaluated the same way. The exact prompt instructions will be announced later.

As a devset, we recommend WMT24 and WMT24++, which are available on HuggingFace.
The testset will be the same as for the General MT 2024 shared task and announced later, thus participants of MIST task automatically participate at General MT.

Task: Open-ended Generation

As a part of this task, systems will be tested on open-ended questions. The outputs will be evaluated by both humans and LLM-as-a-judge models.

As a devset, we recommend mArenaHard

Further details of this task will be announced later.

Task: Linguistic Reasoning

In this task, we probe the models for reasoning on linguistic puzzles. This will be tested with instructions in multiple languages.

As a devset, we recommend Linguini, an English-only instruction dataset, in the format as integrated in Big Bench Extra Hard.

Further details of this task will be announced later.

Task: LLM-as-a-judge

As a part of this task, the participanting systems will be tested how well they can perform as a judge to assess quality of other systems in assessing quality of answers for the machine translation and open-ended generation subtasks.

As a devset for machine translation, we recommend using the human-evaluated WMT24.
The testset will comprise of the submissions of participants in machine translation and open-ended generation tasks.

Participation

Constrained/Unconstrained LLMs: Similar to the General MT task, we have two tracks with respect to LLM size and availability. This allows for fairer comparisons. Constrained models compete only against other constrained, while unconstrained compete against all.

Constrained open weights systems:
- You are allowed to use any training data or models, under any open source license that allows unrestricted use for non-commercial purposes (e.g. Apache, MIT, …) allowing to make your work replicable;
- The final model’s total number of parameters must be smaller than 20B parameters. See suggested LLMs that falls into this category below. Intermediate steps may use larger models, such as distilling;
- You are required to release the model weights under some open source license.
Unconstrained track:
- No limitations with no requirements on publishing the models. Closed systems such as GPT-4 fall into this track.

Here is a non-exhaustive list of suggested LLMs that falls into the category under 20B parameters: Aya Expanse 8B, Aya 101 (13B), Cohere R 7B, Llama 7B, Llama 13B, Qwen 2.5 7B, Ministral 8B, Mistral 7B, EuroLLM.

Participation format: Participants have to participate in all tasks to be eligible for comparisons and human evaluation. The prompts for all tasks will be fixed, however, system builders are allowed to define preamble. You need to provide all details that will allow to replicate your prompting, such as the verbatim system preamble, temperature, and the decoding algorithm need to be disclosed in the system description paper at WMT.

We will provide a package to test and evaluate your setup, ETA in late March/April.

CONTACT

For queries, please use the mailing list or contact Tom Kocmi.

Organizers

Tom Kocmi - kocmi@cohere.com
Sweta Agrawal
Eleftherios Avramidis
Ondřej Bojar
Eleftheria Briakou
Pinzhen Chen
Marzieh Fadaee
Natalia Fedorova
Markus Freitag
Roman Grundkiewicz
Philipp Koehn
Julia Kreutzer
Saab Mansour
Stefano Perrella
Lorenzo Proietti
Ricardo Rei
Sebastian Ruder
Eduardo Sánchez
Patrícia Schmidtová
Mariya Shmatova
Sergei Tilga
Vilém Zouhar

Acknowledgements

To be announced.

TENTH CONFERENCE ON MACHINE TRANSLATION (WMT25)