ANNOUNCEMENTS
-
2025-03-11: Shared task announced
DESCRIPTION
The Multilingual Instruction Shared Task focuses on evaluating and advancing models capable of following instructions across multiple languages and diverse task types. The objective, among others, is to establish a comprehensive evaluation framework that assesses multilingual models on a range of instruction-following capabilities. The selected tasks are:
-
machine translation;
-
linguistic reasoning;
-
open-ended generation, and
-
LLM-as-a-Judge, where large language models serve as evaluators of other model outputs.
The general format will be to provide LLM outputs given fixed prompts. The tasks will be on a subset of languages covered by the General MT shared task, namely: Arabic, Bengali, Bhojpuri, Chinese, Czech, English, Estonian, Farsi (Persian), German, Greek, Hindi, Icelandic, Indonesian, Italian, Japanese, Kannada, Korean, Lithuanian, Maasai, Marathi, Romanian, Russian, Serbian, Swedish, Thai, Turkish, Ukrainian, and Vietnamese.
We encourage participation from both research groups and industry practitioners.
IMPORTANT DATES
All dates are at the end of the day for Anywhere on Earth (AoE). Participants are required to participate in both waves as the secondary wave contains outputs of other systems for judging.
Finalized task details |
end of April |
Test data for first wave released (first three tasks) |
26th June |
Submission first wave deadline |
3rd July |
Test data for the second wave released (LLM-as-a-judge task) |
10th July |
Submission deadline for second wave |
17th July |
Individual subtasks
As a proxy to test general multilingual capabilities, we will test participant models on a set of following tasks. Each task will contain multiple languages from the list above.
Task: Machine Translation
The setup for the machine translation task in MIST is compatible with the General MT shared task and will be evaluated the same way. The exact prompt instructions will be announced later.
-
As a devset, we recommend WMT24 and WMT24++, which are available on HuggingFace.
-
The testset will be the same as for the General MT 2024 shared task and announced later, thus participants of MIST task automatically participate at General MT.
Task: Open-ended Generation
As a part of this task, systems will be tested on open-ended questions. The outputs will be evaluated by both humans and LLM-as-a-judge models.
As a devset, we recommend mArenaHard
Further details of this task will be announced later.
Task: Linguistic Reasoning
In this task, we probe the models for reasoning on linguistic puzzles. This will be tested with instructions in multiple languages.
-
As a devset, we recommend Linguini, an English-only instruction dataset, in the format as integrated in Big Bench Extra Hard.
Further details of this task will be announced later.
Task: LLM-as-a-judge
As a part of this task, the participanting systems will be tested how well they can perform as a judge to assess quality of other systems in assessing quality of answers for the machine translation and open-ended generation subtasks.
-
As a devset for machine translation, we recommend using the human-evaluated WMT24.
-
The testset will comprise of the submissions of participants in machine translation and open-ended generation tasks.
Participation
Constrained/Unconstrained LLMs: Similar to the General MT task, we have two tracks with respect to LLM size and availability. This allows for fairer comparisons. Constrained models compete only against other constrained, while unconstrained compete against all.
-
Constrained open weights systems:
-
You are allowed to use any training data or models, under any open source license that allows unrestricted use for non-commercial purposes (e.g. Apache, MIT, …) allowing to make your work replicable;
-
The final model’s total number of parameters must be smaller than 20B parameters. See suggested LLMs that falls into this category below. Intermediate steps may use larger models, such as distilling;
-
You are required to release the model weights under some open source license.
-
-
Unconstrained track:
-
No limitations with no requirements on publishing the models. Closed systems such as GPT-4 fall into this track.
-
Here is a non-exhaustive list of suggested LLMs that falls into the category under 20B parameters: Aya Expanse 8B, Aya 101 (13B), Cohere R 7B, Llama 7B, Llama 13B, Qwen 2.5 7B, Ministral 8B, Mistral 7B, EuroLLM.
Participation format: Participants have to participate in all tasks to be eligible for comparisons and human evaluation. The prompts for all tasks will be fixed, however, system builders are allowed to define preamble. You need to provide all details that will allow to replicate your prompting, such as the verbatim system preamble, temperature, and the decoding algorithm need to be disclosed in the system description paper at WMT.
We will provide a package to test and evaluate your setup, ETA in late March/April.
CONTACT
For queries, please use the mailing list or contact Tom Kocmi.
Organizers
-
Tom Kocmi - kocmi@cohere.com
-
Sweta Agrawal
-
Eleftherios Avramidis
-
Ondřej Bojar
-
Eleftheria Briakou
-
Pinzhen Chen
-
Marzieh Fadaee
-
Natalia Fedorova
-
Markus Freitag
-
Roman Grundkiewicz
-
Philipp Koehn
-
Julia Kreutzer
-
Saab Mansour
-
Stefano Perrella
-
Lorenzo Proietti
-
Ricardo Rei
-
Sebastian Ruder
-
Eduardo Sánchez
-
Patrícia Schmidtová
-
Mariya Shmatova
-
Sergei Tilga
-
Vilém Zouhar
Acknowledgements
To be announced.