Announcements
| Please register here! |
📢 May 6, 2026 - Website released, and the task is announced!
| We are still updating the page. Please keep your eye on it! |
Overview 🧑🏫
We present the second shared task for Creole language machine translation, a continuation of last year’s task. We hope that this shared task may engage more researchers from Creole language-speaking communities, as well as those interested in Creole language technologies.
This shared task contains two categories of subtasks:
-
Task 1: Models for Creole language translation
-
Subtask 1A: Models for machine translation (MT) of Creole languages
-
Subtask 1B: Models for language identification (LID) of Creole languages
-
-
Task 2: Datasets for Creole language translation
-
Subtask 2A: Aligned bitexts for Creole language machine translation (MT)
-
Subtask 2B: Speech datasets for Creole languages
-
Subtask 1A: MT models for Creole Languages 🤖
We solicit MT systems that translate between any number of Creole languages and English or French. If a participant would like to create a system to translate between a Creole language and another language considered culturally relevant to the community in question, this may be arranged by contacting the task organizers.
The purpose of this subtask is to encourage creation of state-of-the-art Creole language MT systems. Hence, teams will be allowed to use data from any source, and any pre-trained models or LLMs. (However we will require teams to report the percentage of test set segments that were contained in the training data they use; we will provide the software to compute this when the evaluation period begins.) We encourage teams to experiment LLM-based systems, as this is under-studied in the Creole language space.
PROVIDED RESOURCES
Baseline models: These will be provided shortly 🔜 (One will be CreoleM2M; the other a newly trained iteration of Kreyòl-MT)
Provided train data: These will be provided shortly 🔜
Eval data: These will be provided shortly 🔜
ACCEPTABLE LANGUAGE PAIRS
We will accept submissions for any of the Creole languages supported by Kreyòl-MT or CreoleM2M; with translation into/out of English and/or French permitted for languages supported by Kreyòl-MT, and translation into/out of English permitted for the languages supported by CreoleM2M only. We will furnish test sets for each of these languages pairs (at least, all those for which we receive at least one submission).
We will also accept submissions for Creole languages yet supported by neither Kreyòl-MT nor CreoleM2M. In this case, we will require that participants submit an eval set meeting the requirements detailed in Subtask 2A as part of their submission, since we won’t have an eval set of our own for the language pair. (This is the same as the requirement for translation directions into or out of a language other than English and French, as mentioned earlier.)
Subtask 1B: LID models for Creole Languages
The existing datasets for Creole language MT are, for the most part, manually curated and organized. A common way to expand existing resources is through web-crawling. However, web-crawling multilingual data requires high-quality language identification (LID). We hence challenge participants to develop LID systems for Creole languages from existing data, which will be evaluated on private evaluation sets of an unknown genre.
More details to come
Subtask 2A: Data for Creole Language MT 📚
We solicit contributions to Creole language MT training and evaluation sets, in bitext formats with translations into any other language (though stronger submissions will be able to justify why the other language is relevant for the Creole language-speaking community).
DATA REQUIREMENTS
-
Participants must show that 100% of translations were either translated or post-edited by competent native or proficient speakers of the source and target languages. (Stronger quality assurances, such as only native speakers or translation instead of post-editing, will lead to stronger submissions.)
-
We require a data card with each submitted data set.
-
Participants must be able to show that one of the languages in each submitted bitext is considered a Creole language, by citing adequate academic sources or other sufficiently convincing means.
-
If submitting training data, it is strongly encouraged that participants use it to develop an MT system and evaluate this system on a test set (either a test set of their own creation, which must be submitted along with the training data, or a previously published test set). It is encouraged that participants show significant (p < 0.05) improvements in chrF++ over the previous state-of-the-art open-source MT system for the language pair. (To do this they must identify the previous SOTA model and make a compelling case for why it would be considered SOTA.) We will provide software to assist with meeting this requirement when the evaluation period begins. If participants are not able to meet this requirement, they must provide other convincing evidence of the utility of their training set.
-
If submitting a test set, participants must use it to evaluate performance of an MT model and provide compelling evidence that the model’s performance on the test set aligns with conventional wisdom regarding the model’s performance in the translation direction.
Please direct any questions about these requirements to the task organizers.
SEED DATASET
We provide a seed dataset for translation into a test set for teams interested.
It is simply the FLORES-200 English devtest set. Teams may choose to translate all 1k
segments into their Creole language of choice, or to opt for a subset of them (so long as
they can make a convincing case that the dataset size is sufficient for a high quality
test set). Naturally if teams wish to also translate the FLORES-200 dev sets as well,
they are welome to do so.
Here are instructions to download the seed dataset:
from datasets import load_dataset
flores_code = "eng_Latn"
devtest_set = load_dataset(
"facebook/flores",
flores_code,
split="devtest",
trust_remote_code=True
) # len ~= 1000
Subtask 2B: Creole language speech data 💬
Speech applications are broadly needed for Creole languages. However Creole speech technologies is still a nascient field with few resources. We hence challenge interested participants to submit datasets with Creole language speech data. Each dataset must:
-
Contain segments not exceeding 60 seconds of speech in a Creole language
-
Contain textual transcriptions or translations of such segments
-
Be large enough to train a speech recognition system to perform better than 90% WER on held out data, or a speech translation system to perform better than 15.0 chrF++ on held out data
More details to come
Task Information 📢
System / dataset report: For all subtasks, we’ll ask task participants to submit a 4-8 page paper to detail and publicize their contributions.
REGISTRATION
If you plan to participate, please register for the shared task using this form.
IMPORTANT DATES
May 6, 2026 |
Website released, and the task is announced! |
May 6, 2026 |
Team Registration Open |
June 7, 2026 |
All baseline models and provided training data released |
July 18, 2026 |
Beginning of the evaluation cycle (test sets released) |
July 25, 2026 |
End of the evaluation cycle |
June 20, 2026 |
Result Declaration to individual team |
in-line with WMT26 |
System Paper Submission |
November, 2026 |
Under EMNLP Conference |
PAPER SUBMISSION
in-line with WMT26
All deadlines are in AoE (Anywhere on Earth). Dates are specified with respect to EMNLP 2025.
ORGANIZERS
-
Nathaniel R. Robinson (Johns Hopkins University)
-
Raj Dabre (Google)
-
Rasul Dent (Inria Paris)
-
Claire Bizon Monroc (Inria Paris)
-
Kenton Murray (Johns Hopkins University)
CONTACT
-
Organizers: creolemtsharedtask@gmail.com
-
Shared task community: creole-mt-shared-task@googlegroups.com
CREOLE LANGUAGES SUPPORTED BY BASELINE MODELS
Here are the languages supported by Kreyòl-MT (from the paper).
And here are the languages supported by CreoleVal (from the paper).