Overview 🧑🏫
We present the first shared task for Creole language machine translation. We hope that this shared task may engage more researchers from Creole language-speaking communities, as well as those interested in Creole language technologies.
This shared task contains two subtasks:
-
Contribute MT systems for Creole languages
-
Contribute MT datasets for Creole languages
Systems Subtask 💻
We solicit MT systems that translate between any number of Creole languages and English or French. If a participant would like to create a system to translate between a Creole language and another language considered culturally relevant to the community in question, this may be arranged by contacting the task organizers.
System report: We’ll ask task participants to submit a 4-8 page paper to detail and publicize their contributions.
This subtask contains two submission tracks: constrained and unconstrained (detailed below).
TRAINING DATA
The established training data set for this subtask is available for download here. Participants are not permitted to use the data designated as test data in the current public splits, as we will be using these data as part of our evaluation.
Teams submitting a system for Haitian, Papiamento, or Sango: please contact the organizers, and we will send you the portions of training data for those languages that are still pending pulic release on LDC.
Eval data: Evaluation data for both of the tracks in this subtask (detailed below) will be released when the evaluation period begins.
CONSTRAINED TRACK
The purpose of this track is to explore better ways to model Creole MT with limited resources, and allow researchers to explore smarter configurations than the simple ones we used in past Creole language MT models.
In this track, participants will be able to use only the provided train sets to train their MT systems. Hence, this track will only accept submissions in one of the 40 Creole languages included in the official train set, with translation into or out of English and/or French.
Baseline model: The baseline model for this track, that participants will attempt to improve upon in their selected translation directions, will be the the kreyol-mt-pubtrain model available on HuggingFace. Participants may use this model however they wish in the constrained track, including as an initialization for fine-tuning. But they will not be permitted to use any other pre-trained models for a constrained track submission.
UNCONSTRAINED TRACK
The purpose of this track is to encourage creation of state-of-the-art Creole language MT systems. In this track teams will be allowed to use data from any source, and any pre-trained models or LLMs. (However we will require teams to report the percentage of test set segments that were contained in the training data they use; we will provide the software to compute this when the evaluation period begins.)
Baseline models: This track will have two baseline models to improve upon: the kreyol-mt model and the CreoleM2M model. These two models support slightly different sets of Creole languages, so it is possible that for a given Creole language only one baseline model will be applicable.
Eval data: The test set for this track will be the same as for the constrained track, with some additions to cover the broader set of potential Creole languages submitted.
Acceptable language pairs for the unconstrained track
We will accept submissions for any of the Creole languages supported by either of the baseline models; with translation into/out of English and/or French permitted for languages supported by Kreyòl-MT, and translation into/out of English permitted for the languages supported by CreoleM2M only. We will furnish test sets for each of these languages pairs (at least, all those for which we receive at least one submission).
We will also accept submissions for Creole languages not yet supported by either Kreyòl-MT nor CreoleM2M. For these languages the baseline model will still be either Kreyòl-MT or CreoleM2M: whichever performs better zero-shot for the new language pair. In this case, we will require that participants submit an eval set meeting the requirements detailed in the data subtask as part of their submission, since we won’t have an eval set of our own for the language pair. (This is the same as the requirement for translation directions into or out of a language other than English and French, as mentioned earlier.)
Data Subtask 📚
We solicit contributions to Creole language MT training and evaluation sets, in bitext formats with translations into any other language (though stronger submissions will be able to justify why the other language is relevant for the Creole language-speaking community).
Dataset report: Again, we will ask task participants to submit a 4-8 page paper to detail and publicize their contributions.
DATA REQUIREMENTS
-
Participants must show that 100% of translations were either translated or post-edited by competent native or proficient speakers of the source and target languages. (Stronger quality assurances, such as only native speakers or translation instead of post-editing, will lead to stronger submissions.)
-
We require a data card with each submitted data set.
-
Participants must be able to show that one of the languages in each submitted bitext is considered a Creole language, by citing adequate academic sources or other sufficiently convincing means.
-
If submitting training data, it is strongly encouraged that participants use it to develop an MT system and evaluate this system on a test set (either a test set of their own creation, which must be submitted along with the training data, or a previously published test set). It is encouraged that participants show significant (p < 0.05) improvements in chrF++ over the previous state-of-the-art open-source MT system for the language pair. (To do this they must identify the previous SOTA model and make a compelling case for why it would be considered SOTA.) We will provide software to assist with meeting this requirement when the evaluation period begins. If participants are not able to meet this requirement, they must provide other convincing evidence of the utility of their training set.
-
If submitting a test set, participants must use it to evaluate performance of an MT model and provide compelling evidence that the model’s performance on the test set aligns with conventional wisdom regarding the model’s performance in the translation direction.
Please direct any questions about these requirements to the task organizers.
SEED DATASET
We provide a seed dataset for translation into a test set for teams interested. It is simply the FLORES-200 English devtest
set. Teams may choose to translate all 1k segments into their Creole language of choice, or to opt for a subset of them (so long as they can make a convincing case that the dataset size is sufficient for a high quality test set). Naturally if teams wish to also translate the FLORES-200 dev
sets as well, they are welome to do so.
Here are instructions to download the seed dataset:
from datasets import load_dataset
flores_code = "eng_Latn"
devtest_set = load_dataset(
"facebook/flores",
flores_code,
split="devtest",
trust_remote_code=True
) # len ~= 1000
Task Information 📢
REGISTRATION
If you plan to participate, please register for the shared task using this form.
IMPORTANT DATES
Finalized task details |
2 May, 2025 |
Release of training data |
2 May, 2025 |
Release of test data |
5 July, 2025 |
Outputs/data submission deadline |
19 July, 2025 |
Paper submission deadline |
14 August, 2025 (follows WMT) |
EMNLP Conference |
5-9 November, 2025 |
All deadlines are in AoE (Anywhere on Earth). Dates are specified with respect to EMNLP 2025.
ORGANIZERS
-
Nathaniel R. Robinson (Johns Hopkins University)
-
Heather Lent (Aalborg University)
-
Raj Dabre (Google)
-
Andre Coy (University of the West Indes)
-
Rasul Dent (Inria Paris)
-
Claire Bizon Monroc (Inria Paris)
-
Stefan Watson (University of the West Indes)
CONTACT
-
Organizers: nrobin38@jhu.edu
-
Shared task community: creole-mt-shared-task@googlegroups.com
CREOLE LANGUAGES SUPPORTED BY BASELINE MODELS
Update coming soon….