Announcements
-
June 27, 2023 - Data Track Submissions are available -- training data finalized!
-
-
April 19, 2023 - Website released!
-
May 31, 2023 - Additional Details available
Overview
Machine translation research has traditionally placed an outsized focus on a limited number of languages - mostly belonging to the Indoeuropean family. Progress for many languages, some with millions of speakers, has been held back by data scarcity issues. An inspiring recent trend has been the increased attention paid to low-resource languages. However, these modelling efforts have been hindered by the lack of high quality, standardised evaluation benchmarks.
For the third edition of the Large-Scale MT shared task, we aim to bring together the community on the topic of machine translation for a set of at least 26 African languages. We do so by using several high quality benchmarks, paired with a fair and rigorous evaluation procedure.
Task Description
The shared task will consist of three tracks.
- The Data track focuses on the contribution of novel corpora. Participants may submit monolingual, bilingual or multilingual datasets relevant to the training of MT models for any African languages (either focus or additional languages listed below, or even others).
- The Small Translation track will evaluate the performance of translation models on any of the language pairs covered by this year's African languages, including the languages introduced by Data track participants. Small-compute and bilingual models (or not massively multilingual) are eligible for this track.
- The Multilingual Translation track will evaluate the performance of (potentially multilingual) translation models covering all of this year's languages. Translation will be evaluated to and from English and French as well as to/from select African languages within particular geographical/cultural clusters:
- depending on the number of submissions, we may distinguish between constrained submissions (where only the data listed on this page will be allowed, including submissions accepted to the Data track, as well as pre-trained models provided they were open-sourced before the Data track submission deadline) and unconstrained ones.
Resources
- To facilitate work in the Data track, we have released LASER sentence encoders supporting 24 relevant languages. LASER is a sentence representation toolkit which enables the fast mining of parallel corpora (Heffernan et al, 2022). The encoders may be obtained here. A more recent update provides LASER encoders for all 200 languages that are part of the NLLB model includes almost all languages that are part of our shared task.
- The validation and test data are based on several benchmarks:
- FLORES-200, a high quality benchmark which supports many-to-many evaluation in over two hundred languages.
- NTREX-128, another high quality benchmark that covers 128 languages.
- a novel benchmark built by the Decolonise Science project, that covers 6 African languages.
- Validation and validation-test datasets will be provided, while the actual evaluation will be performed by us: the participants will need to share their models and we will obtain outputs on the test set.
- Submissions in the Constrained Translation track are only allowed to use data from the
following sources. If you would like to use other sources, please submit to the Unconstrained
Translation track.
- Data from the 2023 Data Track participants
- SALT: Sunbird African Language Technology -- a corpus for Acholi, Lugbara, Luganda, Runyankole, Ateso, and English
- common-parallel-corpora -- a n-way aligned multitext-nllb-seed and flores-200 extended to include N'ko (nqo_Nkoo) and including translator edits
- nicolingua-0005-nqo-nmt-resources -- a parallel corpus for N'ko (including English and French). additional training resources curated from the n'ko community
- OPUS
- Parallel corpora mined from crawled data
- Data from the 2022 Data Track participants
- Data from the 2023 Data Track participants
Full list of Languages
Afrikaans - afr | Lingala - lin | Swati - ssw | Luba-Kasai - lua |
Amharic - amh | Luganda - lug | Tswana - tsn | Kikongo - kon |
Chichewa - nya | Luo - luo | Umbundu - umb | Ewe - ewe |
Nigerian Fulfulde - fuv | Northern Sotho - nso | Wolof - wol | Central Kanuri (latin) - kau |
Hausa - hau | Oromo - orm | Xhosa - xho | Fon - fon |
Igbo - ibo | Shona - sna | Xitsonga - tso | Twi - twi |
Kamba - kam | Somali - som | Yoruba - yor | |
Kinyarwanda - kin | Swahili - swh | Zulu - zul |
Colonial linguae francae: English - eng, French - fra
Additional languages (that we will be able to evaluate automatically, if covered by any submitted systems):
Akan, Bambara, Bemba, Chokwe, Southwestern Dinka, Dyula, Kabyela, Kabiye, Kikuyu, Kimbundu, Plateau Malagasy, Mossi, Nuer, Nyanja, Rundi, Sango, Southern Sotho, Tigrinya, Tamasheq, Tumbuka, Central Atlas Tamazight, Pulaar, Malagasy, North Ndebele, Shilha, Venda
Important Dates
- Release of (most of) the encoders for mining for the data track, April 18
- Data track submission deadline and model availability deadline, June 19
- Training data finalized, June 21
- Evaluation opens, July 13
- Evaluation period ends, July 21
- System description abstract paper, July 27
- Camera-ready version due, TBD - September
- Conference (EMNLP), TBD - December
Evaluation
Due to computational and budgetary constraints, manual and human evaluation will be conducted on a small set of language pairs from the FLORES-101 dataset. You can download it using this script . Specifically, we will evaluate on the following 100 language pairs:
- Translation from the focus languages to and from the pivots [54 pairs]:
- to/from English: Afrikaans, Amharic, Chichewa, Central Kanuri, Nigerian Fulfulde, Hausa, Igbo, Kamba, Kinyarwanda, Luganda, Luo, Northern Sotho, Oromo, Shona, Somali, Swahili, Swati, Tswana, Twi, Umbundu, Xhosa, Xitsonga, Yoruba, Zulu
- to/from French: Kinyarwanda, Lingala, Swahili, Wolof, Fon, Ewe, Luba-Kasai, kikongo
- An additional select 66 pairs within geographical/cultural clusters, to be selected based on
translators/annotators availability (specifics here):
- South/South East Africa: Afrikaans, Northern Sotho, Shona, Swati, Tswana, Xhosa, Xitsonga, Zulu
- Horn of Africa and Central/East Africa: Amharic, Oromo, Somali, Luo
- Nigeria and Gulf of Guinea: Nigerian Fulfulde, Hausa, Igbo, Yoruba, Central Kanuri
- Central-East Africa: Chichewa, Kinyarwanda, Lingala, Luganda, Swahili
- Coastal West Africa: Fon, Ewe, Twi, Wolof
- Central Africa (DRC): Kikongo, Luba-Kasai, Luganda, Swahili
- Accuracy measures: BLEU, chrF++, and potentially a version of COMET tuned on African languages
- Fairness measures: measures of cross-lingual fairness (more details forthcoming)
Participants are encouraged but not required to handle all language pairs. Submissions dealing with only a subset of pairs will be admissible.
We will also measure progress in the form of languages and population covered by the proposed participants, in a manner similar to Blasi et al, (2022).
Data Track
The Data track focuses on the contribution of novel corpora. Participants may submit monolingual, bilingual or multilingual datasets relevant to the training of MT models for this year’s set of languages, as well as for any other African language!
Data track: Submissions
- Data has to be submitted in the most raw versions, no pre-done tokenization, deadline: June 19th
- Data has to be submitted through this form. The form requires a link to the hosted version of the data.
- License - We encourage data submissions with a permissive license (e.g. CC-0) that will allow participants to use data in their model training.
- Quality Assurance: we will require a note by all participants on any quality assirance efforts. For example, denote the expertise of the translators (if new data are created), or the expertise of the annotators (if new data is mined and then manually checked), etc.
- For submissions expanding the set of languages (i.e. submissions on languages other than the above) we also recommend that the participants pre-define train-dev-test splits, and describe the process of obtaining the split as well as their characteristics.
Data track: Evaluation
- There will not be an official evaluation metric for this track. Instead, we will document the data sources according to their usage as reported by the main track participants.
- Data will be ranked based on the improved coverage that they bring for the languages of the African continent and its people.
- Data will be ranked based on how many groups have used it to train systems in the evaluation. The more participants have used the data, the better ranked the data contribution.
Data track: Paper submission
Data track will require either the submission of an extended abstract [2-4] pages or a paragraph describing the dataset, together with the datasheet [example templates: [1] [2]]. Participants who submit datasets should ensure that data is correctly credited by giving attribution not only to the data collectors but also to the people from whom the data was originally collected.
The deadline for this submission is the same as system description papers.
Contact
Interested in the task? Please join the WMT google group for any further questions or comments.
Paper Submission
Your system paper submission should be prepared according to the WMT instructions and uploaded to START before July 27, 2023.
Organizers
Jade Abbott, MasakhaneNLP, LelapaAI
Idris Abdulmumin, Ahmadu Bello University Zaria, MasakhaneNLP
David Adelani, Saarland University, Masakhane NLP
Antonios Anastasopoulos, George Mason University.
Vukosi Marivate, University of Pretoria, Masakhane NLP, Deep Learning Indaba
Marta R. Costa-jussà, Meta AI
Tajuddeen Gwadabe, MasakhaneNLP, Open Life Science
Jean Maillard, Meta AI
Shamsuddeen Hassan Muhammad, University of Porto, MasakhaneNLP
Holger Schwenk, Meta AI
Natalia Fedorova, Toloka AI
Sergey Koshelev, Toloka AI
Md Mahfuz ibn Alam, George Mason University
Jonathan Mbuya, George Mason University