Shared Task: Large-Scale Machine Translation Evaluation for African Languages

Announcements

  • June 27, 2023 - Data Track Submissions are available -- training data finalized!

  • -
  • April 19, 2023 - Website released!

  • May 31, 2023 - Additional Details available

Overview

Machine translation research has traditionally placed an outsized focus on a limited number of languages - mostly belonging to the Indoeuropean family. Progress for many languages, some with millions of speakers, has been held back by data scarcity issues. An inspiring recent trend has been the increased attention paid to low-resource languages. However, these modelling efforts have been hindered by the lack of high quality, standardised evaluation benchmarks.

For the third edition of the Large-Scale MT shared task, we aim to bring together the community on the topic of machine translation for a set of at least 26 African languages. We do so by using several high quality benchmarks, paired with a fair and rigorous evaluation procedure.

Task Description

The shared task will consist of three tracks.

  • The Data track focuses on the contribution of novel corpora. Participants may submit monolingual, bilingual or multilingual datasets relevant to the training of MT models for any African languages (either focus or additional languages listed below, or even others).
  • The Small Translation track will evaluate the performance of translation models on any of the language pairs covered by this year's African languages, including the languages introduced by Data track participants. Small-compute and bilingual models (or not massively multilingual) are eligible for this track.
  • The Multilingual Translation track will evaluate the performance of (potentially multilingual) translation models covering all of this year's languages. Translation will be evaluated to and from English and French as well as to/from select African languages within particular geographical/cultural clusters:
    • depending on the number of submissions, we may distinguish between constrained submissions (where only the data listed on this page will be allowed, including submissions accepted to the Data track, as well as pre-trained models provided they were open-sourced before the Data track submission deadline) and unconstrained ones.

Resources

Full list of Languages

Focus languages (ones that we will evaluate on, including with human annotations):
Afrikaans - afr Lingala - lin Swati - ssw Luba-Kasai - lua
Amharic - amh Luganda - lug Tswana - tsn Kikongo - kon
Chichewa - nya Luo - luo Umbundu - umb Ewe - ewe
Nigerian Fulfulde - fuv Northern Sotho - nso Wolof - wol Central Kanuri (latin) - kau
Hausa - hau Oromo - orm Xhosa - xho Fon - fon
Igbo - ibo Shona - sna Xitsonga - tso Twi - twi
Kamba - kam Somali - som Yoruba - yor
Kinyarwanda - kin Swahili - swh Zulu - zul

Colonial linguae francae: English - eng, French - fra

Additional languages (that we will be able to evaluate automatically, if covered by any submitted systems):
Akan, Bambara, Bemba, Chokwe, Southwestern Dinka, Dyula, Kabyela, Kabiye, Kikuyu, Kimbundu, Plateau Malagasy, Mossi, Nuer, Nyanja, Rundi, Sango, Southern Sotho, Tigrinya, Tamasheq, Tumbuka, Central Atlas Tamazight, Pulaar, Malagasy, North Ndebele, Shilha, Venda

Important Dates

  • Release of (most of) the encoders for mining for the data track, April 18
  • Data track submission deadline and model availability deadline, June 19
  • Training data finalized, June 21
  • Evaluation opens, July 13
  • Evaluation period ends, July 21
  • System description abstract paper, July 27
  • Camera-ready version due, TBD - September
  • Conference (EMNLP), TBD - December

Evaluation

Due to computational and budgetary constraints, manual and human evaluation will be conducted on a small set of language pairs from the FLORES-101 dataset. You can download it using this script . Specifically, we will evaluate on the following 100 language pairs:

  • Translation from the focus languages to and from the pivots [54 pairs]:
    • to/from English: Afrikaans, Amharic, Chichewa, Central Kanuri, Nigerian Fulfulde, Hausa, Igbo, Kamba, Kinyarwanda, Luganda, Luo, Northern Sotho, Oromo, Shona, Somali, Swahili, Swati, Tswana, Twi, Umbundu, Xhosa, Xitsonga, Yoruba, Zulu
    • to/from French: Kinyarwanda, Lingala, Swahili, Wolof, Fon, Ewe, Luba-Kasai, kikongo
  • An additional select 66 pairs within geographical/cultural clusters, to be selected based on translators/annotators availability (specifics here):
    • South/South East Africa: Afrikaans, Northern Sotho, Shona, Swati, Tswana, Xhosa, Xitsonga, Zulu
    • Horn of Africa and Central/East Africa: Amharic, Oromo, Somali, Luo
    • Nigeria and Gulf of Guinea: Nigerian Fulfulde, Hausa, Igbo, Yoruba, Central Kanuri
    • Central-East Africa: Chichewa, Kinyarwanda, Lingala, Luganda, Swahili
    • Coastal West Africa: Fon, Ewe, Twi, Wolof
    • Central Africa (DRC): Kikongo, Luba-Kasai, Luganda, Swahili
Automatic Metrics: The systems will be evaluated on a suite of automatic metrics:
  • Accuracy measures: BLEU, chrF++, and potentially a version of COMET tuned on African languages
  • Fairness measures: measures of cross-lingual fairness (more details forthcoming)

Participants are encouraged but not required to handle all language pairs. Submissions dealing with only a subset of pairs will be admissible.

We will also measure progress in the form of languages and population covered by the proposed participants, in a manner similar to Blasi et al, (2022).

Data Track

The Data track focuses on the contribution of novel corpora. Participants may submit monolingual, bilingual or multilingual datasets relevant to the training of MT models for this year’s set of languages, as well as for any other African language!

Data track: Submissions

  • Data has to be submitted in the most raw versions, no pre-done tokenization, deadline: June 19th
  • Data has to be submitted through this form. The form requires a link to the hosted version of the data.
  • License - We encourage data submissions with a permissive license (e.g. CC-0) that will allow participants to use data in their model training.
  • Quality Assurance: we will require a note by all participants on any quality assirance efforts. For example, denote the expertise of the translators (if new data are created), or the expertise of the annotators (if new data is mined and then manually checked), etc.
  • For submissions expanding the set of languages (i.e. submissions on languages other than the above) we also recommend that the participants pre-define train-dev-test splits, and describe the process of obtaining the split as well as their characteristics.

Data track: Evaluation

  • There will not be an official evaluation metric for this track. Instead, we will document the data sources according to their usage as reported by the main track participants.
  • Data will be ranked based on the improved coverage that they bring for the languages of the African continent and its people.
  • Data will be ranked based on how many groups have used it to train systems in the evaluation. The more participants have used the data, the better ranked the data contribution.

Data track: Paper submission

Data track will require either the submission of an extended abstract [2-4] pages or a paragraph describing the dataset, together with the datasheet [example templates: [1] [2]]. Participants who submit datasets should ensure that data is correctly credited by giving attribution not only to the data collectors but also to the people from whom the data was originally collected.

The deadline for this submission is the same as system description papers.

Contact

Interested in the task? Please join the WMT google group for any further questions or comments.

Paper Submission

Your system paper submission should be prepared according to the WMT instructions and uploaded to START before July 27, 2023.

Organizers

Jade Abbott, MasakhaneNLP, LelapaAI
Idris Abdulmumin, Ahmadu Bello University Zaria, MasakhaneNLP
David Adelani, Saarland University, Masakhane NLP
Antonios Anastasopoulos, George Mason University.
Vukosi Marivate, University of Pretoria, Masakhane NLP, Deep Learning Indaba
Marta R. Costa-jussà, Meta AI
Tajuddeen Gwadabe, MasakhaneNLP, Open Life Science
Jean Maillard, Meta AI
Shamsuddeen Hassan Muhammad, University of Porto, MasakhaneNLP
Holger Schwenk, Meta AI
Natalia Fedorova, Toloka AI
Sergey Koshelev, Toloka AI
Md Mahfuz ibn Alam, George Mason University
Jonathan Mbuya, George Mason University