The MultiIndic22MT 2024 Shared Task.

EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 15-16, 2024
Miami, Florida, USA

[HOME] [PROGRAM] [PAPERS] [AUTHORS]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL] [MULTIINDIC22MT TASK] [ENGLISH-TO-LOWRES MULTIMODAL MT TASK] [NON-REPETITIVE] [PATENT] [LITERARY]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

Introduction

In WAT 2018, we introduced the IndicMT task for the first time spanning covering 7 Indic languages. Over the years we have gradually added languages and now in WAT 2024, we are pleased to announce a multilingual Indic MT task spanning all 22 scheduled languages of India belonging to 4 language families and written in 12 scripts. The languages exhibit both genetic as well as contact relatedness (Kunchukuttan et al. 2020). Some of these languages are extremely low-resource. This diversity makes this language group ideal for studies in multilingual learning, language relatedness and low-resource MT. For the first time, this task will be hosted jointly along with the WMT 2024 shared tasks.

Task Description

The task covers English and 22 Indic Languages, namely, Assamese, Bengali, Bodo, Dogri, Konkani, Gujarati, Hindi, Kannada, Kashmiri (Arabic script), Maithili, Malayalam, Marathi, Manipuri (Meitei script), Nepali, Oriya, Punjabi, Sanskrit, Santali, Sindhi (Devanagari script), Tamil, Telugu, Urdu. We will evaluate the submissions on 44 translation directions (English-Indic and Indic-English). We will also evaluate the performance of 5 Indic-Indic pairs: Bengali-Hindi, Tamil-Telugu, Hindi-Malayalam and Sindhi-Punjabi. We encourage the use of multilingualism and transfer-learning by leveraging monolingual data, backtranslation and (potentially) LLMs, to develop high quality systems.

Corpora

Evaluation data:
- Development set: We provide FLORES-22 Indic dev set an extended version of FLORES-200 dev set spanning all 22 aforementioned languages.
- Public Test set: We will evaluate on the IN22-Gen and IN22-Conv test sets. These are English original test sets. As we have always done, these test sets are public but we trust participants to not fine-tune on them for fairness and correctness.
- Hidden Test set: Additionally, we will evaluate on a hidden test set which is Indic original. This spans 13 of the 22 languages and will be released a week before the end of the shared task deadline.
- Update - 12th July, 2024: The hidden test set is now available at Source Original Test Set. The references have not been provided unlike the public test sets. Please translate from Indic to English and then submit your translations as per the submission instructions below.
Training data:
- We recommend using the BPCC dataset (Gala et al., 2023) which spans all 22 languages. Additional details of the dataset are present in the github repo. Note that one may pivot via English to obtain Indic-Indic parallel corpora.
- We additionally recommend the use of monolingual data from Varta, IndicCorp v2 and Sangraha corpora.
Other data:
- Please check with the organizers once before using any other data sources.

Pre-trained models and training toolkits:

Participants are welcome to leverage the following publicly available models for fine-tuning or synthetic data generation:
- IndicTrans2 one-to-many - large hf, distilled hf, large fairseq, distilled fairseq
- IndicTrans2 many-to-one - large, distilled, large fairseq, distilled fairseq
- IndicTrans2 many-to-many - large, distilled, large fairseq, distilled fairseq
- mT5 - You may use models of any size. hf, github
- IndicBart - hf unified script, hf separate script, github
- VartaT5 - hf
- BLOOM - hf
- Gemma - hf
- If you want to use any other LLM, please feel free but let us know in advance.
For training you are free to use any toolkit you like but we list the following for convenience:
- HuggingFace scripts/notebooks might be the most convenient
- Fairseq v1, v2, lora and distillation fork
- YANMTT
- open-instruct and its fork for Indic languages if you plan to fine-tune LLMs.

Submission details

If you wish to participate then please fill this form so that we may send you the hidden test sets when it is time.
There are two types of submissions: Constrained and Unconstrained.
- Constrained submission: If you use the data and models mentioned above the your submission will be considered as constrained.
- Unconstrained submission: Any other data or models are used without confirmation from the organizers will result in unconstrained submissions.
Please submit your system translations to prajdabre@gmail.com and anoop.kunchukuttan@gmail.com.
- Participants may submit up to 2 systems, one Primary (ranked) and one Contrastive (unranked, optional).
When submitting results please submit a single zip file with the following name: (Teamname)-(Constrained/Unconstrained)-(Primary/Contrastive)-(In22Gen/In22Conv/Hidden).zip.
- For example if your team name is Garuda and you are making an Unconstrained, Primary submission on the Hidden test set then the zip file will be: Garuda-Unconstrained-Primary-Hidden.zip.
- Under the zip file there should be files for each direction in the format: (srccode)-(tgtcode).txt.
- The codes should be as per the codes listed here. Thus if you are submitting Hindi to English and Punjabi to Sindhi (Devanagari script) translations then the file names will be: hin_Deva-eng_Latn.txt and pan_Guru-snd_Deva.txt.

Evaluation

We will use the SacreBLEU library for evaluation and primarily rely on chrF while BLEU will be the secondary metric.
- We recommend following this script for evaluation, since we will do the same.
We will also perform human evaluation of a few language pairs which will be decided soon.
We will provide rankings per translation direction for both automatic and human evaluation.

Timeline

Task and training/dev data announcement - 30th April, 2024
Test data release - 13 July, 2024
Run Submission and Abstract submission deadline - 28th July, 2024
System description/workshop paper submission deadline - TBA, 2024 (follow EMNLP/WMT page)
Notification of Acceptance - TBA, 2024 (follow EMNLP/WMT page)
Camera-ready - TBA, 2024 (follow EMNLP/WMT page)
Workshop Dates - follow EMNLP/WMT main page

Contact

Raj Dabre: prajdabre@gmail.com
Anoop Kunchukuttan: anoop.kunchukuttan@gmail.com

NINTH CONFERENCE ON MACHINE TRANSLATION (WMT24)