ANNOUNCEMENTS
Test set results have been distributed among participants. Participants are encouraged to submit their research papers and follow the WMT 2024 deadlines. |
Aug 7, 2024 - Run Submission deadline (AoE)! Please dont forget to send a brief system description.
Monday July 29, 2024 - Test data released!
Feb 25, 2024 - Website released, and the task is announced!
We are still updating the page. Please keep your eye on it! |
OVERVIEW AND TASK DESCRIPTION
Building upon the resounding success of “Shared Task: Low-Resource Indic Language Translation” in WMT 2023, which saw enthusiastic participation from around the world, we are excited to announce the “Shared Task: Low-Resource Indic Language Translation” in WMT 2024. Recent advances in machine translation (MT) have significantly improved performance. Techniques such as multilingual translation and transfer learning are expanding MT’s reach beyond well-resourced languages. Yet, extending coverage to diverse, low-resource languages remains a challenge due to the limited availability of parallel data for training robust systems. The WMT 2024 Indic Machine Translation Shared Task tackles this challenge by focusing on low-resource Indic languages from diverse language families. The focus will be on languages like Assamese (an Indo-Aryan language spoken mainly in the north-eastern Indian state of Assam), Mizo (a Sino-Tibetan language spoken primarily in the Mizoram state of India), Khasi (an Austroasiatic language spoken in Meghalaya, India), Manipuri (also known as Meiteilon, a Sino-Tibetan language and the official language of Manipur, India), and Nyishi (a Sino-Tibetan language of Arunachal Pradesh, India).
This year’s task features two categories:
Category 1: (Moderate Training Data Available)
-
en-as: English ⇔ Assamese
-
en-lus: English ⇔ Mizo
-
en-kha: English ⇔ Khasi
-
en-mni: English ⇔ Manipuri
-
en-nshi: English ⇔ Nyishi (CANCELED)
English ⇔ Nyish language pairs is CANCELED this year due to error in training data, we will release this task next year. |
Category 2: (Very Limited Training Data) (CANCEL)
-
en-bodo: English ⇔ Bodo
-
en-mrp: English ⇔ Mising
-
en-trp: English ⇔ Kokborok
The specific language pairs for Category 2 is CANCELED this year due to error in training data, we will release this task next year. |
GOAL
The central objective is to develop MT systems that produce high-quality translations despite the constraints of data availability. Participants are encouraged to explore:
-
Monolingual Data Utilization: Leveraging monolingual data effectively for improved translation.
-
Multilingual Approaches: Investigating whether cross-lingual transfer benefits low-resource pairs.
-
Transfer Learning: Adapting models trained on richer language pairs to the target languages.
-
Innovative Techniques: Experimenting with novel methods specifically tailored for low-resource settings.
DEADLINES
Release of training/dev data |
25 May, 2024 |
Test data release |
29 July, 2024 |
Run Submission deadline |
7 Aug, 2024 NOTE: Please send a brief/abstract (mandatory) along with your system submission on 7 Aug, 2024 |
System description/workshop paper submission deadline |
20th August, 2024 (follow EMNLP/WMT page) |
Notification of Acceptance |
20th September, 2024 (follow EMNLP/WMT page) |
Camera-ready |
3rd October, 2024 (follow EMNLP/WMT page) |
Workshop Dates |
15-16 November, 2024 (follow EMNLP/WMT main page) |
All deadlines are in AoE (Anywhere on Earth). Dates are specified with respect to EMNLP 2024. |
DATA
-
Assamese, Khasi, Mizo, Manipuri for WMT 2023: DOWNLOAD.
-
Nyshi: [DOWNLOAD LINK WILL BE ENABLE SOON]
CITATIONS
If you are using this data, please cite:
-
Santanu Pal, Partha Pakray, Sahinur Rahman Laskar, Lenin Laitonjam, Vanlalmuansangi Khenglawt, Sunita Warjri, Pankaj Kundan Dadure and Sandeep Kumar Dash, Findings of the WMT 2023 Shared Task on Low-Resource Indic Language Translation In Proceedings of the Eighth Conference on Machine Translation (WMT), pages 682–694. [2023]
-
Nabam Kakum, Sahinur Rahman Laskar, Koj Sambyo, Partha Pakray, Neural machine translation for limited resources English-Nyishi pair, Sādhanā, Springer [2023].
TEST SET OUTPUT SUBMISSION
The test data will be available at the same repository as the training data and it can be accessed using the same password sent via e-mail. You are allowed to submit 1 CONSTRAINT, 1 PRIMARY and up to 2 CONTRASTIVE systems for each language pair/translation direction.
Submission File Naming Convention
The submission file must be named as follows: Team Name_Submission Type_Language Pair_Target Language Output
Example:
-
Team Name: CNLP-NITS-PP
-
Submission Type: primary/contrastive (Primary: Training is done strictly using the task data.) (Contrastive: Training uses either task or any outside data.)
-
Language Pair: en_to_as (English-Assamese: en_to_as)
-
Example File Name: CNLP-NITS-PP_primary_en_to_as.txt
Kindly ensure that there are no newlines after the last line in the submission file, or evaluation and submission will fail. |
Submission Process:
Please keep all the submission files and the abstract system description file (PDF) in a single zip file named with your team name (<team name>.zip). Send the zip file to the email id: lrilt.wmt24@gmail.com (Subject: Team Name: Submission File for Shared Task: Low-Resource Indic Language Translation).
You can submit the result only once for each PRIMARY and CONTRASTIVE submission type. |
EVALUATION
Systems will undergo both automatic evaluation (using BLEU, TER, RIBES, COMET, ChrF) and human evaluation by native speakers for a comprehensive assessment of translation quality.
CONTACT
PAPER SUBMISSION
Your system paper submission should be prepared according to the WMT instructions and uploaded to START before TBA, 2024 (WMT MAIN PAGE).
ORGANIZERS
-
Santanu Pal, Wipro AI Lab, London, UK
-
Partha Pakray, National Institute of Technology, Silchar, India
-
Advaitha Vetagiri, National Institute of Technology, Silchar, India
-
Sandeep Kumar Dash, National Institute of Technology, Mizoram, India
-
Lenin Laitonjam, National Institute of Technology, Mizoram, India
-
Arnab Maji, North-Eastern Hill University, India
-
Lyngdoh Sarah, North-Eastern Hill University, India
-
Riyanka Manna, Amrita Vishwa Vidyapeetham, Amaravati Campus, Andhra Pradesh, India
TECHNICAL MEMBERS
-
Advaitha Vetagiri, National Institute of Technology, Silchar, India
-
Reddi Mohana Krishna, National Institute of Technology, Silchar, India
-
Shyambabu Pandey, National Institute of Technology, Silchar, India
-
Annepaka Yadagiri, National Institute of Technology, Silchar, India