EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 15-16, 2024
Miami, Florida, USA
 
[HOME]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL] [MULTIINDIC22MT TASK] [ENGLISH-TO-LOWRES MULTIMODAL MT TASK] [NON-REPETITIVE] [PATENT] [LITERARY]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

ANNOUNCEMENTS

  • 9th May 2024 - Added a relevant paragraph to the data section about the different procedures followed to obtain the FLORES+ datasets.

  • 22th April 2024 - Added a language identifier for the target languages (Idiomata Cognitor) and a small clarification that you can use OPUS as a source of monolingual data in the constrained setting.

  • 18th April 2024 - All pending data (obviously excluding test sets) has been released. You may start training your systems!

  • 20th February 2024 - The shared task is announced.

INTRODUCTION

In Spain, a diverse linguistic landscape exists, including, beyond the widely recognized Spanish, other languages such as Basque, Catalan, and Galician. Although Spanish is obviously at the forefront in terms of the volume of resources available for training data-driven MT systems, the capabilities and richness of the other languages should not be underestimated. Basque, Catalan, and Galician, which might have been considered limited in resources in the past, actually possess a significant amount of data that facilitate their integration into modern MT technologies. In fact, these three languages have been recently included among the list of up to 100 select languages in well-known multilingual systems such as mBERT, XLM-R, mBART, mT5 or NLLB-200. However, Spain is home to additional languages with much fewer resources, especially in the form of bilingual data. This task focuses on three of them, namely, Aragonese, Aranese, and Asturian.

GOALS

The goals of this shared translation task are:

  • To push the boundaries of machine translation system development when the amount of resources is extremely scarce.

  • To explore the transferability among low-resource Romance languages when translating from Spanish.

  • To find the best way to use pre-trained models of any kind for the translation between Spanish and low-resource Romance languages.

  • To create publicly available corpora for machine translation development and evaluation.

We expect participation in this task from both newcomers and established research teams.

TASK DESCRIPTION

Participants will use their systems to translate a test set of unseen sentences. Actually, as described later, the Spanish source side of this test is available to participants as it is part of the FLORES+ dataset. Nevertheless, the counterpart in the low-resource language will be new and purposefully created for this shared task.

The translation quality will be exclusively measured via automatic evaluation metrics. You may participate in any or all of the language pairs. To have a common framework that allows for comparable results, and also to lower the barrier to entry, we provide a possible training set.

As in all WMT shared tasks, each participant is required to submit a submission paper, which should highlight in which ways your own methods and data differ from the standard approach. You should make it clear which tools you used and which datasets you used. Moreover, each participant has to submit a one-page abstract of the system description one week after the system submission deadline. The abstract should contain, at a minimum, basic information about the system and the methods/data/tools used. See the main page for the link to the submission site.

Language pairs

The list of language pairs that are going to be evaluated (only specified direction) follows:

  • Spanish to Aragonese

  • Spanish to Aranese

  • Spanish to Asturian

Note that Aranese is a variety of Gascon, one of the two main dialects of Occitan. You may decide to use Occitan data as part of the training process for your systems, bearing in mind that some orthographic differences may exist.

Types of submissions

Submissions may fall into three different categories, depending on the corpora and models used and on the reproducibility of the results: constrained, open and closed.

Constrained submissions can only use the corpora distributed as part of this shared task (see below), but can make use of any existing publicly-available pre-trained language models or translation models provided that they do not exceed the amount of 1B (one thousand million) parameters; for example, NLLB-200-600M fits within this limit. Developed systems may be bilingual or multilingual, not necessarily including all the languages of interest.

Open submissions can make use of any resource (corpora, pre-trained models, etc.), in any language, without size restriction, provided that they are publicly available under open source licenses to ensure reproducibility by third parties. Machine translation systems or large language models available online belong to this category if the resulting outputs are made available.

Closed submissions can be trained without limitations regarding the availability of resources (corpora, pre-trained models, etc.) used.

In any case, we encourage submissions to estimate the amount of KWh that their training required and include a section discussing the possible trade-off between energy consumption and translation quality in the paper describing the submission.

DATA

Training Data and Development Data

The training and development corpora for the constrained submissions is restricted to everything in OPUS, the recently released PILAR corpora, and the brand new FLORES+ dev sets for our target languages.

On the one hand, OPUS lists some (mostly uncurated) resources for Aragonese, Asturian, and Occitan. See, for example, the following bilingual resources:

Note that the fact that you can use any resource from OPUS includes the possibility of also collecting monolingual data by using the source or target side of any bilingual resource in OPUS.

On the other hand, PILAR (Pan-Iberian Language Archival Resource) contains mostly monolingual data for Aranese, Aragonese and Asturian. The data in PILAR can be freely used for research purposes. If you do so, we ask that you cite the shared task overview paper and the URL of the repository, and respect any additional citation requirements. For other uses of the data, you should consult with original owners of the data sets.

Finally, to evaluate your systems during development, we suggest using the newly created manually-revised corpora for Aragonese, Aranese, and Asturian based on the multilingual FLORES+ dev sets (997 sentences for each language). This evaluation set closely mirrors the test set (based on the FLORES+ devtest sets) in terms of orthographic, grammatical, and domain aspects, making it an appropriate choice for use during development. Utilizing this set will help ensure that your system is well-tuned to the nuances expected in the final evaluation. Both dev and devtests sentences have been reviewed by the respective language academies to ensure that they follow the current standard orthographic conventions.

It is important to note that the method used to create the FLORES+ Asturian dataset released for this task differs from the one used for Aragonese and Aranese. The Asturian sentences were originally obtained by Meta via professional translation from English and then we asked the academia to revise it, whereas the Aragonese and Aranese sentences were first machine translated from the Spanish sentences using Apertium and then manually post-edited by specialists proficient in these languages and finally revised by the academias. The utilization of MT systems is justified for two reasons: firstly, the two-step workflow consisting of machine translation followed by post-editing is prevalent for these languages, with many existing texts being produced this way; secondly, sourcing linguists or translators for these languages proved challenging due to their scarcity, making it difficult to complete the task within the required timeframe.

You can find the FLORES+ dev sets in Aragonese, Aranese, and Asturian as part of the PILAR repository. Follow the instructions in the README file in order to download them.

Test Data

The test set will contain the 1012 sentences in the devtest split of the FLORES+ evaluation benchmark for multilingual machine translation. The Spanish sentences in FLORES+ (which you may download from the FLORES+ repository) will be professionally translated following contemporary orthographic conventions of the respective language academies as described below. The sentences in FLORES+ are human-produced translations of English sentences sampled in equal amounts from Wikinews (an international news source), Wikijunior (a collection of age-appropriate non-fiction books), and Wikivoyage (a travel guide).

Please note that the procedures described in the previous section for obtaining the FLORES+ data also apply here. In particular, note that the original FLORES+ set already contains sentences in Asturian; however, we will use a revised version of the Asturian files of FLORES+ for evaluation.

Importantly, participants cannot use any part, in any language, of the FLORES+ devtest.

Language Identification

In case you need to identify the language of sentences, we have created Idiomata Cognitor, a simple language identifier with high accuracy for the target languages and a few additional Romance languages.

Existing MT Systems

An interesting fact about our three low-resource languages is that they have open rule-based MT systems available for the Apertium framework. Apertium is a free/open-source rule-based architecture for MT that consists of a pipeline of modules performing part-of-speech disambiguation and tagging, lexical transfer, lexical selection, chunk-level or recursive structural transfer, and morphological generation. Use these systems at your own risk, as they may follow different orthographic conventions than those in our validation and test sets. If you are interested on the Apertium linguistic data, you may download them from their GitHub repository:

However, if you just want to use Apertium to generate synthetic data, you may simply install the engine and the data package for your languages of interest. For example, the following lines install the systems for Aragonese, Aranese and Asturian, and translate a sentence in Debian/Ubuntu:

curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash
apt search apertium
sudo apt-get install apertium-spa-arg apertium-es-ast apertium-oc-ca
echo "La heroica ciudad dormía la siesta." | apertium spa-ast
echo "La heroica ciudad dormía la siesta." | apertium spa-arg
echo "Des d’aquí, des de la meva finestra, no puc veure la mar." | apertium ca-oc_aran

Besides official Apertium-based MT systems, there are a few additional MT systems available. As with Apertium, you may use them at your own risk, as they may follow different orthographic conventions than those in our validation and test sets:

  • The traduze system for Aragonese-Spanish.

  • The Softcatalà neural Aranese-Catalan system.

  • The eslema machine translation system for Asturian-Spanish.

Dictionaries

Dictionaries, whether monolingual or bilingual, may be a very relevant source of information complementary to sentences. The following list includes some of the available dictionaries:

  • The “Diccionari der aranés” by Institut d’Estudis Aranesi with definitions in Aranese and translations into Catalan provided by Enciclopèdia Catalana. A PDF version is also available.

  • The “Diccionariu de la Llingua Asturiana” is available online with queries limited to 500 results

Orthographic Standards

The languages of interest in this task have historically exhibited diverse orthographies, and it is important to note that the released datasets might contain texts written in several of these orthographies. However, both our evaluation set and our test sets adhere to the contemporary standard supported by the respective language academies which are reflected in the following documents:

EVALUATION

Participants will register and submit their translations through OCELoT, with submissions required to be in an XML-based format. Further details will be provided here.

DEADLINES

Release of training and development data for shared tasks

March, 2024

Test data released

27th June

Translation submission deadline

4th July

System description abstract paper

11th July

Paper submission deadline

TBA

All deadlines are in AoE (Anywhere on Earth).

NEW DATA RELEASE

We strongly encourage participants to release any new resources they acquire as part of their research efforts and provide links in their description papers, since access to such data is crucial for continuing to build NLP systems for the low-resource languages of interest. We also urge participants to opt for the most permissive licenses possible.

CONTACT

For queries, please contact romance2024@dlsi.ua.es.

Organizers

  • Juan Antonio Pérez-Ortiz, Universitat d’Alacant

  • Felipe Sánchez-Martínez, Universitat d’Alacant

  • Antoni Oliver, Universitat Oberta de Catalunya

Acknowledgements

This task would not have been possible without the support and help in data acquisition and in the translation of the FLORES+ sentences from Academia Aragonesa de la Lengua (Instituto de l’Aragonés), Academia de la Llingua Asturiana and Institut d’Estudis Aranesi. This shared task is part of the R+D+i projects PID2021-127999NB-I00 ("LiLowLa: Lightweight neural translation technologies for low-resource languages") and PID2021-124663OB-I00 ("TAN-IBE: Neural Machine Translation for the Romance languages of the Iberian Peninsula"), both funded by the Spanish Ministry of Science and Innovation (MCIN), the Spanish Research Agency (AEI/10.13039/501100011033) and the European Regional Development Fund, A Way to Make Europe.