Shared task: Translation into Low-Resource Languages of Spain

EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 15-16, 2024
Miami, Florida, USA

[HOME] [PROGRAM] [PAPERS] [AUTHORS]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL] [MULTIINDIC22MT TASK] [ENGLISH-TO-LOWRES MULTIMODAL MT TASK] [NON-REPETITIVE] [PATENT] [LITERARY]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

ANNOUNCEMENTS

23rd July 2024 - New details on the test data and the publication of the results have been added to the subsection Release of the Test Data.
14th July 2024 - More details about the extended abstract have been added to the subsection Extended Abstract.
5th July 2024 - The system submission platform is now open! Instructions on how to submit your translations have been added to the evaluation section under the Instructions for Submitting your Team Results to the Conference Task subsection.
27th June 2024 - The released of the source test data has been postponed to the 5th of July. This way, we also avoid saturation with other tasks in the submission system. However, note that the Spanish side of the test set is already available in the FLORES+ dataset as indicated in the data section of this page. The deadline for the submission of your translations has been extended accordingly. Also, note that after submitting your translations, you will have to send a one-or-two-page PDF document with the description of your systems to the organizers via email.
18th June 2024 - The paragraph explaining constrained systems has been updated to clarify some aspects.
9th May 2024 - Added a relevant paragraph to the data section about the different procedures followed to obtain the FLORES+ datasets.
22th April 2024 - Added a language identifier for the target languages (Idiomata Cognitor) and a small clarification that you can use OPUS as a source of monolingual data in the constrained setting.
18th April 2024 - All pending data (obviously excluding test sets) has been released. You may start training your systems!
20th February 2024 - The shared task is announced.

INTRODUCTION

In Spain, a diverse linguistic landscape exists, including, beyond the widely recognized Spanish, other languages such as Basque, Catalan, and Galician. Although Spanish is obviously at the forefront in terms of the volume of resources available for training data-driven MT systems, the capabilities and richness of the other languages should not be underestimated. Basque, Catalan, and Galician, which might have been considered limited in resources in the past, actually possess a significant amount of data that facilitate their integration into modern MT technologies. In fact, these three languages have been recently included among the list of up to 100 select languages in well-known multilingual systems such as mBERT, XLM-R, mBART, mT5 or NLLB-200. However, Spain is home to additional languages with much fewer resources, especially in the form of bilingual data. This task focuses on three of them, namely, Aragonese, Aranese, and Asturian.

GOALS

The goals of this shared translation task are:

To push the boundaries of machine translation system development when the amount of resources is extremely scarce.
To explore the transferability among low-resource Romance languages when translating from Spanish.
To find the best way to use pre-trained models of any kind for the translation between Spanish and low-resource Romance languages.
To create publicly available corpora for machine translation development and evaluation.

We expect participation in this task from both newcomers and established research teams.

TASK DESCRIPTION

Participants will use their systems to translate a test set of unseen sentences. Actually, as described later, the Spanish source side of this test is available to participants as it is part of the FLORES+ dataset. Nevertheless, the counterpart in the low-resource language will be new and purposefully created for this shared task.

The translation quality will be exclusively measured via automatic evaluation metrics. You may participate in any or all of the language pairs. To have a common framework that allows for comparable results, and also to lower the barrier to entry, we provide a possible training set.

As in all WMT shared tasks, each participant is required to submit a submission paper, which should highlight in which ways your own methods and data differ from the standard approach. You should make it clear which tools you used and which datasets you used. Moreover, each participant has to submit a one-page abstract of the system description one week after the system submission deadline. The abstract should contain, at a minimum, basic information about the system and the methods/data/tools used. See the main page for the link to the submission site.

Language pairs

The list of language pairs that are going to be evaluated (only specified direction) follows:

Spanish to Aragonese
Spanish to Aranese
Spanish to Asturian

Note that Aranese is a variety of Gascon, one of the two main dialects of Occitan. You may decide to use Occitan data as part of the training process for your systems, bearing in mind that some orthographic differences may exist.

Types of submissions

Submissions may fall into three different categories, depending on the corpora and models used and on the reproducibility of the results: constrained, open and closed. The organizers of the task hold the right to decide the category of a submission in case of doubt.

Constrained submissions can only use the resources (corpora, dictionaries, Apertium-based systems or data, and orthographic conventions) linked as part of this page (see below). They can also make use of any existing publicly available pre-trained language models or translation models, provided they do not exceed the amount of 1B (one thousand million) parameters according to their model cards; for example, NLLB-200-600M fits within this limit. This size restriction also applies to neural systems used collaterally, for example, to obtain synthetic data. Developed systems may be bilingual or multilingual, not necessarily including all the languages of interest.

Open submissions can make use of any resource (corpora, pre-trained models, etc.), in any language, without size restriction, provided that they are publicly available under open source licenses to ensure reproducibility by third parties. Machine translation systems or large language models available online belong to this category if the resulting outputs are made available.

Closed submissions can be trained without limitations regarding the availability of resources (corpora, pre-trained models, etc.) used.

In any case, we encourage submissions to estimate the amount of KWh that their training required and include a section discussing the possible trade-off between energy consumption and translation quality in the paper describing the submission.

DATA

Training Data and Development Data

The training and development corpora for the constrained submissions is restricted to everything in OPUS, the recently released PILAR corpora, and the brand new FLORES+ dev sets for our target languages.

On the one hand, OPUS lists some (mostly uncurated) resources for Aragonese, Asturian, and Occitan. See, for example, the following bilingual resources:

Note that the fact that you can use any resource from OPUS includes the possibility of also collecting monolingual data by using the source or target side of any bilingual resource in OPUS.

On the other hand, PILAR (Pan-Iberian Language Archival Resource) contains mostly monolingual data for Aranese, Aragonese and Asturian. The data in PILAR can be freely used for research purposes. If you do so, we ask that you cite the shared task overview paper and the URL of the repository, and respect any additional citation requirements. For other uses of the data, you should consult with original owners of the data sets.

Finally, to evaluate your systems during development, we suggest using the newly created manually-revised corpora for Aragonese, Aranese, and Asturian based on the multilingual FLORES+ dev sets (997 sentences for each language). This evaluation set closely mirrors the test set (based on the FLORES+ devtest sets) in terms of orthographic, grammatical, and domain aspects, making it an appropriate choice for use during development. Utilizing this set will help ensure that your system is well-tuned to the nuances expected in the final evaluation. Both dev and devtests sentences have been reviewed by the respective language academies to ensure that they follow the current standard orthographic conventions.

It is important to note that the method used to create the FLORES+ Asturian dataset released for this task differs from the one used for Aragonese and Aranese. The Asturian sentences were originally obtained by Meta via professional translation from English and then we asked the academia to revise it, whereas the Aragonese and Aranese sentences were first machine translated from the Spanish sentences using Apertium and then manually post-edited by specialists proficient in these languages and finally revised by the academias. The utilization of MT systems is justified for two reasons: firstly, the two-step workflow consisting of machine translation followed by post-editing is prevalent for these languages, with many existing texts being produced this way; secondly, sourcing linguists or translators for these languages proved challenging due to their scarcity, making it difficult to complete the task within the required timeframe.

You can find the FLORES+ dev sets in Aragonese, Aranese, and Asturian as part of the PILAR repository. Follow the instructions in the README file in order to download them.

Test Data

The test set will contain the 1012 sentences in the devtest split of the FLORES+ evaluation benchmark for multilingual machine translation. The Spanish sentences in FLORES+ (which you may download from the FLORES+ repository) will be professionally translated following contemporary orthographic conventions of the respective language academies as described below. The sentences in FLORES+ are human-produced translations of English sentences sampled in equal amounts from Wikinews (an international news source), Wikijunior (a collection of age-appropriate non-fiction books), and Wikivoyage (a travel guide).

Please note that the procedures described in the previous section for obtaining the FLORES+ data also apply here. In particular, note that the original FLORES+ set already contains sentences in Asturian; however, we will use a revised version of the Asturian files of FLORES+ for evaluation.

Importantly, participants cannot use any part, in any language, of the FLORES+ devtest.

Language Identification

In case you need to identify the language of sentences, we have created Idiomata Cognitor, a simple language identifier with high accuracy for the target languages and a few additional Romance languages.

Existing MT Systems

An interesting fact about our three low-resource languages is that they have open rule-based MT systems available for the Apertium framework. Apertium is a free/open-source rule-based architecture for MT that consists of a pipeline of modules performing part-of-speech disambiguation and tagging, lexical transfer, lexical selection, chunk-level or recursive structural transfer, and morphological generation. Use these systems at your own risk, as they may follow different orthographic conventions than those in our validation and test sets. If you are interested on the Apertium linguistic data, you may download them from their GitHub repository:

However, if you just want to use Apertium to generate synthetic data, you may simply install the engine and the data package for your languages of interest. For example, the following lines install the systems for Aragonese, Aranese and Asturian, and translate a sentence in Debian/Ubuntu:

curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash
apt search apertium
sudo apt-get install apertium-spa-arg apertium-es-ast apertium-oc-ca
echo "La heroica ciudad dormía la siesta." | apertium spa-ast
echo "La heroica ciudad dormía la siesta." | apertium spa-arg
echo "Des d’aquí, des de la meva finestra, no puc veure la mar." | apertium ca-oc_aran

Besides official Apertium-based MT systems, there are a few additional MT systems available. As with Apertium, you may use them at your own risk, as they may follow different orthographic conventions than those in our validation and test sets:

The traduze system for Aragonese-Spanish.
The Softcatalà neural Aranese-Catalan system.
The eslema machine translation system for Asturian-Spanish.

Dictionaries

Dictionaries, whether monolingual or bilingual, may be a very relevant source of information complementary to sentences. The following list includes some of the available dictionaries:

The “Diccionari der aranés” by Institut d’Estudis Aranesi with definitions in Aranese and translations into Catalan provided by Enciclopèdia Catalana. A PDF version is also available.
The “Diccionariu de la Llingua Asturiana” is available online with queries limited to 500 results

Orthographic Standards

The languages of interest in this task have historically exhibited diverse orthographies, and it is important to note that the released datasets might contain texts written in several of these orthographies. However, both our evaluation set and our test sets adhere to the contemporary standard supported by the respective language academies which are reflected in the following documents:

“Normes ortogràfiques” by Academia de la Llingua Asturiana.
“Ortografía de l’aragonés by Academia Aragonesa de la Lengua
“Gramatica der occitan aranés“ published by Institut d’Estudis Aranesi.

EVALUATION

Participants will register and submit their translations through OCELoT, with submissions required to be in an XML-based format. The evaluation will primarily be conducted automatically using BLEU and ChrF++ metrics.

Instructions for Submitting your Team Results to the Conference Task

Submissions can be made via this OCELoT website.

In order to do this, it is necessary to register the team here. The account verification process is manual, so there will be a waiting period after registration before you can start submitting. If it takes more than 24 hours, please contact the organizers at romance2024@dlsi.ua.es.

Until that time, the submission form will not be available for you.

Once the registration process is complete, the submission form is available here.

The "test set" dropdown menu allows you to choose the reference set for evaluating the submission:

romance2024.spa-arn test set (spa-arn) for Aranese
romance2024.spa-ast test set (spa-ast) for Asturian
romance2024.spa-arg test set (spa-arg) for Aragonese

The file with the hypotheses must be in the format generated by wmt-format-tools. This same tool allows you to extract the source sentences from the test available in the downloads tab. We suggest you to use the tools to prepare your files, but for your convenience, this is an excerpt of an XML file with the expected format:

<?xml version='1.0' encoding='utf-8'?>
<dataset id="romance-translation-test">
  <collection id="romance">
    <doc origlang="spa" id="Flores-spa_Latn.devtest">
      <src lang="spa">
        <p>
          <seg id="1">«Actualmente, tenemos ratones de cuatro meses de edad...</seg>
          <seg id="2">La investigación todavía...</seg>
          ...
        </p>
      </src>
      <hyp system="Apertium" lang="arg">
        <p>
          <seg id="1">«Actualment, tenemos ratos de cuatro meses d'edat...</seg>
          <seg id="2">La investigación encara...</seg>
          ...
        </p>
      </hyp>
    </doc>
  </collection>
</dataset>

The "Is primary" option you will find in the form is not relevant for this task. You may leave it unchecked.

Lastly, it is very important to choose the type of submission (constrained, closed, or open). There is a dropdown menu for this in the "Primary submissions" section of the Team submissions tab. In this same tab, you will be able to fill in the details of the final article that describes your systems later on.

Release of the Test Data

The test data for all target languages was released on July 23rd and is now included in the FLORES+ file of the PILAR repository. Please follow the instructions in the README file to download and decrypt it. The test set for this task corresponds to the devtest split of FLORES+.

Regarding the publication of the overall results, we cannot commit to a specific date yet. In the past, some shared tasks have waited until the conference to publish results, and this may be a possibility for us as well. Stay tuned to this page for updates.

DEADLINES

Release of training and development data for shared tasks

March, 2024

Test data released and submission platform opening

5th July

Translation submission deadline

12th July

System description extended abstract

15th July

Paper submission deadline

20th August

All deadlines are in AoE (Anywhere on Earth).

Extended Abstract

The extended abstract, with a maximum length of two pages, should be sent in PDF format to romance2024@dlsi.ua.es.

As regards the extended abstract, its objective is to provide an overview of the data used, the training procedures, and the models employed for each submission. This document of one or two pages, which will not be published, aims to give the organizers of the shared task a preliminary idea of each system so that the organizers may start writing the findings paper that will briefly describe them all. For this reason, state clearly the authorship, the corresponding submission ids, the type of each submission (constrained, open, or closed), and, optionally, if available, the amount of kWh consumed during training.

Additionally, each participant will have to present a full article with specific details of their findings by the deadline shown above, and this article will be published. This initial extended abstract can obviously serve as a starting point for the full article.

There is no specific template required for the extended abstract, and it is not necessary to include all the details at this stage. However, please ensure that enough information is provided to allow the organizers to understand the main points of your submission appropriately. Although the wording should be similar to a scientific paper, there is no need for an introduction, conclusions, results, previous work, or references, unless they involve techniques that are not well-established and which are really relevant for your work. Simply state the main points on your datasets, training procedures, and models.

NEW DATA RELEASE

We strongly encourage participants to release any new resources they acquire as part of their research efforts and provide links in their description papers, since access to such data is crucial for continuing to build NLP systems for the low-resource languages of interest. We also urge participants to opt for the most permissive licenses possible.

CONTACT

For queries, please contact romance2024@dlsi.ua.es.

Organizers

Juan Antonio Pérez-Ortiz, Universitat d’Alacant
Felipe Sánchez-Martínez, Universitat d’Alacant
Aarón Galiano Jiménez, Universitat d’Alacant
Antoni Oliver, Universitat Oberta de Catalunya

Acknowledgements

This task would not have been possible without the support and help in data acquisition and in the translation of the FLORES+ sentences from Academia Aragonesa de la Lengua (Instituto de l’Aragonés), Academia de la Llingua Asturiana and Institut d’Estudis Aranesi. This shared task is part of the R+D+i projects PID2021-127999NB-I00 ("LiLowLa: Lightweight neural translation technologies for low-resource languages") and PID2021-124663OB-I00 ("TAN-IBE: Neural Machine Translation for the Romance languages of the Iberian Peninsula"), both funded by the Spanish Ministry of Science and Innovation (MCIN), the Spanish Research Agency (AEI/10.13039/501100011033) and the European Regional Development Fund, A Way to Make Europe.

NINTH CONFERENCE ON MACHINE TRANSLATION (WMT24)