Open Language Data Initiative

EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 15-16, 2024
Miami, Florida, USA

[HOME] [PROGRAM] [PAPERS] [AUTHORS]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL] [MULTIINDIC22MT TASK] [ENGLISH-TO-LOWRES MULTIMODAL MT TASK] [NON-REPETITIVE] [PATENT] [LITERARY]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

ANNOUNCEMENTS

1st March 2024 - The shared task is announced
20th May 2024 - Additional information on the submission format is released

INTRODUCTION

The Open Language Data Initiative (OLDI) aims to empower language communities to contribute to key datasets. These datasets are essential for expanding the reach of language technology to more language varieties.

Progress made in translation quality has largely been directed at high-resource languages. Recently, focus has started to shift to under-served languages, and foundational datasets such as FLORES and NTREX have made it easier to develop and evaluate MT models for an increasing amount of languages. The high impact of these components left some in the research community wondering: how do we add more languages to these existing open-source datasets?

GOALS

The goal of this shared task is to expand OLDI’s open datasets to more languages. In particular, we are soliciting contributions to the following:

The MT evaluation dataset FLORES+.
The MT Seed dataset.
Other high-quality, human-verified monolingual text datasets in under-resourced languages.

Contributions may consist of either the addition of entirely new languages, varieties or dialects to the above datasets, or substantial improvements to existing datasets.

To describe and publicise their contributions, task participants will be asked to submit a 2-4 page paper to be presented at the WMT 2024 conference.

TASK DESCRIPTION

To help us to gauge interest and co-ordinate efforts, we ask prospective participants to email the organisers <info@oldi.org>.

FLORES+ and Seed contribution guidelines

Workflow: Contributing a new language to FLORES+ and Seed typically involves starting from the original English data and translating it. Starting from a different language is also possible, but this choice should be clearly documented. Translations should be performed, wherever possible, by qualified, native speakers of the target language. We strongly encourage verification of the data by at least one additional native speaker.
Dataset card: dataset cards should be attached to new data submissions, detailing precise language information and the translation workflow that was employed. In particular, we ask participants to identify the language with both an ISO 639-3 individual language tag and a Glottocode. The script should be identified with an ISO 15924 script code.
Use of MT:
- The FLORES+ dataset is used to evaluate MT systems. For this reason, new contributions require human translation. Using or even referencing machine translation output is not allowed, and this includes post-editing.
- For Seed data, the use of post-edited machine translated content is allowed, as long as all data is manually verified. Raw, unverified machine translated outputs are not allowed. If using MT, you must ensure that the terms of service of the model you use allow re-using its outputs to train other machine translation models (as an example, popular commercial systems such as DeepL, Google Translate and ChatGPT disallow this).
Data validation: Participants are strongly encouraged to provide experimental validation of the quality of the data they are submitting. For Seed data contributions, where applicable, this may include training a simple MT model and evaluating it on FLORES+.
License: As both FLORES+ and Seed are open datasets released under CC BY-SA 4.0, new contributions must also be released under this same license. By contributing data to this shared task, participants agree to have this data released under these terms.

For further information, please consult the OLDI translation guidelines.

Monolingual contribution guidelines

The aim of this corpus is to collect high-quality, human-verified monolingual text in multiple under-resourced languages, for the purposes of training language identification systems, language models, backtranslation, and other related tools.

Workflow: Data must not be synthetic, such as MT or LLM output. Other than that, there are no restrictions on the provenance of the data provided it is clearly identified along with its license. In order to ensure the quality of the data, contributors must verify that it is in the claimed language and free of issues such as encoding problems. If at all possible, this should be done by having one or more native speakers manually check a sufficiently large representative sample of the whole dataset.
Data card: data card should be attached to new data submissions, detailing precise language information and the translation workflow that was employed. In particular, we ask participants to identify the language with both an ISO 639-3 individual language tag and a Glottocode. The script should be identified with an ISO 15924 script code.
License: We encourage contributions under open licenses. At a minimum, data should be made available for research use.

For further information, please consult the OLDI monolingual contribution guidelines.

Submission format

Submissions consist of two main components, both of which should be submitted by 20th August:

A dataset accompanied by its corresponding data card, following the guidelines above and the in-depth instructions on the OLDI website. These should be shared with the organisers via email at <info@oldi.org>.
A system paper, prepared according to the WMT instructions and submitted via START.

Papers should cover the following topics:

Language overview. Provide background information on the language, highlighting dialectal and spelling variations, as well as the specific variety used in the submission. If applicable, mention any particular attributes which need special consideratino when developing NLP applications e.g. morphology, commonly confused languages, writing system(s). For under-resourced languages, include a review of existing datasets, reference materials, and other relevant resources.
Data collection. Offer an in-depth description of the data acquisition process. For example, when submitting a translated dataset, provide details such as the source language, number of translators, their expertise (native speakers, proficiency, professional experience), and whether any portion was independently reviewed by third parties.
Experimental validation. Provide experimental validation of the submitted data’s quality. For a seed translation dataset, this might involve training a translation model on new data and evaluating it on existing benchmark data, comparing it to pre-existing models or those trained with pre-existing data (where applicable). For monolingual data, this could include training and cross-validating language models or language identification (LID) classifiers on the new data.
Data sample. Provide a short excerpt of the data available in the dataset to demonstrate its format and content. This should take up no more than half a page. If possible, provide a translation in English.

DEADLINES

Paper and data submission deadline

20th August (follows WMT/EMNLP)

Notification of acceptance

20th September (follows WMT/EMNLP)

Conference

15th-16th November (follows WMT/EMNLP)

All deadlines are in AoE (Anywhere on Earth).

CONTACT

OLDI Organisers <info@oldi.org>

ORGANIZERS

Antonios Anastasopoulos, George Mason University
Laurie Burchell, University of Edinburgh
Christian Federmann, Microsoft
Jean Maillard, FAIR, Meta
Philipp Koehn, Johns Hopkins University
Skyler Wang, UC Berkley, FAIR, Meta

NINTH CONFERENCE ON MACHINE TRANSLATION (WMT24)