EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 12-13, 2024
Miami, Florida, USA
 
[HOME]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

ANNOUNCEMENTS

  • 1st March 2024 - The shared task is announced

INTRODUCTION

The Open Language Data Initiative (OLDI) aims to empower language communities to contribute to key datasets. These datasets are essential for expanding the reach of language technology to more language varieties.

Progress made in translation quality has largely been directed at high-resource languages. Recently, focus has started to shift to under-served languages, and foundational datasets such as FLORES and NTREX have made it easier to develop and evaluate MT models for an increasing amount of languages. The high impact of these components left some in the research community wondering: how do we add more languages to these existing open-source datasets?

GOALS

The goal of this shared task is to expand OLDI’s open datasets to more languages. In particular, we are soliciting contributions to the following:

  • The MT evaluation dataset FLORES+.

  • The MT Seed dataset.

  • Other high-quality, human-verified monolingual text datasets in under-resourced languages.

Contributions may consist of either the addition of entirely new languages, varieties or dialects to the above datasets, or substantial improvements to existing datasets.

To describe and publicise their contributions, task participants will be asked to submit a 2-4 page paper to be presented at the WMT 2024 conference.

TASK DESCRIPTION

To help us to gauge interest and co-ordinate efforts, we strongly encourage prospective participants to email the organisers <info@oldi.org> by the end of April.

FLORES+ and Seed contribution guidelines

  • Workflow: Contributing a new language to FLORES+ and Seed typically involves starting from the original English data and translating it. Starting from a different language is also possible, but this choice should be clearly documented. Translations should be performed, wherever possible, by qualified, native speakers of the target language. We strongly encourage verification of the data by at least one additional native speaker.

  • Dataset card: dataset cards should be attached to new data submissions, detailing precise language information and the translation workflow that was employed. In particular, we ask participants to identify the language with both an ISO 639-3 individual language tag and a Glottocode. The script should be identified with an ISO 15924 script code.

  • Use of MT:

    • The FLORES+ dataset is used to evaluate MT systems. For this reason, new contributions require human translation. Using or even referencing machine translation output is not allowed, and this includes post-editing.

    • For Seed data, the use of post-edited machine translated content is allowed, as long as all data is manually verified. Raw, unverified machine translated outputs are not allowed. If using MT, you must ensure that the terms of service of the model you use allow re-using its outputs to train other machine translation models (as an example, popular commercial systems such as DeepL, Google Translate and ChatGPT disallow this).

  • Data validation: Participants are strongly encouraged to provide experimental validation of the quality of the data they are submitting. For Seed data contributions, where applicable, this may include training a simple MT model and evaluating it on FLORES+.

  • License: As both FLORES+ and Seed are open datasets released under CC BY-SA 4.0, new contributions must also be released under this same license. By contributing data to this shared task, participants agree to have this data released under these terms.

For further information, please consult the OLDI translation guidelines.

Monolingual contribution guidelines

The aim of this corpus is to collect high-quality, human-verified monolingual text in multiple under-resourced languages, for the purposes of training language identification systems, language models, backtranslation, and other related tools.

  • Workflow: Data must not be synthetic, such as MT or LLM output. Other than that, there are no restrictions on the provenance of the data provided it is clearly identified along with its license. In order to ensure the quality of the data, contributors must verify that it is in the claimed language and free of issues such as encoding problems. If at all possible, this should be done by having one or more native speakers manually check a sufficiently large representative sample of the whole dataset.

  • Data card: data card should be attached to new data submissions, detailing precise language information and the translation workflow that was employed. In particular, we ask participants to identify the language with both an ISO 639-3 individual language tag and a Glottocode. The script should be identified with an ISO 15924 script code.

  • License: We encourage contributions under open licenses. At a minimum, data should be made available for research use.

For further information, please consult the OLDI monolingual contribution guidelines.

DEADLINES

Indication of interest (recommended)

April, 2024

Paper and data submission deadline

TBA around 12th August (follows EMNLP)

All deadlines are in AoE (Anywhere on Earth).

CONTACT

OLDI Organisers <info@oldi.org>

ORGANIZERS

  • Antonios Anastasopoulos, George Mason University

  • Laurie Burchell, University of Edinburgh

  • Christian Federmann, Microsoft

  • Jean Maillard, FAIR, Meta

  • Philipp Koehn, Johns Hopkins University

  • Skyler Wang, UC Berkley, FAIR, Meta