Shared Task: Biomedical Translation Task

EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 15-16, 2024
Miami, Florida, USA

[HOME]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL] [MULTIINDIC22MT TASK] [ENGLISH-TO-LOWRES MULTIMODAL MT TASK] [NON-REPETITIVE] [PATENT] [LITERARY]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

Task description

This task aims to evaluate systems for the translation of documents from the biomedical domain. The test data will consist of biomedical abstracts and it will address the following language pairs:

English-French and French-English (en/fr, fr/en)
English-German and German-English (en/de, de/en)
English-Italian and Italian-English (en/it, it/en)
English-Portuguese and Portuguese-English (en/pt, pt/en)
English-Russian and Russian-English (en/ru, ru/en)
English-Spanish and Spanish-English (en/es, es/en)

Data

NEW!! We ask the participants not to download the Medline database by themselves in order to retrieve training data. Submissions that are derived from a model that was trained on the whole PubMed will not be considered in the evaluation.

Participants can rely on training (and development) data from various sources, for instance:

The Biomedical Translation repository includes links to parallel corpora of scientific publications (en/pt, en/es, en/fr, en/de, en/zh, en/it, en/ru), among others
The UFAL Medical Corpus (formerly HimLCorpus) includes medical text from various sources for many language pairs (en/es, en/de, en/fr, en/ro) HimL test sets can be used as the development sets for some language pairs (en/es, en/de, en/fr, en/ro)
The Khresmoi development data can be used for some language pairs (en/es, en/de, en/fr).
The UNCorpus contains training data for some languages (en/es, en/fr, en/zh)
The MeSpEn corpus contains many parallel documents of en/es
The Scielo full text corpus for en, es and pt
The Brazilian Thesis and Dissertations for en/pt
Refactored versions of the WMT biomedical test sets (by Abdul Rauf and Yvon)

Participants are also free to use out-of-domain data.

Evaluation

Evaluation will be carried out both automatically and manually. The automatic evaluation will make use of standard machine translation metrics.

Native speakers of each language will manually check the quality of the translation for a small sample of the submissions. If necessary, we also expect participants to support us in the manual evaluation (according to the number of submissions).

We plan to release test sets for the following language pairs and sources:

Scientific abstracts:

French/English (both directions)
German/English (both directions)
Italian/English (both directions)
Portuguese/English (both directions)
Russian/English (both directions)
Spanish/English (both directions)

NEW!! The test sets will consist of titles and abstracts, i.e., a long text (no sentences).

Scientific abstracts:

For the test set of Medline abstracts, the format will be plain text files. The format will be the following:

TITLE_ABSTRACT_TEXT1
TITLE_ABSTRACT_TEXT2
..
TITLE_ABSTRACT_TEXTn

The format for the submission will be the same. The participants should follow the same order of the texts as in the original test set file.

translated_TITLE_ABSTRACT_TEXT1
translated_TITLE_ABSTRACT_TEXT2
..
translated_TITLE_ABSTRACT_TEXTn

Submission Requirements

Please notice that, following the General WMT policy explicitly enforced in other tasks, we will release all participants' submissions after this year's edition of the task to promote further studies.

Please register your team using this form. You will receive an email with the confirmation of your registration. The link for the submission site will be provided in this mail. Please register your team as soon as possible.

The Medline test files are available in the WMT'24 biomedical task Google Drive Folder.

The format for the submission files should include the original test file name preceded by the team identifier (as registered in the form above) and the run number, following this example for the abstracts:

The submission file for run 1 of the "ABC" team for the scientific abstracts from English into Spanish should be called "ABC_run1_en2es_es.txt".
The submission file for run 3 of the "ABC" team for the scientific abstracts from Spanish into English should be called "ABC_run3_es2en_en.txt".

Each team will be allowed to submit up to 3 runs per test set. Please note that the submission form will include questions about the details of your methods. Please provide as many details as possible, this is important for us.

The test sets are also available in OCELoT, and submissions can be sent to this system. Please note the XML format is used in the tool (check the format tools).

Important dates

Release of test data

June 27th, 2024

Results submission deadline

~~July 4th~~ July 12th, 2024

Paper submission deadline

TBA (follows EMNLP)

Paper notification

TBA (follows EMNLP)

Camera-ready version due

TBA (follows EMNLP)

Conference EMNLP

12-13 November, 2024

All deadlines are in AoE (Anywhere on Earth).

Organisers

Rachel Bawden (University of Edinburgh, UK)
Giorgio Maria Di Nunzio (University of Padua, Italy)
Cristian Grozea (Fraunhofer Institute, Germany)
Antonio Jimeno Yepes (University of Melbourne, Australia)
Aurélie Névéol (Université Paris Saclay, CNRS, LISN, France)
Mariana Neves (German Federal Institute for Risk Assessment, Germany)
Roland Roller (DFKI, Germany)
Philippe Thomas (DFKI, Germany)
Federica Vezzani (University of Padua, Italy)
Maika Vicente Navarro, Maika Spanish Translator, Melbourne, Australia
Dina Wiemann (Novartis, Switzerland)
Lana Yeganova (NCBI/NLM/NIH, USA)

Please contact us in the mail wmtbiomedical@gmail.com. Please join our discussion forum.

NINTH CONFERENCE ON MACHINE TRANSLATION (WMT24)

Shared Task: Biomedical Translation Task

Task description

Data

Evaluation

Scientific abstracts:

Submission Requirements

Important dates

Organisers

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)