Shared Task: General Machine Translation

EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 15-16, 2024
Miami, Florida, USA

[HOME] [PROGRAM] [PAPERS] [AUTHORS]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL] [MULTIINDIC22MT TASK] [ENGLISH-TO-LOWRES MULTIMODAL MT TASK] [NON-REPETITIVE] [PATENT] [LITERARY]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

ANNOUNCEMENTS

2024-07-10: Due to data processing delays, the return of the test suites will be delayed until approximately July 16.
2024-06-28: Blind testsets released data.statmt.org/wmt24/general-mt/wmt24_GeneralMT.zip
2024-06-26: Submission system is ready to accept translations at aka.ms/wmt24submissions
2024-06-19: The human evaluation protocol for this year will be ESA: arxiv.org/abs/2406.11580
2024-05-23: JParacrawl updated for JA-ZH.
2024-04-25: News Commentary version corrected in mtdata config (v18.1). Since v18.1 contains v16 data (which was used in the previous years), the models trained with v16 are also permitted.
2024-01-29: The shared task announced

DESCRIPTION

Formerly known as News translation task of the WMT focusses on evaluation of general MT capabilities. The main focus is testing general machine translation capabilities for various domains, genres, and possibly modalities. All submitted systems will be scored and ranked by human judgement.

Goals

The goals of the shared translation task are:

To investigate the applicability of current MT techniques when translating into languages other than English, different domains, and modalities
To examine special challenges in translating between language families, including word order differences and morphology
To investigate the translation of low-resource, morphologically rich languages
To create publicly available corpora for machine translation and machine translation evaluation
To generate up-to-date performance numbers in order to provide a basis of comparison in future research
To offer newcomers a smooth start with hands-on experience in state-of-the-art machine translation methods
To investigate the usefulness of multilingual and third language resources
To assess the effectiveness of document-level approaches

We hope that both beginners and established research groups will participate in this task.

The main changes

we focus on language pairs English-to-X and X-to-Y (avoiding evaluation on X-to-English)
all testsets will be paragraph-level (one paragraph one line)
English testsets will contain speech domain in form of audio and ASR. It is not required to use the audio, we provide automatically generated transcript
we have redefined constrained/unconstrained track
we extend support for system breakers via Test-suite subtask
we investigate literary domain

Language pairs

The list of languages that are going to be evaluated (only specified direction)

Czech to Ukrainian
Japanese to Chinese
EN to Chinese
EN to Czech
EN to German
EN to Hindi
EN to Icelandic
EN to Japanese
EN to Russian
EN to Spanish (Latin America)
EN to Ukrainian

IMPORTANT DATES

All dates are at the end of the day for Anywhere on Earth (AoE)

Finalized training data/allowed models for constrained track

29th February

Test suite source texts for pre-run with SoTA systems must reach us

11th April

Final Test suite source texts must reach us

12th June

Test data released

27th June (at the end of AoE)

Translation submission deadline

4th July

System description abstract paper

11th July

Translated test suites shipped back to test suites authors

11th 16th July

System description submission

20th August

TASK DESCRIPTION

The task is to improve current MT methods. We encourage a broad participation — if you feel that your method is interesting but not state-of-the-art, then please participate in order to disseminate it and measure progress. Participants will use their systems to translate a test set of unseen paragraphs in the source language. The translation quality is measured by a human evaluation and various automatic evaluation metrics.

You may participate in any or all of the language pairs. To have a common framework that allows for comparable results, and also to lower the barrier to entry, we provide a common training set. You are not limited to this training set, and you are not limited to the training set provided for your target language pair.

Each participant is required to submit submission paper, which should highlight in which ways their own methods and data differ from the standard task. You should make it clear which tools and trainind sets you used. Each participant has to submit (one page) abstract of the system description one week after the system submission deadline. The abstract should contain, at a minimum, basic information about the system and the approaches/data/tools used, but could be a full description paper or a draft that can be later modified for the final system description paper. See the Main page for the link to the submission site.

Constrained/Open/Closed track

The General MT task has three separate tracks with different constraints on the training of the models: constrained, open, and closed systems.

Constrained systems - sets specifically allowed training data and pretrained models that may be used to train the translation models, for details, see below.
Open systems - you may use software and data under any open source license that places no constrains for non-commercial purposes (e.g. Apache, MIT, …) allowing to make your work replicable by any research group
Closed systems - track that do not place any limitations (ONLINE systems are in this category).

the closed systems will not directly compete with constrained and open systems. However, they will be human evaluated for comparison purposes.

The limitations for the constrained systems track are as follows

Allowed pretrained LMs and LLMs:
Llama-2-7B, Llama-2-13B, Mistral-7B
following pretrained LMs in all publicly available model sizes: mBART, BERT, RoBERTa, XLM-RoBERTa, sBERT, LaBSE
You may ONLY use the training data allowed for this year (specified later on this page)
You may use any publicly available metric that was evaluated on past WMT Metrics shared tasks (for example: COMET, Bleurt, etc.)
Any basic linguistics tools (taggers, parsers, morphology analyzers, etc.)

If you’d like to propose another pretrained model to be added into the list of allowed models for constrained track, write us an email (we may consider extending constrained list until the end of February).

Domains

For the English, we plan following domains: news, social, speech, and literary. For other source languages, the list of domain may be different.

Speech domain

One of the four domains for English (and possibly Japanese) will be speech presented with audio together with machine generated transcript. All testsets will contain the transcript which will be generated with speech recognition model, therefore the audio is not required to be used but could improve the performance.

Document-level MT

We are interested in the question of whether MT can be improved by using context beyond the paragraph, and to what extent state-of-the-art MT systems can produce translations that are correct "in-context". All of our development and test data contains full documents, and all our human evaluation will be in-context, in other words the evaluators will view the paragraph as well as its surrounding context when evaluating.

Our training data retains context and document boundaries wherever possible, in particular the following corpora retain the context intact:

Parallel: europarl, news-commentary, CzEng, Rapid
Monolingual: news-crawl (en, de and cs), europarl, news-commentary

Test Suites

This year’s shared task will also include the “Test suites” sub-task, which has been part of WMT since 2018. More details about the test suites are provided in a separate page.

DATA

Licensing of Data

The data released for the \General MT task can be freely used for research purposes, we ask that you cite the WMT24 shared task overview findings paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.

Training Data

We aim to use publicly available datasets wherever possible. Click here to view more details.

Training data tables have been relocated to a separate page at ./mtdata.

You can download all corpora via command line approach as:

pip install mtdata==0.4.2
wget https://www.statmt.org/wmt24/mtdata/mtdata.recipes.wmt24-constrained.yml
mtdata cache -j 8 -ri "wmt24-*"
for id in wmt24-eng-{ces,deu,spa,hin,isl,jpn,rus,ukr,zho} wmt24-{ces-ukr,jpn-zho}; do
  mtdata get-recipe -i $id -o $id --compress --no-merge -j 16
done

Development Data

To evaluate your system during development, we suggest using test sets from past WMT years. For automatic evaluation, we recommend to use sacreBLEU, which will automatically download previous WMT test sets for you. You may want to consider COMET automatic metric that has been shown to have high correlation with humans. We also release other dev and test sets from previous years.

The 2024 test sets will be created from a sample of up to four domains with equal number of words per domain. The sources of the test sets will be original text, whereas the targets will be human-produced translations.

WMT Development sets (together with 2024 testsets - HERE)
NTREX-128
Flores-200

Note that the dev data contains both forward and reverse translations (clearly marked).

EVALUATION

Primary systems (for which abstracts have been submitted) will be included in the human evaluation, we plan to use ESA protocol (arxiv.org/abs/2406.11580). We will collect subjective judgments about the translation quality from annotators, taking the document context into account.

In the unlikely event of an unprecedented number of system submissions that we couldn’t evaluate, we may decide to preselect the best performing systems for human evaluation with automatic metrics (such as COMET), we will primarily remove closed systems from the evaluation. However, we believe this won’t be applied and all primary systems will be evaluated by humans.

Test Set Submission

Here are detailed instructions to make submission process seamless. To participate, please, follow the steps outlined below:

1) Register your team to OCELoT, the registration portal is already open via link aka.ms/wmt24submissions

2) Get your team verified by sending an email with your affiliation to maja.popovic.166@gmail.com (please note that the verification process may require some time to complete)

3) Translate blind test sets: WMT24_GeneralMT.zip; testsets with audio are available HERE

4) After verification, submit translations to OCELoT (deadline is 4th July AoE, please, don’t leave it to the last minute, we expect server slowness on the last day)

5) Before 11th July AoE, please, select the primary system in OCELoT, you may choose a single system for each language pair

6) Before 11th July AoE, submit a short abstract paper to SoftConf: www.softconf.com/emnlp2024/wmt/ (it should be a brief summary of your submission which you later replace with a system description paper, you may already submit the system description paper if you want)

We will be updating this sheet to make verification process for teams transparent and to avoid possible confusion from missing steps.

Detailed steps

The blind test data sources can be downloaded above.
The sources are in xml format. Scripts for converting xml to/from line-oriented text are available github.com/wmt-conference/wmt-format-tools. It is important to use an xml parser to wrap/unwrap text in order to ensure correct escaping/de-escaping.
The sources contain the General MT test sets and additional testsets from “test suites” and other shared tasks, which will be used for further evaluation of the translation systems.
Challenge sets are required part of the submission, not translating them may result in system disqualification. If it is impossible for you to translate all segments due to their size, reach out to us with detailed explanation of situation and we may grant you exception.
Speech domain contains automatic transcripts, but we also provide audio files for multimodal systems. Do not produce any sentence or speaker segmentation for the speech domain, i. e. every document (audio) translation should be a single line output.
Your translations be submitted through OCELoT
You first need to register a team name with OCELoT. Your team will then need to be activated by General MT task organisers before you can submit. Please send an email to Maja with your OCELoT team name and your institution/company details in order to get activated.
Translations should be “human-ready”, i.e. in the form that text is normally published, so latin-script languages should be recased and detokenised, Chinese and Japanese should be unsegmented, etc.
Submissions should be formatted in the WMT xml format, using the format tools linked above
You can make up to 7 submissions per language pair, per team. Each submission will be scored (ChrF) against a reference translation, the scores in OCELoT does not reflect actual system performance and are mainly for validation.
During the test week, all submissions will remain anonymous
Submissions should be uploaded by deadline stated above
There may be server outages at the last day, please, do not keep the submission to last minute. In case the system is down, try to wait few minutes or write us a message.
After submission, select primary system for each of the language directions
To select primaries, log in to OCELoT, select the Team tab at the top, and click on the yellow "Team submissions" button.
When choosing a primary system, you will be asked to give a short (one paragraph) description of the system, and fill in a web form with some details of technologies used.
About a week after submission week deadline and once we have a final set of primary submissions, we will de-anonymise the primary submissions, and only the primary submissions. We will also release the references.
Each team must submit an abstract paper by deadline stated above and later a full system paper describing your submission. Otherwise, their submission won’t be considered for General MT task system ranking.
See the Main page for details on abstract and paper submission.

The size of the testset may be larger than 100 000 sentences due to test-suites. Translation of full blind test set is mandatory for participating in General MT.

CONTACT

For queries, please use the mailing list or contact Tom Kocmi.

Organizers

Tom Kocmi - tomkocmi@microsoft.com
Eleftherios Avramidis
Rachel Bawden
Ondřej Bojar
Anton Dvorkovich
Christian Federmann
Mark Fishel
Markus Freitag
TG Gowda
Roman Grundkiewicz
Barry Haddow
Marzena Karpinska
Philipp Koehn
Benjamin Marie
Kenton Murray
Masaaki Nagata
Martin Popel
Maja Popović
Mariya Shmatova
Steinþór Steingrímsson
Vilém Zouhar

Acknowledgements

This task would not have been possible without the sponsorship of test sets translations and human evaluation done with our partners. Namely Microsoft, Charles University, Dubformer, Toloka, NTT, Google, Árni Magnússon Institute for Icelandic Studies, Custom.mt, Cohere, Together.ai, Unbabel.

NINTH CONFERENCE ON MACHINE TRANSLATION (WMT24)