EMNLP 2024


November 12-13, 2024
Miami, Florida, USA


  • 29th January - The shared task announced


Formerly known as News translation task of the WMT focusses on evaluation of general MT capabilities. The main focus is testing general machine translation capabilities for various domains, genres, and possibly modalities. All submitted systems will be scored and ranked by human judgement.


The goals of the shared translation task are:

  • To investigate the applicability of current MT techniques when translating into languages other than English, different domains, and modalities

  • To examine special challenges in translating between language families, including word order differences and morphology

  • To investigate the translation of low-resource, morphologically rich languages

  • To create publicly available corpora for machine translation and machine translation evaluation

  • To generate up-to-date performance numbers in order to provide a basis of comparison in future research

  • To offer newcomers a smooth start with hands-on experience in state-of-the-art machine translation methods

  • To investigate the usefulness of multilingual and third language resources

  • To assess the effectiveness of document-level approaches

We hope that both beginners and established research groups will participate in this task.

The main changes

  • we focus on language pairs English-to-X and X-to-Y (avoiding evaluation on X-to-English)

  • all testsets will be paragraph-level (one paragraph one line)

  • English testsets will contain speech domain in form of audio and ASR. It is not required to use the audio, we provide automatically generated transcript

  • we have redefined constrained/unconstrained track

  • we extend support for system breakers via Test-suite subtask

  • we investigate literary domain

Language pairs

The list of languages that are going to be evaluated (only specified direction)
  • Czech to Ukrainian

  • Japanese to Chinese

  • EN to Chinese

  • EN to Czech

  • EN to German

  • EN to Hindi

  • EN to Icelandic

  • EN to Japanese

  • EN to Russian

  • EN to Spanish (Latin America)

  • EN to Ukrainian

To be confirmed
  • Estonian

  • possibly other


All dates are at the end of the day for Anywhere on Earth (AoE)

Finalized training data/allowed models for constrained track

29th February

Test suite source texts for pre-run with SoTA systems must reach us

11th April

Final Test suite source texts must reach us

12th June

Test data released

27th June (at the end of AoE)

Translation submission deadline

4th July

System description abstract paper

11th July

Translated test suites shipped back to test suites authors

11th July

System description submission



The task is to improve current MT methods. We encourage a broad participation — if you feel that your method is interesting but not state-of-the-art, then please participate in order to disseminate it and measure progress. Participants will use their systems to translate a test set of unseen paragraphs in the source language. The translation quality is measured by a human evaluation and various automatic evaluation metrics.

You may participate in any or all of the language pairs. To have a common framework that allows for comparable results, and also to lower the barrier to entry, we provide a common training set. You are not limited to this training set, and you are not limited to the training set provided for your target language pair.

Each participant is required to submit submission paper, which should highlight in which ways their own methods and data differ from the standard task. You should make it clear which tools and trainind sets you used. Each participant has to submit (one page) abstract of the system description one week after the system submission deadline. The abstract should contain, at a minimum, basic information about the system and the approaches/data/tools used, but could be a full description paper or a draft that can be later modified for the final system description paper. See the Main page for the link to the submission site.

Constrained/Open/Closed track

The General MT task has three separate tracks with different constraints on the training of the models: constrained, open, and closed systems. * Constrained systems - sets specifically allowed training data and pretrained models that may be used to train the translation models, for details, see below. * Open systems - you may use software and data under any open source license that places no constrains for non-commercial purposes (e.g. Apache, MIT, …​) allowing to make your work replicable by any research group * Closed systems - track that do not place any limitations (ONLINE systems are in this category).

the closed systems will not directly compete with constrained and open systems. However, they will be human evaluated for comparison purposes.
The limitations for the constrained systems track are as follows
  • Allowed pretrained LMs and LLMs:

  • Llama-2-7B, Llama-2-13B, Mistral-7B

  • following pretrained LMs in all publicly available model sizes: mBART, BERT, RoBERTa, XLM-RoBERTa, sBERT, LaBSE

  • You may ONLY use the training data allowed for this year (specified later on this page)

  • You may use any publicly available metric that was evaluated on past WMT Metrics shared tasks (for example: COMET, Bleurt, etc.)

  • Any basic linguistics tools (taggers, parsers, morphology analyzers, etc.)

If you’d like to propose another pretrained model to be added into the list of allowed models for constrained track, write us an email (we may consider extending constrained list until the end of February).


For the English, we plan following domains: news, social, speech, and literary. For other source languages, the list of domain may be different.

Speech domain

One of the four domains for English (and possibly Japanese) will be speech presented with audio together with machine generated transcript. All testsets will contain the transcript which will be generated with Whisper model, therefore the audio is not required to be used but could improve the performance.

Document-level MT

We are interested in the question of whether MT can be improved by using context beyond the paragraph, and to what extent state-of-the-art MT systems can produce translations that are correct "in-context". All of our development and test data contains full documents, and all our human evaluation will be in-context, in other words the evaluators will view the paragraph as well as its surrounding context when evaluating.

Our training data retains context and document boundaries wherever possible, in particular the following corpora retain the context intact:

  • Parallel: europarl, news-commentary, CzEng, Rapid

  • Monolingual: news-crawl (en, de and cs), europarl, news-commentary

Test Suites

This year’s shared task will also include the “Test suites” sub-task, which has been part of WMT since 2018. More details about the test suites are provided in a separate page.


Licensing of Data

The data released for the \General MT task can be freely used for research purposes, we ask that you cite the WMT24 shared task overview findings paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.

Training Data

We aim to use publicly available datasets wherever possible. Click here to view more details.

Training data tables have been relocated to a separate page at ./mtdata.

You can download all corpora via command line approach as:

pip install mtdata==0.4.0
wget https://www.statmt.org/wmt24/mtdata/mtdata.recipes.wmt24-constrained.yml
for id in wmt24-eng-{ces,deu,spa,hin,isl,jpn,rus,ukr,zho} wmt24-{ces-ukr,jpn-zho}; do
  mtdata get-recipe -i $id -o $id --compress --no-merge -j 16

Development Data

To evaluate your system during development, we suggest using test sets from past WMT years. For automatic evaluation, we recommend to use sacreBLEU, which will automatically download previous WMT test sets for you. You may want to consider COMET automatic metric that has been shown to have high correlation with humans. We also release other dev and test sets from previous years.

The 2024 test sets will be created from a sample of up to four domains with equal number of words per domain. The sources of the test sets will be original text, whereas the targets will be human-produced translations.

Note that the dev data contains both forward and reverse translations (clearly marked).

We use an xml format for all dev, test and submission files. It is important to use an xml parser to wrap/unwrap text in order to ensure correct escaping/de-escaping. We will provide tools.


Primary systems (for which abstracts have been submitted) will be included in the human evaluation. We will collect subjective judgments about the translation quality from annotators, taking the document context into account.

In the unlikely event of an unprecedented number of system submissions that we couldn’t evaluate, we may decide to preselect systems for human evaluation by automatic metrics or not evaluating low-performing closed systems. However, we believe this won’t be applied and all primary systems will be evaluated by humans.

Test Set Submission



For queries, please use the mailing list or contact Tom Kocmi.


  • Tom Kocmi - tomkocmi@microsoft.com

  • Eleftherios Avramidis

  • Rachel Bawden

  • Ondřej Bojar

  • Anton Dvorkovich

  • Christian Federmann

  • Mark Fishel

  • Markus Freitag

  • TG Gowda

  • Roman Grundkiewicz

  • Barry Haddow

  • Marzena Karpinska

  • Philipp Koehn

  • Benjamin Marie

  • Kenton Murray

  • Masaaki Nagata

  • Martin Popel

  • Maja Popović

  • Mariya Shmatova

  • Steinþór Steingrímsson

  • Vilém Zouhar


This task would not have been possible without the sponsorship of monolingual data, test sets translation and evaluation from our partners. Namely Microsoft, Charles University, Toloka, NTT, Dubformer, Google, …​ TBA.