Shared Task: General Machine Translation


  • 2023-09-04 - description papers for the test suites can be submitted as an exception until Friday September 8th, at 12:00 noon CEST.

  • 2023-07-27 - due to data preparation issues, test suites will be shipped back to participants one week late, on August 2nd

  • 2023-07-12 - submission week starts, see details in section Test Set Submission

  • 2023-05-16 - updated news-commentary

  • 2023-05-08 - added description page for test suites and test suite dates

  • 2023-04-12 - allowed to use NTREX-128 and Flores-200 as additional dev tests.

  • 2023-04-05 - deadlines published and we created to donate mono data for testing

  • 2023-03-27 - published a list of pretrained models allowed for constrained track

  • 2023-03-21 - Parallel data for Hebrew. New version of news-commentary (v18)

  • 2023-03-15 - Japanese is run for both directions. Removed obsolete Gigawords corpus

  • 2023-03-08 - all language pairs finalized

  • 2023-02-20 - general translation task announced, some languages are to be confirmed


Formerly known as News translation task of the WMT focusses on evaluation of general MT capabilities. The main difference in contrast News shared taks is that testsets will contain multiple domains. Testsets will contain several different domains, likely news, user generated (social), conversational, and ecommerce. All systems will be scored and ranked by human judgement.
The list of languages, that are going to be evaluated (pay attention to translation direction):

Both directions
  • Chinese to/from English

  • German to/from English: document-level (testset won’t be sentence breaked)

  • Hebrew to/from English: low-resource

  • Japanese to/from English

  • Russian to/from English

  • Ukrainian to/from English

Single direction
  • Czech to Ukrainian: non-English

  • English to Czech

We provide parallel corpora for all languages as training data, and additional resources for download.

The main changes

  • Not all languages are evaluated in both directions

  • We made clearer definition of constrained track in regards to pretrained models

  • We are no longer going to use MTurk crowd workers for human evaluation.

  • We make the submission process clearer to avoid dropping of systems from the human evaluation for example due to forgotten abstract paper

  • In case of large amount of participants that we couldn’t evaluate by humans, we may use automatic metric to remove worst performing systems from the evaluation (we strongly hope, this is not going to be needed)

  • Participants are expected to translate 100 000 or more sentences per language pair during the submission week. These are mandatory for test-suite track.


The goals of the shared translation task are:

  • To investigate the applicability of current MT techniques when translating into languages other than English and different domains

  • To examine special challenges in translating between language families, including word order differences and morphology

  • To investigate the translation of low-resource, morphologically rich languages

  • To create publicly available corpora for machine translation and machine translation evaluation

  • To generate up-to-date performance numbers in order to provide a basis of comparison in future research

  • To offer newcomers a smooth start with hands-on experience in state-of-the-art machine translation methods

  • To investigate the usefulness of multilingual and third language resources

  • To assess the effectiveness of document-level approaches

We hope that both beginners and established research groups will participate in this task.

Important Dates

All dates are at the end of the day for Anywhere on Earth

Release of training data for shared tasks (by)


Test suite source texts must reach us

19th June

Test data released

13th July (at the end of AoE)

Translation submission deadline

20th July

System description abstract paper

27th July 2nd August

Translated test suites shipped back to test suites authors

27th July

System description submission

5th September

Test suite description submission

8th September 12:00 noon CEST

Task Description

We provide training data for all language pairs, and a common framework. The task is to improve current methods. We encourage a broad participation — if you feel that your method is interesting but not state-of-the-art, then please participate in order to disseminate it and measure progress. Participants will use their systems to translate a test set of unseen sentences in the source language. The translation quality is measured by a manual evaluation and various automatic evaluation metrics.

You may participate in any or all of the language pairs. To have a common framework that allows for comparable results, and also to lower the barrier to entry, we provide a common training set. You are not limited to this training set, and you are not limited to the training set provided for your target language pair. This means that multilingual systems are allowed, and classed as constrained as long as they use only data released for WMT23.

Each participant is required to submit submission paper, which should highlight in which ways your own methods and data differ from the standard task. You should make it clear which tools you used, and which training sets you used.
Each participant has to submit (one page) abstract of the system description one week after the system submission deadline. The abstract should contain, at a minimum, basic information about the system and the approaches/data/tools used, but could be a full description paper or a draft that can be later modified for the final system description paper. See the Main page for the link to the submission site.

Constrained and Unconstrained track

The General MT task has two separate tracks with different constraints on the training of the models: constrained and unconstrained. The first sets specifically allowed training data and pretrained models that may be used to train the translation models, while the second allows the participation with a system trained without any limitations.

The limitations for the constrained track are as follows
  • You may only use the training data allowed for this year (specified later on this page)

  • You may use any publicly available metric that was evaluated on past WMT Metrics shared tasks (for example: COMET, Bleurt, etc.)

  • You may ONLY use the following listed pretrained models in all publicly available model sizes: mBART, BERT, RoBERTa, XLM-RoBERTa, sBERT, LaBSE

  • Any basic linguistics tools (taggers, parsers, morphology analyzers, etc.)

If you think any pretrained model should be added into the list of allowed models for constrained track, write us an email (we may consider allowing it for next year).

Document-level MT

We are interested in the question of whether MT can be improved by using context beyond the sentence, and to what extent state-of-the-art MT systems can produce translations that are correct "in-context" All of our development and test data contains full documents, and all our human evaluation will be in-context, in other words the evaluators will view the sentence as well as its surrounding context when evaluating.

Our training data retains context and document boundaries wherever possible, in particular the following corpora retain the context intact:

  • Parallel: europarl, news-commentary, CzEng, Rapid

  • Monolingual: news-crawl (en, de and cs), europarl, news-commentary

Test Suites

This year’s shared task will also include the “Test suites” sub-task, which has been part of WMT since 2018. More details about the test suites are provided in a separate page.


Licensing of Data

The data released for the WMT23 General MT task can be freely used for research purposes, we ask that you cite the WMT23 shared task overview paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.

Training Data

We aim to use publicly available sources of data wherever possible.

Note that the released data is not tokenized and includes sentences of any length (including empty sentences). You may want to consider using Moses tools for tokenizing. These tools are available in the Moses git repository.


You can download all corpora via command line approach here with detailed instructions. Except two datasets marked as 'Register and Download' (CzEng2.0, and CCMT). Usage:

pip install mtdata==0.4.0
for ri in wmt23-{enzh,zhen,ende,deen,enhe,heen,enja,jaen,enru,ruen,encs,csuk,enuk,uken}; do
  mtdata get-recipe -ri $ri -o $ri

Parallel Training Data:


Europarl v10

ParaCrawl v9

Note that only the ticked language pairs are available for constrained participants, but the metadata (tmx files) may be used.

Common Crawl corpus

Same as last year. The fr-de version is here

News Commentary v18.1

CzEng 2.0

Register and download CzEng2.0. The new CzEng includes synthetic data, and includes all cs-en data supplied for the task. See the CzEng README for more details.

Yandex Corpus

Wiki Titles v3

UN Parallel Corpus V1.0

Register and download

Tilde MODEL corpus

de-en and cs-en contain document information.

CCMT Corpus

Register and download


We release the official version, with added language identification (from cld2).

Back-translated news

Back-translated news. The cs-en data is contained in CzEng. The zh-en and ru-en data was produced for the University of Edinburgh systems in 2017 and 2018.

Japanese-English Subtitle Corpus

Note: English side is lowercased.

The Kyoto Free Translation Task Corpus

TED Talks

From IWSLT 2017 Evaluation Campaign.

ELRC - EU acts in Ukrainian


Monolingual Training Data:

Corpus CS DE EN JA RU ZH HE UK Notes

News crawl

Large corpora of crawled news, collected since 2007. For de, cs, and en versions are available with document boundaries, and without sentence-splitting.

News discussions

Corpora crawled from comment sections of online newspapers (no longer updated).

Europarl v10

Monolingual version of European parliament crawl. Superset of the parallel version.

News Commentary

Updated Monolingual text from news-commentary crawl. Superset of parallel version. Use the latest version.

Common Crawl

Deduplicated with development and evaluation sentences removed. English was updated 31 January 2016 to remove bad UTF-8. Downloads can be verified with SHA512 checksums. More English is available.

Extended Common Crawl

Extended Common Crawl extracted from crawls up to April 2020.

UberText Corpus

Text crawled from Ukrainian periodicals

Leipzig Corpora

Leipzig Corpora Collection: From 100 to 200 Languages PDF

Legal Ukrainian

Legal Ukrainian: 69M token corpus in the legal sector; crawled from websites belonging to legislation, government, court, and parliament

Development Data

To evaluate your system during development, we suggest using test sets from past WMT years. For automatic evaluation, we recommend to use sacreBLEU, which will automatically download previous WMT test sets for you. You may want to consider COMET automatic metric that has been shown to have high correlation with humans. We also release other dev and test sets from previous years.

The 2023 test sets will be created from a sample of up to four domains (most likely news, e-commerce, user generated, and conversational) with equal number of sentences per domain. The sources of the test sets will be original text, whereas the targets will be human-produced translations.

Note that the dev data contains both forward and reverse translations (clearly marked).

We use an xml format (instead of the previous sgm format) for all dev, test and submission files. It is important to use an xml parser to wrap/unwrap text in order to ensure correct escaping/de-escaping. We will provide tools.

Test Set Submission

Here are detailed instructions to make submission process seamless. To participate, please, follow the steps outlined below:

1) Register your team to OCELoT, the registration portal is currently open via link

2) Get your team verified by sending an email with your affiliation to (please note that the verification process may require some time to complete)

3) Translate blind test sets:

4) After verification, submit translations to OCELoT (deadline is 20th July AoE, please, don’t leave it to the last minute)

5) Before 27th July AoE, please, select the primary system in OCELoT, you may choose a single system for each language pair

6) Before 27th July AoE, submit a short abstract paper to SoftConf: (it should be a brief summary of your submission which you later replace with a system description paper, you may already the submit system description paper if you want)

We will be updating this sheet to make verification process for teams transparent and to avoid possible confusion from missing steps.

  • The blind test data sources are available here

  • The sources are in xml format. Scripts from converting xml to/from line-oriented text are available here

  • The sources contain the General MT test sets and additional testsets from “test suites” and other shared tasks, which will be used for further evaluation of the translation systems.

  • Your translations should be submitted through OCELoT

  • You first need to register a team name with OCELoT. Your team will then need to be activated by General MT task organisers before you can submit. Please send an email to Maja with your OCELoT team name and your institution/company details in order to get activated.

  • Translations should be “human-ready”, i.e. in the form that text is normally published, so latin-script languages should be recased and detokenised, Chinese and Japanese should be unsegmented, etc.

  • Submissions should be formatted in the WMT xml format, using the format tools linked above

  • You can make up to 7 submissions per language pair, per team. Each submission will be scored (ChrF) against a reference translation, the scores in OCELoT does not reflect actual system performance and are mainly for validation.

  • During the test week, all submissions will remain anonymous

  • Submissions should be uploaded by deadline stated above

  • After submission, select primary system for each of the language directions

  • To select primaries, log in to OCELoT, select the Team tab at the top, and click on the yellow "Team submissions" button.

  • When choosing a primary system, you will be asked to give a short (one paragraph) description of the system, and fill in a web form with some details of technologies used.

  • About a week after submission week deadline and once we have a final set of primary submissions, we will de-anonymise the primary submissions, and only the primary submissions. We will also release the references.

  • Each team must submit an abstract paper by deadline stated above and later a full system paper describing your submission. Otherwise, their submission won’t be considered for General MT task

  • See the Main page for details on abstract and paper submission.

The size of the testset may be larger than 100 000 sentences due to test-suites. Translation of full blind test set is mandatory for participating in General MT.

Mono data donation

Getting fresh and unseen data for blind testing is challenging task, especially with the LLM that have unknown training sets. We are looking for partners who would be willing to donate data for the General MT testsets. Let us also know, if you know about source of data available under research permissive license.

We are looking for data with paragraph/document context (not stand-alone sentences). The data must be originally written in the language (no translationese) and in any domain. We are looking for only hundreds of sentences.

We will translate the data ourself and prepare them for this year’s General MT blind tests.

If you would be interested in donating data, please, contact


Primary systems (for which abstracts have been submitted) will be included in the human evaluation. We will collect subjective judgments about the translation quality from annotators, taking the document context into account.

In the unlikely event of an unprecedented number of system submissions that we couldn’t evaluate, we may decide to preselect systems for human evaluation by automatic metrics (especially not evaluating low-performing unconstrained systems). However, we believe this won’t be applied and all primary systems will be evaluated by humans.


For queries, please use the mailing list or contact Tom Kocmi.


  • Tom Kocmi -

  • Eleftherios Avramidis

  • Rachel Bawden

  • Ondřej Bojar

  • Anton Dvorkovich

  • Christian Federmann

  • Mark Fishel

  • Markus Freitag

  • Thamme Gowda

  • Roman Grundkiewicz

  • Barry Haddow

  • Philipp Koehn

  • Benjamin Marie

  • Makoto Morishita

  • Kenton Murray

  • Masaaki Nagata

  • Toshiaki Nakazawa

  • Martin Popel

  • Maja Popović

  • Mariya Shmatova


We would like to thank Rebecca Knowles, Sergio Bruccoleri. This task would not have been possible without the sponsorship of monolingual data, test sets translation and evaluation from our partners. Namely Microsoft, Charles University, Toloka, NTT, Dubformer, Google, Centific …​ TBA.