Shared Task: General Machine Translation

Announcements

2023-09-04 - description papers for the test suites can be submitted as an exception until Friday September 8th, at 12:00 noon CEST.
2023-07-27 - due to data preparation issues, test suites will be shipped back to participants one week late, on August 2nd
2023-07-12 - submission week starts, see details in section Test Set Submission
2023-05-16 - updated news-commentary
2023-05-08 - added description page for test suites and test suite dates
2023-04-12 - allowed to use NTREX-128 and Flores-200 as additional dev tests.
2023-04-05 - deadlines published and we created to donate mono data for testing
2023-03-27 - published a list of pretrained models allowed for constrained track
2023-03-21 - Parallel data for Hebrew. New version of news-commentary (v18)
2023-03-15 - Japanese is run for both directions. Removed obsolete Gigawords corpus
2023-03-08 - all language pairs finalized
2023-02-20 - general translation task announced, some languages are to be confirmed

Description

Formerly known as News translation task of the WMT focusses on evaluation of general MT capabilities. The main difference in contrast News shared taks is that testsets will contain multiple domains. Testsets will contain several different domains, likely news, user generated (social), conversational, and ecommerce. All systems will be scored and ranked by human judgement.
The list of languages, that are going to be evaluated (pay attention to translation direction):

Both directions

Chinese to/from English
German to/from English: document-level (testset won’t be sentence breaked)
Hebrew to/from English: low-resource
Japanese to/from English
Russian to/from English
Ukrainian to/from English

Single direction

Czech to Ukrainian: non-English
English to Czech

We provide parallel corpora for all languages as training data, and additional resources for download.

The main changes

Not all languages are evaluated in both directions
We made clearer definition of constrained track in regards to pretrained models
We are no longer going to use MTurk crowd workers for human evaluation.
We make the submission process clearer to avoid dropping of systems from the human evaluation for example due to forgotten abstract paper
In case of large amount of participants that we couldn’t evaluate by humans, we may use automatic metric to remove worst performing systems from the evaluation (we strongly hope, this is not going to be needed)
Participants are expected to translate 100 000 or more sentences per language pair during the submission week. These are mandatory for test-suite track.

Goals

The goals of the shared translation task are:

To investigate the applicability of current MT techniques when translating into languages other than English and different domains
To examine special challenges in translating between language families, including word order differences and morphology
To investigate the translation of low-resource, morphologically rich languages
To create publicly available corpora for machine translation and machine translation evaluation
To generate up-to-date performance numbers in order to provide a basis of comparison in future research
To offer newcomers a smooth start with hands-on experience in state-of-the-art machine translation methods
To investigate the usefulness of multilingual and third language resources
To assess the effectiveness of document-level approaches

We hope that both beginners and established research groups will participate in this task.

Important Dates

All dates are at the end of the day for Anywhere on Earth

Release of training data for shared tasks (by)

March

Test suite source texts must reach us

19th June

Test data released

13th July (at the end of AoE)

Translation submission deadline

20th July

System description abstract paper

27th July 2nd August

Translated test suites shipped back to test suites authors

27th July

System description submission

5th September

Test suite description submission

8th September 12:00 noon CEST

Task Description

We provide training data for all language pairs, and a common framework. The task is to improve current methods. We encourage a broad participation — if you feel that your method is interesting but not state-of-the-art, then please participate in order to disseminate it and measure progress. Participants will use their systems to translate a test set of unseen sentences in the source language. The translation quality is measured by a manual evaluation and various automatic evaluation metrics.

You may participate in any or all of the language pairs. To have a common framework that allows for comparable results, and also to lower the barrier to entry, we provide a common training set. You are not limited to this training set, and you are not limited to the training set provided for your target language pair. This means that multilingual systems are allowed, and classed as constrained as long as they use only data released for WMT23.

Each participant is required to submit submission paper, which should highlight in which ways your own methods and data differ from the standard task. You should make it clear which tools you used, and which training sets you used.
Each participant has to submit (one page) abstract of the system description one week after the system submission deadline. The abstract should contain, at a minimum, basic information about the system and the approaches/data/tools used, but could be a full description paper or a draft that can be later modified for the final system description paper. See the Main page for the link to the submission site.

Constrained and Unconstrained track

The General MT task has two separate tracks with different constraints on the training of the models: constrained and unconstrained. The first sets specifically allowed training data and pretrained models that may be used to train the translation models, while the second allows the participation with a system trained without any limitations.

The limitations for the constrained track are as follows

You may only use the training data allowed for this year (specified later on this page)
You may use any publicly available metric that was evaluated on past WMT Metrics shared tasks (for example: COMET, Bleurt, etc.)
You may ONLY use the following listed pretrained models in all publicly available model sizes: mBART, BERT, RoBERTa, XLM-RoBERTa, sBERT, LaBSE
Any basic linguistics tools (taggers, parsers, morphology analyzers, etc.)

If you think any pretrained model should be added into the list of allowed models for constrained track, write us an email (we may consider allowing it for next year).

Document-level MT

We are interested in the question of whether MT can be improved by using context beyond the sentence, and to what extent state-of-the-art MT systems can produce translations that are correct "in-context" All of our development and test data contains full documents, and all our human evaluation will be in-context, in other words the evaluators will view the sentence as well as its surrounding context when evaluating.

Our training data retains context and document boundaries wherever possible, in particular the following corpora retain the context intact:

Parallel: europarl, news-commentary, CzEng, Rapid
Monolingual: news-crawl (en, de and cs), europarl, news-commentary

Test Suites

This year’s shared task will also include the “Test suites” sub-task, which has been part of WMT since 2018. More details about the test suites are provided in a separate page.

Data

Licensing of Data

The data released for the WMT23 General MT task can be freely used for research purposes, we ask that you cite the WMT23 shared task overview paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.

Training Data

We aim to use publicly available sources of data wherever possible.

Note that the released data is not tokenized and includes sentences of any length (including empty sentences). You may want to consider using Moses tools for tokenizing. These tools are available in the Moses git repository.

Download

You can download all corpora via command line approach here with detailed instructions. Except two datasets marked as 'Register and Download' (CzEng2.0, and CCMT). Usage:

pip install mtdata==0.4.0
wget https://www.statmt.org/wmt23/mtdata/mtdata.recipes.wmt23-constrained.yml
for ri in wmt23-{enzh,zhen,ende,deen,enhe,heen,enja,jaen,enru,ruen,encs,csuk,enuk,uken}; do
  mtdata get-recipe -ri $ri -o $ri
done

Parallel Training Data:

File	CS-EN	DE-EN	JA-EN	RU-EN	ZH-EN	HE-EN	UK-EN	UK-CS	Notes
Europarl v10	✓	✓
ParaCrawl v9	✓	✓	✓	✓	✓		✓		Note that only the ticked language pairs are available for constrained participants, but the metadata (tmx files) may be used.
Common Crawl corpus	✓	✓		✓					Same as last year. The fr-de version is here
News Commentary v18.1	✓	✓	✓	✓	✓
CzEng 2.0	✓								Register and download CzEng2.0. The new CzEng includes synthetic data, and includes all cs-en data supplied for the task. See the CzEng README for more details.
Yandex Corpus				✓
Wiki Titles v3	✓	✓	✓	✓	✓
UN Parallel Corpus V1.0				✓	✓				Register and download
Tilde MODEL corpus	✓	✓		✓			✓		de-en and cs-en contain document information.
CCMT Corpus					✓				Register and download
WikiMatrix	✓	✓	✓	✓	✓	✓	✓	✓	We release the official version, with added language identification (from cld2).
Back-translated news	✓			✓	✓				Back-translated news. The cs-en data is contained in CzEng. The zh-en and ru-en data was produced for the University of Edinburgh systems in 2017 and 2018.
Japanese-English Subtitle Corpus			✓						Note: English side is lowercased.
The Kyoto Free Translation Task Corpus			✓
TED Talks			✓						From IWSLT 2017 Evaluation Campaign.
ELRC - EU acts in Ukrainian							✓	✓
OPUS						✓		✓

Monolingual Training Data:

Corpus	CS	DE	EN	JA	RU	ZH	HE	UK	Notes
News crawl	✓	✓	✓	✓	✓	✓		✓	Large corpora of crawled news, collected since 2007. For de, cs, and en versions are available with document boundaries, and without sentence-splitting.
News discussions			✓						Corpora crawled from comment sections of online newspapers (no longer updated).
Europarl v10	✓	✓	✓						Monolingual version of European parliament crawl. Superset of the parallel version.
News Commentary	✓	✓	✓	✓	✓	✓			Updated Monolingual text from news-commentary crawl. Superset of parallel version. Use the latest version.
Common Crawl	✓	✓	✓	✓	✓	✓			Deduplicated with development and evaluation sentences removed. English was updated 31 January 2016 to remove bad UTF-8. Downloads can be verified with SHA512 checksums. More English is available.
Extended Common Crawl	✓	✓		✓	✓	✓			Extended Common Crawl extracted from crawls up to April 2020.
UberText Corpus								✓	Text crawled from Ukrainian periodicals
Leipzig Corpora	✓	✓	✓	✓	✓	✓	✓	✓	Leipzig Corpora Collection: From 100 to 200 Languages PDF
Legal Ukrainian								✓	Legal Ukrainian: 69M token corpus in the legal sector; crawled from websites belonging to legislation, government, court, and parliament

Corpus

Notes

News crawl

✓

Large corpora of crawled news, collected since 2007. For de, cs, and en versions are available with document boundaries, and without sentence-splitting.

News discussions

✓

Corpora crawled from comment sections of online newspapers (no longer updated).

Europarl v10

✓

Monolingual version of European parliament crawl. Superset of the parallel version.

News Commentary

✓

Updated Monolingual text from news-commentary crawl. Superset of parallel version. Use the latest version.

Common Crawl

✓

Deduplicated with development and evaluation sentences removed. English was updated 31 January 2016 to remove bad UTF-8. Downloads can be verified with SHA512 checksums. More English is available.

Extended Common Crawl

✓

Extended Common Crawl extracted from crawls up to April 2020.

UberText Corpus

✓

Text crawled from Ukrainian periodicals

Leipzig Corpora

✓

Leipzig Corpora Collection: From 100 to 200 Languages PDF

Legal Ukrainian

✓

Legal Ukrainian: 69M token corpus in the legal sector; crawled from websites belonging to legislation, government, court, and parliament

Development Data

To evaluate your system during development, we suggest using test sets from past WMT years. For automatic evaluation, we recommend to use sacreBLEU, which will automatically download previous WMT test sets for you. You may want to consider COMET automatic metric that has been shown to have high correlation with humans. We also release other dev and test sets from previous years.

The 2023 test sets will be created from a sample of up to four domains (most likely news, e-commerce, user generated, and conversational) with equal number of sentences per domain. The sources of the test sets will be original text, whereas the targets will be human-produced translations.

Note that the dev data contains both forward and reverse translations (clearly marked).

We use an xml format (instead of the previous sgm format) for all dev, test and submission files. It is important to use an xml parser to wrap/unwrap text in order to ensure correct escaping/de-escaping. We will provide tools.

Test Set Submission

Here are detailed instructions to make submission process seamless. To participate, please, follow the steps outlined below:

1) Register your team to OCELoT, the registration portal is currently open via link ocelot-wmt23.mteval.org/

2) Get your team verified by sending an email with your affiliation to maja.popovic.166@gmail.com (please note that the verification process may require some time to complete)

3) Translate blind test sets: www2.statmt.org/wmt23/wmttest2023.src.zip

4) After verification, submit translations to OCELoT (deadline is 20th July AoE, please, don’t leave it to the last minute)

5) Before 27th July AoE, please, select the primary system in OCELoT, you may choose a single system for each language pair

6) Before 27th July AoE, submit a short abstract paper to SoftConf: www.softconf.com/emnlp2023/wmt/ (it should be a brief summary of your submission which you later replace with a system description paper, you may already the submit system description paper if you want)

We will be updating this sheet to make verification process for teams transparent and to avoid possible confusion from missing steps.

The blind test data sources are available here
The sources are in xml format. Scripts from converting xml to/from line-oriented text are available here
The sources contain the General MT test sets and additional testsets from “test suites” and other shared tasks, which will be used for further evaluation of the translation systems.
Your translations should be submitted through OCELoT
You first need to register a team name with OCELoT. Your team will then need to be activated by General MT task organisers before you can submit. Please send an email to Maja with your OCELoT team name and your institution/company details in order to get activated.
Translations should be “human-ready”, i.e. in the form that text is normally published, so latin-script languages should be recased and detokenised, Chinese and Japanese should be unsegmented, etc.
Submissions should be formatted in the WMT xml format, using the format tools linked above
You can make up to 7 submissions per language pair, per team. Each submission will be scored (ChrF) against a reference translation, the scores in OCELoT does not reflect actual system performance and are mainly for validation.
During the test week, all submissions will remain anonymous
Submissions should be uploaded by deadline stated above
After submission, select primary system for each of the language directions
To select primaries, log in to OCELoT, select the Team tab at the top, and click on the yellow "Team submissions" button.
When choosing a primary system, you will be asked to give a short (one paragraph) description of the system, and fill in a web form with some details of technologies used.
About a week after submission week deadline and once we have a final set of primary submissions, we will de-anonymise the primary submissions, and only the primary submissions. We will also release the references.
Each team must submit an abstract paper by deadline stated above and later a full system paper describing your submission. Otherwise, their submission won’t be considered for General MT task
See the Main page for details on abstract and paper submission.

The size of the testset may be larger than 100 000 sentences due to test-suites. Translation of full blind test set is mandatory for participating in General MT.

Mono data donation

Getting fresh and unseen data for blind testing is challenging task, especially with the LLM that have unknown training sets. We are looking for partners who would be willing to donate data for the General MT testsets. Let us also know, if you know about source of data available under research permissive license.

We are looking for data with paragraph/document context (not stand-alone sentences). The data must be originally written in the language (no translationese) and in any domain. We are looking for only hundreds of sentences.

We will translate the data ourself and prepare them for this year’s General MT blind tests.

If you would be interested in donating data, please, contact tomkocmi@microsoft.com.

Evaluation

Primary systems (for which abstracts have been submitted) will be included in the human evaluation. We will collect subjective judgments about the translation quality from annotators, taking the document context into account.

In the unlikely event of an unprecedented number of system submissions that we couldn’t evaluate, we may decide to preselect systems for human evaluation by automatic metrics (especially not evaluating low-performing unconstrained systems). However, we believe this won’t be applied and all primary systems will be evaluated by humans.

Contact

For queries, please use the mailing list or contact Tom Kocmi.

Organizers

Tom Kocmi - tomkocmi@microsoft.com
Eleftherios Avramidis
Rachel Bawden
Ondřej Bojar
Anton Dvorkovich
Christian Federmann
Mark Fishel
Markus Freitag
Thamme Gowda
Roman Grundkiewicz
Barry Haddow
Philipp Koehn
Benjamin Marie
Makoto Morishita
Kenton Murray
Masaaki Nagata
Toshiaki Nakazawa
Martin Popel
Maja Popović
Mariya Shmatova

Acknowledgements

We would like to thank Rebecca Knowles, Sergio Bruccoleri. This task would not have been possible without the sponsorship of monolingual data, test sets translation and evaluation from our partners. Namely Microsoft, Charles University, Toloka, NTT, Dubformer, Google, Centific … TBA.