Shared Task: General Machine Translation

EMNLP 2025

TENTH CONFERENCE ON
MACHINE TRANSLATION (WMT25)

November 5-9, 2025
Suzhou, China

[HOME]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [INDIC MT]
EVALUATION TASKS: [MT TEST SUITES]
OTHER TASKS: [MULTILINGUAL INSTRUCTION] [LIMITED RESOURCES SLAVIC LLM]

ANNOUNCEMENTS

2025-03-25: Change language direction for Bhojpuri and Maasai to be out-of-English instead of to-English.
2025-03-01: Added WMT24++ testset and removed Irish from multilingual subtask.
2025-02-20: The shared task announced
2025-04-01: Test suite sub-task announced

DESCRIPTION

Formerly known as News translation task of the WMT focusses on evaluation of general capabilities of machine translation (MT) systems. Its primary goal is to test performance across a wide range of languages, domains, genres, and modalities. Systems are evaluated with humans.

In addition to the main General MT task, there are two separate subtasks:

Test suite subtask - design challenge testset and analyze the system translations
Multilingual subtask - develop a system that can translate into multiple languages simultaneously. The participating system needs to translate all General MT + additional 15 languages

The goals of the shared translation task are:

To investigate the applicability of current MT techniques when translating into languages other than English across various domains
To investigate the translation of low-resource, morphologically rich languages
To create publicly available testset for machine translation evaluation
To generate up-to-date performance scores in order to provide a basis of comparison in future research
To offer newcomers a smooth start with hands-on experience in state-of-the-art machine translation methods
To investigate the usefulness of multilingual and third language resources
To assess the effectiveness of document-level approaches and uses of non-textual modalities

We hope that both beginners and established research groups will participate in this task.

Main changes of 2025

New modalities as an additional context: video, image
Redefined constrained/open/closed tracks. Only two tracks:
- constrained open weights track - any public training data allowed, any public model smaller than 20B is allowed, model must be publicly released
- unconstrained track - no limitations, only system description paper is required
New language pairs (English-Arabic, English-Estonian, English-Korean, English-Serbian, Czech-German, English-Bhojpuri, English-Maasai)
New Multilingual subtask - participating systems are required to translate ~30 language pairs
Testsets will focus on four different domains:
- News - we employ difficulty sampling to select most challenging articles
- Speech - we provide ASR transcript together with the original video
- Social - in addition to textual content, we provide printscreen image that may contain useful context
- Literary - a domain focussing on a long-context
Providing preamble prompt for each domain containing translation brief of expected translation output useful with instruction LLMs
XML format will be replaced with JSONs
Automatic system ranking will be based on panel of LLM-as-a-judge to minimize the metric bias

Language pairs

The list of languages that are going to be evaluated (only specified direction):

Czech to Ukrainian
Czech to German (New)
Japanese to Chinese
English to Arabic (New)
English to Bhojpuri (New)
English to Chinese
English to Czech
English to Estonian (New)
English to Icelandic
English to Japanese
English to Korean (New)
English to Maasai (New)
English to Russian
English to Serbian (New)
English to Ukrainian

IMPORTANT DATES

All dates are at the end of the day for Anywhere on Earth (AoE)

Finalized task details

end of February

Final test suite source texts must reach us

12th June

Test data released

26th June

Translation submission deadline

3rd July

System description abstract paper

10th July

Translated test suites shipped back to test suites authors

17th July

TASK DESCRIPTION

The task is to improve and evaluate current MT methods. We encourage a broad participation — if you feel that your method is interesting but not state-of-the-art, then please participate in order to disseminate it and measure progress. Participants will use their systems to translate a testset of unseen documents. The translation quality is measured by a human evaluation and various automatic evaluation metrics.

You may participate in any or all of the language pairs. If your system is multilingual, please, participate in Multilingual subtask. To lower the entry barrier, we define a constrained track allowing to compare systems from same category.

Each participant is required to submit submission paper, which will describe the process of how the model was built and highlight in which ways their own methods and data differ from the usual training pipeline. You should make it clear which tools and training sets you used. Each participant has to submit at least one page long abstract of the system description one week after the system submission deadline. The abstract must contain, at a minimum, basic information about the system and the approaches/data/tools used. It can be a full description paper or a draft that can be later modified for the final system description paper. See the Main page for the link to the submission site.

Constrained/Unconstrained track

The General MT task has two separate tracks with different constraints on the models allowing for better comparability. Constrained models compete only against other constrained, while unconstrained compete against all.

Constrained open weights systems
you are allowed to use any training data, models, and frameworks, under any open source license that allows unrestricted use for non-commercial purposes (e.g. Apache, MIT, …) allowing to make your work replicable by any research group
the final model’s total number of parameters must me smaller than 20B parameters. See suggested LLMs that falls into this category below. Intermediate steps may use larger models, such as distilling.
you are required to release the model weights under some open source license
Unconstrained track - track that do not place any limitations, neither requires publishing the model. Closed systems such as GPT are part of this track.

Although we no longer restric training data, we currated corpora that should cover majority of available data.

Here is a non-exhaustive list of suggested LLMs that falls into the category under 20B parameters:

Textual: Aya Expanse 8B, Aya 101 (13B), Cohere R 7B, Llama 7B, Llama 13B, Qwen 2.5 7B, Ministral 8B, Mistral 7B, EuroLLM, NLLB
Multimodal: Whisper, Seamless M4T

Domains

We plan following domains: news, social, speech, and literary. There may be some differences in domains for non-English source languages. All testsets will be provided as a document-level.

Speech domain

The speech domain is going to be released with the original video together with machine generated transcript. We will use only the original video as a source for human evaluation. All testsets will contain the transcript which will be generated with speech recognition model, therefore the audio is not required to be used but could improve the performance. You may utilize models such as Whisper to get a cleaner transcript.

As a useful development set, we recommend using WMT24 that was released together with videos.

Social domain

In addition to the textual content, we will provide the printscreen of the original social page that may or may not be useful to the translation. However, human evaluation will use the printscreens when judging the quality of the translation.

Our setup will have a similar printscreens as released by WMT24++ which can be used as a development set.

Test Suites subtask

This year’s shared task will also include the “Test suites” sub-task, which has been part of WMT since 2018. The participants of the test suite track will provide challenge testsets that will be used to test specific phenomena of machine translation. More details about the test suites are provided in a separate page.

Multilingual subtask

As modern MT systems are able to translate into multiple languages, we define additional list of languages and ask participants of multilingual subtask to translate test containing all General MT and additional languages.

Systems from multilingual subtask will also compete in General MT track as usually, but in addition are evaluated automatically for the following languages and compete in additional multilingual category.

The sources for translation will be identical to English sources from main task, while systems will be evaluated in additional following languages:

Bengali, German, Greek, Hindi, Indonesian, Italian, Kannada, Lithuanian, Marathi, Farsi (Persian), Romanian, Swedish, Thai, Turkish, Vietnamese

A development test containing all these languages in all investigated domains is available at WMT24++

DATA

Licensing

The data released for the General MT task can be freely used for research purposes, we ask that you cite the shared task overview findings paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.

Recommended training data

We aim to use publicly available datasets wherever possible. Click here to view all details. You can download all corpora via command line:

pip install mtdata==0.4.3  # on python 3.9-3.11
wget https://www.statmt.org/wmt25/mtdata/mtdata.recipes.wmt25-constrained.yml
mtdata -no-pb cache -j 8 -ri "wmt25-*"
for id in wmt25-{eng-{ara,bho,ces,est,isl,kor,rus,srp,ukr,zho},ces-{ukr,deu},jpn-zho}; do
  mtdata get-recipe -i $id -o $id --compress --no-merge -j 8
done

Development Data

To evaluate your system during development, we suggest using testsets from past WMT years. For automatic evaluation, we recommend to use COMET automatic metric that has been shown to have high correlation with humans. You can use following testsets when developing your systems

EVALUATION

Primary systems (for which abstracts have been submitted) will be included in the human evaluation. We will collect subjective judgments about the translation quality from annotators based likely on the ESA protocol, taking the document context into account.

We are able to evaluate with humans only about 15-20 systems from each language pair. For language pairs where we obtain more submissions than we can evaluate, we will prioritize constained systems and remove from human evaluation unconstrained systems track with the lowest automatic performance and clear outliers with low performance under automatic evaluation.

Testset Submission

TBA

CONTACT

For queries, please use the mailing list or contact Tom Kocmi.

Organizers

Tom Kocmi - kocmi@cohere.com
Eleftherios Avramidis
Rachel Bawden
Ondřej Bojar
Anton Dvorkovich
Sergey Dukanov
Mark Fishel
Markus Freitag
TG Gowda
Roman Grundkiewicz
Barry Haddow
Marzena Karpinska
Philipp Koehn
Howard Lakougna
Jessica Lundin
Benjamin Marie
Kenton Murray
Masaaki Nagata
Stefano Perrella
Lorenzo Proietti
Martin Popel
Maja Popović
Mariya Shmatova
Steinþór Steingrímsson
Lisa Yankovskaya
Vilém Zouhar

Acknowledgements

This task would not have been possible without the sponsorship of testsets translations and human evaluation done with our partners. Namely (in alphabetical order): Árni Magnússon Institute for Icelandic Studies, Charles University, Dubformer, Gates Foundation, Google, Microsoft, NTT, Tartu University, Toloka, and TBA.

TENTH CONFERENCE ON MACHINE TRANSLATION (WMT25)