Subtask: MT Test Suites. "Mind the Gap: Exposing LLM Translation Blind Spots"

EMNLP 2026

ELEVENTH CONFERENCE ON
MACHINE TRANSLATION (WMT26)

28-29 October, 2026
Budapest, Hungary HOME

TRANSLATION TASKS:	GENERAL MT •︎ INDIC MT •︎ ARABIC-ASIAN MT •︎ CHINESE-SOUTHEAST ASIAN MT •︎ TERMINOLOGY •︎ MODEL COMPRESSION •︎ CREOLE MT •︎ VIDEO SUBTITLE TRANSLATION
EVALUATION TASKS:	MT TEST SUITES •︎︎ AUTOMATED MT EVALUATION
OTHER TASKS:	OPEN DATA •︎ MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES LLM

The “Test Suites” sub-task will be part of WMT for the eighth consecutive year.

What’s new

2026-04-01:
- New JSONL submission format
- Possibility to specify the prompt
- Possibility to submit language directions not included in the GenMT shared task
- Explicit mention of the possibility to use LLMs for the creation and evaluation of test suites

What are the test suites?

Test suites are custom extensions to standard General Translation Task (GenMT) test sets, constructed so that they can focus on particular aspects of MT output. Test suites also evaluate these aspects in their own custom way.
The creation and evaluation of a test suite rely entirely on the test suite provider — e.g. you. For our purposes, every test suite includes:

a source-side test set (i.e., a set of paragraphs in one language),
an evaluation service (not necessarily automatic, and not necessarily relying on reference translations) for a particular target language.
new 2026! a custom prompt for each source-side item.

As opposed to the standard evaluation process, which employs randomly selected test sets, test suites focus on particular translation phenomena or cases that present specific challenges to MT. Additionally, unlike generic quality scores, test suites often produce separate fine-grained results for each case.

Given the massive presence of LLMs in WMT, test suites present a great opportunity to reveal weaknesses and serious flaws in LLM translations — issues that might otherwise be hidden within overall high-quality output.

Why are test suites important?

Over the last years —through advances in MT and more recently with the emergence of LLMs— translation quality has improved substantially. In many cases, evaluation based on generic test sets yields very good results and sometimes fails to discriminate machine output from human translations. Nevertheless, there are still cases where MT struggles: while being fluent and surrounded by seemingly perfect translations, output may still be seriously misleading. In general evaluation methods, such flaws can become “hidden in the average” or simply be missed altogether. Test suites provide a flexible means of focusing on any specific problem and can be very effective at revealing such cases.

What do I have to do as a test suite participant ?

Prepare your test suites and provide us with the source texts (optionally with references)
Receive the machine translations of your source texts and perform evaluation according to your test suite methodology, whatever that is.
Write and submit a test suite paper and a short summary of the key takeaways

How does the test suite track work?

The test suite track is a sub-task of the GenMT task and has the following steps:

4th June: The test suite participants (i.e. you) submit the source side (and optional references) of the test suites along with a README (participant & submission information) to the organizers of the Shared Task.
18th June: The shared task organizers (we) combine the test suites with the other test sets of the GenMT task and provide the combined source side to the developers of MT systems.
2nd July: The developers of MT systems (i.e. participants of the GenMT task) download the source side of the test sets, produce the machine translations with their systems, and upload them to the Shared Task organizers.
16th July: The shared task organizers (we) separate the test suite translations from the generic test set MT outputs and provide them back to the respective test suite participants (you).
- You (the test suite participant) receive the machine translations of your test suites from all participating systems.
- You analyze the outputs, produce numerical results and the corresponding discussion/conclusions, and write a description paper.
middle of August (please confirm at the main page):
- You submit via e-mail a few sentences summarizing the main takeaways, so that we can reference your paper in the WMT Findings paper.
- You submit the corresponding paper (“testsuite paper” in the SoftConf submission system) and go through the reviewing process. (Meanwhile, we will try to provide you with system descriptions for the participating systems to aid your analysis).
October: Your test suite paper is published in the WMT proceedings and is summarized and referenced in the overall findings paper.

How will the submission take place and what is the format?

You are encouraged to submit your interest to participate in the sub-task ahead of time by filling in the participation form. This form is optional but helps us maintain an overview and manage our resources.

Test suites need to be uploaded to the file repository provided by the organizers. For every test suite, participants must create an archive (tar or zip) with a subfolder for each respective language pair. This subfolder should contain:

One README.md file with participant information (name of the test suite, name of the institution, contact email address).
new 2026! A JSONL file (JSON lines), where every line looks like:

{"dataset_id":"testsuite_name","doc_id":"cs-de_DE_#_testsuite_name_#_id123","src_lang":"cs","tgt_lang":"de_DE","prompt":"Translate from cs to de_DE.","src_text":"Original Czech text...\n\nWith paragraph breaks encoded as \n\n.","ref_text":"Optional reference here otherwise null"}

Please note that all submitted source sentences will be made public for research purposes. You are free to specify a more permissive license.

Are there any restrictions in the participation of test suites?

new 2026! Test suite participants may submit test suites in any language pair. (However, note that the language directions of the GenMT Task have priority: GenMT participants may still choose to translate only their preferred subset of languages. Additionally, only the language directions of the GenMT Task will be presented in the aggregated results table in the main paper.)
Participants are free to choose the phenomena or challenges they focus on and the metrics they will report.
It is a bonus if a test suite shares some or all of the sentences or language phenomena across multiple language pairs.
A limit of 1M tokens (per participant, including all language pairs) is applied. Participants who are severely hindered by the limit are encouraged to contact the organizers.

Are the participants allowed to use LLMs for the test suites?

new 2026! Test suite participants are allowed to use LLMs for test suite generation, augmentation, and evaluation, provided this is clearly described in the paper and the results do not introduce biases.
If authors require additional resources, they are encouraged to apply for relevant grants, e.g. the Catalyst grant.

ELEVENTH CONFERENCE ON MACHINE TRANSLATION (WMT26)