Challenge Set Subtask

EMNLP 2026

ELEVENTH CONFERENCE ON
MACHINE TRANSLATION (WMT26)

28-29 October, 2026
Budapest, Hungary HOME

TRANSLATION TASKS:	GENERAL MT •︎ INDIC MT •︎ ARABIC-ASIAN MT •︎ CHINESE-SOUTHEAST ASIAN MT •︎ TERMINOLOGY •︎ MODEL COMPRESSION •︎ CREOLE MT •︎ VIDEO SUBTITLE TRANSLATION
EVALUATION TASKS:	MT TEST SUITES •︎︎ AUTOMATED MT EVALUATION
OTHER TASKS:	OPEN DATA •︎ MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES LLM

Task description

While other participants focus on are worried about building stronger and better metrics, participants of this subtask have to build challenge sets that identify where metrics fail! Continuing the work on the challenge sets of previous years, the goal of this subtask is for participants to create test sets with challenging evaluation examples that current metrics do not evaluate well. The challenge sets can relate to any (or all) of the three other subtasks, i.e., explore challenges on error detection, span annotation, quality score prediction and detection of error-free segments. sentence-level quality estimation, fine-grained error-detection, or error correction (see below).

This subtask is organized into 3 rounds:

Breaking Round: Challenge set participants (Breakers) create challenging examples for metrics. They must send the resulting "challenge sets" to the organizers by July 16th.
Scoring Round: The challenge sets created by Breakers will be included in the test sets of the three main subtasks and thus sent to all system participants (Builders) to score on.
Analysis Round: Breakers will receive their data with all the metric scores for analysis.

Challenge set types

This year, we are inviting submissions of challenge sets for all three other subtasks:

Task 1: Segment-Level Error Detection and Span Annotation
Task 2: Segment-Level Quality Score Prediction
Task 3: Detection of Error-Free Segments

Examples

Example challenge sets from previous years can be found below:

Submission details

Challenge set participants can submit challenges for in any or all of the other sub-tasks.
The submission format will follow the JSONL file structure used by WMT-GenMT. Minor adaptations to accommodate the needs for the Automated MT Evaluation tasks may occur. Details will be announced soon.
Challenge sets in all possible language pairs are welcome, however, system participants may choose to exclude language pairs, other than the official shared task ones, that are not supported by their systems
A limit of 1M tokens (per participant, including all language pairs) is introduced to facilitate the processing by translation systems that have high computational resource requirements. Participants whowhich are severely hindered by the limit are encouraged to communicate with the organizers.
The shared task organisers will not shuffle or change the order of segments provided. The provided segments will be incorporated into the main test data of the respective subtask and provided to participants for scoring.

Note that, depending on the subtask(s) that you intend to apply your challenge set to, we expect you to select the respective options in the registration form, and we will incorporate your data with all selected subtasks.

Challenge set registration

To help us with the organization of the shared task (optionally), please register your intended submission with contacts and a short description by July 3rd here.

Challenge set submission

You can upload your submission in an archive file (preferably .tar.gz, or .zip) until July 16th (AoE) here.

ELEVENTH CONFERENCE ON MACHINE TRANSLATION (WMT26)