MT Evaluation Subtask 3: Detection of Error-Free Segments

EMNLP 2026

ELEVENTH CONFERENCE ON
MACHINE TRANSLATION (WMT26)

28-29 October, 2026
Budapest, Hungary HOME

TRANSLATION TASKS:	GENERAL MT •︎ INDIC MT •︎ ARABIC-ASIAN MT •︎ CHINESE-SOUTHEAST ASIAN MT •︎ TERMINOLOGY •︎ MODEL COMPRESSION •︎ CREOLE MT •︎ VIDEO SUBTITLE TRANSLATION
EVALUATION TASKS:	MT TEST SUITES •︎︎ AUTOMATED MT EVALUATION
OTHER TASKS:	OPEN DATA •︎ MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES LLM

ANNOUNCEMENTS

2026-07-29: Submission deadline extended to 2 August
2026-07-25: Codabench link updated
2026-07-23: Test set released, further submission details
2026-07-20: Submission details updated
2026-06-05: Further task details updated
2026-04-29: Task details published

OVERVIEW

As part of this year’s evaluation campaign, we introduce a new task: Detection of Error-Free Segments. While traditional Quality Estimation (QE) focuses on continuous scores or fine-grained error spans, this task simplifies the objective to a binary classification: Is this translation error-free? The goal is to identify segments that can be published or consumed without further human intervention. This task is highly relevant for industrial MT pipelines, data filtering for LLM training, and automated post-editing workflows. We encourage participants to explore a variety of methods, from traditional data-filtering heuristics to calibrated QE model outputs and LLM-as-a-judge prompting.

TASK DESCRIPTION

Participants must predict a binary label (true for Error-Free, false for Contains Errors) for a given set of source-target pairs. To stay consistent with the real-world use cases this task is trying to emulate, we do not allow use of reference translations for this task.

A segment is considered "Error-Free" if it requires no corrections to be fluently understood and factually consistent with the source, and does not contain linguistic issues including but not limited to style, register, and locale errors. For this task, the ground truth is derived from human annotations conducted for the General MT shared task (contrastive ESA), where a segment is labeled as error-free if:

It meets a score threshold, AND
It contains no annotated error spans.

According to the human annotation guideline provided to us, we set the score threshold to ≥85. Participants are advised that such threshold is sensitive to different annotation and score calibration schema, and should not be used to convert scores from other sources to binary labels without verification.

SUBMISSIONS

We encourage a diverse range of approaches, including but not limited to:

Thresholding QE Scores: If you are participating in the main Task 1 and 2, we will convert your submission as an entry into Task 3 by default (with options to opt out). More details below.
Heuristics and Simple Classifiers: Utilizing source-target length ratios, Language Identification (LangID) checks, HTML/JSONL integrity checks, and/or surface-level fluency features.
Data Filtering Tools: Adapting established tools like Bicleaner or Zipporah.
LLM-as-a-Judge: Using Large Language Models to provide zero-shot or few-shot binary quality assessments.

Shared task participants are encouraged to check out updates published on the general MT shared task, especially after test sets are released on June 18th. A few highlights worth considering for this task:

Additional instructions will be provided with each sample, and failure to follow those instructions will be considered as an error.
Human evaluations are expected to be carried out in a contrastive manner, and all systems will be evaluated. A dynamic sampling mechanism will be in place that gradually samples weaker systems less and less.
Multi-modal contexts for samples will be available to both general MT shared task participants and human evaluators.

TEST SET FORMAT

The official test set (downloadable from here) consists of a collection of documents, each divided into segments, to allow systems to evaluate quality in context. The segments, however, consist of longer multi-sentence units of text. We use one unified test set across Tasks 1, 2, and 3. It will use the following JSON-lines format:

{
    "item_id": "setID_###_srcLang_###_tgtLang_###_domain_###_docID_###_segID",
    "src": "sourceText",
    "ref": {
        "text": "referenceText",
        "type": "referenceType"
    },
    "hyps": {
        "systemName1": "translationText1",
        "systemName2": "translationText2",
        ...,
        "systemNameN": "translationTextN"
    },
    "resources": {
        "screenshot": "screenshotFile",
        "video": "videoFile",
        "asr": "asrSourceText"
    }
}

SUBMISSION FORMAT

Submissions will be collected via Codabench (link here) . You may submit as many systems as permitted on Codabench, but at the end, you may only leave up to two system variants on the leaderboard. Those will be considered as your final submissions for your participating organization or research group.

The Codabench leaderboard will display F-1 scores against an automatic pseudo reference. This is only for sanity checking and has nothing to do with final system rankings. Please do not attempt to hill-climb over this score.

Your submission to Codabench should have two files packed into a zip file:

Your predictions file prediction.jsonl
Your model card file model_card.json

Your predictions file must include a JSON-lines file of predicted segment-level labels, using the following format:

{
    "item_id": "setID_###_srcLang_###_tgtLang_###_domain_###_docID_###_segID",
    "task3_pred": {
        "systemName1": booleanLabel1,
        "systemName2": booleanLabel2,
        ...,
        "systemNameN": booleanLabelN
    }
}

Please note that the labels should be boolean (true for error-free, false for containing errors), as opposed to string type such as "True", "False". For any language pair you attempt, you must at least provide labels for all hypotheses and all segments in the official set (you may opt out for the challenge set).

Your model card should also be a JSON file, filled from the model card template.

Although references are provided in the unified test set and the model card includes questions about reference usage, those are for other tasks. Task 3 does not permit use of references in the systems.

Auto Opt-In for Task 1 and 2 Submissions

To broaden participation and facilitate a more comprehensive analysis of current QE capabilities, we are introducing an automatic opt-in for participants of the other tasks. Unless you choose to opt out, your submissions to Tasks 1 and 2 will be evaluated for Task 3. Please refer to detailed descriptions of those tasks for how automatic opt-in works for your submission.

Participants are still highly encouraged to submit standalone entries specifically optimized for Task 3 in addition to these automatically generated entries. This allows you to test specialized filtering or classification logic that may differ from your primary QE models.

TRAINING DATA

Since this is a new task, we will not be providing any task-specific training/development set. There is no limit to what data and resources can be used for submission, but participants are encouraged to check out the linked datasets in Task 2 and consider using them to develop their system.

EVALUATION METRICS

Because the distribution of error-free vs. erroneous segments is likely skewed, we will use Matthews Correlation Coefficient (MCC) as the primary metric. We also plan to calculate precision and recall as secondary metrics for analysis.

BASELINES

COMET-QE: A thresholded version of the reference-free COMET quality metric.
Bicleaner: Using scores from the Bicleaner toolkit.
Always-Negative: A "no-op" baseline that labels every segment as not error-free.

ELEVENTH CONFERENCE ON MACHINE TRANSLATION (WMT26)