ANNOUNCEMENTS
-
2026-04-29: Task details published
OVERVIEW
As part of this year’s evaluation campaign, we introduce a new task: Detection of Error-Free Segments. While traditional Quality Estimation (QE) focuses on continuous scores or fine-grained error spans, this task simplifies the objective to a binary classification: Is this translation error-free? The goal is to identify segments that can be published or consumed without further human intervention. This task is highly relevant for industrial MT pipelines, data filtering for LLM training, and automated post-editing workflows. We encourage participants to explore a variety of methods, from traditional data-filtering heuristics to calibrated QE model outputs and LLM-as-a-judge prompting.
TASK DESCRIPTION
Participants must predict a binary label (1 for Error-Free, 0 for Contains Errors) for a given set of source-target pairs. To stay consistent with the real-world use cases this task is trying to emulate, we do not allow use of reference translations for this task.
A segment is considered "Error-Free" if it requires no corrections to be fluently understood and factually consistent with the source, and does not contain linguistic issues including but not limited to style, register, and locale errors. For this task, the ground truth is derived from human annotations conducted for the General MT shared task (ESA-like), where a segment is labeled as error-free if:
-
It meets a score threshold, AND
-
It contains no annotated error spans.
According to the human annotation guideline provided to us, we set the score threshold to ≥85. Participants are advised that such threshold is sensitive to different annotation and score calibration schema, and should not be used to convert scores from other sources to binary labels without verification.
SUBMISSIONS
We encourage a diverse range of approaches, including but not limited to:
-
Thresholding QE Scores: If you are participating in the main Task 1 and 2, we will convert your submission as an entry into Task 3 by default (with options to opt out). More details below.
-
Heuristics and Simple Classifiers: Utilizing source-target length ratios, Language Identification (LangID) checks, HTML/JSONL integrity checks, and/or surface-level fluency features.
-
Data Filtering Tools: Adapting established tools like Bicleaner or Zipporah.
-
LLM-as-a-Judge: Using Large Language Models to provide zero-shot or few-shot binary quality assessments.
Similar to other WMT shared tasks this year, we will be using a JSON data schema for submissions. More information will be updated here closer to the test set release date.
Auto Opt-In for Task 1 and 2 Submissions
To broaden participation and facilitate a more comprehensive analysis of current QE capabilities, we are introducing an automatic opt-in for participants of the other tasks. Unless you choose to opt out, your submissions to Tasks 1 and 2 will be evaluated for Task 3. Please refer to detailed descriptions of those tasks for how automatic opt-in works for your submission.
Participants are still highly encouraged to submit standalone entries specifically optimized for Task 3 in addition to these automatically generated entries. This allows you to test specialized filtering or classification logic that may differ from your primary QE models.
DATA
Since this is a new task, we will not be providing any task-specific training/development set. There is no limit to what data and resources can be used for submission, but participants are encouraged to check out the linked datasets in Task 2 and consider using them to develop their system.
EVALUATION METRICS
Because the distribution of error-free vs. erroneous segments is likely skewed, we will use Matthews Correlation Coefficient (MCC) as the primary metric. We also plan to calculate precision and recall as secondary metrics for analysis.
BASELINES
-
COMET-QE: A thresholded version of the reference-free COMET quality metric.
-
Bicleaner: Using scores from the Bicleaner toolkit.
-
Always-Negative: A "no-op" baseline that labels every segment as not error-free.