MT Evaluation Subtask 1: Segment-Level Quality Score Prediction

TRANSLATION TASKS:	GENERAL MT (NEWS) •︎ INDIC MT •︎ TERMINOLOGY •︎ CREOLE MT •︎ MODEL COMPRESSION
EVALUATION TASKS:	MT TEST SUITES •︎ (UNIFIED) MT EVALUATION
OTHER TASKS:	OPEN DATA
MULTILINGUAL TASKS:	MULTILINGUAL INSTRUCTION •︎ LIMITED RESOURCES SLAVIC LLM

ANNOUNCEMENTS

2025-07-24: Test set released and Codabench submission site opened!
2025-07-23: Language pairs and some file-format details changed
2025-07-16: Limits on primary versus secondary submissions clarified
2025-07-15: Details of file format, submission procedure, and evaluation methodology announced
2025-05-17: Detailed description of the task announced

TASK DESCRIPTION

The goal of this task is to predict a quality score for each source–target segment pair in the evaluation set. Depending on the language pair, participants will be asked to predict the numerical score deriving from the segment’s ESA or MQM annotation as provided by a human. Submissions will be evaluated and ranked based on their predictions’ correlations with these human-annotated scores at both the segment and corpus levels.

We welcome submissions covering any or all of the language pairs used in the WMT 2025 General MT task, as per the following:

The ESA score is a direct assessment primed by the act of annotating precise error spans within the segment. We solicit systems that predict segment-level ESA scores for test sets that we will publish in Czech–German, Czech–Ukrainian, English–Arabic, English–Bhojpuri, English–Chinese, English–Czech, English–Estonian, English–Icelandic, English–Italian, English–Japanese, English–Maasai, English–Russian, English–Serbian, and English–Ukrainian.
The MQM score is a direct assessment mathematically derived from the count and severity of precise error spans annotated within the segment. We solicit systems that predict segment-level MQM scores for test sets that we will publish in English–Korean and Japanese–Chinese.

The test sets for all language pairs except for English–Italian and English–Maasai will be provided with a reference translation, which therefore can be used by submitted systems as an additional input.

Participants will also run their automatic score prediction systems on collected “challenge sets” that illustrate particular linguistic phenomena, domains, or even non-WMT language pairs of interest to the developers of the sets. The predicted scores will be returned to the developers of each set for further analysis. See the detailed page on the challenge set subtask for further details.

Note that, for this task, the necessary output is only the segment-level numerical scores. Submitted systems are free, however, to derive their scores in any manner that appears useful. In particular, systems may first predict precise error spans (as in Task 2) and/or post-edits (Task 3) and use them as the basis for computing the numerical score. Such systems that perform multiple subtasks jointly may be entered in each track.

TRAINING AND DEVELOPMENT DATA

You are welcome to build your system from any desired training data, foundation model, etc. We therefore do not release any specific training data. However, labeled data from previous editions of the Metrics and/or QE shared tasks is available to help you train and tune your system if you would like. Those resources are summarized and linked below. Note, however, that the set of language pairs covered in prior years is not an exact match for this year’s: some have training data, while some are zero-shot.

DA and MQM annotations from prior QE tasks

2022, 2023, 2024

MQM annotations from prior Metrics tasks

2020–24

DA annotations from prior Metrics tasks

2016, 2017, 2018, 2019, 2020, 2021, 2022

Relative rank annotations from prior Metrics tasks

2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015

For help in tuning, model selection, etc., we recommend using the human-annotated General MT 2024 test set as a development set. We will release an updated version of this set soon.

TEST DATA

The official test set will consist of a collection of documents, each divided into segments, to allow systems to evaluate quality in context. The segments, however, will consist of longer multi-sentence units of text. Contents of the challenge sets will vary. See below for details on the file format.

Evaluation sets are now available in Github. The single file containing both official and challenge sets is mteval-task1-test25.tsv.gz.

Test Set File Format

We will release the test and challenge sets as a plain-text UTF-8–encoded TSV file, containing one segment per line, using the schema below. Each field is separated by a single tab character. A header row will be included.

Field Name

Explanation

doc_id

String identifier of the document to which the segment belongs

segment_id

Numeric index of the segment’s position among all the segments for a given language pair and test set

source_lang

String code identifying the segment’s source language

target_lang

String code identifying the segment’s target language

set_id

String identifier of the portion of the test set to which the segment belongs

system_id

String identifier of the MT system that translated the segment

source_segment

String contents of the segment’s source side

hypothesis_segment

String contents of the segment’s machine translation

reference_segment

String contents of the segment’s gold-standard translation

domain_name

String identifier of the domain of the test set to which the segment belongs

method

String indication of whether we expect the segment to be quality-scored according to ESA or MQM criteria

Note that some segments (source, hypothesis, and/or reference) will contain internal newline characters. These characters will be presented as escaped tokens, consisting of a literal backslash \ followed by a literal n, with a space on each side. You are free to process the components of such multi-line segments in any way you like, but your segment-level score should reflect the contents of the entire segment — including the presence and position of any internal newlines.

The reference_segment field will contain the string "NaN" when a reference is unavailable; it will contain an empty string when the reference is intentionally blank.

As a plain-text TSV with no quoting, the file containing the test set should be immediately interpretable by command-line tools such as cut and paste. If you wish to process files with a more advanced toolkit, such as Pandas or the Python csv module, please ensure that you set the reading and writing properties appropriately. We provide sample code for this.

Submission File Formats

Your submission must include a file of segment-level scores, named segments.tsv, in UTF-8 plain-text format, matching the schema below. The new column overall should contain your system’s numerical quality score for each segment. Note that the segment contents (source, hypothesis, and reference) should not be included. Other columns should be copied directly from the test-set file.

Field Name

Explanation

doc_id

String identifier of the document to which the segment belongs

segment_id

Numeric index of the segment’s position among all the segments for a given language pair and test set

source_lang

String code identifying the segment’s source language

target_lang

String code identifying the segment’s target language

set_id

String identifier of the portion of the test set to which the segment belongs

system_id

String identifier of the MT system that translated the segment

domain_name

String identifier of the domain of the test set to which the segment belongs

method

String indication of whether we expect the segment to be quality-scored according to ESA or MQM criteria

overall

Numeric automatic assessment of the segment’s quality for Task 1 submissions

Your submission may include a file of corpus-level scores, named systems.tsv. If your submission does not include this file, we will automatically compute corpus-level scores on your behalf as the arithmetic average of your segment-level scores for each unique MT system name and language pair in the test set. The corpus-level file, if present, must be in UTF-8 plain-text format, generally matching the schema of the segment-level file. Columns for doc_id, segment_id, and domain_name in the corpus-level file must contain the string value all instead of the data from any individual segment.

Field Name

Explanation

doc_id

all

segment_id

all

source_lang

String code identifying the source language being scored

target_lang

String code identifying the target language being scored

set_id

String identifier of the portion of the portion of the test set being scored

system_id

String identifier of the MT system being scored

domain_name

all

method

String indication of whether we expect the MT system to be quality-scored according to ESA or MQM criteria

overall

Numeric automatic assessment of the MT system’s quality for Task 1 submissions

SUBMISSION PROCEDURE

You may submit up to three system outputs to the shared task: one primary submission and up to two secondary submissions. Primary systems will feature in our official results, while secondary systems will be included in supplemental analysis.

Step 1: Register your participation by creating an account on Codabench and joining the WMT 2025 segment-level score prediction competition.

Step 2: At the start of the test week (24 July), download the evaluation set file containing source–target segment pairs for scoring.

Step 3: Use your automatic system to score each of the segments (and, optionally, each of the systems) for each language pair in which you wish to participate. Please ensure that your output files conform to the expected names and formats as defined above. If you choose to not participate in a given language pair or on a given challenge set, remove its content entirely from your output files.

Step 4: Submit your system to the shared task by uploading scored test sets to our Codabench competition by the end of the test week (31 July). Upload a single zip file (of any name) containing your output TSVs, which must be at the root directory of the zip. To accommodate failed, erroneous, or mistaken uploads, you may submit up to five times per day but not exceeding 10 times over the course of the test week.

Step 5: Ensure that your zip file was properly processed in Codabench. Its status should show as "Finished" within a minute, and numerical scores should appear for it on the leaderboard for all language pairs in which you expect to have participated. (Note that leaderboard scores are for verification purposes only and are computed against pseudo-gold-standard annotations. However, if the reported correlations are exceptionally weak, that may indicate that you made some error in running your system or assembling your TSV file.) If there is a clear problem processing your submission, its status should show as "Failed"; a processing log should be available for more detailed inspection.

Step 6: Also by the end of the test week (31 July), provide details about your system by filling in the participant form. In particular, be sure to give the Codabench IDs for your designated primary versus any secondary submissions. Completion of this form is essential to proper participation and an accurate analysis of the shared task results.

Contact wmt-qe-metrics-organizers@googlegroups.com in the event of any questions or difficulties with the submission procedure.

EVALUATION

We will evaluate the quality of automatic score prediction on the official test set at both the segment level and the corpus (system) level. Evaluation will be carried out according to a combination of standard and more recently proposed meta-evaluation methodologies to assess correlation with human judgements.

At the segment level, we will use tie-calibrated pairwise accuracy (acc*_eq) as our primary meta-evaluation metric; we will additionally compute Pearson’s r and Kendall’s τ as secondary metrics. These correlations with human judgements will be computed per source segment and then averaged over the entire test set.

At the corpus level, we will use soft pairwise accuracy (SPA) as our primary meta-evaluation metric; we will again compute Pearson’s r and Kendall’s τ as secondary metrics. Because of the segment-level resampling required as part of the SPA computation, submissions that derive their corpus-level scores from a method other than an arithmetic averaging of the segment-level scores should be sure to indicate this fact on the participant form.

Scored challenge sets will be returned to the individual challenge set developers for evaluation and analysis. Performance on the challenge sets will not be tracked on the Codabench leaderboard or counted as part of the official results.

Note that the official results will be based on correlation with human judgements, which will not be complete until around September. At the time of test set submission (24–31 July), we will provide on the Codabench leaderboard an expression of each system’s correlation with automatically derived pseudo-gold-standard judgements, using simple metrics. This will allow participants to confirm that their files were uploaded in a valid format and that the performance of their system was broadly reasonable. Displayed scores, however, are for these basic verification purposes only and do not reflect official results.

System description papers must be submitted for initial review by 14 August and in camera-ready format by 25 September. Participants are therefore advised to begin preparation of their papers prior to the test week, using analysis based on a chosen development set rather than the official shared task results.

BASELINES

We will include in Codabench and in the official results a number of baseline systems. The exact selection will be announced later.

TENTH CONFERENCE ON MACHINE TRANSLATION (WMT25)

MT Evaluation Subtask 1: Segment-Level Quality Score Prediction

ANNOUNCEMENTS

TASK DESCRIPTION

TRAINING AND DEVELOPMENT DATA

TEST DATA

Test Set File Format

Submission File Formats

SUBMISSION PROCEDURE

EVALUATION

BASELINES

TENTH CONFERENCE ON
MACHINE TRANSLATION (WMT25)