Task description
While other participants are worried about building stronger and better metrics, participants of this subtask have to build challenge sets that identify where metrics fail! Inspired by the challenge sets of previous years (Link_to_metrics, Link_to_QE), the goal of this subtask is for participants to create test sets with challenging evaluation examples that current metrics do not evaluate well. The challenge sets can relate to any (or all) of the three main subtasks, i.e., explore challenges on sentence-level quality estimation, fine-grained error-detection, or error correction (see below).
This subtask is organized into 3 rounds: 1. Breaking Round: Challenge set participants (Breakers) create challenging examples for metrics. They must send the resulting "challenge sets" to the organizers by July 5th. 2. Scoring Round: The challenge sets created by Breakers will be included in the test sets of the three main subtasks and thus sent to all participants (Builders) to score on. 3. Analysis Round: Breakers will receive their data with all the metric scores for analysis.
Challenge set types:
This year, we are inviting submissions of challenge sets for all three official subtasks: Task 1: Segment-level quality score prediction Task 2: Word-level error detection and span annotation Task 3: Quality-informed Segment-level Error Correction
Examples:
Example challenge sets referring to task 1 from previous years can be found below.
-
ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics
-
Evaluating WMT 2024 Metrics Shared Task Submissions on AfriMTE
-
MSLC24: Further Challenges for Metrics on a Wide Landscape of Translation Quality
-
Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains
-
Preference-based challenge sets: QE last year challenge sets, Section 7
Submission format
Breakers are responsible for generating the data for their challenge sets: providing their own source segments, one or more target hypotheses, optionally one or more sets of references, and segment-level metadata.
We are adding below an updated challenge set format.
-
One plain text file for all the source segments (one segment per line)
-
One plain text file for the translation output segments from each hypothesis. (If the challenge set creator has only one hypothesis, then they will submit one file. If they have 2 candidate hypotheses, then they will submit two separate files.)
-
[Optional] One plain text file for each set of reference segments (one segment per line)
-
One tsv file for the meta-info of each segment (one segment per line). Each line should contain 2 fields:
-
Segment domain (domain-category)[it can be identical for all segments if there is no domain distinction]
-
Document id (doc_id) [if each segment is independent, please use a different document id for each line]
-
One README file with team name, affiliation and an email contact that should match the registration.
Please name the files:
-
{challenge_set_name}.{langpair}.src.txt,
-
{challenge_set_name}.{langpair}.hyp-1.txt,
-
{challenge_set_name}.{langpair}.meta.txt,
-
[optional] {challenge_set_name}.{langpair}.ref.txt,
-
{challenge_set_name} .readme.txt
respectively.
'langpair' should be similar to 'en-cs' (using the official ISO 639-1 two-letter codes). It is also possible to submit a (compressed) archive. You can find here an example comprising part of the WMT QE 2024 challenge sets.
The challenge set developers are encouraged to organize the challenge set(s) in a way that better suits their analysis, as the shared task organisers will not shuffle or change the order of segments provided. The provided segments will be incorporated to the main test data of the respective subtask and provided to participants for scoring.
Note that depending on the subtask(s) that you intend to apply your challenge set to, we expect you to select the respective options in the registration form and we will incorporate your data with all selected subtasks.
Challenge set registration
Please register your intended submission with contacts and a short description by July 5th here.