CODABENCH LINK
Coming Soon!
DESCRIPTION
The goal in this subtask is to predict translation error spans along with the severity labels. For this subtask we use the error spans obtained from the MQM and ESA human annotations generated for the General MT primary task as the target “gold standard”. Participants will be asked to predict both the error span (start and end indices) as well as the error severity (major or minor) for each segment. Minor issues as those that do not impact meaning or usability whereas major issues as those that impact meaning or usability but do not render the text unusable.
Languages covered
The list below provides the language pairs covered this year (which fully parallel the general machine translation task) and the quality annotations that will be used as targets for each language pair:
-
Czech to Ukrainian (ESA)
-
Czech to German (ESA)
-
Japanese to Chinese (MQM)
-
English to Arabic (ESA)
-
English to Chinese (ESA)
-
English to Czech (ESA)
-
English to Estonian (ESA)
-
English to Icelandic (ESA)
-
English to Japanese (ESA)
-
English to Korean (MQM)
-
English to Russian (ESA)
-
English to Serbian (ESA)
-
English to Ukrainian (ESA)
-
English to Bhojpuri (ESA)
-
English to Maasai (ESA)
Language pairs highlighted in green have development or training sets available from previous versions of the shared task (see below). All language pairs except for Czech–German and English–Czech will also be provided with a reference translation.
TRAINING AND DEVELOPMENT SETS
For training and development, participants can use the MQM and ESA annotations released in the previous editions of the WMT shared tasks. Some repositories that contain these datasets are listed below:
-
ESA: wmt24-humeval
We recommend using WMT24 as the development set.
BASELINES
We will provide xCOMET and LLM-as-a-judge baselines for this subtask.
Evaluation
The primary evaluation metric will be the error severity weighted F1 score. More details coming soon!
Submission format
Will be added soon!