Subtask: Machine Translation Test Suites

This year’s shared task will also include the “Test suites” sub-task, which has been part of WMT since 2018.

What are test suites?

Test suites are custom extensions to standard General Translation Task test sets constructed so that they can focus on particular aspects of the MT output. Test suites also evaluate these aspects in their custom way.
The particular test suite composition and its evaluation is fully on the test suite provider — e.g. you. We can also call test suites an “unshared task”, where every participant is doing something different but related.
For our purpose, every test suite includes:

a source-side test set (i.e. a set of sentences in one language), and
an evaluation service (not necessarily automatic, not necessarily relying on any reference translations) for a particular target language. As opposed to the standard evaluation process, which employs randomly selected test sets, test suites focus on particular translation phenomena or cases that consist of specific challenges to MT. Additionally, as opposed to having generic quality scores, test suites often produce separate fine-grained results for each case.

Why are test suites important?

During the last years, through the improvement of MT and recently with the emergence of LLMs, the translation quality has improved a lot. In many cases, evaluation based on generic test sets yields very good results and sometimes fails to discriminate machine output from human translations. Nevertheless, there are still cases when MT is struggling and while being fluent and surrounded by other seemingly perfect translations, it may be seriously misleading. In general evaluation methods, such flaws can get “hidden in the average” or simply get missed altogether. Test suites are a flexible means of focusing on any specific problem, and as such, they can be very effective at revealing these cases.

How does the test suite track work?

The test suite track is a sub-task of the General Translation shared task at WMT. The test suite track has the following steps:

The test suite participants (i.e. you) submit the source side (and optional references) of their test suites along with a README (participant & submission information) to the organizers of the Shared Task. (19th June)
The shared task organizers (we) combine the test suites into the other test sets of the General Translation shared task and provide the combined source side to the developers of the MT systems (13th July)
The developers of MT systems (i.e. participants of General Translation shared task, “they”) download the source side of the test sets, produce the machine translations with their systems and upload them to the Shared Task organizers (20th July)
The shared task organizers (we) separate the test suite translations from the generic test set MT outputs and provide them back to the respective test suite participants (you). (27th July 2nd August)
You (the test suite participant) receive the machine translations of your test suites by all participating systems. You analyze the outputs and produce numerical results, then you write the corresponding paper (“testsuite paper” in the SoftConf submission system). You also submit a few sentences with the main takeaways, so that we can reference your paper in the WMT Findings paper. (~ September; see the main page)
Your test suite paper gets published among other shared task papers and gets summarized and referenced at the overall findings paper. == What do test suite participants have to do?
Prepare your test suites and provide us with the source texts (optionally with references)
Receive the machine translations of your source texts and perform evaluation of your test suite style, whatever that is.
Write and submit a test suite paper and a short summary of the key takeaways

How will the submission take place and what is the format?

You are encouraged to submit your interest to participate to the subtask ahead of time by filling in the participation form. This form is optional but helps us to have an overview and manage our resources.

Test suites need to be uploaded to a GitHub repository provided by the organizers. For every test suite, the participants will have to create a subfolder to the respective language pair folder. This subfolder should contain:

One text file with the source side of the test suite (plain text format with one segment per line).
One README.md file with the participant information (name of the test suite, name of the institution, contact e-mail address).
Optionally: document IDs for every source sentence in a separate file, one segment per line respectively. Participating MT systems may benefit from that, e.g. specializing for the whole sequence of sentences with the same document ID. Not providing “document IDs” means that each sentence is independent of the others and it will be also shipped to the participants without the context of surrounding sentences.
Optionally: reference translations in a separate text file (one respective segment per line), which will be provided to the MT systems which participate in the opposite language direction. Please note that all source sentences submitted will be made public for research purposes.

Are there any restrictions in the participation of test suites?

The test suite participants can submit test suites in any of the language directions of the General Shared Task. They are free to choose what kind of phenomena or challenges they will focus on and the metrics they will report.

It is a bonus if a test suite shares some/all of the sentences or language phenomena across more language pairs.

There is no direct limit per test suite, but the size of all test sets per language pair (including test suites) should not exceed 100,000 sentences. If this is the case, the organizers will notify participants with very big test suites to shorten them.