Subtask: MT Test Suites. "Help us break LLMs"

EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 15-16, 2024
Miami, Florida, USA

[HOME] [PROGRAM] [PAPERS] [AUTHORS]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL] [MULTIINDIC22MT TASK] [ENGLISH-TO-LOWRES MULTIMODAL MT TASK] [NON-REPETITIVE] [PATENT] [LITERARY]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

The “Test suites” sub-task will be again part of the WMT for the seventh consequent year.

What are the test suites?

Test suites are custom extensions to standard General Translation Task test sets constructed so that they can focus on particular aspects of the MT output. Test suites also evaluate these aspects in their custom way.
The test suite composition and its evaluation relies fully on the test suite provider — e.g. you. One can also call test suites an “unshared task”, where every participant is doing something different but related.
For our purpose, every test suite includes:

a source-side test set (i.e. a set of paragraphs in one language), and
an evaluation service (not necessarily automatic, not necessarily relying on any reference translations) for a particular target language.

As opposed to the standard evaluation process, which employs randomly selected test sets, test suites focus on particular translation phenomena or cases that consist of specific challenges to MT. Additionally, as opposed to having generic quality scores, test suites often produce separate fine-grained results for each case.

Since the usage of LLMs for translation is getting more popular, and we are expecting more LLMs participations in WMT this year, the theme of this year’s test suite sub-task is "Help us break LLMs", i.e. to reveal weaknesses and serious flaws of LLMs when translating, hidden within the overall high-quality generation.

Why are test suites important?

During the last years, through the improvement of MT and recently with the emergence of LLMs, the translation quality has improved a lot. In many cases, evaluation based on generic test sets yields very good results and sometimes fails to discriminate machine output from human translations. Nevertheless, there are still cases when MT is struggling and while being fluent and surrounded by other seemingly perfect translations, it may be seriously misleading. In general evaluation methods, such flaws can get “hidden in the average” or simply get missed altogether. Test suites are a flexible means of focusing on any specific problem, and as such, they can be very effective at revealing these cases.

How does the test suite track work?

The test suite track is a sub-task of the General Translation shared task at WMT and has the following steps:

12th June: The test suite participants (i.e. you) submit the source side (and optional references) of the test suites along with a README (participant & submission information) to the organizers of the Shared Task.
27th June: The shared task organizers (we) combine the test suites into the other test sets of the General Translation shared task and provide the combined source side to the developers of the MT systems
4th July: The developers of MT systems (i.e. participants of General Translation shared task) download the source side of the test sets, produce the machine translations with their systems and upload them to the Shared Task organizers
11th 16th July: The shared task organizers (we) separate the test suite translations from the generic test set MT outputs and provide them back to the respective test suite participants (you).
- You (the test suite participant) receive the machine translations of your test suites by all participating systems.
- We will also try that you get system descriptions for the participating systems.
- You analyze the outputs and produce numerical results and the corresponding discussion/conclusions, and write a description paper.
20th August (please confirm at the main page):
- You submit the corresponding paper (“testsuite paper” in the SoftConf submission system).
- You submit per e-mail a few sentences with the main takeaways, so that we can reference your paper in the WMT Findings paper.
December: Your test suite paper gets published together in the WMT proceedings and gets summarized and referenced at the overall findings paper.

New this year: pre-running of test suites on SoTA MT systems

11th April 7th May: Test suite participants can optionally submit a preliminary version of the test suite. The shared task organizers will translate these test suites with pre-existing state-of-the-art systems and send the output back to the test suite participants. This way, test suite participants have more time to evaluate some of the outputs for presenting them at the description paper. The participants still have to submit the test suites in the compulsory deadline in June in order to have their test suite translated by the MT systems participating in WMT.

What do test suite participants have to do?

Prepare your test suites and provide us with the source texts (optionally with references)
Receive the machine translations of your source texts and perform evaluation of your test suite style, whatever that is.
Write and submit a test suite paper and a short summary of the key takeaways

How will the submission take place and what is the format?

You are encouraged to submit your interest to participate in the sub-task ahead of time by filling in the participation form. This form is optional but helps us to have an overview and manage our resources.

Test suites need to be uploaded to a file repository provided by the organizers. For every test suite, the participants will have to create an archive (tar or zip) with a subfolder for every respective language pair folder. This subfolder should contain:

One text file with the source side of the test suite (plain text format with one segment per line. Every segment may contain more than one sentences).
One README.md file with the participant information (name of the test suite, name of the institution, contact e-mail address).
Optionally: document IDs for every source segment in a separate file, one segment per line respectively, so that all segments which belong to the same document are aligned with the same ID. Participating MT systems may benefit from that, e.g. specializing for the whole sequence of segments with the same document ID. Not providing “document IDs” means that each sentence is independent of the others, and it will be also shipped to the participants without the context of surrounding segments.
Optionally: reference translations in a separate text file (one respective segment per line), which will be provided to the MT systems which participate in the opposite language direction. Please note that all source sentences submitted will be made public for research purposes.

Are there any restrictions in the participation of test suites?

The test suite participants can submit test suites in any of the language directions of the General Shared Task. They are free to choose what kind of phenomena or challenges they will focus on and the metrics they will report.

It is a bonus if a test suite shares some/all of the sentences or language phenomena across more language pairs.

There is no direct limit per test suite, but the size of all test sets per language pair (including test suites) should not exceed 100,000 sentences. If this is the case, the organizers will notify participants with very big test suites to shorten them.

NINTH CONFERENCE ON MACHINE TRANSLATION (WMT24)