LLMs with Limited Resources for Slavic Languages: MT and QA

EMNLP 2025

TENTH CONFERENCE ON
MACHINE TRANSLATION (WMT25)

November 5-9, 2025
Suzhou, China

[HOME]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [INDIC MT]
EVALUATION TASKS: [MT TEST SUITES]
OTHER TASKS: [MULTILINGUAL INSTRUCTION] [LIMITED RESOURCES SLAVIC LLM]

Announcements

2025-04-02 Shared task announced
2025-04-02 Registration open via Google Group

Overview

We present a shared task to train LLMs with Limited Resources for three Slavic languages: Ukrainian (uk), Upper Sorbian (hsb) and Lower Sorbian (dsb).

The objective of this Shared Task is to develop and improve LLMs for these languages. We consider two tasks that are to be evaluated jointly: Machine Translation (MT) and Multiple-Choice Question Answering (QA).

Ukrainian has roughly 40 million first language (L1) speakers spread all over the world and is a mid-resource language in NLP.

Upper and Lower Sorbian are very low-resource, Slavic minority languages, spoken in the Eastern part of Germany with 30k and 7k L1 speakers, respectively.

In this task, we aim to test and improve the performance of LLMs on these languages.

Task Description

Models will be tested jointly on two tasks: machine translation and multiple-choice question answering. Our main goal is to observe the synergy between the two different tasks in LLM: How does training for MT impact performance on a secondary task, here QA? Is it possible to improve machine translation while keeping question-answering capabilities stable?

We set this shared task in a restricted context with limited resources: the base LLM is fixed to the Qwen 2.5 family and a maximum of 3B parameters.

For Machine Translation, we focus on the following directions, which are currently favoured by the respective communities:

English to Ukrainian (en→uk; subset of the general MT test set)
Czech to Ukrainian (cs→uk; subset of the general MT test set)
German to Upper Sorbian (de→hsb)
German to Lower Sorbian (de→dsb)

For Question Answering, we selected multiple choice datasets from education and language certification:

For Ukrainian QA, we will use multiple-choice exam questions from the UNLP 2024 Shared Task on LLM Instruction-Tuning for Ukrainian which is compiled from school graduation examinations on various subjects: language, literature, history, and other general topics.
For Upper and Lower Sorbian QA, we base our evaluation on the actual language certificate, which follows the CEFR scheme. We will test the models on questions from A1 up to C1.

Submissions Tracks

Participants may submit outputs for any of the following languages:

Ukrainian MT & QA (translating both en→uk and cs→uk)
Upper Sorbian MT & QA
Lower Sorbian MT & QA

You MAY NOT submit outputs only for MT or only for QA. Only submissions that follow this rule will count towards the final leaderboard.

The submissions for the MT and QA tasks must be generated from the same model per language.

In order to enable participation even with few computational resources, we constrain the base models to a maximum of 3B parameters. Base models are from the Qwen 2.5 family:

You are also permitted to use any of the quantized versions or unsloth’d versions found here (provided they are 3B or less):

Training Data

We provide the following datasets:

Upper and Lower Sorbian Machine Translation Data

This year, the MT task will focus on two translation directions: German→Upper Sorbian and German→Lower Sorbian. Both language pairs were considered in the previous editions of WMT Shared Tasks on Unsupervised MT and Very Low Resource Supervised MT.

For Upper and Lower Sorbian-German Machine Translation, we provide the data from the WMT2022 Unsupervised MT and Very Low Resource Supervised MT Shared Task for both languages.

The WMT22 datasets (including dev and validation sets) can be found here: Upper and Lower Sorbian MT data (parallel and monolingual data)

In addition, we provide the following new parallel corpora thanks to the Witaj-Sprachzentrum:

Parallel (HSB-DE): [to be released]
Parallel (DSB-DE): [to be released]
Monolingual (HSB): [to be released]
Monolingual (DSB): [to be released]

The Leipzig Corpora Collection (Goldhahn et al., 2012) contains monolingual corpora for both Sorbian languages:

Upper Sorbian news (1999): corpora.uni-leipzig.de/en?corpusId=hsb_news_1999
Upper Sorbian mixed (2012): corpora.uni-leipzig.de/en?corpusId=hsb_mixed_2012
Upper Sorbian Wikipedia (2021): corpora.uni-leipzig.de/en?corpusId=hsb_wikipedia_2021 (may contain other languages and other noise)
Lower Sorbian Wikipedia (2021): corpora.uni-leipzig.de/en?corpusId=dsb_wikipedia_2021 (may contain other languages and other noise)

As the two Sorbian languages are from the West Slavic language family, Czech (for Upper Sorbian) and Polish (Lower Sorbian) are two closely related, better-resourced languages. Both Czech and Polish have been languages in previous MT Shared Tasks; cz→de is one of the language pairs in this year’s general MT task.

Upper and Lower Sorbian QA Dataset

The Witaj-Sprachzentrum provides language certificates for both Upper and Lower Sorbian, from the A1 to C1 levels, according to the CEFR (Common European Framework of Reference for Languages) scheme. We will use a mix of questions from all five levels (A1, A2, B1, B2, and C1, from beginner to advanced) for our task.

These language certificates assess a candidate’s language proficiency according to four pillars: listening comprehension, reading comprehension, written expression, and oral expression. For this Shared Task, we will use questions from the reading and listening parts using reference transcription of the audio material for the latter.

For each language level, there are different formats of exercises; this means that the question types in A1 are different from those in C1, for instance. If the beginner levels have true or false questions regarding a small text, the advanced exercises consist of multiple-choice questions with longer texts and statements.

For QA, here is a small sample: [to be released].

Ukrainian Machine Translation Data

The MT task for Ukrainian will be focused on English→Ukrainian and Czech→Ukrainian directions. Please refer to the datasets from the Main Shared Task:

Google WMT24++: huggingface.co/datasets/google/wmt24pp
Google SMOL: huggingface.co/datasets/google/smol

Ukrainian QA Dataset

The questions are taken from the Ukrainian External Independent Evaluation (called ЗНО/ZNO in Ukrainian) from various subjects: language, literature, history, geography, and other general knowledge. The training data will be compiled from the following opensourced datasets:

UNLP2024 Share Task: huggingface.co/datasets/osyvokon/zno
ZNO-EVAL2024: github.com/NLPForUA/ZNO
Cohere INCLUDE: huggingface.co/datasets/CohereForAI/include-base-44

Test Data

In the test phase, we will release closed test sets for all tasks.

Important Dates

Finalized task details

end of April, 2025

Release of training data for shared tasks

end of April, 2025

Release of test data

end of June, 2025

Outputs submission deadline

early July, 2025

System description paper submission

TBA (follows EMNLP)

All deadlines are in AoE (Anywhere on Earth). Dates are specified with respect to EMNLP 2025.

Contact/Organisers

Main contact: join our google group

TUM Heilbronn:

Daryna Dementieva
Lukas Edman
Alexander Fraser
Kathy Hämmerl
Marion Di Marco
Shu Okabe

(All names are sorted in alphabetical order.)

Witaj-Sprachzentrum (for both Upper and Lower Sorbian):

Beate Brězan
Anita Hendrichowa
Marko Měškank
Tomaš Šołta (language certificate)