Announcements
-
2025-04-02 Shared task announced
-
2025-04-02 Registration open via Google Group
Overview
We present a shared task to train LLMs with Limited Resources for three Slavic languages: Ukrainian (uk), Upper Sorbian (hsb) and Lower Sorbian (dsb).
The objective of this Shared Task is to develop and improve LLMs for these languages. We consider two tasks that are to be evaluated jointly: Machine Translation (MT) and Multiple-Choice Question Answering (QA).
Ukrainian has roughly 40 million first language (L1) speakers spread all over the world and is a mid-resource language in NLP.
Upper and Lower Sorbian are very low-resource, Slavic minority languages, spoken in the Eastern part of Germany with 30k and 7k L1 speakers, respectively.
In this task, we aim to test and improve the performance of LLMs on these languages.
Task Description
Models will be tested jointly on two tasks: machine translation and multiple-choice question answering. Our main goal is to observe the synergy between the two different tasks in LLM: How does training for MT impact performance on a secondary task, here QA? Is it possible to improve machine translation while keeping question-answering capabilities stable?
We set this shared task in a restricted context with limited resources: the base LLM is fixed to the Qwen 2.5 family and a maximum of 3B parameters.
For Machine Translation, we focus on the following directions, which are currently favoured by the respective communities:
-
English to Ukrainian (en→uk; subset of the general MT test set)
-
Czech to Ukrainian (cs→uk; subset of the general MT test set)
-
German to Upper Sorbian (de→hsb)
-
German to Lower Sorbian (de→dsb)
For Question Answering, we selected multiple choice datasets from education and language certification:
-
For Ukrainian QA, we will use multiple-choice exam questions from the UNLP 2024 Shared Task on LLM Instruction-Tuning for Ukrainian which is compiled from school graduation examinations on various subjects: language, literature, history, and other general topics.
-
For Upper and Lower Sorbian QA, we base our evaluation on the actual language certificate, which follows the CEFR scheme. We will test the models on questions from A1 up to C1.
Submissions Tracks
Participants may submit outputs for any of the following languages:
-
Ukrainian MT & QA (translating both en→uk and cs→uk)
-
Upper Sorbian MT & QA
-
Lower Sorbian MT & QA
You MAY NOT submit outputs only for MT or only for QA. Only submissions that follow this rule will count towards the final leaderboard.
The submissions for the MT and QA tasks must be generated from the same model per language.
In order to enable participation even with few computational resources, we constrain the base models to a maximum of 3B parameters. Base models are from the Qwen 2.5 family:
You are also permitted to use any of the quantized versions or unsloth’d versions found here (provided they are 3B or less):
Training Data
We provide the following datasets:
Upper and Lower Sorbian Machine Translation Data
This year, the MT task will focus on two translation directions: German→Upper Sorbian and German→Lower Sorbian. Both language pairs were considered in the previous editions of WMT Shared Tasks on Unsupervised MT and Very Low Resource Supervised MT.
For Upper and Lower Sorbian-German Machine Translation, we provide the data from the WMT2022 Unsupervised MT and Very Low Resource Supervised MT Shared Task for both languages.
-
The WMT22 datasets (including dev and validation sets) can be found here: Upper and Lower Sorbian MT data (parallel and monolingual data)
In addition, we provide the following new parallel corpora thanks to the Witaj-Sprachzentrum:
-
Parallel (HSB-DE): [to be released]
-
Parallel (DSB-DE): [to be released]
-
Monolingual (HSB): [to be released]
-
Monolingual (DSB): [to be released]
The Leipzig Corpora Collection (Goldhahn et al., 2012) contains monolingual corpora for both Sorbian languages:
-
Upper Sorbian news (1999): corpora.uni-leipzig.de/en?corpusId=hsb_news_1999
-
Upper Sorbian mixed (2012): corpora.uni-leipzig.de/en?corpusId=hsb_mixed_2012
-
Upper Sorbian Wikipedia (2021): corpora.uni-leipzig.de/en?corpusId=hsb_wikipedia_2021 (may contain other languages and other noise)
-
Lower Sorbian Wikipedia (2021): corpora.uni-leipzig.de/en?corpusId=dsb_wikipedia_2021 (may contain other languages and other noise)
As the two Sorbian languages are from the West Slavic language family, Czech (for Upper Sorbian) and Polish (Lower Sorbian) are two closely related, better-resourced languages. Both Czech and Polish have been languages in previous MT Shared Tasks; cz→de is one of the language pairs in this year’s general MT task.
Upper and Lower Sorbian QA Dataset
The Witaj-Sprachzentrum provides language certificates for both Upper and Lower Sorbian, from the A1 to C1 levels, according to the CEFR (Common European Framework of Reference for Languages) scheme. We will use a mix of questions from all five levels (A1, A2, B1, B2, and C1, from beginner to advanced) for our task.
These language certificates assess a candidate’s language proficiency according to four pillars: listening comprehension, reading comprehension, written expression, and oral expression. For this Shared Task, we will use questions from the reading and listening parts using reference transcription of the audio material for the latter.
For each language level, there are different formats of exercises; this means that the question types in A1 are different from those in C1, for instance. If the beginner levels have true or false questions regarding a small text, the advanced exercises consist of multiple-choice questions with longer texts and statements.
For QA, here is a small sample: [to be released].
Ukrainian Machine Translation Data
The MT task for Ukrainian will be focused on English→Ukrainian and Czech→Ukrainian directions. Please refer to the datasets from the Main Shared Task:
-
Google WMT24++: huggingface.co/datasets/google/wmt24pp
-
Google SMOL: huggingface.co/datasets/google/smol
Ukrainian QA Dataset
The questions are taken from the Ukrainian External Independent Evaluation (called ЗНО/ZNO in Ukrainian) from various subjects: language, literature, history, geography, and other general knowledge. The training data will be compiled from the following opensourced datasets:
-
UNLP2024 Share Task: huggingface.co/datasets/osyvokon/zno
-
ZNO-EVAL2024: github.com/NLPForUA/ZNO
-
Cohere INCLUDE: huggingface.co/datasets/CohereForAI/include-base-44
Test Data
In the test phase, we will release closed test sets for all tasks.
Important Dates
Finalized task details |
end of April, 2025 |
Release of training data for shared tasks |
end of April, 2025 |
Release of test data |
end of June, 2025 |
Outputs submission deadline |
early July, 2025 |
System description paper submission |
TBA (follows EMNLP) |
All deadlines are in AoE (Anywhere on Earth). Dates are specified with respect to EMNLP 2025.
Contact/Organisers
Main contact: join our google group
TUM Heilbronn:
-
Daryna Dementieva
-
Lukas Edman
-
Alexander Fraser
-
Kathy Hämmerl
-
Marion Di Marco
-
Shu Okabe
(All names are sorted in alphabetical order.)
Witaj-Sprachzentrum (for both Upper and Lower Sorbian):
-
Beate Brězan
-
Anita Hendrichowa
-
Marko Měškank
-
Tomaš Šołta (language certificate)
(All names are sorted in alphabetical order.)
Acknowledgements
We express our deepest gratitude to Ukrainian Shared Task 2024 team:
-
Roman Kyslyi
-
Mariana Romanyshyn
-
Oleksiy Syvokon
who were kind to allow us to re-use there data for this shared task. Please, acknowledge their work by citing the following paper:
Mariana Romanyshyn, Oleksiy Syvokon, and Roman Kyslyi. 2024. The UNLP 2024 Shared Task on Fine-Tuning Large Language Models for Ukrainian. In Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024, pages 67–74, Torino, Italia. ELRA and ICCL.