This document lists WMT25 General MT task datasets for the constrained track and instructions for downloading them using mtdata
.
MTData
Setup
pip install mtdata==0.4.3 # on Python 3.9-3.11
Recipes Config File
Config file for CONSTRAINED track:
wget https://www.statmt.org/wmt25/mtdata/mtdata.recipes.wmt25-constrained.yml
See for dataset IDs selected for constrained eval. By default, mtdata
loads mtdata.recipes*.yml
glob in the current directory (where mtdata
commands are invoked). If the recipe YAML file is placed in a different directory, export MTDATA_RECIPES=/path/to/recipesdir
.
List All Recipes
$ mtdata list-recipe -id | grep wmt25
wmt25*
ids are all loaded from mtdata.recipes.wmt24*.yml
file.
Download Recipes
# example:
mtdata get-recipe -ri wmt25-eng-ces -o wmt25-eng-ces # -h|--help for help
# Optional: download and cache all datasets in parallel
mtdata -no-pb cache -j 8 -ri "wmt25-*"
for id in wmt25-{eng-{ara,bho,ces,est,isl,kor,rus,srp,ukr,zho},ces-{ukr,deu},jpn-zho}; do
mtdata get-recipe -i $id -o $id --compress --no-merge -j 8
done
1. mtdata stores its cache under $HOME/.mtdata by default. To change, either export MTDATA=/path/to/cache or create a symbolic link ln -s /path/to/cache ~/.mtdata 2. mtdata uses pigz (if available in PATH) for a faster compression and decompression. We recommend installing pigz using your package manager.
|
QE Score
Participants are allowed to score parallel training data using quality estimation (QE) metrics to identify and select high-quality parallel segments. Any suitable metric for the task may be used.
We provide starter tools based on PyMarian, a fast scorer optimized for large datasets. Additionally, precomputed scores using the wmt22-cometkiwi-da
model are available for convenience.
URL="https://data.statmt.org/wmt25/general-mt/wmt25-QEscores/wmt25-all.wmt22-cometkiwi-da_fp16.score.tgz"
wget "$URL"
tar -xvf $(basename "$URL")
Optionally, the following script can be used to recompute scores freshly: .Computing QE Scores:
# Install prebuilt pymarian (linux only)
pip install pymarian==1.12.31
metric="wmt22-cometkiwi-da"
cmd="pymarian-eval --stdin --fields src mt --workspace -8000 --model $metric --mini-batch 128"
# Tip: increase batch size for GPUs that have more memory
# Other supported QE metrics: wmt20-comet-qe-da, wmt23-cometkiwi-da-xl, wmt23-cometkiwi-da-xxl
# Fp16 is faster and uses less memory for GPUs with Tensor Cores
metric+="_fp16" # adds "_fp16" to the file name to distinguish it from the normal one
cmd+=" --fp16"
# Test that the command works with the given args (especially batch size)
# Tip: add --debug flag to see details if the command crashes
mtdata echo Statmt-newstest_deen-2014-deu-eng | $cmd --debug
# run the command on all datasets downloaded above using "mtdata get-recipe"
for dir in ./wmt25-*; do
langs=$(basename $dir | sed 's/wmt25-//g')
mtdata score -l $langs -o $dir -n $metric --cmd "$cmd"
done
wmt22-cometkiwi-da is a gated model. Follow these steps for downloading this gated model:
1. Go to huggingface.co/Unbabel/wmt22-cometkiwi-da-marian and accept the terms for the gated model.
2. Run "huggingface-cli login" and enter your token. Model will be cached locally after the successful download. See pymarian-eval --help` for the cache location and other options
|
Parallel Data Statistics
Langs | Dataset | Lines | Src Tokens | Tgt Tokens | Src Chars | Tgt Chars |
---|---|---|---|---|---|---|
ces-deu |
OPUS |
136.54M |
1.47B |
1.61B |
10.65B |
11.42B |
ces-deu |
LinguaTools-wikititles-2014 |
2.39M |
4.65M |
4.28M |
40.00M |
40.23M |
ces-deu |
Tilde |
2.04M |
36.30M |
38.02M |
288.88M |
307.45M |
ces-deu |
Facebook-wikimatrix-1 |
1.60M |
20.74M |
22.47M |
151.14M |
162.61M |
ces-deu |
Statmt-news_commentary-18.1 |
244.83k |
4.82M |
5.45M |
37.02M |
41.07M |
ces-deu |
(Total) |
142.82M |
1.54B |
1.68B |
11.16B |
11.97B |
ces-ukr |
OPUS |
17.15M |
138.66M |
137.78M |
0.97B |
1.65B |
ces-ukr |
Facebook-wikimatrix-1 |
848.96k |
10.43M |
10.07M |
75.97M |
127.31M |
ces-ukr |
ELRC |
130.00k |
2.48M |
2.56M |
19.61M |
35.26M |
ces-ukr |
(Total) |
18.13M |
151.57M |
150.41M |
1.07B |
1.81B |
eng-ara |
OPUS |
304.22M |
4.65B |
4.21B |
28.48B |
44.10B |
eng-ara |
Statmt-ccaligned-1 |
25.31M |
355.78M |
343.52M |
2.27B |
3.58B |
eng-ara |
LinguaTools-wikititles-2014 |
4.82M |
11.15M |
10.91M |
84.51M |
129.17M |
eng-ara |
Facebook-wikimatrix-1 |
1.97M |
38.55M |
35.77M |
242.74M |
376.25M |
eng-ara |
Statmt-tedtalks-2_clean |
341.89k |
6.17M |
5.41M |
34.54M |
54.49M |
eng-ara |
Statmt-news_commentary-18.1 |
193.67k |
8.94M |
11.70M |
57.33M |
127.15M |
eng-ara |
(Total) |
336.86M |
5.07B |
4.61B |
31.17B |
48.37B |
eng-ces |
OPUS |
237.54M |
2.85B |
2.48B |
17.02B |
17.82B |
eng-ces |
ParaCrawl-paracrawl-9 |
50.63M |
692.12M |
626.34M |
4.33B |
4.68B |
eng-ces |
Statmt-ccaligned-1 |
12.73M |
148.71M |
135.81M |
936.99M |
1.01B |
eng-ces |
LinguaTools-wikititles-2014 |
4.81M |
11.36M |
9.67M |
83.77M |
81.29M |
eng-ces |
Facebook-wikimatrix-1 |
2.09M |
33.56M |
29.66M |
206.82M |
216.62M |
eng-ces |
Tilde |
2.09M |
42.26M |
38.26M |
276.52M |
303.75M |
eng-ces |
ELRC |
1.96M |
37.18M |
33.00M |
243.79M |
262.52M |
eng-ces |
EU |
1.92M |
34.27M |
30.09M |
222.84M |
232.92M |
eng-ces |
Statmt-europarl-10 |
644.43k |
15.63M |
13.00M |
94.31M |
98.14M |
eng-ces |
Statmt-wikititles-3 |
410.94k |
1.03M |
965.62k |
7.47M |
7.57M |
eng-ces |
Statmt-news_commentary-18.1 |
265.37k |
5.71M |
5.19M |
36.22M |
39.81M |
eng-ces |
Statmt-commoncrawl_wmt13-1 |
161.84k |
3.35M |
2.93M |
20.66M |
20.75M |
eng-ces |
Neulab-tedtalks_train-1 |
103.09k |
2.10M |
1.77M |
10.58M |
10.39M |
eng-ces |
(Total) |
315.37M |
3.88B |
3.40B |
23.49B |
24.78B |
eng-est |
OPUS |
121.36M |
1.83B |
1.38B |
11.18B |
11.02B |
eng-est |
ELRC |
9.09M |
201.49M |
144.73M |
1.29B |
1.25B |
eng-est |
ParaCrawl-paracrawl-9 |
8.54M |
136.60M |
103.32M |
846.64M |
840.74M |
eng-est |
Statmt-ccaligned-1 |
4.11M |
54.21M |
43.28M |
339.16M |
338.17M |
eng-est |
Tilde |
2.06M |
41.65M |
30.28M |
272.67M |
271.35M |
eng-est |
EU |
2.03M |
36.68M |
26.85M |
237.87M |
231.57M |
eng-est |
Facebook-wikimatrix-1 |
955.55k |
15.41M |
11.78M |
96.18M |
95.33M |
eng-est |
Statmt-europarl-7 |
649.59k |
15.68M |
11.21M |
94.64M |
91.44M |
eng-est |
Neulab-tedtalks_train-1 |
10.74k |
215.97k |
171.65k |
1.09M |
1.04M |
eng-est |
(Total) |
148.81M |
2.33B |
1.76B |
14.36B |
14.14B |
eng-isl |
OPUS |
24.26M |
292.15M |
274.41M |
1.70B |
1.84B |
eng-isl |
ParaCrawl-paracrawl-9 |
2.97M |
45.10M |
42.66M |
266.09M |
292.17M |
eng-isl |
ParIce-eea_train-20.05 |
1.70M |
26.75M |
24.24M |
170.36M |
179.49M |
eng-isl |
Statmt-ccaligned-1 |
1.19M |
18.63M |
17.80M |
115.58M |
124.36M |
eng-isl |
Tilde |
420.71k |
6.31M |
6.10M |
41.71M |
45.26M |
eng-isl |
ParIce-ema_train-20.05 |
399.09k |
6.13M |
5.94M |
40.41M |
43.90M |
eng-isl |
Facebook-wikimatrix-1 |
313.88k |
5.66M |
4.77M |
34.53M |
34.04M |
eng-isl |
Statmt-wikititles-3 |
50.18k |
98.99k |
88.35k |
722.24k |
763.33k |
eng-isl |
EU |
4.72k |
54.43k |
52.31k |
369.04k |
398.50k |
eng-isl |
(Total) |
31.31M |
400.87M |
376.06M |
2.37B |
2.56B |
eng-kor |
OPUS |
138.12M |
1.64B |
1.31B |
9.84B |
12.28B |
eng-kor |
Statmt-ccaligned-1 |
9.03M |
98.69M |
84.80M |
635.05M |
744.99M |
eng-kor |
LinguaTools-wikititles-2014 |
4.83M |
11.62M |
9.32M |
84.86M |
90.51M |
eng-kor |
ParaCrawl-paracrawl-1_bonus |
4.00M |
61.96M |
48.70M |
371.75M |
433.95M |
eng-kor |
Facebook-wikimatrix-1 |
1.35M |
21.63M |
15.66M |
135.00M |
161.17M |
eng-kor |
Neulab-tedtalks_train-1 |
205.64k |
4.29M |
2.97M |
21.55M |
26.31M |
eng-kor |
ELRC |
3.27k |
67.72k |
45.95k |
424.80k |
471.77k |
eng-kor |
(Total) |
157.54M |
1.84B |
1.48B |
11.09B |
13.74B |
eng-rus |
OPUS |
479.12M |
7.32B |
6.39B |
44.88B |
83.67B |
eng-rus |
Statmt-ccaligned-1 |
69.26M |
0.97B |
864.09M |
6.18B |
11.32B |
eng-rus |
Statmt-backtrans_ruen-wmt20 |
39.36M |
746.47M |
596.28M |
4.47B |
7.75B |
eng-rus |
LinguaTools-wikititles-2014 |
13.57M |
33.05M |
28.99M |
245.88M |
421.65M |
eng-rus |
ParaCrawl-paracrawl-1_bonus |
5.38M |
101.31M |
80.41M |
632.54M |
1.06B |
eng-rus |
Facebook-wikimatrix-1 |
5.20M |
86.79M |
76.48M |
537.73M |
0.97B |
eng-rus |
Statmt-wikititles-3 |
1.19M |
3.13M |
2.88M |
22.80M |
39.34M |
eng-rus |
Statmt-yandex-wmt22 |
1.00M |
21.25M |
18.68M |
130.99M |
250.76M |
eng-rus |
Statmt-commoncrawl_wmt13-1 |
878.39k |
18.77M |
17.40M |
116.16M |
214.59M |
eng-rus |
Statmt-news_commentary-18.1 |
377.66k |
8.72M |
8.11M |
55.68M |
112.13M |
eng-rus |
Neulab-tedtalks_train-1 |
208.46k |
4.37M |
3.69M |
21.96M |
36.77M |
eng-rus |
ELRC |
39.50k |
891.98k |
792.00k |
5.73M |
10.87M |
eng-rus |
Tilde |
34.27k |
752.66k |
702.81k |
4.83M |
9.97M |
eng-rus |
(Total) |
615.62M |
9.31B |
8.09B |
57.31B |
105.86B |
eng-srp |
OPUS |
127.45M |
1.33B |
1.17B |
7.57B |
9.99B |
eng-srp |
Statmt-ccaligned-1 |
1.99M |
38.73M |
34.34M |
235.07M |
399.09M |
eng-srp |
Facebook-wikimatrix-1 |
1.21M |
20.95M |
18.81M |
129.91M |
209.19M |
eng-srp |
Neulab-tedtalks_train-1 |
136.90k |
2.79M |
2.38M |
14.05M |
14.40M |
eng-srp |
Tilde |
2.02k |
46.81k |
45.16k |
303.95k |
491.17k |
eng-srp |
ELRC |
856 |
14.50k |
13.28k |
93.28k |
149.56k |
eng-srp |
(Total) |
130.79M |
1.39B |
1.22B |
7.95B |
10.62B |
eng-ukr |
OPUS |
151.87M |
2.68B |
2.33B |
16.50B |
29.37B |
eng-ukr |
ParaCrawl-paracrawl-1_bonus |
13.35M |
505.83M |
487.47M |
3.28B |
6.04B |
eng-ukr |
Statmt-ccaligned-1 |
8.55M |
119.38M |
104.10M |
755.38M |
1.33B |
eng-ukr |
Facebook-wikimatrix-1 |
2.58M |
41.55M |
35.59M |
257.56M |
447.33M |
eng-ukr |
ELRC |
1.16M |
16.65M |
13.15M |
110.37M |
194.76M |
eng-ukr |
Neulab-tedtalks_train-1 |
108.50k |
2.25M |
1.94M |
11.33M |
18.45M |
eng-ukr |
Tilde |
1.63k |
36.07k |
34.18k |
237.96k |
477.91k |
eng-ukr |
(Total) |
177.62M |
3.36B |
2.97B |
20.92B |
37.40B |
eng-zho |
OPUS |
221.88M |
3.25B |
392.85M |
19.99B |
17.76B |
eng-zho |
Statmt-backtrans_enzh-wmt20 |
19.76M |
364.22M |
32.72M |
2.16B |
1.96B |
eng-zho |
Statmt-ccaligned-1 |
15.18M |
155.93M |
42.42M |
1.04B |
1.13B |
eng-zho |
ParaCrawl-paracrawl-1_bonus |
14.17M |
217.60M |
46.40M |
1.34B |
1.18B |
eng-zho |
LinguaTools-wikititles-2014 |
6.66M |
16.16M |
7.79M |
118.50M |
112.12M |
eng-zho |
Facebook-wikimatrix-1 |
2.60M |
49.87M |
5.00M |
311.07M |
277.84M |
eng-zho |
Statmt-wikititles-3 |
921.96k |
2.37M |
973.44k |
17.82M |
16.28M |
eng-zho |
Statmt-news_commentary-18.1 |
442.93k |
9.80M |
799.74k |
62.67M |
55.16M |
eng-zho |
Neulab-tedtalks_train-1 |
5.54k |
95.63k |
23.52k |
476.98k |
399.81k |
eng-zho |
ELRC |
2.98k |
91.23k |
7.36k |
591.36k |
644.17k |
eng-zho |
(Total) |
281.63M |
4.07B |
528.99M |
25.05B |
22.49B |
jpn-zho |
OPUS |
19.74M |
46.43M |
46.87M |
1.44B |
1.08B |
jpn-zho |
KECL-paracrawl-2wmt24 |
4.60M |
27.88M |
29.51M |
0.97B |
704.98M |
jpn-zho |
LinguaTools-wikititles-2014 |
1.66M |
1.97M |
1.97M |
35.18M |
27.48M |
jpn-zho |
Facebook-wikimatrix-1 |
1.33M |
2.36M |
2.12M |
145.10M |
113.60M |
jpn-zho |
KECL-paracrawl-2 |
83.89k |
552.50k |
633.77k |
18.86M |
14.11M |
jpn-zho |
Neulab-tedtalks_train-1 |
5.16k |
19.57k |
22.30k |
490.89k |
375.98k |
jpn-zho |
Statmt-news_commentary-18.1 |
1.62k |
2.59k |
2.17k |
272.83k |
197.25k |
jpn-zho |
(Total) |
27.42M |
79.23M |
81.13M |
2.61B |
1.94B |
Download stats: without rounding and without grouping
Monolingual Data Statistics
Lang | Dataset | Lines | Tokens | Chars |
---|---|---|---|---|
ces-deu/deu |
Leipzig-news-2022_30k-deu |
30.00k |
464.10k |
3.33M |
ces-deu/deu |
Leipzig-web-2021_100k-deu_DE |
100.00k |
1.53M |
11.64M |
ces-deu/deu |
Statmt-news_commentary-18.1-deu |
507.81k |
11.03M |
83.23M |
ces-deu/deu |
Leipzig-mixed_typical-2011_1m-deu |
999.93k |
6.60M |
48.91M |
ces-deu/deu |
Leipzig-wikipedia-2021_1m-deu |
1.00M |
15.36M |
110.63M |
ces-deu/deu |
Leipzig-comweb-2021_1m-deu |
1.00M |
15.83M |
115.76M |
ces-deu/deu |
Leipzig-newscrawl-2020_1m-deu |
1.00M |
15.30M |
108.98M |
ces-deu/deu |
Statmt-europarl-10-deu |
2.11M |
44.85M |
330.07M |
ces-deu/deu |
Statmt-news_crawl-2023-deu |
38.36M |
894.96M |
6.42B |
ces-deu/deu |
Statmt-commoncrawl-wmt22-deu |
2.87B |
53.65B |
392.80B |
ces-deu/deu |
(Total) |
2.92B |
54.66B |
400.03B |
ces-ukr/ukr |
Statmt-news_crawl-2023-ukr |
620.75k |
13.37M |
169.70M |
ces-ukr/ukr |
Leipzig-newscrawl-2018_1m-ukr |
1.00M |
14.82M |
191.59M |
ces-ukr/ukr |
Leipzig-wikipedia-2021_1m-ukr |
1.00M |
14.56M |
185.20M |
ces-ukr/ukr |
Leipzig-web-2019_1m-ukr_UA |
1.00M |
14.96M |
199.08M |
ces-ukr/ukr |
Leipzig-news-2022_1m-ukr |
1.00M |
13.59M |
174.03M |
ces-ukr/ukr |
LangUk-fiction-1-ukr |
1.81M |
18.32M |
198.26M |
ces-ukr/ukr |
LangUk-wiki_dump-1-ukr |
15.79M |
185.65M |
2.34B |
ces-ukr/ukr |
LangUk-laws-1-ukr |
29.21M |
578.99M |
7.69B |
ces-ukr/ukr |
LangUk-news-1-ukr |
31.02M |
461.45M |
5.94B |
ces-ukr/ukr |
LangUk-ubercorpus-1-ukr |
48.62M |
665.42M |
8.48B |
ces-ukr/ukr |
(Total) |
131.07M |
1.98B |
25.57B |
eng-ara/ara |
Statmt-news_commentary-18.1-ara |
211.77k |
12.74M |
138.45M |
eng-ara/ara |
Leipzig-news-2020_1m-ara |
1.00M |
22.67M |
247.86M |
eng-ara/ara |
Leipzig-wikipedia-2021_1m-ara |
1.00M |
16.55M |
175.09M |
eng-ara/ara |
Statmt-news_crawl-2023-ara |
21.67M |
569.84M |
6.31B |
eng-ara/ara |
(Total) |
23.88M |
621.80M |
6.87B |
eng-ces/ces |
Statmt-news_commentary-18.1-ces |
288.07k |
5.55M |
42.60M |
eng-ces/ces |
Statmt-europarl-10-ces |
669.67k |
13.20M |
99.66M |
eng-ces/ces |
Leipzig-newscrawl-2019_1m-ces |
1.00M |
13.11M |
94.72M |
eng-ces/ces |
Leipzig-news-2022_1m-ces |
1.00M |
15.19M |
109.49M |
eng-ces/ces |
Leipzig-web_public-2019_1m-ces_CZ |
1.00M |
14.62M |
104.97M |
eng-ces/ces |
Leipzig-wikipedia-2021_1m-ces |
1.00M |
14.68M |
107.05M |
eng-ces/ces |
Statmt-news_crawl-2023-ces |
9.04M |
192.03M |
1.37B |
eng-ces/ces |
Statmt-commoncrawl-wmt22-ces |
333.48M |
5.30B |
39.19B |
eng-ces/ces |
(Total) |
347.48M |
5.57B |
41.12B |
eng-est/est |
Leipzig-news-2020_300k-est |
300.00k |
4.42M |
33.88M |
eng-est/est |
Leipzig-newscrawl-2017_1m-est |
1.00M |
14.86M |
115.28M |
eng-est/est |
Leipzig-web-2015_1m-est_EE |
1.00M |
14.39M |
107.60M |
eng-est/est |
Statmt-news_crawl-2023-est |
1.36M |
19.22M |
148.74M |
eng-est/est |
(Total) |
3.66M |
52.90M |
405.49M |
eng-isl/isl |
Leipzig-news-2020_30k-isl |
30.00k |
524.84k |
3.70M |
eng-isl/isl |
Leipzig-wikipedia-2021_100k-isl |
100.00k |
1.54M |
10.74M |
eng-isl/isl |
Leipzig-newscrawl-2019_300k-isl |
300.00k |
5.28M |
37.03M |
eng-isl/isl |
Leipzig-web-2020_1m-isl_IS |
1.00M |
16.64M |
113.42M |
eng-isl/isl |
Leipzig-web_public-2019_1m-isl_IS |
1.00M |
16.57M |
113.62M |
eng-isl/isl |
Statmt-news_crawl-2023-isl |
1.71M |
28.23M |
192.88M |
eng-isl/isl |
(Total) |
4.14M |
68.78M |
471.39M |
eng-kor/kor |
Leipzig-web-2020_1m-kor_KR |
1.00M |
15.78M |
170.12M |
eng-kor/kor |
Leipzig-wikipedia-2021_1m-kor |
1.00M |
13.71M |
144.15M |
eng-kor/kor |
Leipzig-news-2020_1m-kor |
1.00M |
15.20M |
160.73M |
eng-kor/kor |
Statmt-news_crawl-2023-kor |
4.00M |
53.43M |
552.07M |
eng-kor/kor |
(Total) |
7.00M |
98.13M |
1.03B |
eng-rus/rus |
Statmt-news_commentary-18.1-rus |
449.48k |
8.91M |
123.24M |
eng-rus/rus |
Leipzig-web-2017_1m-rus_GE |
1.00M |
15.17M |
196.18M |
eng-rus/rus |
Leipzig-newscrawl_public-2018_1m-rus |
1.00M |
14.44M |
190.27M |
eng-rus/rus |
Leipzig-news-2022_1m-rus |
1.00M |
14.43M |
190.39M |
eng-rus/rus |
Leipzig-wikipedia-2021_1m-rus |
1.00M |
13.94M |
182.16M |
eng-rus/rus |
Statmt-news_crawl-2023-rus |
22.61M |
462.59M |
6.08B |
eng-rus/rus |
Statmt-commoncrawl-wmt22-rus |
1.17B |
18.88B |
239.86B |
eng-rus/rus |
(Total) |
1.20B |
19.41B |
246.82B |
eng-srp/srp |
Leipzig-news-2019_30k-srp |
30.00k |
541.17k |
6.14M |
eng-srp/srp |
Leipzig-web-2016_300k-srp_ME |
300.00k |
5.91M |
71.32M |
eng-srp/srp |
Leipzig-web-2016_1m-srp_RS |
1.00M |
17.89M |
206.37M |
eng-srp/srp |
Leipzig-wikipedia-2021_1m-srp |
1.00M |
15.18M |
175.19M |
eng-srp/srp |
Statmt-news_crawl-2023-srp |
15.51M |
374.90M |
2.69B |
eng-srp/srp |
(Total) |
17.84M |
414.41M |
3.15B |
eng-ukr/ukr |
Statmt-news_crawl-2023-ukr |
620.75k |
13.37M |
169.70M |
eng-ukr/ukr |
Leipzig-newscrawl-2018_1m-ukr |
1.00M |
14.82M |
191.59M |
eng-ukr/ukr |
Leipzig-wikipedia-2021_1m-ukr |
1.00M |
14.56M |
185.20M |
eng-ukr/ukr |
Leipzig-web-2019_1m-ukr_UA |
1.00M |
14.96M |
199.08M |
eng-ukr/ukr |
Leipzig-news-2022_1m-ukr |
1.00M |
13.59M |
174.03M |
eng-ukr/ukr |
LangUk-fiction-1-ukr |
1.81M |
18.32M |
198.26M |
eng-ukr/ukr |
LangUk-wiki_dump-1-ukr |
15.79M |
185.65M |
2.34B |
eng-ukr/ukr |
LangUk-laws-1-ukr |
29.21M |
578.99M |
7.69B |
eng-ukr/ukr |
LangUk-news-1-ukr |
31.02M |
461.45M |
5.94B |
eng-ukr/ukr |
LangUk-ubercorpus-1-ukr |
48.62M |
665.42M |
8.48B |
eng-ukr/ukr |
(Total) |
131.07M |
1.98B |
25.57B |
eng-zho/zho |
Leipzig-news-2020_300k-zho |
300.00k |
344.32k |
42.51M |
eng-zho/zho |
Statmt-news_commentary-18.1-zho |
541.52k |
947.43k |
65.61M |
eng-zho/zho |
Leipzig-wikipedia-2018_1m-zho |
1.00M |
1.45M |
106.70M |
eng-zho/zho |
Leipzig-tradnewscrawl-2011_1m-zho |
1.00M |
1.46M |
160.15M |
eng-zho/zho |
Leipzig-web-2016_1m-zho_MO |
1.00M |
1.26M |
194.08M |
eng-zho/zho |
Statmt-news_crawl-2023-zho |
5.53M |
10.19M |
1.04B |
eng-zho/zho |
Statmt-commoncrawl-wmt22-zho |
1.67B |
3.39B |
131.85B |
eng-zho/zho |
(Total) |
1.68B |
3.41B |
133.45B |
jpn-zho/zho |
Leipzig-news-2020_300k-zho |
300.00k |
344.32k |
42.51M |
jpn-zho/zho |
Statmt-news_commentary-18.1-zho |
541.52k |
947.43k |
65.61M |
jpn-zho/zho |
Leipzig-wikipedia-2018_1m-zho |
1.00M |
1.45M |
106.70M |
jpn-zho/zho |
Leipzig-tradnewscrawl-2011_1m-zho |
1.00M |
1.46M |
160.15M |
jpn-zho/zho |
Leipzig-web-2016_1m-zho_MO |
1.00M |
1.26M |
194.08M |
jpn-zho/zho |
Statmt-news_crawl-2023-zho |
5.53M |
10.19M |
1.04B |
jpn-zho/zho |
Statmt-commoncrawl-wmt22-zho |
1.67B |
3.39B |
131.85B |
jpn-zho/zho |
(Total) |
1.68B |
3.41B |
133.45B |
Download stats: without rounding
Constrained Task Datasets
The selected dataset IDs for constrained task are as follows:
# Setup: pip install mtdata==0.4.3
# To list all the available datasets, use the following commands
# mtdata list -id -l <lang1>-<lang2> # parallel
# mtdata list -id -l <lang> # monolingual
# To get a dataset
# mtdata echo <data_id>
########## CES-UKR #########
- id: wmt25-ces-ukr
langs: ces-ukr
train:
- Facebook-wikimatrix-1-ces-ukr
- ELRC-acts_ukrainian-1-ces-ukr
- OPUS-ccmatrix-v1-ces-ukr
- OPUS-elrc_5179_acts_ukrainian-v1-ces-ukr
- OPUS-elrc_wikipedia_health-v1-ces-ukr
- OPUS-eubookshop-v2-ces-ukr
- OPUS-gnome-v1-ces-ukr
- OPUS-kde4-v2-ces-ukr
- OPUS-multiccaligned-v1.1-ces-ukr
- OPUS-multiparacrawl-v9b-ces-ukr
- OPUS-opensubtitles-v2024-ces-ukr
- OPUS-qed-v2.0a-ces-ukr
- OPUS-ted2020-v1-ces-ukr
- OPUS-tatoeba-v20220303-ces-ukr
- OPUS-ubuntu-v14.10-ces-ukr
- OPUS-xlent-v1.1-ces-ukr
- OPUS-bible_uedin-v1-ces-ukr
- OPUS-wikimedia-v20210402-ces-ukr
mono_train: &mono_ukr
- Statmt-news_crawl-2023-ukr
- LangUk-news-1-ukr
- LangUk-wiki_dump-1-ukr
- LangUk-fiction-1-ukr
- LangUk-ubercorpus-1-ukr
- LangUk-laws-1-ukr
- Leipzig-news-2022_1m-ukr
- Leipzig-newscrawl-2018_1m-ukr
- Leipzig-web-2019_1m-ukr_UA
- Leipzig-wikipedia-2021_1m-ukr
###############CES-DEU########################
- id: wmt25-ces-deu
langs: ces-deu
train:
- Statmt-news_commentary-18.1-ces-deu
- Tilde-eesc-2017-ces-deu
- Tilde-ema-2016-ces-deu
- Tilde-ecb-2017-ces-deu
- Tilde-rapid-2016-ces-deu
- Facebook-wikimatrix-1-ces-deu
- LinguaTools-wikititles-2014-ces-deu
- OPUS-ccmatrix-v1-ces-deu
- OPUS-dgt-v2019-ces-deu
- OPUS-dgt-v4-ces-deu
- OPUS-ecb-v1-ces-deu
- OPUS-ecdc-v20160316-ces-deu
- OPUS-elitr_eca-v1-ces-deu
- OPUS-elrc_417_swedish_work_environ-v1-ces-deu
- OPUS-elrc_ec_europa-v1-ces-deu
- OPUS-elrc_emea-v1-ces-deu
- OPUS-elrc_euipo_2017-v1-ces-deu
- OPUS-elrc_europarl_covid-v1-ces-deu
- OPUS-elrc_eur_lex-v1-ces-deu
- OPUS-elrc_eu_publications-v1-ces-deu
- OPUS-elrc_information_portal-v1-ces-deu
- OPUS-elrc_antibiotic-v1-ces-deu
- OPUS-elrc_presscorner_covid-v1-ces-deu
- OPUS-elrc_vaccination-v1-ces-deu
- OPUS-elrc_wikipedia_health-v1-ces-deu
- OPUS-emea-v3-ces-deu
- OPUS-eubookshop-v2-ces-deu
- OPUS-euconst-v1-ces-deu
- OPUS-europarl-v8-ces-deu
- OPUS-gnome-v1-ces-deu
- OPUS-globalvoices-v2018q4-ces-deu
- OPUS-jrc_acquis-v3.0-ces-deu
- OPUS-kde4-v2-ces-deu
- OPUS-multiccaligned-v1.1-ces-deu
- OPUS-multiparacrawl-v9b-ces-deu
- OPUS-nllb-v1-ces-deu
- OPUS-neulab_tedtalks-v1-ces-deu
- OPUS-opensubtitles-v2024-ces-deu
- OPUS-php-v1-ces-deu
- OPUS-qed-v2.0a-ces-deu
- OPUS-ted2020-v1-ces-deu
- OPUS-tanzil-v1-ces-deu
- OPUS-tatoeba-v20230412-ces-deu
- OPUS-tildemodel-v2018-ces-deu
- OPUS-ubuntu-v14.10-ces-deu
#- OPUS-wikimatrix-v1-ces-deu # already added from source
- OPUS-xlent-v1.2-ces-deu
- OPUS-bible_uedin-v1-ces-deu
- OPUS-wikimedia-v20230407-ces-deu
mono_train: &mono_deu
- Statmt-news_crawl-2023-deu
- Statmt-europarl-10-deu
- Statmt-news_commentary-18.1-deu
- Statmt-commoncrawl-wmt22-deu
- Leipzig-wikipedia-2021_1m-deu
- Leipzig-comweb-2021_1m-deu
- Leipzig-mixed_typical-2011_1m-deu
- Leipzig-news-2022_30k-deu
- Leipzig-newscrawl-2020_1m-deu
- Leipzig-web-2021_100k-deu_DE
# TODO: extended common crawl
########## JPN-ZHO ##########
- id: wmt25-jpn-zho
langs: jpn-zho
train:
- Statmt-news_commentary-18.1-jpn-zho
- KECL-paracrawl-2-zho-jpn
- KECL-paracrawl-2wmt24-zho-jpn
- Facebook-wikimatrix-1-jpn-zho
- Neulab-tedtalks_train-1-jpn-zho
- LinguaTools-wikititles-2014-jpn-zho
- OPUS-ccmatrix-v1-jpn-zho
- OPUS-gnome-v1-jpn-zho_CN
- OPUS-kde4-v2-jpn-zho_CN
- OPUS-multiccaligned-v1-jpn-zho_CN
- OPUS-openoffice-v3-jpn-zho_CN
- OPUS-opensubtitles-v2024-jpn-zho_CN
- OPUS-php-v1-jpn-zho
- OPUS-qed-v2.0a-jpn-zho
- OPUS-ted2020-v1-jpn-zho
- OPUS-tanzil-v1-jpn-zho
- OPUS-ubuntu-v14.10-jpn-zho
- OPUS-ubuntu-v14.10-jpn-zho_CN
- OPUS-xlent-v1.1-jpn-zho
- OPUS-bible_uedin-v1-jpn-zho
- OPUS-wikimedia-v20210402-jpn-zho
mono_train: &mono_zho
- Statmt-news_crawl-2023-zho
- Statmt-news_commentary-18.1-zho
- Statmt-commoncrawl-wmt22-zho
- Leipzig-wikipedia-2018_1m-zho
- Leipzig-web-2016_1m-zho_MO
- Leipzig-tradnewscrawl-2011_1m-zho
- Leipzig-news-2020_300k-zho
# TODO: extended common crawl (too big) https://data.statmt.org/wmt21/translation-task/cc-mono/
######### ENG-BHO ###########
- id: wmt25-eng-bho
langs: eng-bho
train:
- OPUS-nllb-v1-bho-eng
- OPUS-tatoeba-v20230412-bho-eng
- OPUS-ubuntu-v14.10-bho-eng
- OPUS-wikimedia-v20230407-bho-eng
# mono_train: &mono_bho
# NOTE: did not found any monolingual data for bho in mtdata and OPUS
######### ENG-MAS ###########
# TODO: did not find parallel data found for mas-eng in mtdata and OPUS
# - id: wmt25-eng-mas
# langs: eng-mas
# train:
# mono_train: &mono_mas
######### ENG-ARA ###########
- id: wmt25-eng-ara
langs: eng-ara
train:
- Statmt-news_commentary-18.1-ara-eng
- Statmt-tedtalks-2_clean-eng-ara
- Statmt-ccaligned-1-ara_AR-eng
- Facebook-wikimatrix-1-ara-eng
- LinguaTools-wikititles-2014-ara-eng
- OPUS-ccaligned-v1-ara-eng
- OPUS-ccmatrix-v1-ara-eng
- OPUS-elrc_3083_wikipedia_health-v1-ara-eng
- OPUS-elrc_wikipedia_health-v1-ara-eng
- OPUS-elrc_2922-v1-ara-eng
- OPUS-eubookshop-v2-ara-eng
- OPUS-gnome-v1-ara-eng
- OPUS-globalvoices-v2018q4-ara-eng
- OPUS-hplt-v2-ara-eng
- OPUS-kde4-v2-ara-eng
- OPUS-linguatools_wikititles-v2014-ara-eng
- OPUS-multiccaligned-v1-ara-eng
- OPUS-multihplt-v2-ara-eng
- OPUS-multiun-v1-ara-eng
- OPUS-nllb-v1-ara-eng
- OPUS-opensubtitles-v2024-ara-eng
- OPUS-qed-v2.0a-ara-eng
- OPUS-ted2020-v1-ara-eng
- OPUS-tatoeba-v20230412-ara-eng
- OPUS-unpc-v1.0-ara-eng
- OPUS-ubuntu-v14.10-ara-eng
- OPUS-wikimatrix-v1-ara-eng
- OPUS-wikipedia-v1.0-ara-eng
- OPUS-xlent-v1.2-ara-eng
- OPUS-bible_uedin-v1-ara-eng
- OPUS-infopankki-v1-ara-eng
- OPUS-tico_19-v20201028-ara-eng
- OPUS-tldr_pages-v20230829-ara-eng
- OPUS-wikimedia-v20230407-ara-eng
mono_train:
- Statmt-news_crawl-2023-ara
- Statmt-news_commentary-18.1-ara
- Leipzig-news-2020_1m-ara
- Leipzig-wikipedia-2021_1m-ara
######### ENG-ZHO ###########
- id: wmt25-eng-zho
langs: eng-zho
train: # TODO: add all public data
- Statmt-news_commentary-18.1-eng-zho
- Statmt-wikititles-3-zho-eng
- Statmt-ccaligned-1-eng-zho_CN
- ParaCrawl-paracrawl-1_bonus-eng-zho
- Facebook-wikimatrix-1-eng-zho
- Neulab-tedtalks_train-1-eng-zho
- ELRC-wikipedia_health-1-eng-zho
- ELRC-hrw_dataset_v1-1-eng-zho
- LinguaTools-wikititles-2014-eng-zho
- OPUS-ccmatrix-v1-eng-zho
- OPUS-elrc_3056_wikipedia_health-v1-eng-zho
- OPUS-elrc_wikipedia_health-v1-eng-zho
- OPUS-elrc_2922-v1-eng-zho
- OPUS-eubookshop-v2-eng-zho
- OPUS-gnome-v1-eng-zho_CN
- OPUS-kde4-v2-eng-zho_CN
- OPUS-linguatools_wikititles-v2014-eng-zho
- OPUS-mdn_web_docs-v20230925-eng-zho_CN
- OPUS-multiccaligned-v1-eng-zho_CN
- OPUS-multiun-v1-eng-zho
- OPUS-nllb-v1-eng-zho
- OPUS-neulab_tedtalks-v1-eng-zho
- OPUS-neulab_tedtalks-v1-eng-zho_CN
- OPUS-openoffice-v3-eng_GB-zho_CN
- OPUS-opensubtitles-v2024-eng-zho_CN
- OPUS-php-v1-eng-zho
- OPUS-qed-v2.0a-eng-zho
- OPUS-spc-v1-eng-zho
- OPUS-ted2020-v1-eng-zho
- OPUS-ted2020-v1-eng-zho_CN
- OPUS-tanzil-v1-eng-zho
- OPUS-unpc-v1.0-eng-zho
- OPUS-ubuntu-v14.10-eng-zho
- OPUS-xlent-v1.2-eng-zho
- OPUS-bible_uedin-v1-eng-zho
- OPUS-infopankki-v1-eng-zho
- OPUS-tico_19-v20201028-eng-zho
- OPUS-tldr_pages-v20230829-eng-zho
- OPUS-wikimedia-v20230407-eng-zho
mono_train: *mono_zho
######### ENG-CES ###########
- id: wmt25-eng-ces
langs: eng-ces
train:
- Statmt-commoncrawl_wmt13-1-ces-eng
- Statmt-news_commentary-18.1-ces-eng
- Statmt-wikititles-3-ces-eng
- Statmt-europarl-10-ces-eng
- Statmt-ccaligned-1-ces_CZ-eng
- ParaCrawl-paracrawl-9-eng-ces
- Tilde-eesc-2017-ces-eng
- Tilde-ema-2016-ces-eng
- Tilde-ecb-2017-ces-eng
- Tilde-rapid-2019-ces-eng
- Facebook-wikimatrix-1-ces-eng
- Neulab-tedtalks_train-1-eng-ces
- ELRC-information_portal_czech_president_czech_castle-1-ces-eng
- ELRC-electronic_exchange_social_security_information-1-ces-eng
- ELRC-euipo_2017-1-ces-eng
- ELRC-czech_supreme_audit_office_2018_reports-1-ces-eng
- ELRC-czech_supreme_audit_office_2008_2017_reports-1-ces-eng
- ELRC-czech_supreme_audit_office_2003_2017_press_releases-1-ces-eng
- ELRC-czech_supreme_audit_office_2018_press_releases-1-ces-eng
- ELRC-emea-1-ces-eng
- ELRC-vaccination-1-ces-eng
- ELRC-eu_publications_medical_v2-1-ces-eng
- ELRC-wikipedia_health-1-ces-eng
- ELRC-antibiotic-1-ces-eng
- ELRC-europarl_covid-1-ces-eng
- ELRC-ec_europa_covid-1-ces-eng
- ELRC-eur_lex_covid-1-ces-eng
- ELRC-presscorner_covid-1-ces-eng
- ELRC-scipar-1-ces-eng
- ELRC-web_acquired_data_related_to_scientific_research-1-eng-ces
- ELRC-hrw_dataset_v1-1-eng-ces
- ELRC-cef_data_marketplace-1-eng-ces
- EU-ecdc-1-eng-ces
- EU-eac_forms-1-ces-eng
- EU-eac_reference-1-ces-eng
- EU-dcep-1-ces-eng
- LinguaTools-wikititles-2014-ces-eng
- OPUS-ccaligned-v1-ces-eng
- OPUS-ccmatrix-v1-ces-eng
- OPUS-dgt-v2019-ces-eng
- OPUS-dgt-v4-ces-eng
- OPUS-ecb-v1-ces-eng
- OPUS-ecdc-v20160316-ces-eng
- OPUS-elitr_eca-v1-ces-eng
- OPUS-elrc_2012_euipo_2017-v1-ces-eng
- OPUS-elrc_2404_czech_supreme_audit-v1-ces-eng
- OPUS-elrc_2405_czech_supreme_audit-v1-ces-eng
- OPUS-elrc_2406_czech_supreme_audit-v1-ces-eng
- OPUS-elrc_2407_czech_supreme_audit-v1-ces-eng
- OPUS-elrc_2713_emea-v1-ces-eng
- OPUS-elrc_2749_vaccination-v1-ces-eng
- OPUS-elrc_2874_eu_publications_medi-v1-ces-eng
- OPUS-elrc_3062_wikipedia_health-v1-ces-eng
- OPUS-elrc_3201_antibiotic-v1-ces-eng
- OPUS-elrc_3292_europarl_covid-v1-ces-eng
- OPUS-elrc_3463_ec_europa_covid-v1-ces-eng
- OPUS-elrc_3564_eur_lex_covid-v1-ces-eng
- OPUS-elrc_3605_presscorner_covid-v1-ces-eng
- OPUS-elrc_40_information_portal_c-v1-ces-eng
- OPUS-elrc_427_electronic_exchange_-v1-ces-eng
- OPUS-elrc_5067_scipar-v1-ces-eng
- OPUS-elrc_ec_europa-v1-ces-eng
- OPUS-elrc_emea-v1-ces-eng
- OPUS-elrc_euipo_2017-v1-ces-eng
- OPUS-elrc_europarl_covid-v1-ces-eng
- OPUS-elrc_eur_lex-v1-ces-eng
- OPUS-elrc_eu_publications-v1-ces-eng
- OPUS-elrc_information_portal-v1-ces-eng
- OPUS-elrc_antibiotic-v1-ces-eng
- OPUS-elrc_presscorner_covid-v1-ces-eng
- OPUS-elrc_vaccination-v1-ces-eng
- OPUS-elrc_wikipedia_health-v1-ces-eng
- OPUS-elrc_2682-v1-ces-eng
- OPUS-elrc_2922-v1-ces-eng
- OPUS-elrc_2923-v1-ces-eng
- OPUS-elrc_3382-v1-ces-eng
- OPUS-emea-v3-ces-eng
- OPUS-eubookshop-v2-ces-eng
- OPUS-euconst-v1-ces-eng
- OPUS-gnome-v1-ces-eng
- OPUS-globalvoices-v2018q4-ces-eng
- OPUS-jrc_acquis-v3.0-ces-eng
- OPUS-kde4-v2-ces-eng
- OPUS-multiccaligned-v1-ces-eng
- OPUS-multiparacrawl-v7.1-ces-eng
- OPUS-nllb-v1-ces-eng
- OPUS-neulab_tedtalks-v1-ces-eng
- OPUS-opensubtitles-v2024-ces-eng
- OPUS-php-v1-ces-eng
- OPUS-qed-v2.0a-ces-eng
- OPUS-ted2020-v1-ces-eng
- OPUS-tanzil-v1-ces-eng
- OPUS-tatoeba-v20230412-ces-eng
- OPUS-tildemodel-v2018-ces-eng
- OPUS-ubuntu-v14.10-ces-eng
- OPUS-wikipedia-v1.0-ces-eng
- OPUS-xlent-v1.2-ces-eng
- OPUS-bible_uedin-v1-ces-eng
- OPUS-wikimedia-v20230407-ces-eng
mono_train:
- Statmt-news_crawl-2023-ces
- Statmt-europarl-10-ces
- Statmt-news_commentary-18.1-ces
- Statmt-commoncrawl-wmt22-ces
- Leipzig-news-2022_1m-ces
- Leipzig-newscrawl-2019_1m-ces
- Leipzig-wikipedia-2021_1m-ces
- Leipzig-web_public-2019_1m-ces_CZ
# TODO: extended common crawl (too big) https://data.statmt.org/wmt21/translation-task/cc-mono/
######### ENG-EST ###########
- id: wmt25-eng-est
langs: eng-est
train:
- Statmt-europarl-7-est-eng
- Statmt-ccaligned-1-eng-est_EE
- ParaCrawl-paracrawl-9-eng-est
- Tilde-eesc-2017-eng-est
- Tilde-ema-2016-eng-est
- Tilde-airbaltic-1-eng-est
- Tilde-ecb-2017-eng-est
- Tilde-rapid-2016-eng-est
- Facebook-wikimatrix-1-eng-est
- Neulab-tedtalks_train-1-eng-est
- ELRC-estonian_cabinet_ministers-1-eng-est
- ELRC-bank_estonia-1-eng-est
- ELRC-legal_estonian_justice-1-eng-est
- ELRC-estonian_foreign_affairs-1-eng-est
- ELRC-parliament_estonia-1-eng-est
- ELRC-finnish_information_bank-1-eng-est
- ELRC-national_security_defence-1-eng-est
- ELRC-akadeemia.ee-1-eng-est
- ELRC-vp1992_2001.president.ee-1-eng-est
- ELRC-vp2001_2006.president.ee-1-eng-est
- ELRC-vp2006_2016.president.ee-1-eng-est
- ELRC-president.ee-1-eng-est
- ELRC-www.visitestonia.com-1-eng-est
- ELRC-euipo_2017-1-eng-est
- ELRC-estonian_classification_economic_activities-1-eng-est
- ELRC-press_releases_foreign_affairs_estonia-1-eng-est
- ELRC-emea-1-eng-est
- ELRC-vaccination-1-eng-est
- ELRC-eu_publications_medical_v2-1-eng-est
- ELRC-wikipedia_health-1-eng-est
- ELRC-antibiotic-1-eng-est
- ELRC-europarl_covid-1-eng-est
- ELRC-ec_europa_covid-1-eng-est
- ELRC-www.kriis.ee-1-eng-est
- ELRC-eur_lex_covid-1-eng-est
- ELRC-presscorner_covid-1-eng-est
- ELRC-nteu_tiera-1-eng-est
- ELRC-nteu_tierb-1-eng-est
- ELRC-scipar-1-eng-est
- ELRC-web_acquired_data_related_to_scientific_research-1-eng-est
- EU-ecdc-1-eng-est
- EU-eac_forms-1-eng-est
- EU-eac_reference-1-eng-est
- EU-dcep-1-eng-est
- OPUS-ccaligned-v1-eng-est
- OPUS-ccmatrix-v1-eng-est
- OPUS-dgt-v2019-eng-est
- OPUS-dgt-v4-eng-est
- OPUS-ecb-v1-eng-est
- OPUS-ecdc-v20160316-eng-est
- OPUS-elitr_eca-v1-eng-est
- OPUS-elra_w0154-v1-eng-est
- OPUS-elra_w0167-v1-eng-est
- OPUS-elra_w0168-v1-eng-est
- OPUS-elra_w0215-v1-eng-est
- OPUS-elra_w0218-v1-eng-est
- OPUS-elra_w0265-v1-eng-est
- OPUS-elrc_1129_www.visitestonia.com-v1-eng-est
- OPUS-elrc_2016_euipo_2017-v1-eng-est
- OPUS-elrc_2457_estonian_classificat-v1-eng-est
- OPUS-elrc_2461_press_releases_forei-v1-eng-est
- OPUS-elrc_2723_emea-v1-eng-est
- OPUS-elrc_2751_vaccination-v1-eng-est
- OPUS-elrc_2882_eu_publications_medi-v1-eng-est
- OPUS-elrc_3079_wikipedia_health-v1-eng-est
- OPUS-elrc_3211_antibiotic-v1-eng-est
- OPUS-elrc_3300_europarl_covid-v1-eng-est
- OPUS-elrc_3471_ec_europa_covid-v1-eng-est
- OPUS-elrc_3554_www.kriis.ee-v1-eng-est
- OPUS-elrc_3572_eur_lex_covid-v1-eng-est
- OPUS-elrc_3613_presscorner_covid-v1-eng-est
- OPUS-elrc_393_estonian_cabinet_min-v1-eng-est
- OPUS-elrc_411_bank_estonia-v1-eng-est
- OPUS-elrc_4271_nteu_tiera-v1-eng-est
- OPUS-elrc_429_legal_estonian_justi-v1-eng-est
- OPUS-elrc_431_estonian_foreign_aff-v1-eng-est
- OPUS-elrc_5067_scipar-v1-eng-est
- OPUS-elrc_714_parliament_estonia-v1-eng-est
- OPUS-elrc_717_finnish_information_-v1-eng-est
- OPUS-elrc_770_national_security_de-v1-eng-est
- OPUS-elrc_919_akadeemia.ee-v1-eng-est
- OPUS-elrc_937_vp1992_2001.presiden-v1-eng-est
- OPUS-elrc_938_vp2001_2006.presiden-v1-eng-est
- OPUS-elrc_939_vp2006_2016.presiden-v1-eng-est
- OPUS-elrc_940_president.ee-v1-eng-est
- OPUS-elrc_ec_europa-v1-eng-est
- OPUS-elrc_emea-v1-eng-est
- OPUS-elrc_euipo_2017-v1-eng-est
- OPUS-elrc_europarl_covid-v1-eng-est
- OPUS-elrc_eur_lex-v1-eng-est
- OPUS-elrc_eu_publications-v1-eng-est
- OPUS-elrc_finnish_information-v1-eng-est
- OPUS-elrc_antibiotic-v1-eng-est
- OPUS-elrc_presscorner_covid-v1-eng-est
- OPUS-elrc_vaccination-v1-eng-est
- OPUS-elrc_wikipedia_health-v1-eng-est
- OPUS-elrc_www.visitestonia.com-v1-eng-est
- OPUS-elrc_2682-v1-eng-est
- OPUS-elrc_2922-v1-eng-est
- OPUS-elrc_2923-v1-eng-est
- OPUS-elrc_3382-v1-eng-est
- OPUS-emea-v3-eng-est
- OPUS-eopc-v2022-eng-est
- OPUS-eubookshop-v2-eng-est
- OPUS-euconst-v1-eng-est
- OPUS-europarl-v8-eng-est
- OPUS-gnome-v1-eng-est
- OPUS-hplt-v2-eng-est
- OPUS-jrc_acquis-v3.0-eng-est
- OPUS-kde4-v2-eng-est
- OPUS-kdedoc-v1-eng_GB-est
- OPUS-multiccaligned-v1-eng-est
- OPUS-multihplt-v2-eng-est
- OPUS-multiparacrawl-v7.1-eng-est
- OPUS-nllb-v1-eng-est
- OPUS-neulab_tedtalks-v1-eng-est
- OPUS-opensubtitles-v2018-eng-est
- OPUS-paracrawl-v9-eng-est
- OPUS-qed-v2.0a-eng-est
- OPUS-ted2020-v1-eng-est
- OPUS-tatoeba-v20230412-eng-est
- OPUS-tildemodel-v2018-eng-est
- OPUS-ubuntu-v14.10-eng-est
- OPUS-xlent-v1.2-eng-est
- OPUS-bible_uedin-v1-eng-est
- OPUS-infopankki-v1-eng-est
- OPUS-wikimedia-v20230407-eng-est
mono_train:
- Statmt-news_crawl-2023-est
- Leipzig-web-2015_1m-est_EE
- Leipzig-news-2020_300k-est
- Leipzig-newscrawl-2017_1m-est
######### ENG-ISL ###########
- id: wmt25-eng-isl
langs: eng-isl
train:
- Statmt-wikititles-3-isl-eng
- Statmt-ccaligned-1-eng-isl_IS
- ParaCrawl-paracrawl-9-eng-isl
- Tilde-eesc-2017-eng-isl
- Tilde-ema-2016-eng-isl
- Tilde-rapid-2016-eng-isl
- Facebook-wikimatrix-1-eng-isl
- ParIce-eea_train-20.05-eng-isl
- ParIce-ema_train-20.05-eng-isl
- EU-ecdc-1-eng-isl
- EU-eac_forms-1-eng-isl
- EU-eac_reference-1-eng-isl
- OPUS-ccmatrix-v1-eng-isl
- OPUS-elrc_2718_emea-v1-eng-isl
- OPUS-elrc_3206_antibiotic-v1-eng-isl
- OPUS-elrc_4295_www.malfong.is-v1-eng-isl
- OPUS-elrc_4324_government_offices_i-v1-eng-isl
- OPUS-elrc_4327_government_offices_i-v1-eng-isl
- OPUS-elrc_4334_rkiskaup_2020-v1-eng-isl
- OPUS-elrc_4338_university_iceland-v1-eng-isl
- OPUS-elrc_502_icelandic_financial_-v1-eng-isl
- OPUS-elrc_504_www.iceida.is-v1-eng-isl
- OPUS-elrc_505_www.pfs.is-v1-eng-isl
- OPUS-elrc_506_www.lanamal.is-v1-eng-isl
- OPUS-elrc_5067_scipar-v1-eng-isl
- OPUS-elrc_508_tilde_statistics_ice-v1-eng-isl
- OPUS-elrc_509_gallery_iceland-v1-eng-isl
- OPUS-elrc_510_harpa_reykjavik_conc-v1-eng-isl
- OPUS-elrc_511_bokmenntaborgin_is-v1-eng-isl
- OPUS-elrc_516_icelandic_medicines-v1-eng-isl
- OPUS-elrc_517_icelandic_directorat-v1-eng-isl
- OPUS-elrc_597_www.nordisketax.net-v1-eng-isl
- OPUS-elrc_718_statistics_iceland-v1-eng-isl
- OPUS-elrc_728_www.norden.org-v1-eng-isl
- OPUS-elrc_emea-v1-eng-isl
- OPUS-elrc_antibiotic-v1-eng-isl
- OPUS-elrc_www.norden.org-v1-eng-isl
- OPUS-elrc_www.nordisketax.net-v1-eng-isl
- OPUS-eubookshop-v2-eng-isl
- OPUS-hplt-v2-eng-isl
- OPUS-multiccaligned-v1-eng-isl
- OPUS-multihplt-v2-eng-isl
- OPUS-multiparacrawl-v7.1-eng-isl
- OPUS-opensubtitles-v2024-eng-isl
- OPUS-ted2020-v1-eng-isl
- OPUS-tatoeba-v20220303-eng-isl
- OPUS-ubuntu-v14.10-eng-isl
- OPUS-wikimatrix-v1-eng-isl
- OPUS-wikititles-v3-eng-isl
- OPUS-xlent-v1.1-eng-isl
- OPUS-wikimedia-v20210402-eng-isl
mono_train:
- Statmt-news_crawl-2023-isl
- Leipzig-web-2020_1m-isl_IS
- Leipzig-web_public-2019_1m-isl_IS
- Leipzig-news-2020_30k-isl
- Leipzig-newscrawl-2019_300k-isl
- Leipzig-wikipedia-2021_100k-isl
######### ENG-JPN ###########
- id: wmt24-eng-jpn
langs: eng-jpn
train: # TODO: add all public data
- Statmt-news_commentary-18.1-eng-jpn
- Statmt-wikititles-3-jpn-eng
- Statmt-ted-wmt20-eng-jpn
- Statmt-ccaligned-1-eng-jpn
- KECL-paracrawl-3-eng-jpn
- Facebook-wikimatrix-1-eng-jpn
- Phontron-kftt_train-1-eng-jpn
- StanfordNLP-jesc_train-1-eng-jpn
- Neulab-tedtalks_train-1-eng-jpn
- LinguaTools-wikititles-2014-eng-jpn
- OPUS-alt-v2019-eng-jpn
- OPUS-ccmatrix-v1-eng-jpn
- OPUS-eubookshop-v2-eng-jpn
- OPUS-gnome-v1-eng-jpn
- OPUS-globalvoices-v2018q4-eng-jpn
- OPUS-hplt-v2-eng-jpn
- OPUS-kde4-v2-eng-jpn
- OPUS-mdn_web_docs-v20230925-eng-jpn
- OPUS-multiccaligned-v1-eng-jpn
- OPUS-multihplt-v2-eng-jpn
- OPUS-nllb-v1-eng-jpn
- OPUS-neulab_tedtalks-v1-eng-jpn
- OPUS-openoffice-v3-eng_GB-jpn
- OPUS-opensubtitles-v2024-eng-jpn
- OPUS-php-v1-eng-jpn
- OPUS-qed-v2.0a-eng-jpn
- OPUS-ted2020-v1-eng-jpn
- OPUS-tanzil-v1-eng-jpn
- OPUS-tatoeba-v20230412-eng-jpn
- OPUS-ubuntu-v14.10-eng-jpn
- OPUS-wikimatrix-v1-eng-jpn
- OPUS-xlent-v1.2-eng-jpn
- OPUS-bible_uedin-v1-eng-jpn
- OPUS-tldr_pages-v20230829-eng-jpn
- OPUS-wikimedia-v20230407-eng-jpn
mono_train: &mono_jpn
- Statmt-news_crawl-2023-jpn
- Statmt-news_commentary-18.1-jpn
- Statmt-commoncrawl-wmt22-jpn
- Leipzig-web-2020_1m-jpn_JP
- Leipzig-comweb-2018_1m-jpn
- Leipzig-web_public-2019_1m-jpn_JP
- Leipzig-news-2020_100k-jpn
- Leipzig-newscrawl-2019_1m-jpn
- Leipzig-wikipedia-2021_1m-jpn
# TODO: Extended Common Crawl
######### ENG-KOR ###########
- id: wmt25-eng-kor
langs: eng-kor
train:
- Statmt-ccaligned-1-eng-kor_KR
- ParaCrawl-paracrawl-1_bonus-eng-kor
- Facebook-wikimatrix-1-eng-kor
- Neulab-tedtalks_train-1-eng-kor
- ELRC-wikipedia_health-1-eng-kor
- ELRC-hrw_dataset_v1-1-eng-kor
- LinguaTools-wikititles-2014-eng-kor
- OPUS-ccaligned-v1-eng-kor
- OPUS-ccmatrix-v1-eng-kor
- OPUS-elrc_3070_wikipedia_health-v1-eng-kor
- OPUS-elrc_wikipedia_health-v1-eng-kor
- OPUS-elrc_2922-v1-eng-kor
- OPUS-gnome-v1-eng-kor
- OPUS-globalvoices-v2018q4-eng-kor
- OPUS-hplt-v2-eng-kor
- OPUS-multihplt-v2-eng-kor
- OPUS-kde4-v2-eng-kor
- OPUS-linguatools_wikititles-v2014-eng-kor
- OPUS-mdn_web_docs-v20230925-eng-kor
- OPUS-multiccaligned-v1-eng-kor
- OPUS-nllb-v1-eng-kor
- OPUS-neulab_tedtalks-v1-eng-kor
- OPUS-opensubtitles-v2024-eng-kor
- OPUS-php-v1-eng-kor
- OPUS-paracrawl-v9-eng-kor
- OPUS-qed-v2.0a-eng-kor
- OPUS-ted2020-v1-eng-kor
- OPUS-tanzil-v1-eng-kor
- OPUS-tatoeba-v20230412-eng-kor
- OPUS-ubuntu-v14.10-eng-kor
- OPUS-wikimatrix-v1-eng-kor
- OPUS-xlent-v1.2-eng-kor
- OPUS-bible_uedin-v1-eng-kor
- OPUS-tldr_pages-v20230829-eng-kor
- OPUS-wikimedia-v20230407-eng-kor
mono_train:
- Statmt-news_crawl-2023-kor
- Leipzig-web-2020_1m-kor_KR
- Leipzig-news-2020_1m-kor
- Leipzig-wikipedia-2021_1m-kor
######### ENG-RUS ###########
- id: wmt25-eng-rus
langs: eng-rus
train: # TODO: add all public data
- Statmt-news_commentary-18.1-eng-rus
- Statmt-wikititles-3-rus-eng
- Statmt-ccaligned-1-eng-rus_RU
- Statmt-yandex-wmt22-eng-rus
- ParaCrawl-paracrawl-1_bonus-eng-rus
- Tilde-airbaltic-1-eng-rus
- Tilde-czechtourism-1-eng-rus
- Tilde-worldbank-1-eng-rus
- Facebook-wikimatrix-1-eng-rus
- Neulab-tedtalks_train-1-eng-rus
- ELRC-wikipedia_health-1-eng-rus
- ELRC-swps_university_social_sciences_humanities-1-eng-rus
- ELRC-scipar-1-eng-rus
- ELRC-web_acquired_data_related_to_scientific_research-1-eng-rus
- ELRC-hrw_dataset_v1-1-eng-rus
- LinguaTools-wikititles-2014-eng-rus
- OPUS-books-v1-eng-rus
- OPUS-ccaligned-v1-eng-rus
- OPUS-ccmatrix-v1-eng-rus
- OPUS-elrc_3075_wikipedia_health-v1-eng-rus
- OPUS-elrc_3855_swps_university_soci-v1-eng-rus
- OPUS-elrc_5067_scipar-v1-eng-rus
- OPUS-elrc_5183_scipar_ukraine-v1-eng-rus
- OPUS-elrc_wikipedia_health-v1-eng-rus
- OPUS-elrc_2922-v1-eng-rus
- OPUS-eubookshop-v2-eng-rus
- OPUS-gnome-v1-eng-rus
- OPUS-globalvoices-v2018q4-eng-rus
- OPUS-kde4-v2-eng-rus
- OPUS-kdedoc-v1-eng_GB-rus
- OPUS-mdn_web_docs-v20230925-eng-rus
- OPUS-multiccaligned-v1-eng-rus
- OPUS-multiparacrawl-v7.1-eng-rus
- OPUS-multiun-v1-eng-rus
- OPUS-nllb-v1-eng-rus
- OPUS-openoffice-v3-eng_GB-rus
- OPUS-opensubtitles-v2024-eng-rus
- OPUS-php-v1-eng-rus
- OPUS-qed-v2.0a-eng-rus
- OPUS-ted2020-v1-eng-rus
- OPUS-tanzil-v1-eng-rus
- OPUS-tatoeba-v20230412-eng-rus
- OPUS-unpc-v1.0-eng-rus
- OPUS-ubuntu-v14.10-eng-rus
- OPUS-wikipedia-v1.0-eng-rus
- OPUS-xlent-v1.2-eng-rus
- OPUS-ada83-v1-eng-rus
- OPUS-bible_uedin-v1-eng-rus
- OPUS-infopankki-v1-eng-rus
- OPUS-tico_19-v20201028-eng-rus
- OPUS-tldr_pages-v20230829-eng-rus
- OPUS-wikimedia-v20230407-eng-rus
mono_train: &mono_rus
- Statmt-news_crawl-2023-rus
- Statmt-news_commentary-18.1-rus
- Statmt-commoncrawl-wmt22-rus
- Leipzig-news-2022_1m-rus
- Leipzig-newscrawl_public-2018_1m-rus
- Leipzig-web-2017_1m-rus_GE
- Leipzig-wikipedia-2021_1m-rus
######### ENG-SRP ###########
- id : wmt25-eng-srp
langs: eng-srp
train:
- Statmt-ccaligned-1-eng-srp_RS
- Tilde-worldbank-1-eng-srp
- Facebook-wikimatrix-1-eng-srp
- Neulab-tedtalks_train-1-eng-srp
- ELRC-swedish_social_security-1-eng-srp
- ELRC-wikipedia_health-1-eng-srp
- OPUS-ccaligned-v1-eng-srp
- OPUS-ccmatrix-v1-eng-srp
- OPUS-elrc_3041_wikipedia_health-v1-eng-srp
- OPUS-elrc_416_swedish_social_secur-v1-eng-srp
- OPUS-elrc_wikipedia_health-v1-eng-srp
- OPUS-elrc_2922-v1-eng-srp
- OPUS-eubookshop-v2-eng-srp
- OPUS-gnome-v1-eng-srp
- OPUS-globalvoices-v2018q4-eng-srp
- OPUS-gourmet-v2-eng-srp
- OPUS-hplt-v2-eng-srp
- OPUS-kde4-v2-eng-srp
- OPUS-kdedoc-v1-eng_GB-srp
- OPUS-multiccaligned-v1-eng-srp
- OPUS-multihplt-v2-eng-srp
- OPUS-nllb-v1-eng-srp
- OPUS-neulab_tedtalks-v1-eng-srp
- OPUS-opensubtitles-v2024-eng-srp
- OPUS-qed-v2.0a-eng-srp
- OPUS-setimes-v2-eng-srp
- OPUS-tatoeba-v20230412-eng-srp
- OPUS-tildemodel-v2018-eng-srp
- OPUS-ubuntu-v14.10-eng-srp
- OPUS-wikimatrix-v1-eng-srp
- OPUS-xlent-v1.2-eng-srp
- OPUS-bible_uedin-v1-eng-srp
- OPUS-tldr_pages-v20230829-eng-srp
- OPUS-wikimedia-v20230407-eng-srp
mono_train:
- Statmt-news_crawl-2023-srp
# TODO: verify if _ME and _RS country codes are safe to mix
- Leipzig-web-2016_300k-srp_ME
- Leipzig-web-2016_1m-srp_RS
- Leipzig-news-2019_30k-srp
- Leipzig-wikipedia-2021_1m-srp
######### ENG-UKR ###########
- id: wmt25-eng-ukr
langs: eng-ukr
train: ¶_eng_ukr #TODO: add all public data
- Statmt-ccaligned-1-eng-ukr_UA
- ParaCrawl-paracrawl-1_bonus-eng-ukr
- Tilde-worldbank-1-eng-ukr
- Facebook-wikimatrix-1-eng-ukr
- Neulab-tedtalks_train-1-eng-ukr
- ELRC-wikipedia_health-1-eng-ukr
- ELRC-french_polish_ukrainian-1-eng-ukr
- ELRC-acts_ukrainian-1-eng-ukr
- ELRC-official_parliament_ukraine_ukrainian_laws_en-1-eng-ukr
- ELRC-official_parliament_ukraine_abstracts_uk_laws-1-eng-ukr
- ELRC-official_parliament_ukraine_primary_legislation-1-eng-ukr
- ELRC-scipar_ukraine-1-eng-ukr
- ELRC-a_lexicon_named_entities_extracted_wikipedia-1-eng-ukr
- ELRC-ukrainian_legal_mt_test_set-1-eng-ukr
- ELRC-web_acquired_data_related_to_scientific_research-1-eng-ukr
- ELRC-hrw_dataset_v1-1-eng-ukr
- OPUS-ccaligned-v1-eng-ukr
- OPUS-ccmatrix-v1-eng-ukr
- OPUS-elrc_3043_wikipedia_health-v1-eng-ukr
- OPUS-elrc_5174_french_polish_ukrain-v1-eng-ukr
- OPUS-elrc_5179_acts_ukrainian-v1-eng-ukr
- OPUS-elrc_5180_official_parliament_-v1-eng-ukr
- OPUS-elrc_5181_official_parliament_-v1-eng-ukr
- OPUS-elrc_5182_official_parliament_-v1-eng-ukr
- OPUS-elrc_5183_scipar_ukraine-v1-eng-ukr
- OPUS-elrc_5214_a_lexicon_named-v1-eng-ukr
- OPUS-elrc_5217_ukrainian_legal_mt-v1-eng-ukr
- OPUS-elrc_wikipedia_health-v1-eng-ukr
- OPUS-elrc_2922-v1-eng-ukr
- OPUS-eubookshop-v2-eng-ukr
- OPUS-gnome-v1-eng-ukr
- OPUS-hplt-v2-eng-ukr
- OPUS-kde4-v2-eng-ukr
- OPUS-kdedoc-v1-eng_GB-ukr
- OPUS-macocu-v2-eng-ukr
- OPUS-multiccaligned-v1-eng-ukr
- OPUS-multihplt-v2-eng-ukr
- OPUS-multimacocu-v2-eng-ukr
- OPUS-nllb-v1-eng-ukr
- OPUS-neulab_tedtalks-v1-eng-ukr
- OPUS-opensubtitles-v2024-eng-ukr
- OPUS-paracrawl_bonus-v9-eng-ukr
- OPUS-qed-v2.0a-eng-ukr
- OPUS-summa-v1-eng-ukr
- OPUS-ted2020-v1-eng-ukr
- OPUS-tatoeba-v20230412-eng-ukr
- OPUS-ubuntu-v14.10-eng-ukr
- OPUS-wikimatrix-v1-eng-ukr
- OPUS-xlent-v1.2-eng-ukr
- OPUS-bible_uedin-v1-eng-ukr
- OPUS-tldr_pages-v20230829-eng-ukr
- OPUS-wikimedia-v20230407-eng-ukr
mono_train: *mono_ukr
Issues/Bugs
Please report them using GitHub issues at github.com/thammegowda/mtdata .