EMNLP 2025

TENTH CONFERENCE ON
MACHINE TRANSLATION (WMT25)

November 5-9, 2025
Suzhou, China
 
[HOME]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [INDIC MT]
EVALUATION TASKS: [MT TEST SUITES]
OTHER TASKS: [MULTILINGUAL INSTRUCTION] [LIMITED RESOURCES SLAVIC LLM]

This document lists WMT25 General MT task datasets for the constrained track and instructions for downloading them using mtdata.

MTData

Setup

pip install mtdata==0.4.3  # on Python 3.9-3.11

Recipes Config File

Config file for CONSTRAINED track:

wget https://www.statmt.org/wmt25/mtdata/mtdata.recipes.wmt25-constrained.yml

See for dataset IDs selected for constrained eval. By default, mtdata loads mtdata.recipes*.yml glob in the current directory (where mtdata commands are invoked). If the recipe YAML file is placed in a different directory, export MTDATA_RECIPES=/path/to/recipesdir.

List All Recipes

$ mtdata list-recipe -id | grep wmt25

wmt25* ids are all loaded from mtdata.recipes.wmt24*.yml file.

Download Recipes

Download a Recipe
# example:
mtdata get-recipe -ri wmt25-eng-ces -o wmt25-eng-ces # -h|--help for help
Download All Recipes
# Optional: download and cache all datasets in parallel
mtdata -no-pb cache -j 8 -ri "wmt25-*"
for id in wmt25-{eng-{ara,bho,ces,est,isl,kor,rus,srp,ukr,zho},ces-{ukr,deu},jpn-zho}; do
  mtdata get-recipe -i $id -o $id --compress --no-merge -j 8
done
1. mtdata stores its cache under $HOME/.mtdata by default. To change, either export MTDATA=/path/to/cache or create a symbolic link ln -s /path/to/cache ~/.mtdata
2. mtdata uses pigz (if available in PATH) for a faster compression and decompression. We recommend installing pigz using your package manager.

QE Score

Participants are allowed to score parallel training data using quality estimation (QE) metrics to identify and select high-quality parallel segments. Any suitable metric for the task may be used.

We provide starter tools based on PyMarian, a fast scorer optimized for large datasets. Additionally, precomputed scores using the wmt22-cometkiwi-da model are available for convenience.

Precomputed QE Scores:
URL="https://data.statmt.org/wmt25/general-mt/wmt25-QEscores/wmt25-all.wmt22-cometkiwi-da_fp16.score.tgz"
wget "$URL"
tar -xvf $(basename "$URL")

Optionally, the following script can be used to recompute scores freshly: .Computing QE Scores:

# Install prebuilt pymarian (linux only)
pip install pymarian==1.12.31

metric="wmt22-cometkiwi-da"
cmd="pymarian-eval --stdin --fields src mt --workspace -8000 --model $metric --mini-batch 128"
# Tip: increase batch size for GPUs that have more memory
# Other supported QE metrics: wmt20-comet-qe-da, wmt23-cometkiwi-da-xl, wmt23-cometkiwi-da-xxl

# Fp16 is faster and uses less memory for GPUs with Tensor Cores
metric+="_fp16"  # adds "_fp16" to the file name to distinguish it from the normal one
cmd+=" --fp16"

# Test that the command works with the given args (especially batch size)
# Tip: add --debug flag to see details if the command crashes
mtdata echo Statmt-newstest_deen-2014-deu-eng | $cmd --debug

# run the command on all datasets downloaded above using "mtdata get-recipe"
for dir in ./wmt25-*; do
  langs=$(basename $dir | sed 's/wmt25-//g')
  mtdata score -l $langs -o $dir -n $metric --cmd "$cmd"
done
wmt22-cometkiwi-da is a gated model. Follow these steps for downloading this gated model: 1. Go to huggingface.co/Unbabel/wmt22-cometkiwi-da-marian and accept the terms for the gated model. 2. Run "huggingface-cli login" and enter your token. Model will be cached locally after the successful download. See pymarian-eval --help` for the cache location and other options

Parallel Data Statistics

Langs Dataset Lines Src Tokens Tgt Tokens Src Chars Tgt Chars

ces-deu

OPUS

136.54M

1.47B

1.61B

10.65B

11.42B

ces-deu

LinguaTools-wikititles-2014

2.39M

4.65M

4.28M

40.00M

40.23M

ces-deu

Tilde

2.04M

36.30M

38.02M

288.88M

307.45M

ces-deu

Facebook-wikimatrix-1

1.60M

20.74M

22.47M

151.14M

162.61M

ces-deu

Statmt-news_commentary-18.1

244.83k

4.82M

5.45M

37.02M

41.07M

ces-deu

(Total)

142.82M

1.54B

1.68B

11.16B

11.97B

ces-ukr

OPUS

17.15M

138.66M

137.78M

0.97B

1.65B

ces-ukr

Facebook-wikimatrix-1

848.96k

10.43M

10.07M

75.97M

127.31M

ces-ukr

ELRC

130.00k

2.48M

2.56M

19.61M

35.26M

ces-ukr

(Total)

18.13M

151.57M

150.41M

1.07B

1.81B

eng-ara

OPUS

304.22M

4.65B

4.21B

28.48B

44.10B

eng-ara

Statmt-ccaligned-1

25.31M

355.78M

343.52M

2.27B

3.58B

eng-ara

LinguaTools-wikititles-2014

4.82M

11.15M

10.91M

84.51M

129.17M

eng-ara

Facebook-wikimatrix-1

1.97M

38.55M

35.77M

242.74M

376.25M

eng-ara

Statmt-tedtalks-2_clean

341.89k

6.17M

5.41M

34.54M

54.49M

eng-ara

Statmt-news_commentary-18.1

193.67k

8.94M

11.70M

57.33M

127.15M

eng-ara

(Total)

336.86M

5.07B

4.61B

31.17B

48.37B

eng-ces

OPUS

237.54M

2.85B

2.48B

17.02B

17.82B

eng-ces

ParaCrawl-paracrawl-9

50.63M

692.12M

626.34M

4.33B

4.68B

eng-ces

Statmt-ccaligned-1

12.73M

148.71M

135.81M

936.99M

1.01B

eng-ces

LinguaTools-wikititles-2014

4.81M

11.36M

9.67M

83.77M

81.29M

eng-ces

Facebook-wikimatrix-1

2.09M

33.56M

29.66M

206.82M

216.62M

eng-ces

Tilde

2.09M

42.26M

38.26M

276.52M

303.75M

eng-ces

ELRC

1.96M

37.18M

33.00M

243.79M

262.52M

eng-ces

EU

1.92M

34.27M

30.09M

222.84M

232.92M

eng-ces

Statmt-europarl-10

644.43k

15.63M

13.00M

94.31M

98.14M

eng-ces

Statmt-wikititles-3

410.94k

1.03M

965.62k

7.47M

7.57M

eng-ces

Statmt-news_commentary-18.1

265.37k

5.71M

5.19M

36.22M

39.81M

eng-ces

Statmt-commoncrawl_wmt13-1

161.84k

3.35M

2.93M

20.66M

20.75M

eng-ces

Neulab-tedtalks_train-1

103.09k

2.10M

1.77M

10.58M

10.39M

eng-ces

(Total)

315.37M

3.88B

3.40B

23.49B

24.78B

eng-est

OPUS

121.36M

1.83B

1.38B

11.18B

11.02B

eng-est

ELRC

9.09M

201.49M

144.73M

1.29B

1.25B

eng-est

ParaCrawl-paracrawl-9

8.54M

136.60M

103.32M

846.64M

840.74M

eng-est

Statmt-ccaligned-1

4.11M

54.21M

43.28M

339.16M

338.17M

eng-est

Tilde

2.06M

41.65M

30.28M

272.67M

271.35M

eng-est

EU

2.03M

36.68M

26.85M

237.87M

231.57M

eng-est

Facebook-wikimatrix-1

955.55k

15.41M

11.78M

96.18M

95.33M

eng-est

Statmt-europarl-7

649.59k

15.68M

11.21M

94.64M

91.44M

eng-est

Neulab-tedtalks_train-1

10.74k

215.97k

171.65k

1.09M

1.04M

eng-est

(Total)

148.81M

2.33B

1.76B

14.36B

14.14B

eng-isl

OPUS

24.26M

292.15M

274.41M

1.70B

1.84B

eng-isl

ParaCrawl-paracrawl-9

2.97M

45.10M

42.66M

266.09M

292.17M

eng-isl

ParIce-eea_train-20.05

1.70M

26.75M

24.24M

170.36M

179.49M

eng-isl

Statmt-ccaligned-1

1.19M

18.63M

17.80M

115.58M

124.36M

eng-isl

Tilde

420.71k

6.31M

6.10M

41.71M

45.26M

eng-isl

ParIce-ema_train-20.05

399.09k

6.13M

5.94M

40.41M

43.90M

eng-isl

Facebook-wikimatrix-1

313.88k

5.66M

4.77M

34.53M

34.04M

eng-isl

Statmt-wikititles-3

50.18k

98.99k

88.35k

722.24k

763.33k

eng-isl

EU

4.72k

54.43k

52.31k

369.04k

398.50k

eng-isl

(Total)

31.31M

400.87M

376.06M

2.37B

2.56B

eng-kor

OPUS

138.12M

1.64B

1.31B

9.84B

12.28B

eng-kor

Statmt-ccaligned-1

9.03M

98.69M

84.80M

635.05M

744.99M

eng-kor

LinguaTools-wikititles-2014

4.83M

11.62M

9.32M

84.86M

90.51M

eng-kor

ParaCrawl-paracrawl-1_bonus

4.00M

61.96M

48.70M

371.75M

433.95M

eng-kor

Facebook-wikimatrix-1

1.35M

21.63M

15.66M

135.00M

161.17M

eng-kor

Neulab-tedtalks_train-1

205.64k

4.29M

2.97M

21.55M

26.31M

eng-kor

ELRC

3.27k

67.72k

45.95k

424.80k

471.77k

eng-kor

(Total)

157.54M

1.84B

1.48B

11.09B

13.74B

eng-rus

OPUS

479.12M

7.32B

6.39B

44.88B

83.67B

eng-rus

Statmt-ccaligned-1

69.26M

0.97B

864.09M

6.18B

11.32B

eng-rus

Statmt-backtrans_ruen-wmt20

39.36M

746.47M

596.28M

4.47B

7.75B

eng-rus

LinguaTools-wikititles-2014

13.57M

33.05M

28.99M

245.88M

421.65M

eng-rus

ParaCrawl-paracrawl-1_bonus

5.38M

101.31M

80.41M

632.54M

1.06B

eng-rus

Facebook-wikimatrix-1

5.20M

86.79M

76.48M

537.73M

0.97B

eng-rus

Statmt-wikititles-3

1.19M

3.13M

2.88M

22.80M

39.34M

eng-rus

Statmt-yandex-wmt22

1.00M

21.25M

18.68M

130.99M

250.76M

eng-rus

Statmt-commoncrawl_wmt13-1

878.39k

18.77M

17.40M

116.16M

214.59M

eng-rus

Statmt-news_commentary-18.1

377.66k

8.72M

8.11M

55.68M

112.13M

eng-rus

Neulab-tedtalks_train-1

208.46k

4.37M

3.69M

21.96M

36.77M

eng-rus

ELRC

39.50k

891.98k

792.00k

5.73M

10.87M

eng-rus

Tilde

34.27k

752.66k

702.81k

4.83M

9.97M

eng-rus

(Total)

615.62M

9.31B

8.09B

57.31B

105.86B

eng-srp

OPUS

127.45M

1.33B

1.17B

7.57B

9.99B

eng-srp

Statmt-ccaligned-1

1.99M

38.73M

34.34M

235.07M

399.09M

eng-srp

Facebook-wikimatrix-1

1.21M

20.95M

18.81M

129.91M

209.19M

eng-srp

Neulab-tedtalks_train-1

136.90k

2.79M

2.38M

14.05M

14.40M

eng-srp

Tilde

2.02k

46.81k

45.16k

303.95k

491.17k

eng-srp

ELRC

856

14.50k

13.28k

93.28k

149.56k

eng-srp

(Total)

130.79M

1.39B

1.22B

7.95B

10.62B

eng-ukr

OPUS

151.87M

2.68B

2.33B

16.50B

29.37B

eng-ukr

ParaCrawl-paracrawl-1_bonus

13.35M

505.83M

487.47M

3.28B

6.04B

eng-ukr

Statmt-ccaligned-1

8.55M

119.38M

104.10M

755.38M

1.33B

eng-ukr

Facebook-wikimatrix-1

2.58M

41.55M

35.59M

257.56M

447.33M

eng-ukr

ELRC

1.16M

16.65M

13.15M

110.37M

194.76M

eng-ukr

Neulab-tedtalks_train-1

108.50k

2.25M

1.94M

11.33M

18.45M

eng-ukr

Tilde

1.63k

36.07k

34.18k

237.96k

477.91k

eng-ukr

(Total)

177.62M

3.36B

2.97B

20.92B

37.40B

eng-zho

OPUS

221.88M

3.25B

392.85M

19.99B

17.76B

eng-zho

Statmt-backtrans_enzh-wmt20

19.76M

364.22M

32.72M

2.16B

1.96B

eng-zho

Statmt-ccaligned-1

15.18M

155.93M

42.42M

1.04B

1.13B

eng-zho

ParaCrawl-paracrawl-1_bonus

14.17M

217.60M

46.40M

1.34B

1.18B

eng-zho

LinguaTools-wikititles-2014

6.66M

16.16M

7.79M

118.50M

112.12M

eng-zho

Facebook-wikimatrix-1

2.60M

49.87M

5.00M

311.07M

277.84M

eng-zho

Statmt-wikititles-3

921.96k

2.37M

973.44k

17.82M

16.28M

eng-zho

Statmt-news_commentary-18.1

442.93k

9.80M

799.74k

62.67M

55.16M

eng-zho

Neulab-tedtalks_train-1

5.54k

95.63k

23.52k

476.98k

399.81k

eng-zho

ELRC

2.98k

91.23k

7.36k

591.36k

644.17k

eng-zho

(Total)

281.63M

4.07B

528.99M

25.05B

22.49B

jpn-zho

OPUS

19.74M

46.43M

46.87M

1.44B

1.08B

jpn-zho

KECL-paracrawl-2wmt24

4.60M

27.88M

29.51M

0.97B

704.98M

jpn-zho

LinguaTools-wikititles-2014

1.66M

1.97M

1.97M

35.18M

27.48M

jpn-zho

Facebook-wikimatrix-1

1.33M

2.36M

2.12M

145.10M

113.60M

jpn-zho

KECL-paracrawl-2

83.89k

552.50k

633.77k

18.86M

14.11M

jpn-zho

Neulab-tedtalks_train-1

5.16k

19.57k

22.30k

490.89k

375.98k

jpn-zho

Statmt-news_commentary-18.1

1.62k

2.59k

2.17k

272.83k

197.25k

jpn-zho

(Total)

27.42M

79.23M

81.13M

2.61B

1.94B

Download stats: without rounding and without grouping

Monolingual Data Statistics

Lang Dataset Lines Tokens Chars

ces-deu/deu

Leipzig-news-2022_30k-deu

30.00k

464.10k

3.33M

ces-deu/deu

Leipzig-web-2021_100k-deu_DE

100.00k

1.53M

11.64M

ces-deu/deu

Statmt-news_commentary-18.1-deu

507.81k

11.03M

83.23M

ces-deu/deu

Leipzig-mixed_typical-2011_1m-deu

999.93k

6.60M

48.91M

ces-deu/deu

Leipzig-wikipedia-2021_1m-deu

1.00M

15.36M

110.63M

ces-deu/deu

Leipzig-comweb-2021_1m-deu

1.00M

15.83M

115.76M

ces-deu/deu

Leipzig-newscrawl-2020_1m-deu

1.00M

15.30M

108.98M

ces-deu/deu

Statmt-europarl-10-deu

2.11M

44.85M

330.07M

ces-deu/deu

Statmt-news_crawl-2023-deu

38.36M

894.96M

6.42B

ces-deu/deu

Statmt-commoncrawl-wmt22-deu

2.87B

53.65B

392.80B

ces-deu/deu

(Total)

2.92B

54.66B

400.03B

ces-ukr/ukr

Statmt-news_crawl-2023-ukr

620.75k

13.37M

169.70M

ces-ukr/ukr

Leipzig-newscrawl-2018_1m-ukr

1.00M

14.82M

191.59M

ces-ukr/ukr

Leipzig-wikipedia-2021_1m-ukr

1.00M

14.56M

185.20M

ces-ukr/ukr

Leipzig-web-2019_1m-ukr_UA

1.00M

14.96M

199.08M

ces-ukr/ukr

Leipzig-news-2022_1m-ukr

1.00M

13.59M

174.03M

ces-ukr/ukr

LangUk-fiction-1-ukr

1.81M

18.32M

198.26M

ces-ukr/ukr

LangUk-wiki_dump-1-ukr

15.79M

185.65M

2.34B

ces-ukr/ukr

LangUk-laws-1-ukr

29.21M

578.99M

7.69B

ces-ukr/ukr

LangUk-news-1-ukr

31.02M

461.45M

5.94B

ces-ukr/ukr

LangUk-ubercorpus-1-ukr

48.62M

665.42M

8.48B

ces-ukr/ukr

(Total)

131.07M

1.98B

25.57B

eng-ara/ara

Statmt-news_commentary-18.1-ara

211.77k

12.74M

138.45M

eng-ara/ara

Leipzig-news-2020_1m-ara

1.00M

22.67M

247.86M

eng-ara/ara

Leipzig-wikipedia-2021_1m-ara

1.00M

16.55M

175.09M

eng-ara/ara

Statmt-news_crawl-2023-ara

21.67M

569.84M

6.31B

eng-ara/ara

(Total)

23.88M

621.80M

6.87B

eng-ces/ces

Statmt-news_commentary-18.1-ces

288.07k

5.55M

42.60M

eng-ces/ces

Statmt-europarl-10-ces

669.67k

13.20M

99.66M

eng-ces/ces

Leipzig-newscrawl-2019_1m-ces

1.00M

13.11M

94.72M

eng-ces/ces

Leipzig-news-2022_1m-ces

1.00M

15.19M

109.49M

eng-ces/ces

Leipzig-web_public-2019_1m-ces_CZ

1.00M

14.62M

104.97M

eng-ces/ces

Leipzig-wikipedia-2021_1m-ces

1.00M

14.68M

107.05M

eng-ces/ces

Statmt-news_crawl-2023-ces

9.04M

192.03M

1.37B

eng-ces/ces

Statmt-commoncrawl-wmt22-ces

333.48M

5.30B

39.19B

eng-ces/ces

(Total)

347.48M

5.57B

41.12B

eng-est/est

Leipzig-news-2020_300k-est

300.00k

4.42M

33.88M

eng-est/est

Leipzig-newscrawl-2017_1m-est

1.00M

14.86M

115.28M

eng-est/est

Leipzig-web-2015_1m-est_EE

1.00M

14.39M

107.60M

eng-est/est

Statmt-news_crawl-2023-est

1.36M

19.22M

148.74M

eng-est/est

(Total)

3.66M

52.90M

405.49M

eng-isl/isl

Leipzig-news-2020_30k-isl

30.00k

524.84k

3.70M

eng-isl/isl

Leipzig-wikipedia-2021_100k-isl

100.00k

1.54M

10.74M

eng-isl/isl

Leipzig-newscrawl-2019_300k-isl

300.00k

5.28M

37.03M

eng-isl/isl

Leipzig-web-2020_1m-isl_IS

1.00M

16.64M

113.42M

eng-isl/isl

Leipzig-web_public-2019_1m-isl_IS

1.00M

16.57M

113.62M

eng-isl/isl

Statmt-news_crawl-2023-isl

1.71M

28.23M

192.88M

eng-isl/isl

(Total)

4.14M

68.78M

471.39M

eng-kor/kor

Leipzig-web-2020_1m-kor_KR

1.00M

15.78M

170.12M

eng-kor/kor

Leipzig-wikipedia-2021_1m-kor

1.00M

13.71M

144.15M

eng-kor/kor

Leipzig-news-2020_1m-kor

1.00M

15.20M

160.73M

eng-kor/kor

Statmt-news_crawl-2023-kor

4.00M

53.43M

552.07M

eng-kor/kor

(Total)

7.00M

98.13M

1.03B

eng-rus/rus

Statmt-news_commentary-18.1-rus

449.48k

8.91M

123.24M

eng-rus/rus

Leipzig-web-2017_1m-rus_GE

1.00M

15.17M

196.18M

eng-rus/rus

Leipzig-newscrawl_public-2018_1m-rus

1.00M

14.44M

190.27M

eng-rus/rus

Leipzig-news-2022_1m-rus

1.00M

14.43M

190.39M

eng-rus/rus

Leipzig-wikipedia-2021_1m-rus

1.00M

13.94M

182.16M

eng-rus/rus

Statmt-news_crawl-2023-rus

22.61M

462.59M

6.08B

eng-rus/rus

Statmt-commoncrawl-wmt22-rus

1.17B

18.88B

239.86B

eng-rus/rus

(Total)

1.20B

19.41B

246.82B

eng-srp/srp

Leipzig-news-2019_30k-srp

30.00k

541.17k

6.14M

eng-srp/srp

Leipzig-web-2016_300k-srp_ME

300.00k

5.91M

71.32M

eng-srp/srp

Leipzig-web-2016_1m-srp_RS

1.00M

17.89M

206.37M

eng-srp/srp

Leipzig-wikipedia-2021_1m-srp

1.00M

15.18M

175.19M

eng-srp/srp

Statmt-news_crawl-2023-srp

15.51M

374.90M

2.69B

eng-srp/srp

(Total)

17.84M

414.41M

3.15B

eng-ukr/ukr

Statmt-news_crawl-2023-ukr

620.75k

13.37M

169.70M

eng-ukr/ukr

Leipzig-newscrawl-2018_1m-ukr

1.00M

14.82M

191.59M

eng-ukr/ukr

Leipzig-wikipedia-2021_1m-ukr

1.00M

14.56M

185.20M

eng-ukr/ukr

Leipzig-web-2019_1m-ukr_UA

1.00M

14.96M

199.08M

eng-ukr/ukr

Leipzig-news-2022_1m-ukr

1.00M

13.59M

174.03M

eng-ukr/ukr

LangUk-fiction-1-ukr

1.81M

18.32M

198.26M

eng-ukr/ukr

LangUk-wiki_dump-1-ukr

15.79M

185.65M

2.34B

eng-ukr/ukr

LangUk-laws-1-ukr

29.21M

578.99M

7.69B

eng-ukr/ukr

LangUk-news-1-ukr

31.02M

461.45M

5.94B

eng-ukr/ukr

LangUk-ubercorpus-1-ukr

48.62M

665.42M

8.48B

eng-ukr/ukr

(Total)

131.07M

1.98B

25.57B

eng-zho/zho

Leipzig-news-2020_300k-zho

300.00k

344.32k

42.51M

eng-zho/zho

Statmt-news_commentary-18.1-zho

541.52k

947.43k

65.61M

eng-zho/zho

Leipzig-wikipedia-2018_1m-zho

1.00M

1.45M

106.70M

eng-zho/zho

Leipzig-tradnewscrawl-2011_1m-zho

1.00M

1.46M

160.15M

eng-zho/zho

Leipzig-web-2016_1m-zho_MO

1.00M

1.26M

194.08M

eng-zho/zho

Statmt-news_crawl-2023-zho

5.53M

10.19M

1.04B

eng-zho/zho

Statmt-commoncrawl-wmt22-zho

1.67B

3.39B

131.85B

eng-zho/zho

(Total)

1.68B

3.41B

133.45B

jpn-zho/zho

Leipzig-news-2020_300k-zho

300.00k

344.32k

42.51M

jpn-zho/zho

Statmt-news_commentary-18.1-zho

541.52k

947.43k

65.61M

jpn-zho/zho

Leipzig-wikipedia-2018_1m-zho

1.00M

1.45M

106.70M

jpn-zho/zho

Leipzig-tradnewscrawl-2011_1m-zho

1.00M

1.46M

160.15M

jpn-zho/zho

Leipzig-web-2016_1m-zho_MO

1.00M

1.26M

194.08M

jpn-zho/zho

Statmt-news_crawl-2023-zho

5.53M

10.19M

1.04B

jpn-zho/zho

Statmt-commoncrawl-wmt22-zho

1.67B

3.39B

131.85B

jpn-zho/zho

(Total)

1.68B

3.41B

133.45B

Download stats: without rounding

Constrained Task Datasets

The selected dataset IDs for constrained task are as follows:

# Setup: pip install mtdata==0.4.3
# To list all the available datasets, use the following commands
#   mtdata list -id -l <lang1>-<lang2>   # parallel
#   mtdata list -id -l <lang>            # monolingual
# To get a dataset
#   mtdata echo <data_id>
########## CES-UKR #########
- id: wmt25-ces-ukr
  langs: ces-ukr
  train:
    - Facebook-wikimatrix-1-ces-ukr
    - ELRC-acts_ukrainian-1-ces-ukr
    - OPUS-ccmatrix-v1-ces-ukr
    - OPUS-elrc_5179_acts_ukrainian-v1-ces-ukr
    - OPUS-elrc_wikipedia_health-v1-ces-ukr
    - OPUS-eubookshop-v2-ces-ukr
    - OPUS-gnome-v1-ces-ukr
    - OPUS-kde4-v2-ces-ukr
    - OPUS-multiccaligned-v1.1-ces-ukr
    - OPUS-multiparacrawl-v9b-ces-ukr
    - OPUS-opensubtitles-v2024-ces-ukr
    - OPUS-qed-v2.0a-ces-ukr
    - OPUS-ted2020-v1-ces-ukr
    - OPUS-tatoeba-v20220303-ces-ukr
    - OPUS-ubuntu-v14.10-ces-ukr
    - OPUS-xlent-v1.1-ces-ukr
    - OPUS-bible_uedin-v1-ces-ukr
    - OPUS-wikimedia-v20210402-ces-ukr
  mono_train: &mono_ukr
    - Statmt-news_crawl-2023-ukr
    - LangUk-news-1-ukr
    - LangUk-wiki_dump-1-ukr
    - LangUk-fiction-1-ukr
    - LangUk-ubercorpus-1-ukr
    - LangUk-laws-1-ukr
    - Leipzig-news-2022_1m-ukr
    - Leipzig-newscrawl-2018_1m-ukr
    - Leipzig-web-2019_1m-ukr_UA
    - Leipzig-wikipedia-2021_1m-ukr

###############CES-DEU########################
- id: wmt25-ces-deu
  langs: ces-deu
  train:
    - Statmt-news_commentary-18.1-ces-deu
    - Tilde-eesc-2017-ces-deu
    - Tilde-ema-2016-ces-deu
    - Tilde-ecb-2017-ces-deu
    - Tilde-rapid-2016-ces-deu
    - Facebook-wikimatrix-1-ces-deu
    - LinguaTools-wikititles-2014-ces-deu
    - OPUS-ccmatrix-v1-ces-deu
    - OPUS-dgt-v2019-ces-deu
    - OPUS-dgt-v4-ces-deu
    - OPUS-ecb-v1-ces-deu
    - OPUS-ecdc-v20160316-ces-deu
    - OPUS-elitr_eca-v1-ces-deu
    - OPUS-elrc_417_swedish_work_environ-v1-ces-deu
    - OPUS-elrc_ec_europa-v1-ces-deu
    - OPUS-elrc_emea-v1-ces-deu
    - OPUS-elrc_euipo_2017-v1-ces-deu
    - OPUS-elrc_europarl_covid-v1-ces-deu
    - OPUS-elrc_eur_lex-v1-ces-deu
    - OPUS-elrc_eu_publications-v1-ces-deu
    - OPUS-elrc_information_portal-v1-ces-deu
    - OPUS-elrc_antibiotic-v1-ces-deu
    - OPUS-elrc_presscorner_covid-v1-ces-deu
    - OPUS-elrc_vaccination-v1-ces-deu
    - OPUS-elrc_wikipedia_health-v1-ces-deu
    - OPUS-emea-v3-ces-deu
    - OPUS-eubookshop-v2-ces-deu
    - OPUS-euconst-v1-ces-deu
    - OPUS-europarl-v8-ces-deu
    - OPUS-gnome-v1-ces-deu
    - OPUS-globalvoices-v2018q4-ces-deu
    - OPUS-jrc_acquis-v3.0-ces-deu
    - OPUS-kde4-v2-ces-deu
    - OPUS-multiccaligned-v1.1-ces-deu
    - OPUS-multiparacrawl-v9b-ces-deu
    - OPUS-nllb-v1-ces-deu
    - OPUS-neulab_tedtalks-v1-ces-deu
    - OPUS-opensubtitles-v2024-ces-deu
    - OPUS-php-v1-ces-deu
    - OPUS-qed-v2.0a-ces-deu
    - OPUS-ted2020-v1-ces-deu
    - OPUS-tanzil-v1-ces-deu
    - OPUS-tatoeba-v20230412-ces-deu
    - OPUS-tildemodel-v2018-ces-deu
    - OPUS-ubuntu-v14.10-ces-deu
    #- OPUS-wikimatrix-v1-ces-deu   # already added from source
    - OPUS-xlent-v1.2-ces-deu
    - OPUS-bible_uedin-v1-ces-deu
    - OPUS-wikimedia-v20230407-ces-deu
  mono_train: &mono_deu
    - Statmt-news_crawl-2023-deu
    - Statmt-europarl-10-deu
    - Statmt-news_commentary-18.1-deu
    - Statmt-commoncrawl-wmt22-deu
    - Leipzig-wikipedia-2021_1m-deu
    - Leipzig-comweb-2021_1m-deu
    - Leipzig-mixed_typical-2011_1m-deu
    - Leipzig-news-2022_30k-deu
    - Leipzig-newscrawl-2020_1m-deu
    - Leipzig-web-2021_100k-deu_DE
    # TODO: extended common crawl

########## JPN-ZHO ##########
- id: wmt25-jpn-zho
  langs: jpn-zho
  train:
    - Statmt-news_commentary-18.1-jpn-zho
    - KECL-paracrawl-2-zho-jpn
    - KECL-paracrawl-2wmt24-zho-jpn
    - Facebook-wikimatrix-1-jpn-zho
    - Neulab-tedtalks_train-1-jpn-zho
    - LinguaTools-wikititles-2014-jpn-zho
    - OPUS-ccmatrix-v1-jpn-zho
    - OPUS-gnome-v1-jpn-zho_CN
    - OPUS-kde4-v2-jpn-zho_CN
    - OPUS-multiccaligned-v1-jpn-zho_CN
    - OPUS-openoffice-v3-jpn-zho_CN
    - OPUS-opensubtitles-v2024-jpn-zho_CN
    - OPUS-php-v1-jpn-zho
    - OPUS-qed-v2.0a-jpn-zho
    - OPUS-ted2020-v1-jpn-zho
    - OPUS-tanzil-v1-jpn-zho
    - OPUS-ubuntu-v14.10-jpn-zho
    - OPUS-ubuntu-v14.10-jpn-zho_CN
    - OPUS-xlent-v1.1-jpn-zho
    - OPUS-bible_uedin-v1-jpn-zho
    - OPUS-wikimedia-v20210402-jpn-zho

  mono_train: &mono_zho
    - Statmt-news_crawl-2023-zho
    - Statmt-news_commentary-18.1-zho
    - Statmt-commoncrawl-wmt22-zho
    - Leipzig-wikipedia-2018_1m-zho
    - Leipzig-web-2016_1m-zho_MO
    - Leipzig-tradnewscrawl-2011_1m-zho
    - Leipzig-news-2020_300k-zho
    # TODO: extended common crawl (too big) https://data.statmt.org/wmt21/translation-task/cc-mono/

######### ENG-BHO ###########
- id: wmt25-eng-bho
  langs: eng-bho
  train:
    - OPUS-nllb-v1-bho-eng
    - OPUS-tatoeba-v20230412-bho-eng
    - OPUS-ubuntu-v14.10-bho-eng
    - OPUS-wikimedia-v20230407-bho-eng
# mono_train: &mono_bho
# NOTE: did not found any monolingual data for bho in mtdata and OPUS

######### ENG-MAS ###########
# TODO: did not find parallel data found for mas-eng in mtdata and OPUS
# - id: wmt25-eng-mas
#  langs: eng-mas
#  train:
#  mono_train: &mono_mas

######### ENG-ARA ###########
- id: wmt25-eng-ara
  langs: eng-ara
  train:
    - Statmt-news_commentary-18.1-ara-eng
    - Statmt-tedtalks-2_clean-eng-ara
    - Statmt-ccaligned-1-ara_AR-eng
    - Facebook-wikimatrix-1-ara-eng
    - LinguaTools-wikititles-2014-ara-eng
    - OPUS-ccaligned-v1-ara-eng
    - OPUS-ccmatrix-v1-ara-eng
    - OPUS-elrc_3083_wikipedia_health-v1-ara-eng
    - OPUS-elrc_wikipedia_health-v1-ara-eng
    - OPUS-elrc_2922-v1-ara-eng
    - OPUS-eubookshop-v2-ara-eng
    - OPUS-gnome-v1-ara-eng
    - OPUS-globalvoices-v2018q4-ara-eng
    - OPUS-hplt-v2-ara-eng
    - OPUS-kde4-v2-ara-eng
    - OPUS-linguatools_wikititles-v2014-ara-eng
    - OPUS-multiccaligned-v1-ara-eng
    - OPUS-multihplt-v2-ara-eng
    - OPUS-multiun-v1-ara-eng
    - OPUS-nllb-v1-ara-eng
    - OPUS-opensubtitles-v2024-ara-eng
    - OPUS-qed-v2.0a-ara-eng
    - OPUS-ted2020-v1-ara-eng
    - OPUS-tatoeba-v20230412-ara-eng
    - OPUS-unpc-v1.0-ara-eng
    - OPUS-ubuntu-v14.10-ara-eng
    - OPUS-wikimatrix-v1-ara-eng
    - OPUS-wikipedia-v1.0-ara-eng
    - OPUS-xlent-v1.2-ara-eng
    - OPUS-bible_uedin-v1-ara-eng
    - OPUS-infopankki-v1-ara-eng
    - OPUS-tico_19-v20201028-ara-eng
    - OPUS-tldr_pages-v20230829-ara-eng
    - OPUS-wikimedia-v20230407-ara-eng
  mono_train:
    - Statmt-news_crawl-2023-ara
    - Statmt-news_commentary-18.1-ara
    - Leipzig-news-2020_1m-ara
    - Leipzig-wikipedia-2021_1m-ara

######### ENG-ZHO ###########
- id: wmt25-eng-zho
  langs: eng-zho
  train:  # TODO: add all public data
    - Statmt-news_commentary-18.1-eng-zho
    - Statmt-wikititles-3-zho-eng
    - Statmt-ccaligned-1-eng-zho_CN
    - ParaCrawl-paracrawl-1_bonus-eng-zho
    - Facebook-wikimatrix-1-eng-zho
    - Neulab-tedtalks_train-1-eng-zho
    - ELRC-wikipedia_health-1-eng-zho
    - ELRC-hrw_dataset_v1-1-eng-zho
    - LinguaTools-wikititles-2014-eng-zho
    - OPUS-ccmatrix-v1-eng-zho
    - OPUS-elrc_3056_wikipedia_health-v1-eng-zho
    - OPUS-elrc_wikipedia_health-v1-eng-zho
    - OPUS-elrc_2922-v1-eng-zho
    - OPUS-eubookshop-v2-eng-zho
    - OPUS-gnome-v1-eng-zho_CN
    - OPUS-kde4-v2-eng-zho_CN
    - OPUS-linguatools_wikititles-v2014-eng-zho
    - OPUS-mdn_web_docs-v20230925-eng-zho_CN
    - OPUS-multiccaligned-v1-eng-zho_CN
    - OPUS-multiun-v1-eng-zho
    - OPUS-nllb-v1-eng-zho
    - OPUS-neulab_tedtalks-v1-eng-zho
    - OPUS-neulab_tedtalks-v1-eng-zho_CN
    - OPUS-openoffice-v3-eng_GB-zho_CN
    - OPUS-opensubtitles-v2024-eng-zho_CN
    - OPUS-php-v1-eng-zho
    - OPUS-qed-v2.0a-eng-zho
    - OPUS-spc-v1-eng-zho
    - OPUS-ted2020-v1-eng-zho
    - OPUS-ted2020-v1-eng-zho_CN
    - OPUS-tanzil-v1-eng-zho
    - OPUS-unpc-v1.0-eng-zho
    - OPUS-ubuntu-v14.10-eng-zho
    - OPUS-xlent-v1.2-eng-zho
    - OPUS-bible_uedin-v1-eng-zho
    - OPUS-infopankki-v1-eng-zho
    - OPUS-tico_19-v20201028-eng-zho
    - OPUS-tldr_pages-v20230829-eng-zho
    - OPUS-wikimedia-v20230407-eng-zho
  mono_train: *mono_zho

######### ENG-CES ###########
- id: wmt25-eng-ces
  langs: eng-ces
  train:
    - Statmt-commoncrawl_wmt13-1-ces-eng
    - Statmt-news_commentary-18.1-ces-eng
    - Statmt-wikititles-3-ces-eng
    - Statmt-europarl-10-ces-eng
    - Statmt-ccaligned-1-ces_CZ-eng
    - ParaCrawl-paracrawl-9-eng-ces
    - Tilde-eesc-2017-ces-eng
    - Tilde-ema-2016-ces-eng
    - Tilde-ecb-2017-ces-eng
    - Tilde-rapid-2019-ces-eng
    - Facebook-wikimatrix-1-ces-eng
    - Neulab-tedtalks_train-1-eng-ces
    - ELRC-information_portal_czech_president_czech_castle-1-ces-eng
    - ELRC-electronic_exchange_social_security_information-1-ces-eng
    - ELRC-euipo_2017-1-ces-eng
    - ELRC-czech_supreme_audit_office_2018_reports-1-ces-eng
    - ELRC-czech_supreme_audit_office_2008_2017_reports-1-ces-eng
    - ELRC-czech_supreme_audit_office_2003_2017_press_releases-1-ces-eng
    - ELRC-czech_supreme_audit_office_2018_press_releases-1-ces-eng
    - ELRC-emea-1-ces-eng
    - ELRC-vaccination-1-ces-eng
    - ELRC-eu_publications_medical_v2-1-ces-eng
    - ELRC-wikipedia_health-1-ces-eng
    - ELRC-antibiotic-1-ces-eng
    - ELRC-europarl_covid-1-ces-eng
    - ELRC-ec_europa_covid-1-ces-eng
    - ELRC-eur_lex_covid-1-ces-eng
    - ELRC-presscorner_covid-1-ces-eng
    - ELRC-scipar-1-ces-eng
    - ELRC-web_acquired_data_related_to_scientific_research-1-eng-ces
    - ELRC-hrw_dataset_v1-1-eng-ces
    - ELRC-cef_data_marketplace-1-eng-ces
    - EU-ecdc-1-eng-ces
    - EU-eac_forms-1-ces-eng
    - EU-eac_reference-1-ces-eng
    - EU-dcep-1-ces-eng
    - LinguaTools-wikititles-2014-ces-eng
    - OPUS-ccaligned-v1-ces-eng
    - OPUS-ccmatrix-v1-ces-eng
    - OPUS-dgt-v2019-ces-eng
    - OPUS-dgt-v4-ces-eng
    - OPUS-ecb-v1-ces-eng
    - OPUS-ecdc-v20160316-ces-eng
    - OPUS-elitr_eca-v1-ces-eng
    - OPUS-elrc_2012_euipo_2017-v1-ces-eng
    - OPUS-elrc_2404_czech_supreme_audit-v1-ces-eng
    - OPUS-elrc_2405_czech_supreme_audit-v1-ces-eng
    - OPUS-elrc_2406_czech_supreme_audit-v1-ces-eng
    - OPUS-elrc_2407_czech_supreme_audit-v1-ces-eng
    - OPUS-elrc_2713_emea-v1-ces-eng
    - OPUS-elrc_2749_vaccination-v1-ces-eng
    - OPUS-elrc_2874_eu_publications_medi-v1-ces-eng
    - OPUS-elrc_3062_wikipedia_health-v1-ces-eng
    - OPUS-elrc_3201_antibiotic-v1-ces-eng
    - OPUS-elrc_3292_europarl_covid-v1-ces-eng
    - OPUS-elrc_3463_ec_europa_covid-v1-ces-eng
    - OPUS-elrc_3564_eur_lex_covid-v1-ces-eng
    - OPUS-elrc_3605_presscorner_covid-v1-ces-eng
    - OPUS-elrc_40_information_portal_c-v1-ces-eng
    - OPUS-elrc_427_electronic_exchange_-v1-ces-eng
    - OPUS-elrc_5067_scipar-v1-ces-eng
    - OPUS-elrc_ec_europa-v1-ces-eng
    - OPUS-elrc_emea-v1-ces-eng
    - OPUS-elrc_euipo_2017-v1-ces-eng
    - OPUS-elrc_europarl_covid-v1-ces-eng
    - OPUS-elrc_eur_lex-v1-ces-eng
    - OPUS-elrc_eu_publications-v1-ces-eng
    - OPUS-elrc_information_portal-v1-ces-eng
    - OPUS-elrc_antibiotic-v1-ces-eng
    - OPUS-elrc_presscorner_covid-v1-ces-eng
    - OPUS-elrc_vaccination-v1-ces-eng
    - OPUS-elrc_wikipedia_health-v1-ces-eng
    - OPUS-elrc_2682-v1-ces-eng
    - OPUS-elrc_2922-v1-ces-eng
    - OPUS-elrc_2923-v1-ces-eng
    - OPUS-elrc_3382-v1-ces-eng
    - OPUS-emea-v3-ces-eng
    - OPUS-eubookshop-v2-ces-eng
    - OPUS-euconst-v1-ces-eng
    - OPUS-gnome-v1-ces-eng
    - OPUS-globalvoices-v2018q4-ces-eng
    - OPUS-jrc_acquis-v3.0-ces-eng
    - OPUS-kde4-v2-ces-eng
    - OPUS-multiccaligned-v1-ces-eng
    - OPUS-multiparacrawl-v7.1-ces-eng
    - OPUS-nllb-v1-ces-eng
    - OPUS-neulab_tedtalks-v1-ces-eng
    - OPUS-opensubtitles-v2024-ces-eng
    - OPUS-php-v1-ces-eng
    - OPUS-qed-v2.0a-ces-eng
    - OPUS-ted2020-v1-ces-eng
    - OPUS-tanzil-v1-ces-eng
    - OPUS-tatoeba-v20230412-ces-eng
    - OPUS-tildemodel-v2018-ces-eng
    - OPUS-ubuntu-v14.10-ces-eng
    - OPUS-wikipedia-v1.0-ces-eng
    - OPUS-xlent-v1.2-ces-eng
    - OPUS-bible_uedin-v1-ces-eng
    - OPUS-wikimedia-v20230407-ces-eng
  mono_train:
    - Statmt-news_crawl-2023-ces
    - Statmt-europarl-10-ces
    - Statmt-news_commentary-18.1-ces
    - Statmt-commoncrawl-wmt22-ces
    - Leipzig-news-2022_1m-ces
    - Leipzig-newscrawl-2019_1m-ces
    - Leipzig-wikipedia-2021_1m-ces
    - Leipzig-web_public-2019_1m-ces_CZ
    # TODO: extended common crawl (too big) https://data.statmt.org/wmt21/translation-task/cc-mono/

######### ENG-EST ###########
- id: wmt25-eng-est
  langs: eng-est
  train:
  - Statmt-europarl-7-est-eng
  - Statmt-ccaligned-1-eng-est_EE
  - ParaCrawl-paracrawl-9-eng-est
  - Tilde-eesc-2017-eng-est
  - Tilde-ema-2016-eng-est
  - Tilde-airbaltic-1-eng-est
  - Tilde-ecb-2017-eng-est
  - Tilde-rapid-2016-eng-est
  - Facebook-wikimatrix-1-eng-est
  - Neulab-tedtalks_train-1-eng-est
  - ELRC-estonian_cabinet_ministers-1-eng-est
  - ELRC-bank_estonia-1-eng-est
  - ELRC-legal_estonian_justice-1-eng-est
  - ELRC-estonian_foreign_affairs-1-eng-est
  - ELRC-parliament_estonia-1-eng-est
  - ELRC-finnish_information_bank-1-eng-est
  - ELRC-national_security_defence-1-eng-est
  - ELRC-akadeemia.ee-1-eng-est
  - ELRC-vp1992_2001.president.ee-1-eng-est
  - ELRC-vp2001_2006.president.ee-1-eng-est
  - ELRC-vp2006_2016.president.ee-1-eng-est
  - ELRC-president.ee-1-eng-est
  - ELRC-www.visitestonia.com-1-eng-est
  - ELRC-euipo_2017-1-eng-est
  - ELRC-estonian_classification_economic_activities-1-eng-est
  - ELRC-press_releases_foreign_affairs_estonia-1-eng-est
  - ELRC-emea-1-eng-est
  - ELRC-vaccination-1-eng-est
  - ELRC-eu_publications_medical_v2-1-eng-est
  - ELRC-wikipedia_health-1-eng-est
  - ELRC-antibiotic-1-eng-est
  - ELRC-europarl_covid-1-eng-est
  - ELRC-ec_europa_covid-1-eng-est
  - ELRC-www.kriis.ee-1-eng-est
  - ELRC-eur_lex_covid-1-eng-est
  - ELRC-presscorner_covid-1-eng-est
  - ELRC-nteu_tiera-1-eng-est
  - ELRC-nteu_tierb-1-eng-est
  - ELRC-scipar-1-eng-est
  - ELRC-web_acquired_data_related_to_scientific_research-1-eng-est
  - EU-ecdc-1-eng-est
  - EU-eac_forms-1-eng-est
  - EU-eac_reference-1-eng-est
  - EU-dcep-1-eng-est
  - OPUS-ccaligned-v1-eng-est
  - OPUS-ccmatrix-v1-eng-est
  - OPUS-dgt-v2019-eng-est
  - OPUS-dgt-v4-eng-est
  - OPUS-ecb-v1-eng-est
  - OPUS-ecdc-v20160316-eng-est
  - OPUS-elitr_eca-v1-eng-est
  - OPUS-elra_w0154-v1-eng-est
  - OPUS-elra_w0167-v1-eng-est
  - OPUS-elra_w0168-v1-eng-est
  - OPUS-elra_w0215-v1-eng-est
  - OPUS-elra_w0218-v1-eng-est
  - OPUS-elra_w0265-v1-eng-est
  - OPUS-elrc_1129_www.visitestonia.com-v1-eng-est
  - OPUS-elrc_2016_euipo_2017-v1-eng-est
  - OPUS-elrc_2457_estonian_classificat-v1-eng-est
  - OPUS-elrc_2461_press_releases_forei-v1-eng-est
  - OPUS-elrc_2723_emea-v1-eng-est
  - OPUS-elrc_2751_vaccination-v1-eng-est
  - OPUS-elrc_2882_eu_publications_medi-v1-eng-est
  - OPUS-elrc_3079_wikipedia_health-v1-eng-est
  - OPUS-elrc_3211_antibiotic-v1-eng-est
  - OPUS-elrc_3300_europarl_covid-v1-eng-est
  - OPUS-elrc_3471_ec_europa_covid-v1-eng-est
  - OPUS-elrc_3554_www.kriis.ee-v1-eng-est
  - OPUS-elrc_3572_eur_lex_covid-v1-eng-est
  - OPUS-elrc_3613_presscorner_covid-v1-eng-est
  - OPUS-elrc_393_estonian_cabinet_min-v1-eng-est
  - OPUS-elrc_411_bank_estonia-v1-eng-est
  - OPUS-elrc_4271_nteu_tiera-v1-eng-est
  - OPUS-elrc_429_legal_estonian_justi-v1-eng-est
  - OPUS-elrc_431_estonian_foreign_aff-v1-eng-est
  - OPUS-elrc_5067_scipar-v1-eng-est
  - OPUS-elrc_714_parliament_estonia-v1-eng-est
  - OPUS-elrc_717_finnish_information_-v1-eng-est
  - OPUS-elrc_770_national_security_de-v1-eng-est
  - OPUS-elrc_919_akadeemia.ee-v1-eng-est
  - OPUS-elrc_937_vp1992_2001.presiden-v1-eng-est
  - OPUS-elrc_938_vp2001_2006.presiden-v1-eng-est
  - OPUS-elrc_939_vp2006_2016.presiden-v1-eng-est
  - OPUS-elrc_940_president.ee-v1-eng-est
  - OPUS-elrc_ec_europa-v1-eng-est
  - OPUS-elrc_emea-v1-eng-est
  - OPUS-elrc_euipo_2017-v1-eng-est
  - OPUS-elrc_europarl_covid-v1-eng-est
  - OPUS-elrc_eur_lex-v1-eng-est
  - OPUS-elrc_eu_publications-v1-eng-est
  - OPUS-elrc_finnish_information-v1-eng-est
  - OPUS-elrc_antibiotic-v1-eng-est
  - OPUS-elrc_presscorner_covid-v1-eng-est
  - OPUS-elrc_vaccination-v1-eng-est
  - OPUS-elrc_wikipedia_health-v1-eng-est
  - OPUS-elrc_www.visitestonia.com-v1-eng-est
  - OPUS-elrc_2682-v1-eng-est
  - OPUS-elrc_2922-v1-eng-est
  - OPUS-elrc_2923-v1-eng-est
  - OPUS-elrc_3382-v1-eng-est
  - OPUS-emea-v3-eng-est
  - OPUS-eopc-v2022-eng-est
  - OPUS-eubookshop-v2-eng-est
  - OPUS-euconst-v1-eng-est
  - OPUS-europarl-v8-eng-est
  - OPUS-gnome-v1-eng-est
  - OPUS-hplt-v2-eng-est
  - OPUS-jrc_acquis-v3.0-eng-est
  - OPUS-kde4-v2-eng-est
  - OPUS-kdedoc-v1-eng_GB-est
  - OPUS-multiccaligned-v1-eng-est
  - OPUS-multihplt-v2-eng-est
  - OPUS-multiparacrawl-v7.1-eng-est
  - OPUS-nllb-v1-eng-est
  - OPUS-neulab_tedtalks-v1-eng-est
  - OPUS-opensubtitles-v2018-eng-est
  - OPUS-paracrawl-v9-eng-est
  - OPUS-qed-v2.0a-eng-est
  - OPUS-ted2020-v1-eng-est
  - OPUS-tatoeba-v20230412-eng-est
  - OPUS-tildemodel-v2018-eng-est
  - OPUS-ubuntu-v14.10-eng-est
  - OPUS-xlent-v1.2-eng-est
  - OPUS-bible_uedin-v1-eng-est
  - OPUS-infopankki-v1-eng-est
  - OPUS-wikimedia-v20230407-eng-est
  mono_train:
  - Statmt-news_crawl-2023-est
  - Leipzig-web-2015_1m-est_EE
  - Leipzig-news-2020_300k-est
  - Leipzig-newscrawl-2017_1m-est


######### ENG-ISL ###########
- id: wmt25-eng-isl
  langs: eng-isl
  train:
    - Statmt-wikititles-3-isl-eng
    - Statmt-ccaligned-1-eng-isl_IS
    - ParaCrawl-paracrawl-9-eng-isl
    - Tilde-eesc-2017-eng-isl
    - Tilde-ema-2016-eng-isl
    - Tilde-rapid-2016-eng-isl
    - Facebook-wikimatrix-1-eng-isl
    - ParIce-eea_train-20.05-eng-isl
    - ParIce-ema_train-20.05-eng-isl
    - EU-ecdc-1-eng-isl
    - EU-eac_forms-1-eng-isl
    - EU-eac_reference-1-eng-isl
    - OPUS-ccmatrix-v1-eng-isl
    - OPUS-elrc_2718_emea-v1-eng-isl
    - OPUS-elrc_3206_antibiotic-v1-eng-isl
    - OPUS-elrc_4295_www.malfong.is-v1-eng-isl
    - OPUS-elrc_4324_government_offices_i-v1-eng-isl
    - OPUS-elrc_4327_government_offices_i-v1-eng-isl
    - OPUS-elrc_4334_rkiskaup_2020-v1-eng-isl
    - OPUS-elrc_4338_university_iceland-v1-eng-isl
    - OPUS-elrc_502_icelandic_financial_-v1-eng-isl
    - OPUS-elrc_504_www.iceida.is-v1-eng-isl
    - OPUS-elrc_505_www.pfs.is-v1-eng-isl
    - OPUS-elrc_506_www.lanamal.is-v1-eng-isl
    - OPUS-elrc_5067_scipar-v1-eng-isl
    - OPUS-elrc_508_tilde_statistics_ice-v1-eng-isl
    - OPUS-elrc_509_gallery_iceland-v1-eng-isl
    - OPUS-elrc_510_harpa_reykjavik_conc-v1-eng-isl
    - OPUS-elrc_511_bokmenntaborgin_is-v1-eng-isl
    - OPUS-elrc_516_icelandic_medicines-v1-eng-isl
    - OPUS-elrc_517_icelandic_directorat-v1-eng-isl
    - OPUS-elrc_597_www.nordisketax.net-v1-eng-isl
    - OPUS-elrc_718_statistics_iceland-v1-eng-isl
    - OPUS-elrc_728_www.norden.org-v1-eng-isl
    - OPUS-elrc_emea-v1-eng-isl
    - OPUS-elrc_antibiotic-v1-eng-isl
    - OPUS-elrc_www.norden.org-v1-eng-isl
    - OPUS-elrc_www.nordisketax.net-v1-eng-isl
    - OPUS-eubookshop-v2-eng-isl
    - OPUS-hplt-v2-eng-isl
    - OPUS-multiccaligned-v1-eng-isl
    - OPUS-multihplt-v2-eng-isl
    - OPUS-multiparacrawl-v7.1-eng-isl
    - OPUS-opensubtitles-v2024-eng-isl
    - OPUS-ted2020-v1-eng-isl
    - OPUS-tatoeba-v20220303-eng-isl
    - OPUS-ubuntu-v14.10-eng-isl
    - OPUS-wikimatrix-v1-eng-isl
    - OPUS-wikititles-v3-eng-isl
    - OPUS-xlent-v1.1-eng-isl
    - OPUS-wikimedia-v20210402-eng-isl
  mono_train:
    - Statmt-news_crawl-2023-isl
    - Leipzig-web-2020_1m-isl_IS
    - Leipzig-web_public-2019_1m-isl_IS
    - Leipzig-news-2020_30k-isl
    - Leipzig-newscrawl-2019_300k-isl
    - Leipzig-wikipedia-2021_100k-isl

######### ENG-JPN ###########
- id: wmt24-eng-jpn
  langs: eng-jpn
  train: # TODO: add all public data
    - Statmt-news_commentary-18.1-eng-jpn
    - Statmt-wikititles-3-jpn-eng
    - Statmt-ted-wmt20-eng-jpn
    - Statmt-ccaligned-1-eng-jpn
    - KECL-paracrawl-3-eng-jpn
    - Facebook-wikimatrix-1-eng-jpn
    - Phontron-kftt_train-1-eng-jpn
    - StanfordNLP-jesc_train-1-eng-jpn
    - Neulab-tedtalks_train-1-eng-jpn
    - LinguaTools-wikititles-2014-eng-jpn
    - OPUS-alt-v2019-eng-jpn
    - OPUS-ccmatrix-v1-eng-jpn
    - OPUS-eubookshop-v2-eng-jpn
    - OPUS-gnome-v1-eng-jpn
    - OPUS-globalvoices-v2018q4-eng-jpn
    - OPUS-hplt-v2-eng-jpn
    - OPUS-kde4-v2-eng-jpn
    - OPUS-mdn_web_docs-v20230925-eng-jpn
    - OPUS-multiccaligned-v1-eng-jpn
    - OPUS-multihplt-v2-eng-jpn
    - OPUS-nllb-v1-eng-jpn
    - OPUS-neulab_tedtalks-v1-eng-jpn
    - OPUS-openoffice-v3-eng_GB-jpn
    - OPUS-opensubtitles-v2024-eng-jpn
    - OPUS-php-v1-eng-jpn
    - OPUS-qed-v2.0a-eng-jpn
    - OPUS-ted2020-v1-eng-jpn
    - OPUS-tanzil-v1-eng-jpn
    - OPUS-tatoeba-v20230412-eng-jpn
    - OPUS-ubuntu-v14.10-eng-jpn
    - OPUS-wikimatrix-v1-eng-jpn
    - OPUS-xlent-v1.2-eng-jpn
    - OPUS-bible_uedin-v1-eng-jpn
    - OPUS-tldr_pages-v20230829-eng-jpn
    - OPUS-wikimedia-v20230407-eng-jpn
  mono_train: &mono_jpn
    - Statmt-news_crawl-2023-jpn
    - Statmt-news_commentary-18.1-jpn
    - Statmt-commoncrawl-wmt22-jpn
    - Leipzig-web-2020_1m-jpn_JP
    - Leipzig-comweb-2018_1m-jpn
    - Leipzig-web_public-2019_1m-jpn_JP
    - Leipzig-news-2020_100k-jpn
    - Leipzig-newscrawl-2019_1m-jpn
    - Leipzig-wikipedia-2021_1m-jpn
    # TODO: Extended Common Crawl

######### ENG-KOR ###########
- id: wmt25-eng-kor
  langs: eng-kor
  train:
    - Statmt-ccaligned-1-eng-kor_KR
    - ParaCrawl-paracrawl-1_bonus-eng-kor
    - Facebook-wikimatrix-1-eng-kor
    - Neulab-tedtalks_train-1-eng-kor
    - ELRC-wikipedia_health-1-eng-kor
    - ELRC-hrw_dataset_v1-1-eng-kor
    - LinguaTools-wikititles-2014-eng-kor
    - OPUS-ccaligned-v1-eng-kor
    - OPUS-ccmatrix-v1-eng-kor
    - OPUS-elrc_3070_wikipedia_health-v1-eng-kor
    - OPUS-elrc_wikipedia_health-v1-eng-kor
    - OPUS-elrc_2922-v1-eng-kor
    - OPUS-gnome-v1-eng-kor
    - OPUS-globalvoices-v2018q4-eng-kor
    - OPUS-hplt-v2-eng-kor
    - OPUS-multihplt-v2-eng-kor
    - OPUS-kde4-v2-eng-kor
    - OPUS-linguatools_wikititles-v2014-eng-kor
    - OPUS-mdn_web_docs-v20230925-eng-kor
    - OPUS-multiccaligned-v1-eng-kor
    - OPUS-nllb-v1-eng-kor
    - OPUS-neulab_tedtalks-v1-eng-kor
    - OPUS-opensubtitles-v2024-eng-kor
    - OPUS-php-v1-eng-kor
    - OPUS-paracrawl-v9-eng-kor
    - OPUS-qed-v2.0a-eng-kor
    - OPUS-ted2020-v1-eng-kor
    - OPUS-tanzil-v1-eng-kor
    - OPUS-tatoeba-v20230412-eng-kor
    - OPUS-ubuntu-v14.10-eng-kor
    - OPUS-wikimatrix-v1-eng-kor
    - OPUS-xlent-v1.2-eng-kor
    - OPUS-bible_uedin-v1-eng-kor
    - OPUS-tldr_pages-v20230829-eng-kor
    - OPUS-wikimedia-v20230407-eng-kor
  mono_train:
    - Statmt-news_crawl-2023-kor
    - Leipzig-web-2020_1m-kor_KR
    - Leipzig-news-2020_1m-kor
    - Leipzig-wikipedia-2021_1m-kor

######### ENG-RUS ###########
- id: wmt25-eng-rus
  langs: eng-rus
  train: # TODO: add all public data
    - Statmt-news_commentary-18.1-eng-rus
    - Statmt-wikititles-3-rus-eng
    - Statmt-ccaligned-1-eng-rus_RU
    - Statmt-yandex-wmt22-eng-rus
    - ParaCrawl-paracrawl-1_bonus-eng-rus
    - Tilde-airbaltic-1-eng-rus
    - Tilde-czechtourism-1-eng-rus
    - Tilde-worldbank-1-eng-rus
    - Facebook-wikimatrix-1-eng-rus
    - Neulab-tedtalks_train-1-eng-rus
    - ELRC-wikipedia_health-1-eng-rus
    - ELRC-swps_university_social_sciences_humanities-1-eng-rus
    - ELRC-scipar-1-eng-rus
    - ELRC-web_acquired_data_related_to_scientific_research-1-eng-rus
    - ELRC-hrw_dataset_v1-1-eng-rus
    - LinguaTools-wikititles-2014-eng-rus
    - OPUS-books-v1-eng-rus
    - OPUS-ccaligned-v1-eng-rus
    - OPUS-ccmatrix-v1-eng-rus
    - OPUS-elrc_3075_wikipedia_health-v1-eng-rus
    - OPUS-elrc_3855_swps_university_soci-v1-eng-rus
    - OPUS-elrc_5067_scipar-v1-eng-rus
    - OPUS-elrc_5183_scipar_ukraine-v1-eng-rus
    - OPUS-elrc_wikipedia_health-v1-eng-rus
    - OPUS-elrc_2922-v1-eng-rus
    - OPUS-eubookshop-v2-eng-rus
    - OPUS-gnome-v1-eng-rus
    - OPUS-globalvoices-v2018q4-eng-rus
    - OPUS-kde4-v2-eng-rus
    - OPUS-kdedoc-v1-eng_GB-rus
    - OPUS-mdn_web_docs-v20230925-eng-rus
    - OPUS-multiccaligned-v1-eng-rus
    - OPUS-multiparacrawl-v7.1-eng-rus
    - OPUS-multiun-v1-eng-rus
    - OPUS-nllb-v1-eng-rus
    - OPUS-openoffice-v3-eng_GB-rus
    - OPUS-opensubtitles-v2024-eng-rus
    - OPUS-php-v1-eng-rus
    - OPUS-qed-v2.0a-eng-rus
    - OPUS-ted2020-v1-eng-rus
    - OPUS-tanzil-v1-eng-rus
    - OPUS-tatoeba-v20230412-eng-rus
    - OPUS-unpc-v1.0-eng-rus
    - OPUS-ubuntu-v14.10-eng-rus
    - OPUS-wikipedia-v1.0-eng-rus
    - OPUS-xlent-v1.2-eng-rus
    - OPUS-ada83-v1-eng-rus
    - OPUS-bible_uedin-v1-eng-rus
    - OPUS-infopankki-v1-eng-rus
    - OPUS-tico_19-v20201028-eng-rus
    - OPUS-tldr_pages-v20230829-eng-rus
    - OPUS-wikimedia-v20230407-eng-rus

  mono_train: &mono_rus
    - Statmt-news_crawl-2023-rus
    - Statmt-news_commentary-18.1-rus
    - Statmt-commoncrawl-wmt22-rus
    - Leipzig-news-2022_1m-rus
    - Leipzig-newscrawl_public-2018_1m-rus
    - Leipzig-web-2017_1m-rus_GE
    - Leipzig-wikipedia-2021_1m-rus
######### ENG-SRP ###########
- id : wmt25-eng-srp
  langs: eng-srp
  train:
    - Statmt-ccaligned-1-eng-srp_RS
    - Tilde-worldbank-1-eng-srp
    - Facebook-wikimatrix-1-eng-srp
    - Neulab-tedtalks_train-1-eng-srp
    - ELRC-swedish_social_security-1-eng-srp
    - ELRC-wikipedia_health-1-eng-srp
    - OPUS-ccaligned-v1-eng-srp
    - OPUS-ccmatrix-v1-eng-srp
    - OPUS-elrc_3041_wikipedia_health-v1-eng-srp
    - OPUS-elrc_416_swedish_social_secur-v1-eng-srp
    - OPUS-elrc_wikipedia_health-v1-eng-srp
    - OPUS-elrc_2922-v1-eng-srp
    - OPUS-eubookshop-v2-eng-srp
    - OPUS-gnome-v1-eng-srp
    - OPUS-globalvoices-v2018q4-eng-srp
    - OPUS-gourmet-v2-eng-srp
    - OPUS-hplt-v2-eng-srp
    - OPUS-kde4-v2-eng-srp
    - OPUS-kdedoc-v1-eng_GB-srp
    - OPUS-multiccaligned-v1-eng-srp
    - OPUS-multihplt-v2-eng-srp
    - OPUS-nllb-v1-eng-srp
    - OPUS-neulab_tedtalks-v1-eng-srp
    - OPUS-opensubtitles-v2024-eng-srp
    - OPUS-qed-v2.0a-eng-srp
    - OPUS-setimes-v2-eng-srp
    - OPUS-tatoeba-v20230412-eng-srp
    - OPUS-tildemodel-v2018-eng-srp
    - OPUS-ubuntu-v14.10-eng-srp
    - OPUS-wikimatrix-v1-eng-srp
    - OPUS-xlent-v1.2-eng-srp
    - OPUS-bible_uedin-v1-eng-srp
    - OPUS-tldr_pages-v20230829-eng-srp
    - OPUS-wikimedia-v20230407-eng-srp
  mono_train:
    - Statmt-news_crawl-2023-srp
    # TODO: verify if _ME and _RS country codes are safe to mix
    - Leipzig-web-2016_300k-srp_ME
    - Leipzig-web-2016_1m-srp_RS
    - Leipzig-news-2019_30k-srp
    - Leipzig-wikipedia-2021_1m-srp

######### ENG-UKR ###########
- id: wmt25-eng-ukr
  langs: eng-ukr
  train: &para_eng_ukr   #TODO: add all public data
    - Statmt-ccaligned-1-eng-ukr_UA
    - ParaCrawl-paracrawl-1_bonus-eng-ukr
    - Tilde-worldbank-1-eng-ukr
    - Facebook-wikimatrix-1-eng-ukr
    - Neulab-tedtalks_train-1-eng-ukr
    - ELRC-wikipedia_health-1-eng-ukr
    - ELRC-french_polish_ukrainian-1-eng-ukr
    - ELRC-acts_ukrainian-1-eng-ukr
    - ELRC-official_parliament_ukraine_ukrainian_laws_en-1-eng-ukr
    - ELRC-official_parliament_ukraine_abstracts_uk_laws-1-eng-ukr
    - ELRC-official_parliament_ukraine_primary_legislation-1-eng-ukr
    - ELRC-scipar_ukraine-1-eng-ukr
    - ELRC-a_lexicon_named_entities_extracted_wikipedia-1-eng-ukr
    - ELRC-ukrainian_legal_mt_test_set-1-eng-ukr
    - ELRC-web_acquired_data_related_to_scientific_research-1-eng-ukr
    - ELRC-hrw_dataset_v1-1-eng-ukr
    - OPUS-ccaligned-v1-eng-ukr
    - OPUS-ccmatrix-v1-eng-ukr
    - OPUS-elrc_3043_wikipedia_health-v1-eng-ukr
    - OPUS-elrc_5174_french_polish_ukrain-v1-eng-ukr
    - OPUS-elrc_5179_acts_ukrainian-v1-eng-ukr
    - OPUS-elrc_5180_official_parliament_-v1-eng-ukr
    - OPUS-elrc_5181_official_parliament_-v1-eng-ukr
    - OPUS-elrc_5182_official_parliament_-v1-eng-ukr
    - OPUS-elrc_5183_scipar_ukraine-v1-eng-ukr
    - OPUS-elrc_5214_a_lexicon_named-v1-eng-ukr
    - OPUS-elrc_5217_ukrainian_legal_mt-v1-eng-ukr
    - OPUS-elrc_wikipedia_health-v1-eng-ukr
    - OPUS-elrc_2922-v1-eng-ukr
    - OPUS-eubookshop-v2-eng-ukr
    - OPUS-gnome-v1-eng-ukr
    - OPUS-hplt-v2-eng-ukr
    - OPUS-kde4-v2-eng-ukr
    - OPUS-kdedoc-v1-eng_GB-ukr
    - OPUS-macocu-v2-eng-ukr
    - OPUS-multiccaligned-v1-eng-ukr
    - OPUS-multihplt-v2-eng-ukr
    - OPUS-multimacocu-v2-eng-ukr
    - OPUS-nllb-v1-eng-ukr
    - OPUS-neulab_tedtalks-v1-eng-ukr
    - OPUS-opensubtitles-v2024-eng-ukr
    - OPUS-paracrawl_bonus-v9-eng-ukr
    - OPUS-qed-v2.0a-eng-ukr
    - OPUS-summa-v1-eng-ukr
    - OPUS-ted2020-v1-eng-ukr
    - OPUS-tatoeba-v20230412-eng-ukr
    - OPUS-ubuntu-v14.10-eng-ukr
    - OPUS-wikimatrix-v1-eng-ukr
    - OPUS-xlent-v1.2-eng-ukr
    - OPUS-bible_uedin-v1-eng-ukr
    - OPUS-tldr_pages-v20230829-eng-ukr
    - OPUS-wikimedia-v20230407-eng-ukr
  mono_train: *mono_ukr

Issues/Bugs

Please report them using GitHub issues at github.com/thammegowda/mtdata .