This document lists the WMT26 General MT task datasets for the constrained track
and explains how to download them using
mtdata.
See also the General Translation Task page.
ANNOUNCEMENTS
-
2026-04-05: Initial WMT26 constrained recipe release.
MTData
Setup
mtdata 0.5.0-dev is not yet released on PyPI, so please install it from the
develop branch:
pip install "git+https://github.com/thammegowda/mtdata.git@develop" # Python 3.10+
Recipes Config File
Config file for the constrained track:
wget https://www.statmt.org/wmt26/mtdata/mtdata.recipes.wmt26-constrained.yml
By default, mtdata loads files matching mtdata.recipes*.yml from the current
working directory. If you prefer to keep the recipe file elsewhere, set:
export MTDATA_RECIPES=/path/to/recipesdir
List All Recipes
mtdata list-recipe -id | grep '^wmt26-'
Download Recipes
Download one recipe:
mtdata get-recipe -ri wmt26-eng-ces -o wmt26-eng-ces --compress --no-merge -j 8
Download all currently supported WMT26 recipes:
# optional: prefetch/cache datasets referenced by all WMT26 recipes
# this step parallelizes caching across all recipes and reduces total time
mtdata -no-pb cache -j 8 -ri "wmt26-*"
# materialize every supported WMT26 recipe into its own directory
for id in $(mtdata list-recipe -id | grep '^wmt26-'); do
mtdata get-recipe -ri "$id" -o "$id" --compress --no-merge -j 8
done
|
WMT26 Recipe IDs
-
wmt26-ces-ukr -
wmt26-ces-deu -
wmt26-jpn-zho -
wmt26-eng-ara -
wmt26-eng-zho -
wmt26-eng-ces -
wmt26-eng-est -
wmt26-eng-isl -
wmt26-eng-jpn -
wmt26-eng-kor -
wmt26-ces-vie -
wmt26-eng-hye -
wmt26-eng-bel -
wmt26-eng-zho_TW -
wmt26-eng-deu -
wmt26-eng-ind -
wmt26-eng-kaz -
wmt26-eng-lld -
wmt26-eng-lij_Latn -
wmt26-eng-sme -
wmt26-eng-tha
Constrained Task Datasets
The selected dataset IDs for the constrained track are as follows:
# Setup: pip install "git+https://github.com/thammegowda/mtdata.git@develop"
# To list all the available datasets, use the following commands
# mtdata list -id -l <lang1>-<lang2> # parallel
# mtdata list -id -l <lang> # monolingual
# To get a dataset
# mtdata echo <data_id>
########## CES-UKR ##########
- id: wmt26-ces-ukr
langs: ces-ukr
train:
- Facebook-wikimatrix-1-ces-ukr
- ELRC-acts_ukrainian-1-ces-ukr
- OPUS-ccmatrix-v1-ces-ukr
- OPUS-elrc_5179_acts_ukrainian-v1-ces-ukr
- OPUS-elrc_wikipedia_health-v1-ces-ukr
- OPUS-eubookshop-v2-ces-ukr
- OPUS-gnome-v1-ces-ukr
- OPUS-kde4-v2-ces-ukr
- OPUS-multiccaligned-v1.1-ces-ukr
- OPUS-multiparacrawl-v9b-ces-ukr
- OPUS-opensubtitles-v2024-ces-ukr
- OPUS-qed-v2.0a-ces-ukr
- OPUS-ted2020-v1-ces-ukr
- OPUS-ubuntu-v14.10-ces-ukr
- OPUS-bible_uedin-v1-ces-ukr
- OPUS-multihplt-v3-ces-ukr
- OPUS-neulab_tedtalks-v1-ces-ukr
- OPUS-nllb-v1-ces-ukr
- OPUS-tatoeba-v20230412-ces-ukr
- OPUS-tldr_pages-v20251124-ces-ukr
- OPUS-wikimedia-v20230407-ces-ukr
- OPUS-xlent-v1.2-ces-ukr
mono_train:
- Statmt-news_crawl-2023-ukr
- LangUk-news-1-ukr
- LangUk-wiki_dump-1-ukr
- LangUk-fiction-1-ukr
- LangUk-ubercorpus-1-ukr
- LangUk-laws-1-ukr
- Leipzig-news-2022_1m-ukr
- Leipzig-newscrawl-2018_1m-ukr
- Leipzig-web-2019_1m-ukr_UA
- Leipzig-wikipedia-2021_1m-ukr
########## CES-DEU ##########
- id: wmt26-ces-deu
langs: ces-deu
train:
- Statmt-news_commentary-18.1-ces-deu
- Tilde-eesc-2017-ces-deu
- Tilde-ema-2016-ces-deu
- Tilde-ecb-2017-ces-deu
- Tilde-rapid-2016-ces-deu
- Facebook-wikimatrix-1-ces-deu
- LinguaTools-wikititles-2014-ces-deu
- OPUS-ccmatrix-v1-ces-deu
- OPUS-dgt-v4-ces-deu
- OPUS-ecb-v1-ces-deu
- OPUS-ecdc-v20160316-ces-deu
- OPUS-elitr_eca-v1-ces-deu
- OPUS-elrc_417_swedish_work_environ-v1-ces-deu
- OPUS-elrc_ec_europa-v1-ces-deu
- OPUS-elrc_emea-v1-ces-deu
- OPUS-elrc_euipo_2017-v1-ces-deu
- OPUS-elrc_europarl_covid-v1-ces-deu
- OPUS-elrc_eur_lex-v1-ces-deu
- OPUS-elrc_eu_publications-v1-ces-deu
- OPUS-elrc_information_portal-v1-ces-deu
- OPUS-elrc_antibiotic-v1-ces-deu
- OPUS-elrc_presscorner_covid-v1-ces-deu
- OPUS-elrc_vaccination-v1-ces-deu
- OPUS-elrc_wikipedia_health-v1-ces-deu
- OPUS-emea-v3-ces-deu
- OPUS-eubookshop-v2-ces-deu
- OPUS-euconst-v1-ces-deu
- OPUS-gnome-v1-ces-deu
- OPUS-globalvoices-v2018q4-ces-deu
- OPUS-jrc_acquis-v3.0-ces-deu
- OPUS-kde4-v2-ces-deu
- OPUS-multiccaligned-v1.1-ces-deu
- OPUS-multiparacrawl-v9b-ces-deu
- OPUS-nllb-v1-ces-deu
- OPUS-neulab_tedtalks-v1-ces-deu
- OPUS-opensubtitles-v2024-ces-deu
- OPUS-php-v1-ces-deu
- OPUS-qed-v2.0a-ces-deu
- OPUS-ted2020-v1-ces-deu
- OPUS-tanzil-v1-ces-deu
- OPUS-tatoeba-v20230412-ces-deu
- OPUS-tildemodel-v2018-ces-deu
- OPUS-ubuntu-v14.10-ces-deu
- OPUS-xlent-v1.2-ces-deu
- OPUS-bible_uedin-v1-ces-deu
- OPUS-wikimedia-v20230407-ces-deu
- OPUS-tldr_pages-v20251124-ces-deu
mono_train:
- Statmt-news_crawl-2023-deu
- Statmt-europarl-10-deu
- Statmt-news_commentary-18.1-deu
- Statmt-commoncrawl-wmt22-deu
- Leipzig-wikipedia-2021_1m-deu
- Leipzig-comweb-2021_1m-deu
- Leipzig-mixed_typical-2011_1m-deu
- Leipzig-news-2022_30k-deu
- Leipzig-newscrawl-2020_1m-deu
- Leipzig-web-2021_100k-deu_DE
########## JPN-ZHO ##########
- id: wmt26-jpn-zho
langs: jpn-zho
train:
- Statmt-news_commentary-18.1-jpn-zho
- KECL-paracrawl-2wmt24-zho-jpn
- Facebook-wikimatrix-1-jpn-zho
- Neulab-tedtalks_train-1-jpn-zho
- LinguaTools-wikititles-2014-jpn-zho
- OPUS-ccmatrix-v1-jpn-zho
- OPUS-php-v1-jpn-zho
- OPUS-qed-v2.0a-jpn-zho
- OPUS-ted2020-v1-jpn-zho
- OPUS-tanzil-v1-jpn-zho
- OPUS-ubuntu-v14.10-jpn-zho
- OPUS-bible_uedin-v1-jpn-zho
- OPUS-alt-v20191206-jpn-zho
- OPUS-eubookshop-v2-jpn-zho
- OPUS-jparacrawl-v3.0-jpn-zho
- OPUS-nllb-v1-jpn-zho
- OPUS-opensubtitles-v2016-jpn-zho
- OPUS-tldr_pages-v20251124-jpn-zho
- OPUS-wikimedia-v20230407-jpn-zho
- OPUS-xlent-v1.2-jpn-zho
mono_train:
- Statmt-news_crawl-2023-zho
- Statmt-news_commentary-18.1-zho
- Statmt-commoncrawl-wmt22-zho
- Leipzig-wikipedia-2018_1m-zho
- Leipzig-web-2016_1m-zho_MO
- Leipzig-tradnewscrawl-2011_1m-zho
- Leipzig-news-2020_300k-zho
########## ENG-ARA ##########
- id: wmt26-eng-ara
langs: eng-ara
train:
- Statmt-news_commentary-18.1-ara-eng
- Statmt-tedtalks-2_clean-eng-ara
- Statmt-ccaligned-1-ara_AR-eng
- Facebook-wikimatrix-1-ara-eng
- LinguaTools-wikititles-2014-ara-eng
- OPUS-ccmatrix-v1-ara-eng
- OPUS-elrc_3083_wikipedia_health-v1-ara-eng
- OPUS-elrc_wikipedia_health-v1-ara-eng
- OPUS-elrc_2922-v1-ara-eng
- OPUS-eubookshop-v2-ara-eng
- OPUS-gnome-v1-ara-eng
- OPUS-globalvoices-v2018q4-ara-eng
- OPUS-hplt-v2-ara-eng
- OPUS-kde4-v2-ara-eng
- OPUS-multiccaligned-v1-ara-eng
- OPUS-multihplt-v2-ara-eng
- OPUS-multiun-v1-ara-eng
- OPUS-nllb-v1-ara-eng
- OPUS-opensubtitles-v2024-ara-eng
- OPUS-qed-v2.0a-ara-eng
- OPUS-ted2020-v1-ara-eng
- OPUS-tatoeba-v20230412-ara-eng
- OPUS-ubuntu-v14.10-ara-eng
- OPUS-wikipedia-v1.0-ara-eng
- OPUS-xlent-v1.2-ara-eng
- OPUS-bible_uedin-v1-ara-eng
- OPUS-infopankki-v1-ara-eng
- OPUS-tico_19-v20201028-ara-eng
- OPUS-wikimedia-v20230407-ara-eng
- OPUS-neulab_tedtalks-v1-ara-eng
- OPUS-opus100_train-1-ara-eng
- OPUS-tanzil-v1-ara-eng
- OPUS-ted2013-v1.1-ara-eng
- OPUS-tldr_pages-v20251124-ara-eng
- OPUS-unpc-v1.0-ara-eng
mono_train:
- Statmt-news_crawl-2023-ara
- Statmt-news_commentary-18.1-ara
- Leipzig-news-2020_1m-ara
- Leipzig-wikipedia-2021_1m-ara
########## ENG-ZHO ##########
- id: wmt26-eng-zho
langs: eng-zho
train:
- Statmt-news_commentary-18.1-eng-zho
- Statmt-wikititles-3-zho-eng
- Statmt-ccaligned-1-eng-zho_CN
- ParaCrawl-paracrawl-1_bonus-eng-zho
- Facebook-wikimatrix-1-eng-zho
- Neulab-tedtalks_train-1-eng-zho
- ELRC-wikipedia_health-1-eng-zho
- ELRC-hrw_dataset_v1-1-eng-zho
- LinguaTools-wikititles-2014-eng-zho
- OPUS-ccmatrix-v1-eng-zho
- OPUS-elrc_3056_wikipedia_health-v1-eng-zho
- OPUS-elrc_wikipedia_health-v1-eng-zho
- OPUS-elrc_2922-v1-eng-zho
- OPUS-eubookshop-v2-eng-zho
- OPUS-multiun-v1-eng-zho
- OPUS-nllb-v1-eng-zho
- OPUS-php-v1-eng-zho
- OPUS-qed-v2.0a-eng-zho
- OPUS-spc-v1-eng-zho
- OPUS-ted2020-v1-eng-zho
- OPUS-tanzil-v1-eng-zho
- OPUS-ubuntu-v14.10-eng-zho
- OPUS-xlent-v1.2-eng-zho
- OPUS-bible_uedin-v1-eng-zho
- OPUS-infopankki-v1-eng-zho
- OPUS-tico_19-v20201028-eng-zho
- OPUS-wikimedia-v20230407-eng-zho
- OPUS-alt-v20191206-eng-zho
- OPUS-opensubtitles-v2016-eng-zho
- OPUS-opus100_train-1-eng-zho
- OPUS-paracrawl_bonus-v9-eng-zho
- OPUS-ted2013-v1.1-eng-zho
- OPUS-tldr_pages-v20251124-eng-zho
- OPUS-unpc-v1.0-eng-zho
mono_train:
- Statmt-news_crawl-2023-zho
- Statmt-news_commentary-18.1-zho
- Statmt-commoncrawl-wmt22-zho
- Leipzig-wikipedia-2018_1m-zho
- Leipzig-web-2016_1m-zho_MO
- Leipzig-tradnewscrawl-2011_1m-zho
- Leipzig-news-2020_300k-zho
########## ENG-CES ##########
- id: wmt26-eng-ces
langs: eng-ces
train:
- Statmt-commoncrawl_wmt13-1-ces-eng
- Statmt-news_commentary-18.1-ces-eng
- Statmt-wikititles-3-ces-eng
- Statmt-europarl-10-ces-eng
- Statmt-ccaligned-1-ces_CZ-eng
- ParaCrawl-paracrawl-9-eng-ces
- Tilde-eesc-2017-ces-eng
- Tilde-ema-2016-ces-eng
- Tilde-ecb-2017-ces-eng
- Tilde-rapid-2019-ces-eng
- Facebook-wikimatrix-1-ces-eng
- Neulab-tedtalks_train-1-eng-ces
- ELRC-information_portal_czech_president_czech_castle-1-ces-eng
- ELRC-electronic_exchange_social_security_information-1-ces-eng
- ELRC-euipo_2017-1-ces-eng
- ELRC-czech_supreme_audit_office_2018_reports-1-ces-eng
- ELRC-czech_supreme_audit_office_2008_2017_reports-1-ces-eng
- ELRC-czech_supreme_audit_office_2003_2017_press_releases-1-ces-eng
- ELRC-czech_supreme_audit_office_2018_press_releases-1-ces-eng
- ELRC-emea-1-ces-eng
- ELRC-vaccination-1-ces-eng
- ELRC-eu_publications_medical_v2-1-ces-eng
- ELRC-wikipedia_health-1-ces-eng
- ELRC-antibiotic-1-ces-eng
- ELRC-europarl_covid-1-ces-eng
- ELRC-ec_europa_covid-1-ces-eng
- ELRC-eur_lex_covid-1-ces-eng
- ELRC-presscorner_covid-1-ces-eng
- ELRC-scipar-1-ces-eng
- ELRC-web_acquired_data_related_to_scientific_research-1-eng-ces
- ELRC-hrw_dataset_v1-1-eng-ces
- ELRC-cef_data_marketplace-1-eng-ces
- EU-ecdc-1-eng-ces
- EU-eac_forms-1-ces-eng
- EU-eac_reference-1-ces-eng
- EU-dcep-1-ces-eng
- LinguaTools-wikititles-2014-ces-eng
- OPUS-ccmatrix-v1-ces-eng
- OPUS-dgt-v4-ces-eng
- OPUS-ecb-v1-ces-eng
- OPUS-ecdc-v20160316-ces-eng
- OPUS-elitr_eca-v1-ces-eng
- OPUS-elrc_2012_euipo_2017-v1-ces-eng
- OPUS-elrc_2404_czech_supreme_audit-v1-ces-eng
- OPUS-elrc_2405_czech_supreme_audit-v1-ces-eng
- OPUS-elrc_2406_czech_supreme_audit-v1-ces-eng
- OPUS-elrc_2407_czech_supreme_audit-v1-ces-eng
- OPUS-elrc_2713_emea-v1-ces-eng
- OPUS-elrc_2749_vaccination-v1-ces-eng
- OPUS-elrc_2874_eu_publications_medi-v1-ces-eng
- OPUS-elrc_3062_wikipedia_health-v1-ces-eng
- OPUS-elrc_3201_antibiotic-v1-ces-eng
- OPUS-elrc_3292_europarl_covid-v1-ces-eng
- OPUS-elrc_3463_ec_europa_covid-v1-ces-eng
- OPUS-elrc_3564_eur_lex_covid-v1-ces-eng
- OPUS-elrc_3605_presscorner_covid-v1-ces-eng
- OPUS-elrc_40_information_portal_c-v1-ces-eng
- OPUS-elrc_427_electronic_exchange_-v1-ces-eng
- OPUS-elrc_5067_scipar-v1-ces-eng
- OPUS-elrc_ec_europa-v1-ces-eng
- OPUS-elrc_emea-v1-ces-eng
- OPUS-elrc_euipo_2017-v1-ces-eng
- OPUS-elrc_europarl_covid-v1-ces-eng
- OPUS-elrc_eur_lex-v1-ces-eng
- OPUS-elrc_eu_publications-v1-ces-eng
- OPUS-elrc_information_portal-v1-ces-eng
- OPUS-elrc_antibiotic-v1-ces-eng
- OPUS-elrc_presscorner_covid-v1-ces-eng
- OPUS-elrc_vaccination-v1-ces-eng
- OPUS-elrc_wikipedia_health-v1-ces-eng
- OPUS-elrc_2682-v1-ces-eng
- OPUS-elrc_2922-v1-ces-eng
- OPUS-elrc_2923-v1-ces-eng
- OPUS-elrc_3382-v1-ces-eng
- OPUS-emea-v3-ces-eng
- OPUS-eubookshop-v2-ces-eng
- OPUS-euconst-v1-ces-eng
- OPUS-gnome-v1-ces-eng
- OPUS-globalvoices-v2018q4-ces-eng
- OPUS-jrc_acquis-v3.0-ces-eng
- OPUS-kde4-v2-ces-eng
- OPUS-multiccaligned-v1-ces-eng
- OPUS-multiparacrawl-v7.1-ces-eng
- OPUS-nllb-v1-ces-eng
- OPUS-opensubtitles-v2024-ces-eng
- OPUS-php-v1-ces-eng
- OPUS-qed-v2.0a-ces-eng
- OPUS-ted2020-v1-ces-eng
- OPUS-tanzil-v1-ces-eng
- OPUS-tatoeba-v20230412-ces-eng
- OPUS-tildemodel-v2018-ces-eng
- OPUS-ubuntu-v14.10-ces-eng
- OPUS-wikipedia-v1.0-ces-eng
- OPUS-xlent-v1.2-ces-eng
- OPUS-bible_uedin-v1-ces-eng
- OPUS-wikimedia-v20230407-ces-eng
- OPUS-hplt-v3-ces-eng
- OPUS-multihplt-v3-ces-eng
- OPUS-opus100_train-1-ces-eng
- OPUS-tldr_pages-v20251124-ces-eng
mono_train:
- Statmt-news_crawl-2023-ces
- Statmt-europarl-10-ces
- Statmt-news_commentary-18.1-ces
- Statmt-commoncrawl-wmt22-ces
- Leipzig-news-2022_1m-ces
- Leipzig-newscrawl-2019_1m-ces
- Leipzig-wikipedia-2021_1m-ces
- Leipzig-web_public-2019_1m-ces_CZ
########## ENG-EST ##########
- id: wmt26-eng-est
langs: eng-est
train:
- Statmt-europarl-7-est-eng
- Statmt-ccaligned-1-eng-est_EE
- ParaCrawl-paracrawl-9-eng-est
- Tilde-eesc-2017-eng-est
- Tilde-ema-2016-eng-est
- Tilde-airbaltic-1-eng-est
- Tilde-ecb-2017-eng-est
- Tilde-rapid-2016-eng-est
- Facebook-wikimatrix-1-eng-est
- Neulab-tedtalks_train-1-eng-est
- ELRC-estonian_cabinet_ministers-1-eng-est
- ELRC-bank_estonia-1-eng-est
- ELRC-legal_estonian_justice-1-eng-est
- ELRC-estonian_foreign_affairs-1-eng-est
- ELRC-parliament_estonia-1-eng-est
- ELRC-finnish_information_bank-1-eng-est
- ELRC-national_security_defence-1-eng-est
- ELRC-akadeemia.ee-1-eng-est
- ELRC-vp1992_2001.president.ee-1-eng-est
- ELRC-vp2001_2006.president.ee-1-eng-est
- ELRC-vp2006_2016.president.ee-1-eng-est
- ELRC-president.ee-1-eng-est
- ELRC-www.visitestonia.com-1-eng-est
- ELRC-euipo_2017-1-eng-est
- ELRC-estonian_classification_economic_activities-1-eng-est
- ELRC-press_releases_foreign_affairs_estonia-1-eng-est
- ELRC-emea-1-eng-est
- ELRC-vaccination-1-eng-est
- ELRC-eu_publications_medical_v2-1-eng-est
- ELRC-wikipedia_health-1-eng-est
- ELRC-antibiotic-1-eng-est
- ELRC-europarl_covid-1-eng-est
- ELRC-ec_europa_covid-1-eng-est
- ELRC-www.kriis.ee-1-eng-est
- ELRC-eur_lex_covid-1-eng-est
- ELRC-presscorner_covid-1-eng-est
- ELRC-nteu_tiera-1-eng-est
- ELRC-nteu_tierb-1-eng-est
- ELRC-scipar-1-eng-est
- ELRC-web_acquired_data_related_to_scientific_research-1-eng-est
- EU-ecdc-1-eng-est
- EU-eac_forms-1-eng-est
- EU-eac_reference-1-eng-est
- EU-dcep-1-eng-est
- OPUS-ccmatrix-v1-eng-est
- OPUS-dgt-v4-eng-est
- OPUS-ecb-v1-eng-est
- OPUS-ecdc-v20160316-eng-est
- OPUS-elitr_eca-v1-eng-est
- OPUS-elra_w0154-v1-eng-est
- OPUS-elra_w0167-v1-eng-est
- OPUS-elra_w0168-v1-eng-est
- OPUS-elra_w0215-v1-eng-est
- OPUS-elra_w0218-v1-eng-est
- OPUS-elra_w0265-v1-eng-est
- OPUS-elrc_1129_www.visitestonia.com-v1-eng-est
- OPUS-elrc_2016_euipo_2017-v1-eng-est
- OPUS-elrc_2457_estonian_classificat-v1-eng-est
- OPUS-elrc_2461_press_releases_forei-v1-eng-est
- OPUS-elrc_2723_emea-v1-eng-est
- OPUS-elrc_2751_vaccination-v1-eng-est
- OPUS-elrc_2882_eu_publications_medi-v1-eng-est
- OPUS-elrc_3079_wikipedia_health-v1-eng-est
- OPUS-elrc_3211_antibiotic-v1-eng-est
- OPUS-elrc_3300_europarl_covid-v1-eng-est
- OPUS-elrc_3471_ec_europa_covid-v1-eng-est
- OPUS-elrc_3554_www.kriis.ee-v1-eng-est
- OPUS-elrc_3572_eur_lex_covid-v1-eng-est
- OPUS-elrc_3613_presscorner_covid-v1-eng-est
- OPUS-elrc_393_estonian_cabinet_min-v1-eng-est
- OPUS-elrc_411_bank_estonia-v1-eng-est
- OPUS-elrc_4271_nteu_tiera-v1-eng-est
- OPUS-elrc_429_legal_estonian_justi-v1-eng-est
- OPUS-elrc_431_estonian_foreign_aff-v1-eng-est
- OPUS-elrc_5067_scipar-v1-eng-est
- OPUS-elrc_714_parliament_estonia-v1-eng-est
- OPUS-elrc_717_finnish_information_-v1-eng-est
- OPUS-elrc_770_national_security_de-v1-eng-est
- OPUS-elrc_919_akadeemia.ee-v1-eng-est
- OPUS-elrc_937_vp1992_2001.presiden-v1-eng-est
- OPUS-elrc_938_vp2001_2006.presiden-v1-eng-est
- OPUS-elrc_939_vp2006_2016.presiden-v1-eng-est
- OPUS-elrc_940_president.ee-v1-eng-est
- OPUS-elrc_ec_europa-v1-eng-est
- OPUS-elrc_emea-v1-eng-est
- OPUS-elrc_euipo_2017-v1-eng-est
- OPUS-elrc_europarl_covid-v1-eng-est
- OPUS-elrc_eur_lex-v1-eng-est
- OPUS-elrc_eu_publications-v1-eng-est
- OPUS-elrc_finnish_information-v1-eng-est
- OPUS-elrc_antibiotic-v1-eng-est
- OPUS-elrc_presscorner_covid-v1-eng-est
- OPUS-elrc_vaccination-v1-eng-est
- OPUS-elrc_wikipedia_health-v1-eng-est
- OPUS-elrc_www.visitestonia.com-v1-eng-est
- OPUS-elrc_2682-v1-eng-est
- OPUS-elrc_2922-v1-eng-est
- OPUS-elrc_2923-v1-eng-est
- OPUS-elrc_3382-v1-eng-est
- OPUS-emea-v3-eng-est
- OPUS-eopc-v2022-eng-est
- OPUS-eubookshop-v2-eng-est
- OPUS-euconst-v1-eng-est
- OPUS-gnome-v1-eng-est
- OPUS-jrc_acquis-v3.0-eng-est
- OPUS-kde4-v2-eng-est
- OPUS-kdedoc-v1-eng_GB-est
- OPUS-multiccaligned-v1-eng-est
- OPUS-multiparacrawl-v7.1-eng-est
- OPUS-nllb-v1-eng-est
- OPUS-qed-v2.0a-eng-est
- OPUS-ted2020-v1-eng-est
- OPUS-tatoeba-v20230412-eng-est
- OPUS-tildemodel-v2018-eng-est
- OPUS-ubuntu-v14.10-eng-est
- OPUS-xlent-v1.2-eng-est
- OPUS-bible_uedin-v1-eng-est
- OPUS-infopankki-v1-eng-est
- OPUS-wikimedia-v20230407-eng-est
- OPUS-hplt-v3-eng-est
- OPUS-multihplt-v3-eng-est
- OPUS-opensubtitles-v2024-eng-est
- OPUS-opus100_train-1-eng-est
mono_train:
- Statmt-news_crawl-2023-est
- Leipzig-web-2015_1m-est_EE
- Leipzig-news-2020_300k-est
- Leipzig-newscrawl-2017_1m-est
########## ENG-ISL ##########
- id: wmt26-eng-isl
langs: eng-isl
train:
- Statmt-wikititles-3-isl-eng
- Statmt-ccaligned-1-eng-isl_IS
- ParaCrawl-paracrawl-9-eng-isl
- Tilde-eesc-2017-eng-isl
- Tilde-ema-2016-eng-isl
- Tilde-rapid-2016-eng-isl
- Facebook-wikimatrix-1-eng-isl
- ParIce-eea_train-20.05-eng-isl
- ParIce-ema_train-20.05-eng-isl
- EU-ecdc-1-eng-isl
- EU-eac_forms-1-eng-isl
- EU-eac_reference-1-eng-isl
- OPUS-ccmatrix-v1-eng-isl
- OPUS-elrc_2718_emea-v1-eng-isl
- OPUS-elrc_3206_antibiotic-v1-eng-isl
- OPUS-elrc_4295_www.malfong.is-v1-eng-isl
- OPUS-elrc_4324_government_offices_i-v1-eng-isl
- OPUS-elrc_4327_government_offices_i-v1-eng-isl
- OPUS-elrc_4334_rkiskaup_2020-v1-eng-isl
- OPUS-elrc_4338_university_iceland-v1-eng-isl
- OPUS-elrc_502_icelandic_financial_-v1-eng-isl
- OPUS-elrc_504_www.iceida.is-v1-eng-isl
- OPUS-elrc_505_www.pfs.is-v1-eng-isl
- OPUS-elrc_506_www.lanamal.is-v1-eng-isl
- OPUS-elrc_5067_scipar-v1-eng-isl
- OPUS-elrc_508_tilde_statistics_ice-v1-eng-isl
- OPUS-elrc_509_gallery_iceland-v1-eng-isl
- OPUS-elrc_510_harpa_reykjavik_conc-v1-eng-isl
- OPUS-elrc_511_bokmenntaborgin_is-v1-eng-isl
- OPUS-elrc_516_icelandic_medicines-v1-eng-isl
- OPUS-elrc_517_icelandic_directorat-v1-eng-isl
- OPUS-elrc_597_www.nordisketax.net-v1-eng-isl
- OPUS-elrc_718_statistics_iceland-v1-eng-isl
- OPUS-elrc_728_www.norden.org-v1-eng-isl
- OPUS-elrc_emea-v1-eng-isl
- OPUS-elrc_antibiotic-v1-eng-isl
- OPUS-elrc_www.norden.org-v1-eng-isl
- OPUS-elrc_www.nordisketax.net-v1-eng-isl
- OPUS-eubookshop-v2-eng-isl
- OPUS-multiccaligned-v1-eng-isl
- OPUS-multiparacrawl-v7.1-eng-isl
- OPUS-opensubtitles-v2024-eng-isl
- OPUS-ted2020-v1-eng-isl
- OPUS-ubuntu-v14.10-eng-isl
- OPUS-bible_uedin-v1-eng-isl
- OPUS-ecdc-v20160316-eng-isl
- OPUS-gnome-v1-eng-isl
- OPUS-hplt-v3-eng-isl
- OPUS-kde4-v2-eng-isl
- OPUS-macocu-v2-eng-isl
- OPUS-multihplt-v3-eng-isl
- OPUS-multimacocu-v2-eng-isl
- OPUS-nllb-v1-eng-isl
- OPUS-opus100_train-1-eng-isl
- OPUS-parice-v1-eng-isl
- OPUS-qed-v2.0a-eng-isl
- OPUS-tatoeba-v20230412-eng-isl
- OPUS-tildemodel-v2018-eng-isl
- OPUS-wikimedia-v20230407-eng-isl
- OPUS-xlent-v1.2-eng-isl
mono_train:
- Statmt-news_crawl-2023-isl
- Leipzig-web-2020_1m-isl_IS
- Leipzig-web_public-2019_1m-isl_IS
- Leipzig-news-2020_30k-isl
- Leipzig-newscrawl-2019_300k-isl
- Leipzig-wikipedia-2021_100k-isl
########## ENG-JPN ##########
- id: wmt26-eng-jpn
langs: eng-jpn
train:
- Statmt-news_commentary-18.1-eng-jpn
- Statmt-wikititles-3-jpn-eng
- Statmt-ted-wmt20-eng-jpn
- Statmt-ccaligned-1-eng-jpn
- KECL-paracrawl-3-eng-jpn
- Facebook-wikimatrix-1-eng-jpn
- Phontron-kftt_train-1-eng-jpn
- StanfordNLP-jesc_train-1-eng-jpn
- Neulab-tedtalks_train-1-eng-jpn
- LinguaTools-wikititles-2014-eng-jpn
- OPUS-ccmatrix-v1-eng-jpn
- OPUS-eubookshop-v2-eng-jpn
- OPUS-gnome-v1-eng-jpn
- OPUS-globalvoices-v2018q4-eng-jpn
- OPUS-hplt-v2-eng-jpn
- OPUS-kde4-v2-eng-jpn
- OPUS-mdn_web_docs-v20230925-eng-jpn
- OPUS-multiccaligned-v1-eng-jpn
- OPUS-multihplt-v2-eng-jpn
- OPUS-nllb-v1-eng-jpn
- OPUS-openoffice-v3-eng_GB-jpn
- OPUS-opensubtitles-v2024-eng-jpn
- OPUS-php-v1-eng-jpn
- OPUS-qed-v2.0a-eng-jpn
- OPUS-ted2020-v1-eng-jpn
- OPUS-tanzil-v1-eng-jpn
- OPUS-tatoeba-v20230412-eng-jpn
- OPUS-ubuntu-v14.10-eng-jpn
- OPUS-xlent-v1.2-eng-jpn
- OPUS-bible_uedin-v1-eng-jpn
- OPUS-wikimedia-v20230407-eng-jpn
- OPUS-alt-v20191206-eng-jpn
- OPUS-jesc-v20191205-eng-jpn
- OPUS-jparacrawl-v3.0-eng-jpn
- OPUS-kftt-v1.0-eng-jpn
- OPUS-openoffice-v2-eng-jpn
- OPUS-opus100_train-1-eng-jpn
- OPUS-tldr_pages-v20251124-eng-jpn
mono_train:
- Statmt-news_crawl-2023-jpn
- Statmt-news_commentary-18.1-jpn
- Statmt-commoncrawl-wmt22-jpn
- Leipzig-web-2020_1m-jpn_JP
- Leipzig-comweb-2018_1m-jpn
- Leipzig-web_public-2019_1m-jpn_JP
- Leipzig-news-2020_100k-jpn
- Leipzig-newscrawl-2019_1m-jpn
- Leipzig-wikipedia-2021_1m-jpn
########## ENG-KOR ##########
- id: wmt26-eng-kor
langs: eng-kor
train:
- Statmt-ccaligned-1-eng-kor_KR
- ParaCrawl-paracrawl-1_bonus-eng-kor
- Facebook-wikimatrix-1-eng-kor
- Neulab-tedtalks_train-1-eng-kor
- ELRC-wikipedia_health-1-eng-kor
- ELRC-hrw_dataset_v1-1-eng-kor
- LinguaTools-wikititles-2014-eng-kor
- OPUS-ccmatrix-v1-eng-kor
- OPUS-elrc_3070_wikipedia_health-v1-eng-kor
- OPUS-elrc_wikipedia_health-v1-eng-kor
- OPUS-elrc_2922-v1-eng-kor
- OPUS-gnome-v1-eng-kor
- OPUS-globalvoices-v2018q4-eng-kor
- OPUS-hplt-v2-eng-kor
- OPUS-multihplt-v2-eng-kor
- OPUS-kde4-v2-eng-kor
- OPUS-mdn_web_docs-v20230925-eng-kor
- OPUS-multiccaligned-v1-eng-kor
- OPUS-nllb-v1-eng-kor
- OPUS-opensubtitles-v2024-eng-kor
- OPUS-php-v1-eng-kor
- OPUS-qed-v2.0a-eng-kor
- OPUS-ted2020-v1-eng-kor
- OPUS-tanzil-v1-eng-kor
- OPUS-tatoeba-v20230412-eng-kor
- OPUS-ubuntu-v14.10-eng-kor
- OPUS-xlent-v1.2-eng-kor
- OPUS-bible_uedin-v1-eng-kor
- OPUS-wikimedia-v20230407-eng-kor
- OPUS-opus100_train-1-eng-kor
- OPUS-tldr_pages-v20251124-eng-kor
- OPUS-translatewiki-v20250101-eng-kor
mono_train:
- Statmt-news_crawl-2023-kor
- Leipzig-web-2020_1m-kor_KR
- Leipzig-news-2020_1m-kor
- Leipzig-wikipedia-2021_1m-kor
########## CES-VIE (new) ##########
- id: wmt26-ces-vie
langs: ces-vie
train:
- Facebook-wikimatrix-1-ces-vie
- Neulab-tedtalks_train-1-vie-ces
- OPUS-bible_uedin-v1-ces-vie
- OPUS-ccmatrix-v1-ces-vie
- OPUS-elrc_wikipedia_health-v1-ces-vie
- OPUS-gnome-v1-ces-vie
- OPUS-kde4-v2-ces-vie
- OPUS-multiccaligned-v1.1-ces-vie
- OPUS-nllb-v1-ces-vie
- OPUS-opensubtitles-v2024-ces-vie
- OPUS-qed-v2.0a-ces-vie
- OPUS-tatoeba-v20230412-ces-vie
- OPUS-ted2020-v1-ces-vie
- OPUS-ubuntu-v14.10-ces-vie
- OPUS-wikimedia-v20230407-ces-vie
- OPUS-xlent-v1.2-ces-vie
mono_train:
- Leipzig-web-2013_10k-vie_KH
- Leipzig-mixed-2014_1m-vie
- Leipzig-news-2020_1m-vie
- Leipzig-newscrwal-2011_1m-vie
- Leipzig-web-2015_1m-vie_VN
- Leipzig-wikipedia-2021_1m-vie
########## ENG-HYE (new) ##########
- id: wmt26-eng-hye
langs: eng-hye
train:
- Neulab-tedtalks_train-1-eng-hye
- OPUS-bible_uedin-v1-eng-hye
- OPUS-gnome-v1-eng-hye
- OPUS-kde4-v2-eng-hye
- OPUS-multiccaligned-v1-eng-hye
- OPUS-nllb-v1-eng-hye
- OPUS-opensubtitles-v2024-eng-hye
- OPUS-opus100_train-1-eng-hye
- OPUS-paracrawl_bonus-v9-eng-hye
- OPUS-qed-v2.0a-eng-hye
- OPUS-tatoeba-v20230412-eng-hye
- OPUS-ted2020-v1-eng-hye
- OPUS-ubuntu-v14.10-eng-hye
- OPUS-wikimedia-v20230407-eng-hye
- OPUS-xlent-v1.2-eng-hye
- Statmt-ccaligned-1-eng-hye_AM
mono_train:
- Leipzig-web-2017_1m-hye_AM
- Leipzig-community-2017-hy
- Leipzig-news-2021_30k-hye
- Leipzig-wikipedia-2021_1m-hye
########## ENG-BEL (new) ##########
- id: wmt26-eng-bel
langs: eng-bel
train:
- ELRC-wikipedia_health-1-bel-eng
- Facebook-wikimatrix-1-bel-eng
- Neulab-tedtalks_train-1-eng-bel
- OPUS-ccmatrix-v1-bel-eng
- OPUS-elrc_2922-v1-bel-eng
- OPUS-elrc_3046_wikipedia_health-v1-bel-eng
- OPUS-elrc_wikipedia_health-v1-bel-eng
- OPUS-eubookshop-v2-bel-eng
- OPUS-gnome-v1-bel-eng
- OPUS-hplt-v2-bel-eng
- OPUS-kde4-v2-bel-eng
- OPUS-kde4-v2-bel-eng_GB
- OPUS-multiccaligned-v1-bel-eng
- OPUS-multihplt-v2-bel-eng
- OPUS-nllb-v1-bel-eng
- OPUS-opensubtitles-v2024-bel-eng
- OPUS-opus100_train-1-bel-eng
- OPUS-qed-v2.0a-bel-eng
- OPUS-tatoeba-v20230412-bel-eng
- OPUS-ted2020-v1-bel-eng
- OPUS-ubuntu-v14.10-bel-eng
- OPUS-wikimedia-v20230407-bel-eng
- OPUS-xlent-v1.2-bel-eng
- Statmt-ccaligned-1-bel_BY-eng
mono_train:
- Leipzig-web-2013_1m-bel_BY
- Leipzig-web-2015_300k-bel_BY
- Leipzig-news-2020_100k-bel
- Leipzig-newscrawl-2015_1m-bel
- Leipzig-newscrawl-2017_300k-bel
- Leipzig-wikipedia-2021_300k-bel
########## ENG-ZHO_TW (new) ##########
- id: wmt26-eng-zho_TW
langs: eng-zho_TW
train:
- ELRC-hrw_dataset_v1-1-eng-zho
- ELRC-wikipedia_health-1-eng-zho
- Facebook-wikimatrix-1-eng-zho
- LinguaTools-wikititles-2014-eng-zho
- Neulab-tedtalks_train-1-eng-zho
- OPUS-bible_uedin-v1-eng-zho
- OPUS-ccmatrix-v1-eng-zho
- OPUS-elrc_2922-v1-eng-zho
- OPUS-elrc_3056_wikipedia_health-v1-eng-zho
- OPUS-elrc_wikipedia_health-v1-eng-zho
- OPUS-eubookshop-v2-eng-zho
- OPUS-infopankki-v1-eng-zho
- OPUS-multiun-v1-eng-zho
- OPUS-nllb-v1-eng-zho
- OPUS-opensubtitles-v2016-eng-zho
- OPUS-opus100_train-1-eng-zho
- OPUS-paracrawl_bonus-v9-eng-zho
- OPUS-php-v1-eng-zho
- OPUS-qed-v2.0a-eng-zho
- OPUS-spc-v1-eng-zho
- OPUS-tanzil-v1-eng-zho
- OPUS-ted2013-v1.1-eng-zho
- OPUS-ted2020-v1-eng-zho
- OPUS-tico_19-v20201028-eng-zho
- OPUS-ubuntu-v14.10-eng-zho
- OPUS-unpc-v1.0-eng-zho
- OPUS-wikimedia-v20230407-eng-zho
- OPUS-xlent-v1.2-eng-zho
- ParaCrawl-paracrawl-1_bonus-eng-zho
#- Statmt-backtrans_enzh-wmt20-eng-zho
- Statmt-ccaligned-1-eng-zho_TW
- Statmt-news_commentary-18.1-eng-zho
- Statmt-wikititles-3-zho-eng
- OPUS-alt-v20191206-eng-zho
- OPUS-gnome-v1-eng-zho_TW
- OPUS-gnome-v1-eng_AU-zho_TW
- OPUS-gnome-v1-eng_CA-zho_TW
- OPUS-gnome-v1-eng_GB-zho_TW
- OPUS-gnome-v1-eng_NZ-zho_TW
- OPUS-gnome-v1-eng_US-zho_TW
- OPUS-kde4-v2-eng-zho_TW
- OPUS-kde4-v2-eng_GB-zho_TW
- OPUS-kdedoc-v1-eng_GB-zho_TW
- OPUS-mdn_web_docs-v20230925-eng-zho_TW
- OPUS-multiccaligned-v1-eng-zho_TW
- OPUS-nllb-v1-eng-zho_TW
- OPUS-opensubtitles-v2024-eng-zho_TW
- OPUS-php-v1-eng-zho_TW
- OPUS-ted2020-v1-eng-zho_TW
- OPUS-tldr_pages-v20251124-eng-zho
- OPUS-ubuntu-v14.10-eng-zho_TW
- OPUS-ubuntu-v14.10-eng_AU-zho_TW
- OPUS-ubuntu-v14.10-eng_CA-zho_TW
- OPUS-ubuntu-v14.10-eng_GB-zho_TW
- OPUS-ubuntu-v14.10-eng_NZ-zho_TW
- OPUS-ubuntu-v14.10-eng_US-zho_TW
- OPUS-wikimedia-v20230407-eng-zho_TW
mono_train:
- Statmt-news_crawl-2023-zho
- Statmt-news_commentary-18.1-zho
- Statmt-commoncrawl-wmt22-zho
- Leipzig-web-2015_1m-zho_CN
- Leipzig-news-2007_2009_1m-zho
- Leipzig-news-2020_300k-zho
- Leipzig-simp_twweb-2014_300k-zho
- Leipzig-tradnewscrawl-2011_1m-zho
- Leipzig-wikipedia-2018_1m-zho
########## ENG-DEU (new) ##########
- id: wmt26-eng-deu
langs: eng-deu
train:
- EU-dcep-1-deu-eng
- EU-eac_forms-1-deu-eng
- EU-eac_reference-1-deu-eng
- EU-ecdc-1-eng-deu
- Facebook-wikimatrix-1-deu-eng
- LinguaTools-wikititles-2014-deu-eng
- Neulab-tedtalks_train-1-eng-deu
- OPUS-bible_uedin-v1-deu-eng
- OPUS-books-v1-deu-eng
- OPUS-ccaligned-v1-deu-eng
- OPUS-ccmatrix-v1-deu-eng
- OPUS-dgt-v4-deu-eng
- OPUS-ecb-v1-deu-eng
- OPUS-ecdc-v20160316-deu-eng
- OPUS-elitr_eca-v1-deu-eng
- OPUS-elra_w0143-v1-deu-eng
- OPUS-elra_w0197-v1-deu-eng_GB
- OPUS-elra_w0198-v1-deu-eng_GB
- OPUS-elra_w0199-v1-deu-eng_GB
- OPUS-elra_w0200-v1-deu-eng_GB
- OPUS-elra_w0201-v1-deu-eng
- OPUS-elra_w0301-v1-deu-eng
- OPUS-elrc_1077_euipo_law-v1-deu-eng
- OPUS-elrc_1086_information_portal_g-v1-deu-eng
- OPUS-elrc_1088_german_foreign_offic-v1-deu-eng
- OPUS-elrc_1089_german_foreign_offic-v1-deu-eng
- OPUS-elrc_1090_german_foreign_offic-v1-deu-eng
- OPUS-elrc_1092_euipo_list-v1-deu-eng
- OPUS-elrc_1117_cordis_news-v1-deu-eng
- OPUS-elrc_1121_cordis_results_brief-v1-deu-eng
- OPUS-elrc_1238_energy_report_city-v1-deu-eng
- OPUS-elrc_1240_austrian_research_te-v1-deu-eng
- OPUS-elrc_1241_2017_activity_report-v1-deu-eng
- OPUS-elrc_1243_vienna_environmental-v1-deu-eng
- OPUS-elrc_2014_euipo_2017-v1-deu-eng
- OPUS-elrc_2410_portal_oficial_turis-v1-deu-eng
- OPUS-elrc_2612_artigos_visitportuga-v1-deu-eng
- OPUS-elrc_2614_localidades_2007-v1-deu-eng
- OPUS-elrc_2616_museus_2007-v1-deu-eng
- OPUS-elrc_2622_arquitectura_2007-v1-deu-eng
- OPUS-elrc_2623_patrimnio_aores_2006-v1-deu-eng
- OPUS-elrc_2638_monumentos_2007-v1-deu-eng
- OPUS-elrc_2639_parques_e_reservas-v1-deu-eng
- OPUS-elrc_2641_praias_2007-v1-deu-eng
- OPUS-elrc_2682-v1-deu-eng
- OPUS-elrc_2714_emea-v1-deu-eng
- OPUS-elrc_2736_vaccination-v1-deu-eng
- OPUS-elrc_2875_eu_publications_medi-v1-deu-eng
- OPUS-elrc_2922-v1-deu-eng
- OPUS-elrc_2923-v1-deu-eng
- OPUS-elrc_3063_wikipedia_health-v1-deu-eng
- OPUS-elrc_3202_antibiotic-v1-deu-eng
- OPUS-elrc_3293_europarl_covid-v1-deu-eng
- OPUS-elrc_3382-v1-deu-eng
- OPUS-elrc_3464_ec_europa_covid-v1-deu-eng
- OPUS-elrc_3565_eur_lex_covid-v1-deu-eng
- OPUS-elrc_3606_presscorner_covid-v1-deu-eng
- OPUS-elrc_3852_development_funds_re-v1-deu-eng
- OPUS-elrc_401_swedish_labour_part2-v1-deu-eng
- OPUS-elrc_403_rights_arrested-v1-deu-eng
- OPUS-elrc_406_swedish_labour_part1-v1-deu-eng
- OPUS-elrc_416_swedish_social_secur-v1-deu-eng
- OPUS-elrc_417_swedish_work_environ-v1-deu-eng
- OPUS-elrc_4992_customer_support_mt-v1-deu-eng
- OPUS-elrc_5067_scipar-v1-deu-eng
- OPUS-elrc_5220_information_crime_vi-v1-deu-eng
- OPUS-elrc_621_federal_constitution-v1-deu-eng
- OPUS-elrc_630_bmvi_publications-v1-deu-eng
- OPUS-elrc_631_bmvi_website-v1-deu-eng
- OPUS-elrc_632_bmi_brochure_civil-v1-deu-eng
- OPUS-elrc_633_bmi_brochures_2016-v1-deu-eng
- OPUS-elrc_634_bmi_brochures_2011-v1-deu-eng
- OPUS-elrc_637_sip-v1-deu-eng
- OPUS-elrc_638_luxembourg.lu-v1-deu-eng
- OPUS-elrc_642_federal_foreign_berl-v1-deu-eng
- OPUS-elrc_774_presidency-v1-deu-eng
- OPUS-elrc_775_by_presidency_counci-v1-deu-eng
- OPUS-elrc_776_by_presidency_counci-v1-deu-eng
- OPUS-elrc_832_charter_values_citiz-v1-deu-eng
- OPUS-elrc_antibiotic-v1-deu-eng
- OPUS-elrc_arquitectura_2007-v1-deu-eng
- OPUS-elrc_artigos_visitportuga-v1-deu-eng
- OPUS-elrc_cordis_news-v1-deu-eng
- OPUS-elrc_cordis_results-v1-deu-eng
- OPUS-elrc_ec_europa-v1-deu-eng
- OPUS-elrc_emea-v1-deu-eng
- OPUS-elrc_eu_publications-v1-deu-eng
- OPUS-elrc_euipo_2017-v1-deu-eng
- OPUS-elrc_euipo_law-v1-deu-eng
- OPUS-elrc_euipo_list-v1-deu-eng
- OPUS-elrc_eur_lex-v1-deu-eng
- OPUS-elrc_europarl_covid-v1-deu-eng
- OPUS-elrc_federal_foreign-v1-deu-eng_GB
- OPUS-elrc_german_foreign-v1-deu-eng_GB
- OPUS-elrc_information_portal-v1-deu-eng
- OPUS-elrc_localidades_2007-v1-deu-eng
- OPUS-elrc_museus_2007-v1-deu-eng
- OPUS-elrc_parques_e-v1-deu-eng
- OPUS-elrc_patrimnio_aores-v1-deu-eng
- OPUS-elrc_praias_2007-v1-deu-eng
- OPUS-elrc_presscorner_covid-v1-deu-eng
- OPUS-elrc_swedish_labour-v1-deu-eng
- OPUS-elrc_termitur-v1-deu-eng
- OPUS-elrc_vaccination-v1-deu-eng
- OPUS-elrc_wikipedia_health-v1-deu-eng
- OPUS-emea-v3-deu-eng
- OPUS-eubookshop-v2-deu-eng
- OPUS-euconst-v1-deu-eng
- OPUS-europat-v3-deu-eng
- OPUS-globalvoices-v2018q4-deu-eng
- OPUS-gnome-v1-deu-eng
- OPUS-gnome-v1-deu_CH-eng
- OPUS-jrc_acquis-v3.0-deu-eng
- OPUS-kdedoc-v1-deu-eng_GB
- OPUS-mpc1-v1-deu-eng
- OPUS-multiccaligned-v1-deu-eng
- OPUS-multiparacrawl-v7.1-deu-eng
- OPUS-multiun-v1-deu-eng
- OPUS-nllb-v1-deu-eng
- OPUS-openoffice-v3-deu-eng_GB
- OPUS-opensubtitles-v2024-deu-eng
- OPUS-opus100_train-1-deu-eng
- OPUS-php-v1-deu-eng
- OPUS-qed-v2.0a-deu-eng
- OPUS-rf-v1-deu-eng
- OPUS-salome-v1-deu-eng
- OPUS-stanfordnlp_nmt-v1.0-eng-deu
- OPUS-tanzil-v1-deu-eng
- OPUS-tatoeba-v20230412-deu-eng
- OPUS-ted2013-v1.1-deu-eng
- OPUS-ted2020-v1-deu-eng
- OPUS-tildemodel-v2018-deu-eng
- OPUS-wikimedia-v20230407-deu-eng
- OPUS-wikipedia-v1.0-deu-eng
- OPUS-xlent-v1.2-deu-eng
- ParaCrawl-paracrawl-9-eng-deu
- Statmt-commoncrawl_wmt13-1-deu-eng
- Statmt-europarl-9-deu-eng
- Statmt-europarl_wmt13-7-deu-eng
- Statmt-news_commentary-18.1-deu-eng
- Statmt-news_commentary_wmt18-13-deu-eng
- Statmt-wiki_titles-2-deu-eng
- Statmt-wikititles-3-deu-eng
- Tilde-airbaltic-1-deu-eng
- Tilde-czechtourism-1-deu-eng
- Tilde-ecb-2017-deu-eng
- Tilde-eesc-2017-deu-eng
- Tilde-ema-2016-deu-eng
- Tilde-rapid-2019-deu-eng
- OPUS-kde4-v2-deu-eng
- OPUS-openoffice-v2-deu-eng
- OPUS-tldr_pages-v20251124-deu-eng
- OPUS-ubuntu-v14.10-deu-eng
mono_train:
- Statmt-news_crawl-2023-deu
- Statmt-europarl-10-deu
- Statmt-news_commentary-18.1-deu
- Statmt-commoncrawl-wmt22-deu
- Leipzig-web-2013_10k-deu_BE
- Leipzig-web-2002_1m-deu_CH
- Leipzig-comweb-2021_1m-deu
- Leipzig-web-2021_1m-deu_DE
- Leipzig-web_public-2019_1m-deu_DE
- Leipzig-euweb-2015_300k-deu
- Leipzig-euweb-2017_1m-deu
- Leipzig-mixed_typical-2011_1m-deu
- Leipzig-news-2022_1m-deu
- Leipzig-newscrawl-2020_1m-deu
- Leipzig-newscrawl_public-2019_1m-deu
- Leipzig-web-2002_1m-deu
- Leipzig-web-2011_1m-deu
- Leipzig-wikipedia-2021_1m-deu
########## ENG-IND (new) ##########
- id: wmt26-eng-ind
langs: eng-ind
train:
- ELRC-hrw_dataset_v1-1-eng-ind
- ELRC-wikipedia_health-1-eng-ind
- Facebook-wikimatrix-1-eng-ind
- Neulab-tedtalks_train-1-eng-ind
- OPUS-bible_uedin-v1-eng-ind
- OPUS-ccmatrix-v1-eng-ind
- OPUS-elrc_2922-v1-eng-ind
- OPUS-elrc_3049_wikipedia_health-v1-eng-ind
- OPUS-elrc_wikipedia_health-v1-eng-ind
- OPUS-globalvoices-v2018q4-eng-ind
- OPUS-gnome-v1-eng-ind
- OPUS-kde4-v2-eng-ind
- OPUS-multiccaligned-v1-eng-ind
- OPUS-nllb-v1-eng-ind
- OPUS-opensubtitles-v2024-eng-ind
- OPUS-opus100_train-1-eng-ind
- OPUS-paracrawl_bonus-v9-eng-ind
- OPUS-qed-v2.0a-eng-ind
- OPUS-tanzil-v1-eng-ind
- OPUS-tatoeba-v20230412-eng-ind
- OPUS-ted2020-v1-eng-ind
- OPUS-tico_19-v20201028-eng-ind
- OPUS-ubuntu-v14.10-eng-ind
- OPUS-wikimedia-v20230407-eng-ind
- OPUS-xlent-v1.2-eng-ind
- Statmt-ccaligned-1-eng-ind_ID
- Statmt-news_commentary-18.1-eng-ind
- OPUS-alt-v20191206-eng-ind
- OPUS-tldr_pages-v20251124-eng-ind
mono_train:
- Statmt-news_crawl-2023-ind
- Statmt-news_commentary-18.1-ind
- Leipzig-comweb-2018_1m-ind
- Leipzig-web-2015_1m-ind_IN
- Leipzig-mixed-2013_1m-ind
- Leipzig-mixed_tufs4-2012_1m-ind
- Leipzig-news-2022_1m-ind
- Leipzig-newscrawl-2016_1m-ind
- Leipzig-newscrawl_tufs6-2012_3m-ind
- Leipzig-web_tufs13-2012_3m-ind
- Leipzig-wikipedia-2010_300k-ind
- Leipzig-wikipedia-2021_1m-ind
########## ENG-KAZ (new) ##########
- id: wmt26-eng-kaz
langs: eng-kaz
train:
- ELRC-kazakh_legal_mt_test_set-1-eng-kaz
- Facebook-wikimatrix-1-eng-kaz
- Neulab-tedtalks_train-1-eng-kaz
- OPUS-elrc_5042_kazakh_legal_mt-v1-eng-kaz
- OPUS-gnome-v1-eng-kaz
- OPUS-hplt-v2-eng-kaz
- OPUS-kde4-v2-eng-kaz
- OPUS-multiccaligned-v1-eng-kaz
- OPUS-multihplt-v2-eng-kaz
- OPUS-nllb-v1-eng-kaz
- OPUS-opensubtitles-v2024-eng-kaz
- OPUS-opus100_train-1-eng-kaz
- OPUS-qed-v2.0a-eng-kaz
- OPUS-tatoeba-v20230412-eng-kaz
- OPUS-ted2020-v1-eng-kaz
- OPUS-ubuntu-v14.10-eng-kaz
- OPUS-wikimedia-v20230407-eng-kaz
- OPUS-xlent-v1.2-eng-kaz
- Statmt-ccaligned-1-eng-kaz_KZ
- Statmt-news_commentary-18.1-eng-kaz
- Statmt-wiki_titles-1-kaz-eng
- OPUS-translatewiki-v20250101-eng-kaz
mono_train:
- Statmt-news_crawl-2023-kaz
- Statmt-news_commentary-18.1-kaz
- Leipzig-news-2020_30k-kaz
- Leipzig-newscrawl-2016_1m-kaz
- Leipzig-wikipedia-2021_300k-kaz
########## ENG-LLD (new) ##########
- id: wmt26-eng-lld
langs: eng-lld
train:
- OPUS-qed-v2.0a-eng-lld
- OPUS-tatoeba-v20230412-eng-lld
- OPUS-ubuntu-v14.10-eng-lld
- OPUS-ubuntu-v14.10-eng_AU-lld
- OPUS-ubuntu-v14.10-eng_CA-lld
- OPUS-ubuntu-v14.10-eng_GB-lld
- OPUS-wikimedia-v20230407-eng-lld
- OPUS-translatewiki-v20250101-eng-lld
- OPUS-translatewiki-v20250101-eng_CA-lld
mono_train:
- Sfrontull-la_usc_valbadia_loresmt24-1-lld
- Sfrontull-south_tyrol_weather_lld-1-lld
########## ENG-LIJ_LATN (new) ##########
- id: wmt26-eng-lij_Latn
langs: eng-lij_Latn
train:
- AllenAi-nllb-1-eng-lij_Latn
- Conseggioligure-zenamt_eng_train-1-eng-lij_Latn
- Openlanguagedata-oldi_seed-1-eng-lij_Latn
mono_train:
- Conseggioligure-linc-1-lij_Latn
########## ENG-SME (new) ##########
- id: wmt26-eng-sme
langs: eng-sme
train:
- OPUS-kde4-v2-eng-sme
- OPUS-kde4-v2-eng_GB-sme
- OPUS-opensubtitles-v2024-eng-sme
- OPUS-opus100_train-1-eng-sme
- OPUS-tatoeba-v20230412-eng-sme
- OPUS-translatewiki-v20250101-eng_CA-sme
- OPUS-ubuntu-v14.10-eng-sme
- OPUS-ubuntu-v14.10-eng_AU-sme
- OPUS-ubuntu-v14.10-eng_CA-sme
- OPUS-ubuntu-v14.10-eng_GB-sme
- OPUS-ubuntu-v14.10-eng_NZ-sme
- OPUS-ubuntu-v14.10-eng_US-sme
- OPUS-wikimedia-v20230407-eng-sme
mono_train:
- Leipzig-news-2015_10k-sme_NO
- Leipzig-web-2013_10k-sme_NO
- Leipzig-wikipedia-2021_10k-sme
########## ENG-THA (new) ##########
- id: wmt26-eng-tha
langs: eng-tha
train:
- ELRC-hrw_dataset_v1-1-eng-tha
- ELRC-wikipedia_health-1-eng-tha
- Neulab-tedtalks_train-1-eng-tha
- OPUS-bible_uedin-v1-eng-tha
- OPUS-elrc_2922-v1-eng-tha
- OPUS-elrc_3048_wikipedia_health-v1-eng-tha
- OPUS-elrc_wikipedia_health-v1-eng-tha
- OPUS-gnome-v1-eng-tha
- OPUS-hplt-v2-eng-tha
- OPUS-kde4-v2-eng-tha
- OPUS-multiccaligned-v1-eng-tha
- OPUS-multihplt-v2-eng-tha
- OPUS-opensubtitles-v2024-eng-tha
- OPUS-opus100_train-1-eng-tha
- OPUS-paracrawl_bonus-v9-eng-tha
- OPUS-qed-v2.0a-eng-tha
- OPUS-scb_mt_en_th-v1.0-eng-tha
- OPUS-tanzil-v1-eng-tha
- OPUS-tatoeba-v20230412-eng-tha
- OPUS-ted2020-v1-eng-tha
- OPUS-ubuntu-v14.10-eng-tha
- OPUS-wikimedia-v20230407-eng-tha
- OPUS-xlent-v1.2-eng-tha
- Statmt-ccaligned-1-eng-tha_TH
- OPUS-alt-v20191206-eng-tha
- OPUS-tldr_pages-v20251124-eng-tha
mono_train:
- Leipzig-community-2021-tha
- Leipzig-news-2020_30k-tha
- Leipzig-newscrawl-2011_100k-tha
- Leipzig-web-2018_1m-tha_TH
- Leipzig-wikipedia-2021_10k-tha
Issues / Bugs
Please report issues at github.com/thammegowda/mtdata/issues and mention the relevant recipe ID.