This document provides a list of WMT24 General MT task datasets for constrained track and instructions for downloading them using mtdata
.
Parallel Training Data:
File | EN-CS | EN-DE | EN-ES | EN-HI | EN-IS | EN-JA | EN-RU | EN-ZH | EN-UK | CS-UK | JA-ZH | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|---|
✓ |
✓ |
✓ |
✓ |
|||||||||
Note that only the ticked language pairs are available for constrained participants, but the metadata (tmx files) may be used. JParacrawl is here |
||||||||||||
✓ |
✓ |
✓ |
Same as last year. The fr-de version is here |
|||||||||
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
||||||
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
|||||||
✓ |
✓ |
|||||||||||
✓ |
✓ |
|||||||||||
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
de-en and cs-en contain document information. |
||||||
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
We release the official version, with added language identification (from cld2). |
|||
✓ |
✓ |
✓ |
Back-translated news. The cs-en data is contained in CzEng. The zh-en and ru-en data was produced for the University of Edinburgh systems in 2017 and 2018. |
|||||||||
✓ |
✓ |
✓ |
✓ |
✓ |
||||||||
Note: English side is lowercased. |
||||||||||||
From IWSLT 2017 Evaluation Campaign. |
||||||||||||
✓ |
✓ |
✓ |
||||||||||
✓ |
✓ |
|||||||||||
✓ |
Register and download CzEng2.0. The new CzEng includes synthetic data, and includes all cs-en data supplied for the task. See the CzEng README for more details. |
|||||||||||
✓ |
✓ |
✓ |
✓ |
|||||||||
Others |
✓ |
✓ |
EN-HI: Anuuvaad, IITB, NLLB;; For EN-IS: ParIce |
Monolingual Training Data:
Corpus | CS | DE | EN | ES | HI | IS | JA | RU | ZH | UK | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
Large corpora of crawled news, collected since 2007. For de, cs, and en versions are available with document boundaries, and without sentence-splitting. |
||
✓ |
Corpora crawled from comment sections of online newspapers (no longer updated). |
||||||||||
✓ |
✓ |
✓ |
Monolingual version of European parliament crawl. Superset of the parallel version. |
||||||||
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
Updated Monolingual text from news-commentary crawl. Superset of parallel version. Use the latest version. |
||||
Deduplicated with development and evaluation sentences removed. English was updated 31 January 2016 to remove bad UTF-8. Downloads can be verified with SHA512 checksums. More English is available. |
|||||||||||
✓ |
✓ |
✓ |
✓ |
✓ |
Extended Common Crawl extracted from crawls up to April 2020. |
||||||
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
Leipzig Corpora Collection: From 100 to 200 Languages PDF |
|
✓ |
Text crawled from Ukrainian periodicals |
||||||||||
✓ |
Legal Ukrainian: 69M token corpus in the legal sector; crawled from websites belonging to legislation, government, court, and parliament |
MTData
Setup
pip install -I mtdata==0.4.2
Recipes Config File
Config file for CONSTRAINED track (missing datasets behing registration: CzEng2.0, and CCMT):
wget https://www.statmt.org/wmt24/mtdata/mtdata.recipes.wmt24-constrained.yml
See for dataset IDs selected for constrained eval. By default, mtdata
loads mtdata.recipes*.yml
glob in the current directory (where mtdata
commands are invoked). To place recipe YAML files in a different directory, export MTDATA_RECIPES=/path/to/recipesdir
.
List All Recipes
$ mtdata list-recipe -id | grep wmt24
wmt24-eng-ces
wmt24-eng-zho
wmt24-eng-deu
wmt24-eng-spa
wmt24-eng-jpn
wmt24-eng-rus
wmt24-ces-ukr
wmt24-eng-ukr
wmt24-eng-hin
wmt24-eng-isl
wmt24-jpn-zho
wmt24*
ids are all loaded from mtdata.recipes.wmt24*.yml
file.
Download Recipes
# example:
mtdata get-recipe -ri wmt24-eng-ces -o wmt24-eng-ces # -h|--help for help
# Optional: download and cache all datasets in parallel
mtdata -no-pb cache -j 8 -ri "wmt24-*"
for id in wmt24-eng-{ces,deu,spa,hin,isl,jpn,rus,ukr,zho} wmt24-{ces-ukr,jpn-zho}; do
mtdata get-recipe -i $id -o $id --compress --no-merge -j8
done
mtdata stores its cache under $HOME/.mtdata by default. To change, export MTDATA=/path/to/cache
|
mtdata uses pigz (if available in PATH) for faster compression/decompression. This feature is enabled by default in v0.4.1; to disable, export USE_PIGZ=0
|
Constrained Task Datasets
The selected dataset IDs for constrained task are as follows:
# Setup: pip install mtdata==0.4.0
# To list/view all available datasets:
# mtdata list -id -l <lang1>-<lang2> # parallel
# mtdata list -id -l <lang> # monolingual
# To get a dataset
# mtdata echo <data_id>
- id: wmt24-eng-ces
langs: eng-ces
train:
- Statmt-europarl-10-ces-eng
- ParaCrawl-paracrawl-9-eng-ces
- Statmt-commoncrawl_wmt13-1-ces-eng
- Statmt-news_commentary-18.1-ces-eng
- Statmt-wikititles-3-ces-eng
- Facebook-wikimatrix-1-ces-eng
- Tilde-eesc-2017-ces-eng
- Tilde-ema-2016-ces-eng
- Tilde-ecb-2017-ces-eng
- Tilde-rapid-2019-ces-eng
# TODO: manually download bracktranslated news and CzEng2.0
mono_train:
- Statmt-news_crawl-2023-ces
- Statmt-europarl-10-ces
- Statmt-news_commentary-18.1-ces
- Statmt-commoncrawl-wmt22-ces
- Leipzig-news-2022_1m-ces
- Leipzig-newscrawl-2019_1m-ces
- Leipzig-wikipedia-2021_1m-ces
- Leipzig-web_public-2019_1m-ces_CZ
# TODO: extended common crawl (too big) https://data.statmt.org/wmt21/translation-task/cc-mono/
- id: wmt24-eng-zho
langs: eng-zho
train:
- ParaCrawl-paracrawl-1_bonus-eng-zho
- Statmt-news_commentary-18.1-eng-zho
- Statmt-wikititles-3-zho-eng
- OPUS-unpc-v1.0-eng-zho
- Facebook-wikimatrix-1-eng-zho
- Statmt-backtrans_enzh-wmt20-eng-zho # English translated to Chinese. Double check if you want this
mono_train: &mono_zho
- Statmt-news_crawl-2023-zho
- Statmt-news_commentary-18.1-zho
- Statmt-commoncrawl-wmt22-zho
- Leipzig-wikipedia-2018_1m-zho
- Leipzig-web-2016_1m-zho_MO
- Leipzig-tradnewscrawl-2011_1m-zho
- Leipzig-news-2020_300k-zho
# TODO: extended common crawl (too big) https://data.statmt.org/wmt21/translation-task/cc-mono/
- id: wmt24-eng-deu
langs: eng-deu
train:
- Statmt-europarl-10-deu-eng
- ParaCrawl-paracrawl-9-eng-deu
- Statmt-commoncrawl_wmt13-1-deu-eng
- Statmt-news_commentary-18.1-deu-eng
- Statmt-wikititles-3-deu-eng
- Facebook-wikimatrix-1-deu-eng
- Tilde-eesc-2017-deu-eng
- Tilde-ema-2016-deu-eng
- Tilde-airbaltic-1-deu-eng
- Tilde-czechtourism-1-deu-eng
- Tilde-ecb-2017-deu-eng
- Tilde-rapid-2016-deu-eng
- Tilde-rapid-2019-deu-eng
mono_train:
- Statmt-news_crawl-2023-deu
- Statmt-europarl-10-deu
- Statmt-news_commentary-18.1-deu
- Statmt-commoncrawl-wmt22-deu
- Leipzig-wikipedia-2021_1m-deu
- Leipzig-comweb-2021_1m-deu
- Leipzig-mixed_typical-2011_1m-deu
- Leipzig-news-2022_30k-deu
- Leipzig-newscrawl-2020_1m-deu
- Leipzig-web-2021_100k-deu_DE
# TODO: extended common crawl
- id: wmt24-eng-spa
langs: eng-spa
train:
- Statmt-commoncrawl_wmt13-1-spa-eng
- Statmt-europarl-7-spa-eng
- Statmt-news_commentary-18.1-eng-spa
- Statmt-ccaligned-1-eng-spa
- ParaCrawl-paracrawl-9-eng-spa
- Tilde-eesc-2017-eng-spa
- Tilde-ema-2016-eng-spa
- Tilde-czechtourism-1-eng-spa
- Tilde-ecb-2017-eng-spa
- Tilde-rapid-2016-eng-spa
- Tilde-worldbank-1-eng-spa
- Facebook-wikimatrix-1-eng-spa
- EU-ecdc-1-eng-spa
- EU-eac_forms-1-eng-spa
- EU-eac_reference-1-eng-spa
- EU-dcep-1-eng-spa
- LinguaTools-wikititles-2014-eng-spa
- Neulab-tedtalks_train-1-eng-spa
- OPUS-books-v1-eng-spa
- OPUS-dgt-v2019-eng-spa
- OPUS-dgt-v4-eng-spa
- OPUS-ecb-v1-eng-spa
- OPUS-elitr_eca-v1-eng-spa
- OPUS-elra_w0147-v1-eng-spa
- OPUS-elra_w0305-v1-eng-spa
- OPUS-elrc_1076_euipo_law-v1-eng-spa
- OPUS-elrc_1082_cnio-v1-eng-spa
- OPUS-elrc_1083_aecosan-v1-eng-spa
- OPUS-elrc_1084_agencia_tributaria-v1-eng-spa
- OPUS-elrc_1096_euipo_list-v1-eng-spa
- OPUS-elrc_1125_cordis_news-v1-eng-spa
- OPUS-elrc_1126_cordis_results_brief-v1-eng-spa
- OPUS-elrc_2015_euipo_2017-v1-eng-spa
- OPUS-elrc_2410_portal_oficial_turis-v1-eng-spa
- OPUS-elrc_2478_glossrio_pt_en-v1-eng-spa
- OPUS-elrc_2479_lei_orgnica_2-v1-eng-spa
- OPUS-elrc_2480_estatuto_dos_deputad-v1-eng-spa
- OPUS-elrc_2481_constituio_da_repbli-v1-eng-spa
- OPUS-elrc_2498_plan_nacional_e-v1-eng-spa
- OPUS-elrc_2502_termitur-v1-eng-spa
- OPUS-elrc_2503_descripciones_vulner-v1-eng-spa
- OPUS-elrc_2536_estatuto_da_vtima-v1-eng-spa
- OPUS-elrc_2538_lei_25_2009-v1-eng-spa
- OPUS-elrc_2543_inteliterm-v1-eng-spa
- OPUS-elrc_2558_government_websites_-v1-eng-spa
- OPUS-elrc_2612_artigos_visitportuga-v1-eng-spa
- OPUS-elrc_2614_localidades_2007-v1-eng-spa
- OPUS-elrc_2616_museus_2007-v1-eng-spa
- OPUS-elrc_2622_arquitectura_2007-v1-eng-spa
- OPUS-elrc_2623_patrimnio_aores_2006-v1-eng-spa
- OPUS-elrc_2638_monumentos_2007-v1-eng-spa
- OPUS-elrc_2639_parques_e_reservas-v1-eng-spa
- OPUS-elrc_2641_praias_2007-v1-eng-spa
- OPUS-elrc_2722_emea-v1-eng-spa
- OPUS-elrc_2738_vaccination-v1-eng-spa
- OPUS-elrc_2881_eu_publications_medi-v1-eng-spa
- OPUS-elrc_3077_wikipedia_health-v1-eng-spa
- OPUS-elrc_3210_antibiotic-v1-eng-spa
- OPUS-elrc_3299_europarl_covid-v1-eng-spa
- OPUS-elrc_3470_ec_europa_covid-v1-eng-spa
- OPUS-elrc_3571_eur_lex_covid-v1-eng-spa
- OPUS-elrc_3612_presscorner_covid-v1-eng-spa
- OPUS-elrc_3852_development_funds_re-v1-eng-spa
- OPUS-elrc_401_swedish_labour_part2-v1-eng-spa
- OPUS-elrc_406_swedish_labour_part1-v1-eng-spa
- OPUS-elrc_416_swedish_social_secur-v1-eng-spa
- OPUS-elrc_417_swedish_work_environ-v1-eng-spa
- OPUS-elrc_436_swedish_food-v1-eng-spa
- OPUS-elrc_4992_customer_support_mt-v1-eng-spa
- OPUS-elrc_5067_scipar-v1-eng-spa
- OPUS-elrc_5190_cyber_mt_test-v1-eng-spa
- OPUS-elrc_637_sip-v1-eng-spa
- OPUS-elrc_832_charter_values_citiz-v1-eng-spa
- OPUS-elrc_844_paesi__administratio-v1-eng-spa
- OPUS-elrc_863_government_websites_-v1-eng-spa
- OPUS-elrc_arquitectura_2007-v1-eng-spa
- OPUS-elrc_artigos_visitportuga-v1-eng-spa
- OPUS-elrc_cordis_news-v1-eng-spa
- OPUS-elrc_cordis_results-v1-eng-spa
- OPUS-elrc_ec_europa-v1-eng-spa
- OPUS-elrc_emea-v1-eng-spa
- OPUS-elrc_euipo_2017-v1-eng-spa
- OPUS-elrc_euipo_law-v1-eng-spa
- OPUS-elrc_euipo_list-v1-eng-spa
- OPUS-elrc_europarl_covid-v1-eng-spa
- OPUS-elrc_eur_lex-v1-eng-spa
- OPUS-elrc_eu_publications-v1-eng-spa
- OPUS-elrc_localidades_2007-v1-eng-spa
- OPUS-elrc_museus_2007-v1-eng-spa
- OPUS-elrc_parques_e-v1-eng-spa
- OPUS-elrc_patrimnio_aores-v1-eng-spa
- OPUS-elrc_praias_2007-v1-eng-spa
- OPUS-elrc_swedish_labour-v1-eng-spa
- OPUS-elrc_termitur-v1-eng-spa
- OPUS-elrc_antibiotic-v1-eng-spa
- OPUS-elrc_government_websites-v1-eng-spa
- OPUS-elrc_presscorner_covid-v1-eng-spa
- OPUS-elrc_vaccination-v1-eng-spa
- OPUS-elrc_wikipedia_health-v1-eng-spa
- OPUS-elrc_2682-v1-eng-spa
- OPUS-elrc_2922-v1-eng-spa
- OPUS-elrc_2923-v1-eng-spa
- OPUS-elrc_3382-v1-eng-spa
- OPUS-emea-v3-eng-spa
- OPUS-eubookshop-v2-eng-spa
- OPUS-euconst-v1-eng-spa
- OPUS-europat-v3-eng-spa
- OPUS-europarl-v8-eng-spa
- OPUS-globalvoices-v2018q4-eng-spa
- OPUS-multiccaligned-v1-eng-spa
- OPUS-multiun-v1-eng-spa
- OPUS-unpc-v1.0-eng-spa
- OPUS-wikimatrix-v1-eng-spa
- OPUS-wikipedia-v1.0-eng-spa
- OPUS-xlent-v1.1-eng-spa
- OPUS-infopankki-v1-eng-spa
- OPUS-tico_19-v20201028-eng-spa
- OPUS-wikimedia-v20210402-eng-spa
mono_train:
- Statmt-europarl-10-spa
- Statmt-news_commentary-18.1-spa
- Statmt-news_crawl-2023-spa
- Leipzig-news-2022_1m-spa
- Leipzig-newscrawl_public-2019_1m-spa
- Leipzig-web-2016_1m-spa
- Leipzig-web-2016_1m-spa_AR
- Leipzig-web-2016_1m-spa_PA
- Leipzig-web-2016_1m-spa_PE
- Leipzig-web-2016_1m-spa_PY
- Leipzig-web-2016_1m-spa_SV
- Leipzig-web-2016_1m-spa_UY
- Leipzig-web-2016_1m-spa_VE
- Leipzig-wikipedia-2021_1m-spa
- id: wmt24-eng-jpn
langs: eng-jpn
train:
- Statmt-news_commentary-18.1-eng-jpn
- KECL-paracrawl-3-eng-jpn
- Statmt-wikititles-3-jpn-eng
- Facebook-wikimatrix-1-eng-jpn
- Statmt-ted-wmt20-eng-jpn
- StanfordNLP-jesc_train-1-eng-jpn
- Phontron-kftt_train-1-eng-jpn
mono_train: &mono_jpn
- Statmt-news_crawl-2023-jpn
- Statmt-news_commentary-18.1-jpn
- Statmt-commoncrawl-wmt22-jpn
- Leipzig-web-2020_1m-jpn_JP
- Leipzig-comweb-2018_1m-jpn
- Leipzig-web_public-2019_1m-jpn_JP
- Leipzig-news-2020_100k-jpn
- Leipzig-newscrawl-2019_1m-jpn
- Leipzig-wikipedia-2021_1m-jpn
# TODO: Extended Common Crawl
- id: wmt24-eng-rus
langs: eng-rus
train:
- ParaCrawl-paracrawl-1_bonus-eng-rus
- Statmt-commoncrawl_wmt13-1-rus-eng
- Statmt-news_commentary-18.1-eng-rus
- Statmt-yandex-wmt22-eng-rus
- Statmt-wikititles-3-rus-eng
- OPUS-unpc-v1.0-eng-rus
- Facebook-wikimatrix-1-eng-rus
- Tilde-airbaltic-1-eng-rus
- Tilde-czechtourism-1-eng-rus
- Tilde-worldbank-1-eng-rus
- Statmt-backtrans_ruen-wmt20-rus-eng # russian translated to english
mono_train: &mono_rus
- Statmt-news_crawl-2023-rus
- Statmt-news_commentary-18.1-rus
- Statmt-commoncrawl-wmt22-rus
- Leipzig-news-2022_1m-rus
- Leipzig-newscrawl_public-2018_1m-rus
- Leipzig-web-2017_1m-rus_GE
- Leipzig-wikipedia-2021_1m-rus
- id: wmt24-ces-ukr
langs: ces-ukr
train:
- Facebook-wikimatrix-1-ces-ukr
- ELRC-acts_ukrainian-1-ces-ukr
- OPUS-ccmatrix-v1-ces-ukr
- OPUS-elrc_5179_acts_ukrainian-v1-ces-ukr
- OPUS-elrc_wikipedia_health-v1-ces-ukr
- OPUS-eubookshop-v2-ces-ukr
- OPUS-gnome-v1-ces-ukr
- OPUS-kde4-v2-ces-ukr
- OPUS-multiccaligned-v1.1-ces-ukr
- OPUS-multiparacrawl-v9b-ces-ukr
- OPUS-opensubtitles-v2018-ces-ukr
- OPUS-qed-v2.0a-ces-ukr
- OPUS-ted2020-v1-ces-ukr
- OPUS-tatoeba-v20220303-ces-ukr
- OPUS-ubuntu-v14.10-ces-ukr
- OPUS-xlent-v1.1-ces-ukr
- OPUS-bible_uedin-v1-ces-ukr
- OPUS-wikimedia-v20210402-ces-ukr
mono_train: &mono_ukr
- Statmt-news_crawl-2023-ukr
- LangUk-news-1-ukr
- LangUk-wiki_dump-1-ukr
- LangUk-fiction-1-ukr
- LangUk-ubercorpus-1-ukr
- LangUk-laws-1-ukr
- Leipzig-news-2022_1m-ukr
- Leipzig-newscrawl-2018_1m-ukr
- Leipzig-web-2019_1m-ukr_UA
- Leipzig-wikipedia-2021_1m-ukr
- id: wmt24-eng-ukr
langs: eng-ukr
train: ¶_eng_ukr
- ParaCrawl-paracrawl-1_bonus-eng-ukr
- Tilde-worldbank-1-eng-ukr
- Facebook-wikimatrix-1-eng-ukr
- ELRC-acts_ukrainian-1-eng-ukr
- Statmt-ccaligned-1-eng-ukr_UA
mono_train: *mono_ukr
- id: wmt24-eng-hin
langs: eng-hin
train:
- Statmt-news_commentary-18.1-eng-hin
- Statmt-pmindia-1-eng-hin
- Statmt-ccaligned-1-eng-hin_IN
- JoshuaDec-indian_training-1-hin-eng
- Facebook-wikimatrix-1-eng-hin
- IITB-hien_train-1.5-hin-eng
- Neulab-tedtalks_train-1-eng-hin
- ELRC-wikipedia_health-1-eng-hin
- AI4Bharath-samananthar-0.2-eng-hin
- Anuvaad-internal_judicial_2021-v1-eng-hin
- Anuvaad-legal_terms_2021-v1-eng-hin
- Anuvaad-pib_2017-2020-eng-hin
- Anuvaad-pibarchives_2009-2016-eng-hin
- Anuvaad-wikipedia-20210201-eng-hin
- Anuvaad-drivespark-20210303-eng-hin
- Anuvaad-nativeplanet-20210315-eng-hin
- Anuvaad-catchnews-20210320-eng-hin
- Anuvaad-dwnews_2008-2020-eng-hin
- Anuvaad-oneindia-20210320-eng-hin
- Anuvaad-mk-20210320-eng-hin
- Anuvaad-goodreturns-20210320-eng-hin
- Anuvaad-ie_sports-20210320-eng-hin
- Anuvaad-ie_tech-20210320-eng-hin
- Anuvaad-ie_news-20210320-eng-hin
- Anuvaad-ie_lifestyle-20210320-eng-hin
- Anuvaad-ie_general-20210320-eng-hin
- Anuvaad-ie_entertainment-20210320-eng-hin
- Anuvaad-ie_education-20210320-eng-hin
- Anuvaad-ie_business-20210320-eng-hin
- Anuvaad-fin_express-20210320-eng-hin
- Anuvaad-thewire-20210320-eng-hin
- Anuvaad-tribune-20210320-eng-hin
- Anuvaad-zeebiz-20210320-eng-hin
- Anuvaad-pa_govt-20210320-eng-hin
- Anuvaad-betterindia-20210320-eng-hin
- Anuvaad-jagran_news-20210320-eng-hin
- Anuvaad-jagran_tech-20210320-eng-hin
- Anuvaad-jagran_education-20210320-eng-hin
- Anuvaad-jagran_entertainment-20210320-eng-hin
- Anuvaad-jagran_business-20210320-eng-hin
- Anuvaad-jagran_sports-20210320-eng-hin
- Anuvaad-jagran_lifestyle-20210320-eng-hin
- Anuvaad-asianetnews-20210320-eng-hin
- Anuvaad-business_standard-20210320-eng-hin
- Anuvaad-pranabmukherjee-20210320-eng-hin
- Anuvaad-lokmat_entertainment-20210501-eng-hin
- Anuvaad-lokmat_news-20210501-eng-hin
- Anuvaad-lokmat_lifestyle-20210501-eng-hin
- Anuvaad-lokmat_sports-20210501-eng-hin
- Anuvaad-lokmat_tech-20210501-eng-hin
- Anuvaad-lokmat_financial-20210501-eng-hin
- Anuvaad-lokmat_healthcare-20210501-eng-hin
- AllenAi-nllb-1-eng-hin
- OPUS-elrc_wikipedia_health-v1-eng-hin
- OPUS-elrc_2922-v1-eng-hin
- OPUS-globalvoices-v2018q4-eng-hin
- OPUS-iitb-v2.0-eng-hin
- OPUS-multiccaligned-v1-eng-hin
- OPUS-opensubtitles-v2018-eng-hin
- OPUS-ted2020-v1-eng-hin
- OPUS-tanzil-v1-eng-hin
- OPUS-tatoeba-v20220303-eng-hin
- OPUS-ubuntu-v14.10-eng-hin
- OPUS-xlent-v1.1-eng-hin
- OPUS-tico_19-v20201028-eng-hin
- OPUS-wikimedia-v20210402-eng-hin
mono_train:
- Statmt-news_crawl-2023-hin
- Statmt-news_commentary-18.1-hin
- Leipzig-web-2015_1m-hin_IN
- Leipzig-mixed-2019_1m-hin
- Leipzig-news-2020_1m-hin
- Leipzig-newscrawl-2017_1m-hin
- Leipzig-wikipedia-2021_1m-hin
- id: wmt24-eng-isl
langs: eng-isl
train:
- Statmt-wikititles-3-isl-eng
- Statmt-ccaligned-1-eng-isl_IS
- ParaCrawl-paracrawl-9-eng-isl
- Tilde-eesc-2017-eng-isl
- Tilde-ema-2016-eng-isl
- Tilde-rapid-2016-eng-isl
- Facebook-wikimatrix-1-eng-isl
- ParIce-eea_train-20.05-eng-isl
- ParIce-ema_train-20.05-eng-isl
- EU-ecdc-1-eng-isl
- EU-eac_forms-1-eng-isl
- EU-eac_reference-1-eng-isl
- OPUS-ccmatrix-v1-eng-isl
- OPUS-elrc_2718_emea-v1-eng-isl
- OPUS-elrc_3206_antibiotic-v1-eng-isl
- OPUS-elrc_4295_www.malfong.is-v1-eng-isl
- OPUS-elrc_4324_government_offices_i-v1-eng-isl
- OPUS-elrc_4327_government_offices_i-v1-eng-isl
- OPUS-elrc_4334_rkiskaup_2020-v1-eng-isl
- OPUS-elrc_4338_university_iceland-v1-eng-isl
- OPUS-elrc_502_icelandic_financial_-v1-eng-isl
- OPUS-elrc_504_www.iceida.is-v1-eng-isl
- OPUS-elrc_505_www.pfs.is-v1-eng-isl
- OPUS-elrc_506_www.lanamal.is-v1-eng-isl
- OPUS-elrc_5067_scipar-v1-eng-isl
- OPUS-elrc_508_tilde_statistics_ice-v1-eng-isl
- OPUS-elrc_509_gallery_iceland-v1-eng-isl
- OPUS-elrc_510_harpa_reykjavik_conc-v1-eng-isl
- OPUS-elrc_511_bokmenntaborgin_is-v1-eng-isl
- OPUS-elrc_516_icelandic_medicines-v1-eng-isl
- OPUS-elrc_517_icelandic_directorat-v1-eng-isl
- OPUS-elrc_597_www.nordisketax.net-v1-eng-isl
- OPUS-elrc_718_statistics_iceland-v1-eng-isl
- OPUS-elrc_728_www.norden.org-v1-eng-isl
- OPUS-elrc_emea-v1-eng-isl
- OPUS-elrc_antibiotic-v1-eng-isl
- OPUS-elrc_www.norden.org-v1-eng-isl
- OPUS-elrc_www.nordisketax.net-v1-eng-isl
- OPUS-eubookshop-v2-eng-isl
- OPUS-multiccaligned-v1-eng-isl
- OPUS-multiparacrawl-v7.1-eng-isl
- OPUS-opensubtitles-v2018-eng-isl
- OPUS-ted2020-v1-eng-isl
- OPUS-tatoeba-v20220303-eng-isl
- OPUS-ubuntu-v14.10-eng-isl
- OPUS-wikimatrix-v1-eng-isl
- OPUS-wikititles-v3-eng-isl
- OPUS-xlent-v1.1-eng-isl
- OPUS-wikimedia-v20210402-eng-isl
mono_train:
- Statmt-news_crawl-2023-isl
- Leipzig-web-2020_1m-isl_IS
- Leipzig-web_public-2019_1m-isl_IS
- Leipzig-news-2020_30k-isl
- Leipzig-newscrawl-2019_300k-isl
- Leipzig-wikipedia-2021_100k-isl
- id: wmt24-jpn-zho
langs: jpn-zho
train:
- Statmt-news_commentary-18.1-jpn-zho
- KECL-paracrawl-2-zho-jpn
- KECL-paracrawl-2wmt24-zho-jpn
- Facebook-wikimatrix-1-jpn-zho
- Neulab-tedtalks_train-1-jpn-zho
- LinguaTools-wikititles-2014-jpn-zho
- OPUS-ccmatrix-v1-jpn-zho
- OPUS-gnome-v1-jpn-zho_CN
- OPUS-kde4-v2-jpn-zho_CN
- OPUS-multiccaligned-v1-jpn-zho_CN
- OPUS-openoffice-v3-jpn-zho_CN
- OPUS-opensubtitles-v2018-jpn-zho_CN
- OPUS-php-v1-jpn-zho
- OPUS-qed-v2.0a-jpn-zho
- OPUS-ted2020-v1-jpn-zho
- OPUS-tanzil-v1-jpn-zho
- OPUS-ubuntu-v14.10-jpn-zho
- OPUS-ubuntu-v14.10-jpn-zho_CN
- OPUS-xlent-v1.1-jpn-zho
- OPUS-bible_uedin-v1-jpn-zho
- OPUS-wikimedia-v20210402-jpn-zho
mono_train: *mono_zho
Issues/Bugs
Please report them using GitHub issues at github.com/thammegowda/mtdata .