EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 15-16, 2024
Miami, Florida, USA
 
[HOME] [PROGRAM] [PAPERS] [AUTHORS]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL] [MULTIINDIC22MT TASK] [ENGLISH-TO-LOWRES MULTIMODAL MT TASK] [NON-REPETITIVE] [PATENT] [LITERARY]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

This document provides a list of WMT24 General MT task datasets for constrained track and instructions for downloading them using mtdata.

Parallel Training Data:

File EN-CS EN-DE EN-ES EN-HI EN-IS EN-JA EN-RU EN-ZH EN-UK CS-UK JA-ZH Notes

Europarl v10

EN-ES is v7 from wmt13

ParaCrawl v9

Note that only the ticked language pairs are available for constrained participants, but the metadata (tmx files) may be used. JParacrawl is here

Common Crawl corpus

Same as last year. The fr-de version is here

News Commentary v18.1

Wiki Titles v3

Linguuatools Wiki Titles

UN Parallel Corpus V1.0

Register and download

Tilde MODEL corpus

de-en and cs-en contain document information.

WikiMatrix

We release the official version, with added language identification (from cld2).

Back-translated news

Back-translated news. The cs-en data is contained in CzEng. The zh-en and ru-en data was produced for the University of Edinburgh systems in 2017 and 2018.

OPUS

Japanese-English Subtitle Corpus

Note: English side is lowercased.

The Kyoto Free Translation Task Corpus

TED Talks

From IWSLT 2017 Evaluation Campaign.

Neulab TED Talks

ELRC - EU acts in Ukrainian

CzEng 2.0

Register and download CzEng2.0. The new CzEng includes synthetic data, and includes all cs-en data supplied for the task. See the CzEng README for more details.

Yandex Corpus

CC Aligned

Others

EN-HI: Anuuvaad, IITB, NLLB;; For EN-IS: ParIce

Monolingual Training Data:

Corpus CS DE EN ES HI IS JA RU ZH UK Notes

News crawl

Large corpora of crawled news, collected since 2007. For de, cs, and en versions are available with document boundaries, and without sentence-splitting.

News discussions

Corpora crawled from comment sections of online newspapers (no longer updated).

Europarl v10

Monolingual version of European parliament crawl. Superset of the parallel version.

News Commentary

Updated Monolingual text from news-commentary crawl. Superset of parallel version. Use the latest version.

Common Crawl

Deduplicated with development and evaluation sentences removed. English was updated 31 January 2016 to remove bad UTF-8. Downloads can be verified with SHA512 checksums. More English is available.

Extended Common Crawl

Extended Common Crawl extracted from crawls up to April 2020.

Leipzig Corpora

Leipzig Corpora Collection: From 100 to 200 Languages PDF

UberText Corpus

Text crawled from Ukrainian periodicals

Legal Ukrainian

Legal Ukrainian: 69M token corpus in the legal sector; crawled from websites belonging to legislation, government, court, and parliament

MTData

Setup

pip install -I mtdata==0.4.2

Recipes Config File

Config file for CONSTRAINED track (missing datasets behing registration: CzEng2.0, and CCMT):

wget https://www.statmt.org/wmt24/mtdata/mtdata.recipes.wmt24-constrained.yml

See for dataset IDs selected for constrained eval. By default, mtdata loads mtdata.recipes*.yml glob in the current directory (where mtdata commands are invoked). To place recipe YAML files in a different directory, export MTDATA_RECIPES=/path/to/recipesdir.

List All Recipes

$ mtdata list-recipe -id | grep wmt24
wmt24-eng-ces
wmt24-eng-zho
wmt24-eng-deu
wmt24-eng-spa
wmt24-eng-jpn
wmt24-eng-rus
wmt24-ces-ukr
wmt24-eng-ukr
wmt24-eng-hin
wmt24-eng-isl
wmt24-jpn-zho

wmt24* ids are all loaded from mtdata.recipes.wmt24*.yml file.

Download Recipes

Download a Recipe
# example:
mtdata get-recipe -ri wmt24-eng-ces -o wmt24-eng-ces # -h|--help for help
Download All Recipes
# Optional: download and cache all datasets in parallel
mtdata -no-pb cache -j 8 -ri "wmt24-*"
for id in wmt24-eng-{ces,deu,spa,hin,isl,jpn,rus,ukr,zho} wmt24-{ces-ukr,jpn-zho}; do
  mtdata get-recipe -i $id -o $id --compress --no-merge -j8
done
mtdata stores its cache under $HOME/.mtdata by default. To change, export MTDATA=/path/to/cache
mtdata uses pigz (if available in PATH) for faster compression/decompression. This feature is enabled by default in v0.4.1; to disable, export USE_PIGZ=0

Constrained Task Datasets

The selected dataset IDs for constrained task are as follows:

# Setup: pip install mtdata==0.4.0
# To list/view all available datasets:
#   mtdata list -id -l <lang1>-<lang2>   # parallel
#   mtdata list -id -l <lang>            # monolingual
# To get a dataset
#   mtdata echo <data_id>

- id: wmt24-eng-ces
  langs: eng-ces
  train:
    - Statmt-europarl-10-ces-eng
    - ParaCrawl-paracrawl-9-eng-ces
    - Statmt-commoncrawl_wmt13-1-ces-eng
    - Statmt-news_commentary-18.1-ces-eng
    - Statmt-wikititles-3-ces-eng
    - Facebook-wikimatrix-1-ces-eng
    - Tilde-eesc-2017-ces-eng
    - Tilde-ema-2016-ces-eng
    - Tilde-ecb-2017-ces-eng
    - Tilde-rapid-2019-ces-eng
    # TODO: manually download bracktranslated news and CzEng2.0
  mono_train:
    - Statmt-news_crawl-2023-ces
    - Statmt-europarl-10-ces
    - Statmt-news_commentary-18.1-ces
    - Statmt-commoncrawl-wmt22-ces
    - Leipzig-news-2022_1m-ces
    - Leipzig-newscrawl-2019_1m-ces
    - Leipzig-wikipedia-2021_1m-ces
    - Leipzig-web_public-2019_1m-ces_CZ
    # TODO: extended common crawl (too big) https://data.statmt.org/wmt21/translation-task/cc-mono/

- id: wmt24-eng-zho
  langs: eng-zho
  train:
    - ParaCrawl-paracrawl-1_bonus-eng-zho
    - Statmt-news_commentary-18.1-eng-zho
    - Statmt-wikititles-3-zho-eng
    - OPUS-unpc-v1.0-eng-zho
    - Facebook-wikimatrix-1-eng-zho
    - Statmt-backtrans_enzh-wmt20-eng-zho   # English translated to Chinese. Double check if you want this
  mono_train: &mono_zho
    - Statmt-news_crawl-2023-zho
    - Statmt-news_commentary-18.1-zho
    - Statmt-commoncrawl-wmt22-zho
    - Leipzig-wikipedia-2018_1m-zho
    - Leipzig-web-2016_1m-zho_MO
    - Leipzig-tradnewscrawl-2011_1m-zho
    - Leipzig-news-2020_300k-zho
    # TODO: extended common crawl (too big) https://data.statmt.org/wmt21/translation-task/cc-mono/

- id: wmt24-eng-deu
  langs: eng-deu
  train:
    - Statmt-europarl-10-deu-eng
    - ParaCrawl-paracrawl-9-eng-deu
    - Statmt-commoncrawl_wmt13-1-deu-eng
    - Statmt-news_commentary-18.1-deu-eng
    - Statmt-wikititles-3-deu-eng
    - Facebook-wikimatrix-1-deu-eng
    - Tilde-eesc-2017-deu-eng
    - Tilde-ema-2016-deu-eng
    - Tilde-airbaltic-1-deu-eng
    - Tilde-czechtourism-1-deu-eng
    - Tilde-ecb-2017-deu-eng
    - Tilde-rapid-2016-deu-eng
    - Tilde-rapid-2019-deu-eng
  mono_train:
    - Statmt-news_crawl-2023-deu
    - Statmt-europarl-10-deu
    - Statmt-news_commentary-18.1-deu
    - Statmt-commoncrawl-wmt22-deu
    - Leipzig-wikipedia-2021_1m-deu
    - Leipzig-comweb-2021_1m-deu
    - Leipzig-mixed_typical-2011_1m-deu
    - Leipzig-news-2022_30k-deu
    - Leipzig-newscrawl-2020_1m-deu
    - Leipzig-web-2021_100k-deu_DE
    # TODO: extended common crawl
- id: wmt24-eng-spa
  langs: eng-spa
  train:
    - Statmt-commoncrawl_wmt13-1-spa-eng
    - Statmt-europarl-7-spa-eng
    - Statmt-news_commentary-18.1-eng-spa
    - Statmt-ccaligned-1-eng-spa
    - ParaCrawl-paracrawl-9-eng-spa
    - Tilde-eesc-2017-eng-spa
    - Tilde-ema-2016-eng-spa
    - Tilde-czechtourism-1-eng-spa
    - Tilde-ecb-2017-eng-spa
    - Tilde-rapid-2016-eng-spa
    - Tilde-worldbank-1-eng-spa
    - Facebook-wikimatrix-1-eng-spa
    - EU-ecdc-1-eng-spa
    - EU-eac_forms-1-eng-spa
    - EU-eac_reference-1-eng-spa
    - EU-dcep-1-eng-spa
    - LinguaTools-wikititles-2014-eng-spa
    - Neulab-tedtalks_train-1-eng-spa
    - OPUS-books-v1-eng-spa
    - OPUS-dgt-v2019-eng-spa
    - OPUS-dgt-v4-eng-spa
    - OPUS-ecb-v1-eng-spa
    - OPUS-elitr_eca-v1-eng-spa
    - OPUS-elra_w0147-v1-eng-spa
    - OPUS-elra_w0305-v1-eng-spa
    - OPUS-elrc_1076_euipo_law-v1-eng-spa
    - OPUS-elrc_1082_cnio-v1-eng-spa
    - OPUS-elrc_1083_aecosan-v1-eng-spa
    - OPUS-elrc_1084_agencia_tributaria-v1-eng-spa
    - OPUS-elrc_1096_euipo_list-v1-eng-spa
    - OPUS-elrc_1125_cordis_news-v1-eng-spa
    - OPUS-elrc_1126_cordis_results_brief-v1-eng-spa
    - OPUS-elrc_2015_euipo_2017-v1-eng-spa
    - OPUS-elrc_2410_portal_oficial_turis-v1-eng-spa
    - OPUS-elrc_2478_glossrio_pt_en-v1-eng-spa
    - OPUS-elrc_2479_lei_orgnica_2-v1-eng-spa
    - OPUS-elrc_2480_estatuto_dos_deputad-v1-eng-spa
    - OPUS-elrc_2481_constituio_da_repbli-v1-eng-spa
    - OPUS-elrc_2498_plan_nacional_e-v1-eng-spa
    - OPUS-elrc_2502_termitur-v1-eng-spa
    - OPUS-elrc_2503_descripciones_vulner-v1-eng-spa
    - OPUS-elrc_2536_estatuto_da_vtima-v1-eng-spa
    - OPUS-elrc_2538_lei_25_2009-v1-eng-spa
    - OPUS-elrc_2543_inteliterm-v1-eng-spa
    - OPUS-elrc_2558_government_websites_-v1-eng-spa
    - OPUS-elrc_2612_artigos_visitportuga-v1-eng-spa
    - OPUS-elrc_2614_localidades_2007-v1-eng-spa
    - OPUS-elrc_2616_museus_2007-v1-eng-spa
    - OPUS-elrc_2622_arquitectura_2007-v1-eng-spa
    - OPUS-elrc_2623_patrimnio_aores_2006-v1-eng-spa
    - OPUS-elrc_2638_monumentos_2007-v1-eng-spa
    - OPUS-elrc_2639_parques_e_reservas-v1-eng-spa
    - OPUS-elrc_2641_praias_2007-v1-eng-spa
    - OPUS-elrc_2722_emea-v1-eng-spa
    - OPUS-elrc_2738_vaccination-v1-eng-spa
    - OPUS-elrc_2881_eu_publications_medi-v1-eng-spa
    - OPUS-elrc_3077_wikipedia_health-v1-eng-spa
    - OPUS-elrc_3210_antibiotic-v1-eng-spa
    - OPUS-elrc_3299_europarl_covid-v1-eng-spa
    - OPUS-elrc_3470_ec_europa_covid-v1-eng-spa
    - OPUS-elrc_3571_eur_lex_covid-v1-eng-spa
    - OPUS-elrc_3612_presscorner_covid-v1-eng-spa
    - OPUS-elrc_3852_development_funds_re-v1-eng-spa
    - OPUS-elrc_401_swedish_labour_part2-v1-eng-spa
    - OPUS-elrc_406_swedish_labour_part1-v1-eng-spa
    - OPUS-elrc_416_swedish_social_secur-v1-eng-spa
    - OPUS-elrc_417_swedish_work_environ-v1-eng-spa
    - OPUS-elrc_436_swedish_food-v1-eng-spa
    - OPUS-elrc_4992_customer_support_mt-v1-eng-spa
    - OPUS-elrc_5067_scipar-v1-eng-spa
    - OPUS-elrc_5190_cyber_mt_test-v1-eng-spa
    - OPUS-elrc_637_sip-v1-eng-spa
    - OPUS-elrc_832_charter_values_citiz-v1-eng-spa
    - OPUS-elrc_844_paesi__administratio-v1-eng-spa
    - OPUS-elrc_863_government_websites_-v1-eng-spa
    - OPUS-elrc_arquitectura_2007-v1-eng-spa
    - OPUS-elrc_artigos_visitportuga-v1-eng-spa
    - OPUS-elrc_cordis_news-v1-eng-spa
    - OPUS-elrc_cordis_results-v1-eng-spa
    - OPUS-elrc_ec_europa-v1-eng-spa
    - OPUS-elrc_emea-v1-eng-spa
    - OPUS-elrc_euipo_2017-v1-eng-spa
    - OPUS-elrc_euipo_law-v1-eng-spa
    - OPUS-elrc_euipo_list-v1-eng-spa
    - OPUS-elrc_europarl_covid-v1-eng-spa
    - OPUS-elrc_eur_lex-v1-eng-spa
    - OPUS-elrc_eu_publications-v1-eng-spa
    - OPUS-elrc_localidades_2007-v1-eng-spa
    - OPUS-elrc_museus_2007-v1-eng-spa
    - OPUS-elrc_parques_e-v1-eng-spa
    - OPUS-elrc_patrimnio_aores-v1-eng-spa
    - OPUS-elrc_praias_2007-v1-eng-spa
    - OPUS-elrc_swedish_labour-v1-eng-spa
    - OPUS-elrc_termitur-v1-eng-spa
    - OPUS-elrc_antibiotic-v1-eng-spa
    - OPUS-elrc_government_websites-v1-eng-spa
    - OPUS-elrc_presscorner_covid-v1-eng-spa
    - OPUS-elrc_vaccination-v1-eng-spa
    - OPUS-elrc_wikipedia_health-v1-eng-spa
    - OPUS-elrc_2682-v1-eng-spa
    - OPUS-elrc_2922-v1-eng-spa
    - OPUS-elrc_2923-v1-eng-spa
    - OPUS-elrc_3382-v1-eng-spa
    - OPUS-emea-v3-eng-spa
    - OPUS-eubookshop-v2-eng-spa
    - OPUS-euconst-v1-eng-spa
    - OPUS-europat-v3-eng-spa
    - OPUS-europarl-v8-eng-spa
    - OPUS-globalvoices-v2018q4-eng-spa
    - OPUS-multiccaligned-v1-eng-spa
    - OPUS-multiun-v1-eng-spa
    - OPUS-unpc-v1.0-eng-spa
    - OPUS-wikimatrix-v1-eng-spa
    - OPUS-wikipedia-v1.0-eng-spa
    - OPUS-xlent-v1.1-eng-spa
    - OPUS-infopankki-v1-eng-spa
    - OPUS-tico_19-v20201028-eng-spa
    - OPUS-wikimedia-v20210402-eng-spa
  mono_train:
    - Statmt-europarl-10-spa
    - Statmt-news_commentary-18.1-spa
    - Statmt-news_crawl-2023-spa
    - Leipzig-news-2022_1m-spa
    - Leipzig-newscrawl_public-2019_1m-spa
    - Leipzig-web-2016_1m-spa
    - Leipzig-web-2016_1m-spa_AR
    - Leipzig-web-2016_1m-spa_PA
    - Leipzig-web-2016_1m-spa_PE
    - Leipzig-web-2016_1m-spa_PY
    - Leipzig-web-2016_1m-spa_SV
    - Leipzig-web-2016_1m-spa_UY
    - Leipzig-web-2016_1m-spa_VE
    - Leipzig-wikipedia-2021_1m-spa


- id: wmt24-eng-jpn
  langs: eng-jpn
  train:
    - Statmt-news_commentary-18.1-eng-jpn
    - KECL-paracrawl-3-eng-jpn
    - Statmt-wikititles-3-jpn-eng
    - Facebook-wikimatrix-1-eng-jpn
    - Statmt-ted-wmt20-eng-jpn
    - StanfordNLP-jesc_train-1-eng-jpn
    - Phontron-kftt_train-1-eng-jpn
  mono_train: &mono_jpn
    - Statmt-news_crawl-2023-jpn
    - Statmt-news_commentary-18.1-jpn
    - Statmt-commoncrawl-wmt22-jpn
    - Leipzig-web-2020_1m-jpn_JP
    - Leipzig-comweb-2018_1m-jpn
    - Leipzig-web_public-2019_1m-jpn_JP
    - Leipzig-news-2020_100k-jpn
    - Leipzig-newscrawl-2019_1m-jpn
    - Leipzig-wikipedia-2021_1m-jpn
    # TODO: Extended Common Crawl

- id: wmt24-eng-rus
  langs: eng-rus
  train:
    - ParaCrawl-paracrawl-1_bonus-eng-rus
    - Statmt-commoncrawl_wmt13-1-rus-eng
    - Statmt-news_commentary-18.1-eng-rus
    - Statmt-yandex-wmt22-eng-rus
    - Statmt-wikititles-3-rus-eng
    - OPUS-unpc-v1.0-eng-rus
    - Facebook-wikimatrix-1-eng-rus
    - Tilde-airbaltic-1-eng-rus
    - Tilde-czechtourism-1-eng-rus
    - Tilde-worldbank-1-eng-rus
    - Statmt-backtrans_ruen-wmt20-rus-eng   # russian translated to english
  mono_train: &mono_rus
    - Statmt-news_crawl-2023-rus
    - Statmt-news_commentary-18.1-rus
    - Statmt-commoncrawl-wmt22-rus
    - Leipzig-news-2022_1m-rus
    - Leipzig-newscrawl_public-2018_1m-rus
    - Leipzig-web-2017_1m-rus_GE
    - Leipzig-wikipedia-2021_1m-rus

- id: wmt24-ces-ukr
  langs: ces-ukr
  train:
    - Facebook-wikimatrix-1-ces-ukr
    - ELRC-acts_ukrainian-1-ces-ukr
    - OPUS-ccmatrix-v1-ces-ukr
    - OPUS-elrc_5179_acts_ukrainian-v1-ces-ukr
    - OPUS-elrc_wikipedia_health-v1-ces-ukr
    - OPUS-eubookshop-v2-ces-ukr
    - OPUS-gnome-v1-ces-ukr
    - OPUS-kde4-v2-ces-ukr
    - OPUS-multiccaligned-v1.1-ces-ukr
    - OPUS-multiparacrawl-v9b-ces-ukr
    - OPUS-opensubtitles-v2018-ces-ukr
    - OPUS-qed-v2.0a-ces-ukr
    - OPUS-ted2020-v1-ces-ukr
    - OPUS-tatoeba-v20220303-ces-ukr
    - OPUS-ubuntu-v14.10-ces-ukr
    - OPUS-xlent-v1.1-ces-ukr
    - OPUS-bible_uedin-v1-ces-ukr
    - OPUS-wikimedia-v20210402-ces-ukr
  mono_train: &mono_ukr
    - Statmt-news_crawl-2023-ukr
    - LangUk-news-1-ukr
    - LangUk-wiki_dump-1-ukr
    - LangUk-fiction-1-ukr
    - LangUk-ubercorpus-1-ukr
    - LangUk-laws-1-ukr
    - Leipzig-news-2022_1m-ukr
    - Leipzig-newscrawl-2018_1m-ukr
    - Leipzig-web-2019_1m-ukr_UA
    - Leipzig-wikipedia-2021_1m-ukr

- id: wmt24-eng-ukr
  langs: eng-ukr
  train: &para_eng_ukr
      - ParaCrawl-paracrawl-1_bonus-eng-ukr
      - Tilde-worldbank-1-eng-ukr
      - Facebook-wikimatrix-1-eng-ukr
      - ELRC-acts_ukrainian-1-eng-ukr
      - Statmt-ccaligned-1-eng-ukr_UA

  mono_train: *mono_ukr

- id: wmt24-eng-hin
  langs: eng-hin
  train:
    - Statmt-news_commentary-18.1-eng-hin
    - Statmt-pmindia-1-eng-hin
    - Statmt-ccaligned-1-eng-hin_IN
    - JoshuaDec-indian_training-1-hin-eng
    - Facebook-wikimatrix-1-eng-hin
    - IITB-hien_train-1.5-hin-eng
    - Neulab-tedtalks_train-1-eng-hin
    - ELRC-wikipedia_health-1-eng-hin
    - AI4Bharath-samananthar-0.2-eng-hin
    - Anuvaad-internal_judicial_2021-v1-eng-hin
    - Anuvaad-legal_terms_2021-v1-eng-hin
    - Anuvaad-pib_2017-2020-eng-hin
    - Anuvaad-pibarchives_2009-2016-eng-hin
    - Anuvaad-wikipedia-20210201-eng-hin
    - Anuvaad-drivespark-20210303-eng-hin
    - Anuvaad-nativeplanet-20210315-eng-hin
    - Anuvaad-catchnews-20210320-eng-hin
    - Anuvaad-dwnews_2008-2020-eng-hin
    - Anuvaad-oneindia-20210320-eng-hin
    - Anuvaad-mk-20210320-eng-hin
    - Anuvaad-goodreturns-20210320-eng-hin
    - Anuvaad-ie_sports-20210320-eng-hin
    - Anuvaad-ie_tech-20210320-eng-hin
    - Anuvaad-ie_news-20210320-eng-hin
    - Anuvaad-ie_lifestyle-20210320-eng-hin
    - Anuvaad-ie_general-20210320-eng-hin
    - Anuvaad-ie_entertainment-20210320-eng-hin
    - Anuvaad-ie_education-20210320-eng-hin
    - Anuvaad-ie_business-20210320-eng-hin
    - Anuvaad-fin_express-20210320-eng-hin
    - Anuvaad-thewire-20210320-eng-hin
    - Anuvaad-tribune-20210320-eng-hin
    - Anuvaad-zeebiz-20210320-eng-hin
    - Anuvaad-pa_govt-20210320-eng-hin
    - Anuvaad-betterindia-20210320-eng-hin
    - Anuvaad-jagran_news-20210320-eng-hin
    - Anuvaad-jagran_tech-20210320-eng-hin
    - Anuvaad-jagran_education-20210320-eng-hin
    - Anuvaad-jagran_entertainment-20210320-eng-hin
    - Anuvaad-jagran_business-20210320-eng-hin
    - Anuvaad-jagran_sports-20210320-eng-hin
    - Anuvaad-jagran_lifestyle-20210320-eng-hin
    - Anuvaad-asianetnews-20210320-eng-hin
    - Anuvaad-business_standard-20210320-eng-hin
    - Anuvaad-pranabmukherjee-20210320-eng-hin
    - Anuvaad-lokmat_entertainment-20210501-eng-hin
    - Anuvaad-lokmat_news-20210501-eng-hin
    - Anuvaad-lokmat_lifestyle-20210501-eng-hin
    - Anuvaad-lokmat_sports-20210501-eng-hin
    - Anuvaad-lokmat_tech-20210501-eng-hin
    - Anuvaad-lokmat_financial-20210501-eng-hin
    - Anuvaad-lokmat_healthcare-20210501-eng-hin
    - AllenAi-nllb-1-eng-hin
    - OPUS-elrc_wikipedia_health-v1-eng-hin
    - OPUS-elrc_2922-v1-eng-hin
    - OPUS-globalvoices-v2018q4-eng-hin
    - OPUS-iitb-v2.0-eng-hin
    - OPUS-multiccaligned-v1-eng-hin
    - OPUS-opensubtitles-v2018-eng-hin
    - OPUS-ted2020-v1-eng-hin
    - OPUS-tanzil-v1-eng-hin
    - OPUS-tatoeba-v20220303-eng-hin
    - OPUS-ubuntu-v14.10-eng-hin
    - OPUS-xlent-v1.1-eng-hin
    - OPUS-tico_19-v20201028-eng-hin
    - OPUS-wikimedia-v20210402-eng-hin

  mono_train:
    - Statmt-news_crawl-2023-hin
    - Statmt-news_commentary-18.1-hin
    - Leipzig-web-2015_1m-hin_IN
    - Leipzig-mixed-2019_1m-hin
    - Leipzig-news-2020_1m-hin
    - Leipzig-newscrawl-2017_1m-hin
    - Leipzig-wikipedia-2021_1m-hin

- id: wmt24-eng-isl
  langs: eng-isl
  train:
    - Statmt-wikititles-3-isl-eng
    - Statmt-ccaligned-1-eng-isl_IS
    - ParaCrawl-paracrawl-9-eng-isl
    - Tilde-eesc-2017-eng-isl
    - Tilde-ema-2016-eng-isl
    - Tilde-rapid-2016-eng-isl
    - Facebook-wikimatrix-1-eng-isl
    - ParIce-eea_train-20.05-eng-isl
    - ParIce-ema_train-20.05-eng-isl
    - EU-ecdc-1-eng-isl
    - EU-eac_forms-1-eng-isl
    - EU-eac_reference-1-eng-isl
    - OPUS-ccmatrix-v1-eng-isl
    - OPUS-elrc_2718_emea-v1-eng-isl
    - OPUS-elrc_3206_antibiotic-v1-eng-isl
    - OPUS-elrc_4295_www.malfong.is-v1-eng-isl
    - OPUS-elrc_4324_government_offices_i-v1-eng-isl
    - OPUS-elrc_4327_government_offices_i-v1-eng-isl
    - OPUS-elrc_4334_rkiskaup_2020-v1-eng-isl
    - OPUS-elrc_4338_university_iceland-v1-eng-isl
    - OPUS-elrc_502_icelandic_financial_-v1-eng-isl
    - OPUS-elrc_504_www.iceida.is-v1-eng-isl
    - OPUS-elrc_505_www.pfs.is-v1-eng-isl
    - OPUS-elrc_506_www.lanamal.is-v1-eng-isl
    - OPUS-elrc_5067_scipar-v1-eng-isl
    - OPUS-elrc_508_tilde_statistics_ice-v1-eng-isl
    - OPUS-elrc_509_gallery_iceland-v1-eng-isl
    - OPUS-elrc_510_harpa_reykjavik_conc-v1-eng-isl
    - OPUS-elrc_511_bokmenntaborgin_is-v1-eng-isl
    - OPUS-elrc_516_icelandic_medicines-v1-eng-isl
    - OPUS-elrc_517_icelandic_directorat-v1-eng-isl
    - OPUS-elrc_597_www.nordisketax.net-v1-eng-isl
    - OPUS-elrc_718_statistics_iceland-v1-eng-isl
    - OPUS-elrc_728_www.norden.org-v1-eng-isl
    - OPUS-elrc_emea-v1-eng-isl
    - OPUS-elrc_antibiotic-v1-eng-isl
    - OPUS-elrc_www.norden.org-v1-eng-isl
    - OPUS-elrc_www.nordisketax.net-v1-eng-isl
    - OPUS-eubookshop-v2-eng-isl
    - OPUS-multiccaligned-v1-eng-isl
    - OPUS-multiparacrawl-v7.1-eng-isl
    - OPUS-opensubtitles-v2018-eng-isl
    - OPUS-ted2020-v1-eng-isl
    - OPUS-tatoeba-v20220303-eng-isl
    - OPUS-ubuntu-v14.10-eng-isl
    - OPUS-wikimatrix-v1-eng-isl
    - OPUS-wikititles-v3-eng-isl
    - OPUS-xlent-v1.1-eng-isl
    - OPUS-wikimedia-v20210402-eng-isl
  mono_train:
    - Statmt-news_crawl-2023-isl
    - Leipzig-web-2020_1m-isl_IS
    - Leipzig-web_public-2019_1m-isl_IS
    - Leipzig-news-2020_30k-isl
    - Leipzig-newscrawl-2019_300k-isl
    - Leipzig-wikipedia-2021_100k-isl

- id: wmt24-jpn-zho
  langs: jpn-zho
  train:
    - Statmt-news_commentary-18.1-jpn-zho
    - KECL-paracrawl-2-zho-jpn
    - KECL-paracrawl-2wmt24-zho-jpn
    - Facebook-wikimatrix-1-jpn-zho
    - Neulab-tedtalks_train-1-jpn-zho
    - LinguaTools-wikititles-2014-jpn-zho
    - OPUS-ccmatrix-v1-jpn-zho
    - OPUS-gnome-v1-jpn-zho_CN
    - OPUS-kde4-v2-jpn-zho_CN
    - OPUS-multiccaligned-v1-jpn-zho_CN
    - OPUS-openoffice-v3-jpn-zho_CN
    - OPUS-opensubtitles-v2018-jpn-zho_CN
    - OPUS-php-v1-jpn-zho
    - OPUS-qed-v2.0a-jpn-zho
    - OPUS-ted2020-v1-jpn-zho
    - OPUS-tanzil-v1-jpn-zho
    - OPUS-ubuntu-v14.10-jpn-zho
    - OPUS-ubuntu-v14.10-jpn-zho_CN
    - OPUS-xlent-v1.1-jpn-zho
    - OPUS-bible_uedin-v1-jpn-zho
    - OPUS-wikimedia-v20210402-jpn-zho

  mono_train: *mono_zho

Issues/Bugs

Please report them using GitHub issues at github.com/thammegowda/mtdata .