This document provides instructions for downloading WMT23 General MT task datasets for constrained track using mtdata.

Setup

pip install -I mtdata==0.4.0
# pip install -I https://github.com/thammegowda/mtdata/archive/develop.zip  # Install from develop branch

Recipes Config File

Config file for CONSTRAINED track (missing datasets behing registration: CzEng2.0, and CCMT):

wget https://www.statmt.org/wmt23/mtdata/mtdata.recipes.wmt23-constrained.yml

See for dataset IDs selected for constrained eval. By default, mtdata loads mtdata.recipes*.yml glob in the current directory (where mtdata commands are invoked). To place recipe YAML files in a different directory, export MTDATA_RECIPES=/path/to/recipesdir.

List All Recipes

$ mtdata list-recipe -id | grep wmt23
wmt23-encs
wmt23-zhen
wmt23-enzh
wmt23-deen
wmt23-ende
wmt23-heen
wmt23-enhe
wmt23-jaen
wmt23-enjp
wmt23-ruen
wmt23-enru
wmt23-csuk
wmt23-enuk
wmt23-uken

wmt23* ids are all loaded from mtdata.recipes.wmt23*.yml file.

Download Recipes

Download a Recipe
# example: wmt23-encs
mtdata get-recipe -ri wmt23-encs -o wmt23-encs
Download All Recipes
for ri in wmt23-{enzh,zhen,ende,deen,enhe,heen,enja,jaen,enru,ruen,encs,csuk,enuk,uken}; do
  mtdata get-recipe -ri $ri -o $ri
done
Limitations:
  1. Two datasets listed under WMT 23 page — CsEng2.0 and CCMT — require login and will not be downloaded using this tool.

Usage: mtdata get-recipe
$  mtdata get-recipe  -h
usage: mtdata get-recipe [-h] -ri RECIPE_ID [-f] [-j N_JOBS] [--merge | --no-merge] [--compress] [-dd] [-dt] -o OUT_DIR

optional arguments:
  -h, --help            show this help message and exit
  -ri RECIPE_ID, --recipe-id RECIPE_ID
                        Recipe ID (default: None)
  -f, --fail-on-error   Fail on error (default: False)
  -j N_JOBS, --n-jobs N_JOBS
                        Number of worker jobs (processes) (default: 1)
  --merge               Merge train into a single file (default: True)
  --no-merge            Do not Merge train into a single file (default: False)
  --compress            Keep the files compressed (default: False)
  -dd, --dedupe, --drop-dupes
                        Remove duplicate (src, tgt) pairs in training (if any); valid when --merge. Not recommended for large datasets. (default: False)
  -dt, --drop-tests     Remove dev/test sentences from training sets (if any); valid when --merge (default: False)
  -o OUT_DIR, --out OUT_DIR
                        Output directory name (default: None)

Constrained Task Datasets

The selected dataset IDs for constrained task are as follows:

# pip install mtdata==0.4.0

# To list/view all available datasets:
#   mtdata list -id -l <lang1>-<lang2>   # parallel
#   mtdata list -id -l <lang>            # monolingual

- id: wmt23-encs
  langs: eng-ces
  #dev:
  #test:
  train:
    - Statmt-europarl-10-ces-eng
    - ParaCrawl-paracrawl-9-eng-ces
    - Statmt-commoncrawl_wmt13-1-ces-eng
    - Statmt-news_commentary-16-ces-eng
    - Statmt-wikititles-3-ces-eng
    - Facebook-wikimatrix-1-ces-eng
    - Tilde-eesc-2017-ces-eng
    - Tilde-ema-2016-ces-eng
    - Tilde-ecb-2017-ces-eng
    - Tilde-rapid-2019-ces-eng
    # TODO: CzEng2.0 and backtranslated news require login
  mono_train:
    - &mono_ces
        - Statmt-news_crawl-2021-ces
        - Statmt-europarl-10-ces
        - Statmt-news_commentary-17-ces
        - Statmt-commoncrawl-wmt22-ces
        # TODO extended common crawl
        - Leipzig-news-2022_1m-ces
        - Leipzig-newscrawl-2019_1m-ces
        - Leipzig-wikipedia-2021_1m-ces
        - Leipzig-web_public-2019_1m-ces_CZ
    - &mono_eng
        - Statmt-news_crawl-2021-eng
        - Statmt-news_discussions-2019-eng
        - Statmt-news_commentary-17-eng
        - Statmt-commoncrawl-wmt22-eng
        - Statmt-europarl-10-eng
        - Leipzig-news-2020_1m-eng
        - Leipzig-ukweb_public-2018_1m-eng
        - Leipzig-simplewikipedia-2021_300k-eng

- id: wmt23-zhen
  langs: zho-eng
  #dev:
  #test:
  train:
    - &para_zho_eng
        - ParaCrawl-paracrawl-1_bonus-eng-zho
        - Statmt-news_commentary-16-eng-zho
        - Statmt-wikititles-3-zho-eng
        - OPUS-unpc-v1.0-eng-zho
        - Facebook-wikimatrix-1-eng-zho
        # TODO: CCMT
  mono_train:
    - &mono_zho
        - Statmt-news_crawl-2021-zho
        - Statmt-news_commentary-17-zho
        - Statmt-commoncrawl-wmt22-zho
        - Leipzig-wikipedia-2018_1m-zho
        - Leipzig-web-2016_1m-zho_MO
        - Leipzig-tradnewscrawl-2011_1m-zho
        - Leipzig-news-2020_300k-zho
    - *mono_eng

- id: wmt23-enzh
  langs: eng-zho
  train:
    - *para_zho_eng             # same bitexts as zho->eng
    - Statmt-backtrans_enzh-wmt20-eng-zho
  mono_train:
   - *mono_eng
   - *mono_zho

- id: wmt23-deen
  langs: deu-eng
  #dev:
  #test:
  train: &para_deu_eng
    - Statmt-europarl-10-deu-eng
    - ParaCrawl-paracrawl-9-eng-deu
    - Statmt-commoncrawl_wmt13-1-deu-eng
    - Statmt-news_commentary-16-deu-eng
    - Statmt-wikititles-3-deu-eng
    - Facebook-wikimatrix-1-deu-eng
    - Tilde-eesc-2017-deu-eng
    - Tilde-ema-2016-deu-eng
    - Tilde-airbaltic-1-deu-eng
    - Tilde-czechtourism-1-deu-eng
    - Tilde-ecb-2017-deu-eng
    - Tilde-rapid-2016-deu-eng
    - Tilde-rapid-2019-deu-eng
  mono_train:
    - &mono_deu
        - Statmt-commoncrawl-wmt22-deu
        - Statmt-europarl-10-deu
        - Statmt-news_commentary-17-deu
        - Statmt-news_crawl-2021-deu
        # TODO extended common crawl
        - Leipzig-wikipedia-2021_1m-deu
        - Leipzig-comweb-2021_1m-deu
        - Leipzig-mixed_typical-2011_1m-deu
        - Leipzig-news-2022_30k-deu
        - Leipzig-newscrawl-2020_1m-deu
        - Leipzig-web-2021_100k-deu_DE
    - *mono_eng
- id: wmt23-ende
  langs: eng-deu
  #dev:
  #test:
  train: *para_deu_eng
  mono_train:
    - *mono_eng
    - *mono_deu

- id: wmt23-heen
  langs: heb-eng
  #dev:
  #test:
  train: &para_heb_eng
    - Statmt-ccaligned-1-eng-heb_IL
    - Facebook-wikimatrix-1-eng-heb
    - Neulab-tedtalks_train-1-eng-heb
    - ELRC-wikipedia_health-1-eng-heb
    - OPUS-ccmatrix-v1-eng-heb
    - OPUS-elrc_3065_wikipedia_health-v1-eng-heb
    - OPUS-elrc_wikipedia_health-v1-eng-heb
    - OPUS-elrc_2922-v1-eng-heb
    - OPUS-gnome-v1-eng-heb
    - OPUS-globalvoices-v2018q4-eng-heb
    - OPUS-kde4-v2-eng-heb
    - OPUS-multiccaligned-v1-eng-heb
    - OPUS-opensubtitles-v2018-eng-heb
    - OPUS-php-v1-eng-heb
    - OPUS-qed-v2.0a-eng-heb
    - OPUS-tatoeba-v2-eng-heb
    - OPUS-tatoeba-v20220303-eng-heb
    - OPUS-ubuntu-v14.10-eng-heb
    - OPUS-wikipedia-v1.0-eng-heb
    - OPUS-xlent-v1.1-eng-heb
    - OPUS-bible_uedin-v1-eng-heb
    - OPUS-wikimedia-v20210402-eng-heb
    #- OPUS-ccaligned-v1-eng-heb     # from statmt
    #- OPUS-wikimatrix-v1-eng-heb    # from facebook
    #- OPUS-opus100_train-1-eng-heb  # combo of above
  mono_train:
    - *mono_eng
    - &mono_heb
        - Leipzig-news-2020_1m-heb
        - Leipzig-newscrawl-2011_1m-heb
        - Leipzig-wikipedia-2021_1m-heb

- id: wmt23-enhe
  langs: eng-heb
  #dev:
  #test:
  train: *para_heb_eng
  mono_train:
    - *mono_eng
    - *mono_heb

- id: wmt23-jaen
  langs: jpn-eng
  #dev:
  #test:
  train: &para_jpn_eng
    - Statmt-news_commentary-16-eng-jpn
    - KECL-paracrawl-3-eng-jpn
    - Statmt-wikititles-3-jpn-eng
    - Facebook-wikimatrix-1-eng-jpn
    - Statmt-ted-wmt20-eng-jpn
    - StanfordNLP-jesc_train-1-eng-jpn
    - Phontron-kftt_train-1-eng-jpn
  mono_train:
    - *mono_eng
    - &mono_jpn
        - Statmt-news_crawl-2021-jpn
        - Statmt-news_commentary-17-jpn
        - Statmt-commoncrawl-wmt22-jpn
        - Leipzig-web-2020_1m-jpn_JP
        - Leipzig-comweb-2018_1m-jpn
        - Leipzig-web_public-2019_1m-jpn_JP
        - Leipzig-news-2020_100k-jpn
        - Leipzig-newscrawl-2019_1m-jpn
        - Leipzig-wikipedia-2021_1m-jpn

- id: wmt23-enjp
  langs: eng-jpn
  #dev:
  #test:
  train: *para_jpn_eng
  mono_train:
    - *mono_jpn
    - *mono_eng

- id: wmt23-ruen
  langs: rus-eng
  #dev:
  #test:
  train:
    - &para_rus_eng
        - ParaCrawl-paracrawl-1_bonus-eng-rus
        - Statmt-commoncrawl_wmt13-1-rus-eng
        - Statmt-news_commentary-16-eng-rus
        - Statmt-yandex-wmt22-eng-rus
        - Statmt-wikititles-3-rus-eng
        - OPUS-unpc-v1.0-eng-rus
        - Facebook-wikimatrix-1-eng-rus
        - Tilde-airbaltic-1-eng-rus
        - Tilde-czechtourism-1-eng-rus
        - Tilde-worldbank-1-eng-rus
    - Statmt-backtrans_ruen-wmt20-rus-eng
  mono_train:
    - *mono_eng
    - &mono_rus
        - Statmt-news_crawl-2021-rus
        - Statmt-news_commentary-17-rus
        - Statmt-commoncrawl-wmt22-rus
        - Leipzig-news-2022_1m-rus
        - Leipzig-newscrawl_public-2018_1m-rus
        - Leipzig-web-2017_1m-rus_GE
        - Leipzig-wikipedia-2021_1m-rus

- id: wmt23-enru
  langs: eng-rus
  #dev:
  #test:
  train:
    - *para_rus_eng
    - Statmt-backtrans_enru-wmt20-eng-rus
  mono_train:
    - *mono_rus
    - *mono_eng

- id: wmt23-csuk
  langs: ces-ukr
  #dev:
  #test:
  train:
    - Facebook-wikimatrix-1-ces-ukr
    #- OPUS-wikimatrix-v1-ces-ukr
    - ELRC-acts_ukrainian-1-ces-ukr
    - OPUS-ccmatrix-v1-ces-ukr
    - OPUS-elrc_5179_acts_ukrainian-v1-ces-ukr
    - OPUS-elrc_wikipedia_health-v1-ces-ukr
    - OPUS-eubookshop-v2-ces-ukr
    - OPUS-gnome-v1-ces-ukr
    - OPUS-kde4-v2-ces-ukr
    - OPUS-multiccaligned-v1.1-ces-ukr
    - OPUS-multiparacrawl-v9b-ces-ukr
    - OPUS-opensubtitles-v2018-ces-ukr
    - OPUS-qed-v2.0a-ces-ukr
    - OPUS-ted2020-v1-ces-ukr
    - OPUS-tatoeba-v20220303-ces-ukr
    - OPUS-ubuntu-v14.10-ces-ukr
    - OPUS-xlent-v1.1-ces-ukr
    - OPUS-bible_uedin-v1-ces-ukr
    - OPUS-wikimedia-v20210402-ces-ukr
  mono_train:
    - *mono_ces
    - &mono_ukr
        - Statmt-news_crawl-2021-ukr
        - LangUk-news-1-ukr
        - LangUk-wiki_dump-1-ukr
        - LangUk-fiction-1-ukr
        - LangUk-ubercorpus-1-ukr
        - LangUk-laws-1-ukr
        - Leipzig-mixed-2012_1m-ukr
        - Leipzig-news-2022_1m-ukr
        - Leipzig-newscrawl-2018_1m-ukr
        - Leipzig-web-2019_1m-ukr_UA
        - Leipzig-wikipedia-2021_1m-ukr

- id: wmt23-enuk
  langs: eng-ukr
  #dev:
  #test:
  train: &para_eng_ukr
      - ParaCrawl-paracrawl-1_bonus-eng-ukr
      - Tilde-worldbank-1-eng-ukr
      - Facebook-wikimatrix-1-eng-ukr
      - ELRC-acts_ukrainian-1-eng-ukr
  mono_train:
    - *mono_eng
    - *mono_ukr

- id: wmt23-uken
  langs: ukr-eng
  #dev:
  #test:
  train: *para_eng_ukr
  mono_train:
    - *mono_eng
    - *mono_ukr

Add/Customize a Recipe

To ustomize recipes (e.g., for unconstrained task) as follows:

- id: wmt23-deen (1)
  langs: deu-eng
  dev:  (2)
    - Statmt-newstest_deen-2020-deu-eng
    - Statmt-newstest_ende-2020-eng-deu
  test: (2)
    #- Statmt-newstest_deen-2021-deu-eng
    #- Statmt-newstest_ende-2021-eng-deu
  train: (3)
    - Statmt-europarl-10-deu-eng
    - ParaCrawl-paracrawl-9-eng-deu
    - Statmt-commoncrawl_wmt13-1-deu-eng
    - Statmt-news_commentary-16-deu-eng
    - Statmt-wikititles-3-deu-eng
    - Tilde-rapid-2019-deu-eng # - Tilde-rapid-2016-deu-eng
    - Facebook-wikimatrix-1-deu-eng
  mono_train: (4)
    - Statmt-commoncrawl-wmt22-deu
    - Statmt-europarl-10-deu
    - Statmt-news_commentary-17-deu
    - Statmt-news_crawl-2021-deu
    - Statmt-news_crawl-2021-eng
    - Statmt-news_discussions-2019-eng
    - Statmt-news_commentary-17-eng
    - Statmt-commoncrawl-wmt22-eng
  1. id has to be unique.

  2. dev and test are optional. They can be a single dataset (i.e. String) or list of datasets (i.e. list of strings)

  3. train is required. To list all available parallell dataset IDs for eng-deu: mtdata list -l eng-deu -id

  4. mono_train is optional. To list all available monolingual dataset IDs for eng: mtdata list -l eng -id.

Issues/Bugs

Please report them using GitHub issues at github.com/thammegowda/mtdata .