Data Repository for "When Does Unsupervised Machine Translation Work?"

This is the data repository for "When Does Unsupervised Machine Translation Work?" (Marchisio, Duh, and Koehn, 2020)

Preprocessing is done with Monoses (https://github.com/artetxem/monoses)

Download data.

Description of files: cc/*100M_tokens are used in CC experiments (see paper)

  • cc/*291M_tokens is used in Different Domain experiments
  • nc/* are used for News experiments
  • un/* contains data for Parallel and Disjoint experiments. Different Domain also used un/*/train.{fr,ru}.a

    If you use these splits of the data, please cite:
    Marchisio, Kelly and Duh, Kevin and Koehn, Philipp: When Does Unsupervised Machine Translation Work?, Proceedings of the Fifth Conference on Machine Translation (WMT), 2020.

    @InProceedings{marchisio-duh-koehn:2020:WMT,
      author    = {Marchisio, Kelly  and  Duh, Kevin  and  Koehn, Philipp},
      title     = {When Does Unsupervised Machine Translation Work?},
      booktitle      = {Proceedings of the Fifth Conference on Machine Translation},
      month          = {November},
      year           = {2020},
      publisher      = {Association for Computational Linguistics},
    }