Neural Paracrawl

The parallel corpora published here were created with a neural version of the Paracrawl pipeline. The methods how this data was crawled and aligned is described in the following paper:
Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages, Philipp Koehn, Proceedings of the Ninth Conference on Machine Translation (WMT), 2024.

Please cite it when using the data:

@InProceedings{koehn:2024:WMT,
  author    = {Koehn, Philipp},
  title     = {Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages},
  booktitle = {Proceedings of the Ninth Conference on Machine Translation},
  month     = {November},
  year      = {2024},
  address   = {Miami, Florida, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {1454--1466},
  url       = {https://aclanthology.org/2024.wmt-1.132}
}

Data

Hindi-English: raw txt tmx
Indonesian-English: raw txt tmx
Khmer-English: raw txt tmx
Korean-English: raw txt tmx
Lao-English: raw tmx txt
Burmese-English: raw txt tmx
Nepali-English: raw txt tmx
Thai-English: raw txt tmx
Vietnamese-English: raw txt tmx