Neural Paracrawl
The parallel corpora published here were created with a neural version of the Paracrawl pipeline.
The methods how this data was crawled and aligned is described in the following paper:
Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages, Philipp Koehn, Proceedings of the Ninth Conference on Machine Translation (WMT), 2024.
Please cite it when using the data:
@InProceedings{koehn:2024:WMT,
author = {Koehn, Philipp},
title = {Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages},
booktitle = {Proceedings of the Ninth Conference on Machine Translation},
month = {November},
year = {2024},
address = {Miami, Florida, USA},
publisher = {Association for Computational Linguistics},
pages = {1454--1466},
url = {https://aclanthology.org/2024.wmt-1.132}
}
Data
Hindi-English:
raw
txt
tmx
Indonesian-English:
raw
txt
tmx
Khmer-English:
raw
txt
tmx
Korean-English:
raw
txt
tmx
Lao-English:
raw
tmx
txt
Burmese-English:
raw
txt
tmx
Nepali-English:
raw
txt
tmx
Thai-English:
raw
txt
tmx
Vietnamese-English:
raw
txt
tmx