Moses
statistical
machine translation
system

Moses » LinksToCorpora

Parallel Corpora Available On-Line

This page is your 'shopping list' for parallel texts. Let us know if we're missing something.

We don't claim anything about copyright issues, make sure you don't break any restrictions.
We don't claim anything about alignment of the collections. Some sources might need more work from you, some might need less.

And remember, we're interested in any tools you create to get the clean data from not so clean collections.

Multi-language

Europarl, data of release 7 available (most of European languages)
News Commentary corpus, part of WMT 2013 shared task training data
OPUS (various languages, various sub-corpora)
Subtitles (various languages, various sites, e.g. OpenSubtitles, TED)
JRC-Acquis Multilingual legal text in 22 European languages
EU Official Journal Multilingual legal text in 22 European languages
The United Nations Parallel Corpus v1.0 - An official parallel corpus released by the United Nations. Constains sentence aligned data for all 6 language pairs and a fully-aligned subcorpus across all 6 languages.
Multi-UN A Multilingual corpus from United Nation documents in 7 languages (an older, unofficial release than the corpus above.)
Microtopia A Multilingual corpus extracted from Twitter and Sina Weibo in 11 languages.
Asian Scientific Paper Excerpt Corpus Japanese-English and Japanese-Chinese scientific paper abstracts (3 million sentence pairs JE, 600,000 sentence pairs JC)

Bi-language

CzEng (cs-en)
Hunglish (hu-en)
Kyoto (jp-en)

Other

Besides collections mentioned above, LDC has heaps of data available.

Edit - History - Print

Page last modified on May 31, 2016, at 03:29 PM