Projects

Parallel Corpus Extraction From Common Crawl

Project leader: NN

Desirable skills for participants: Hadoop, Amazon Web Services

Note: this project has no project leader as of now, so it requires the participants to self-organize.

Since the data is too huge to download, it requires the use of Amazon web services to access it. More information about this here:

http://commoncrawl.org/data/accessing-the-data/

The first step should be to run language detection on all the data to identify web domains with multi-language content. Then, the pipeline of document-alignment, sentence-alignment, and filtering at various stages must be applied, possibly using existing tools and offline (i.e., on non-Amazon machines once small useful subsets of the data are identified).