Parallel Corpus Extraction From Common Crawl
Project leader: NN
Desirable skills for participants: Hadoop, Amazon Web Services
Note: this project has no project leader as of now, so it requires the participants to self-organize.
Since the data is too huge to download, it requires the use of Amazon web services to access it. More information about this here:
http://commoncrawl.org/data/accessing-the-data/
The first step should be to run language detection on all the data to identify web domains with multi-language content. Then, the pipeline of document-alignment, sentence-alignment, and filtering at various stages must be applied, possibly using existing tools and offline (i.e., on non-Amazon machines once small useful subsets of the data are identified).