• Chris Callison-Burch
  • James Dennis
  • Andreas Eisele
  • Johann Roturier
  • Maria Holmqvist
  • ByungGyu Ahn

WikiTrans is an open source effort to translate high-quality Wikipedia articles into other languages. We aim to provide translation tools in the form of statistical machine translation systems and other translation aids, and to gather bilingual parallel corpora for training statistical models.

We have several subproject ideas that people can work on during the workshop:

  1. WikiRank - we can't translate every Wikipedia page, so determine which pages we ought to translate by ranking the most popular wikipedia pages for each language using the page view statistics provided at
  2. Extract text from wiki-markup and split into sentence. Work out how to extract only text elements from Wikipedia using the mwlib python library. Learn how to train NLTK's unsupervised sentence splitter model and train sentence splitters for different languages using Wikipedia data dumps.
  3. Integration with Amazon's Mechanical Turk - we'll bootstrap the initial set of data by soliciting translations from Turkers. Help us finalize the details of our existing implementation using the boto API. Also design tasks to allow monolingual Turkers to edit the output of MT systems.
  4. Write a classifier to determine whether something is a machine translation, a human translation, or not a translation at all. Use the classifier to flag suspicious translations from Mechanical Turk, so that they may be rejected.
  5. Set up Joshua as a server, so that it can handle requests from WikiTrans when a new page needs to be translated.
  6. Translation collaboration - design tools to allow collaboration between two monolingual users who speak different languages. For instance, that allow them flag words that are poorly translated by an MT system, so that the other user can re-phrase and re-translate.
  7. Integration with Philipp Koehn's computer assisted translation (CAT) tools.

Check out our code at GitHub

Page last modified on January 26, 2010, at 11:09 AM