Google-beating Irish--English MT
The idea is to, in the five days of the marathon, make an free/open-source Irish--English MT system that can beat Google's offering. Possible components: Free corpora, morphological analysers, dictionaries, Moses, MEMT, ...
Resources
Plan
- Use Eurotext,
- Use KDE,
- Extract a dictionary, and generate full-form list bilingual dictionary (using generator to produce all the forms, and Apertium to do any transfer necessary -- e.g. verb tenses, adjectives).
- Translate the Irish Wikipedia using Google Translate.
- Train 3 different SMT systems (Moses, Joshua, Marclator) and do system recombination using the CMU MEMT system.
- Compare output of MEMT system with Google translate :)
Result
- Joshua decoder
- Data from: EU, KDE, Vocab. list, Wikipedia (ga-en with Google)
- 187,521 sentences (training)
- 2,000 sentences (dev)
- 1,667 sentences (devtest)
Google: BLEU (1 Ref) = 0,5924
MTM2010: BLEU (1 Ref) = 0,5326
Links
http://elx.dlsi.ua.es/~fran/mtm2010.new.tar.gz
Page last modified on January 30, 2010, at 09:38 AM