Google-beating Irish--English MT

The idea is to, in the five days of the marathon, make an free/open-source Irish--English MT system that can beat Google's offering. Possible components: Free corpora, morphological analysers, dictionaries, Moses, MEMT, ...

Resources

Plan

  1. Use Eurotext,
  2. Use KDE,
  3. Extract a dictionary, and generate full-form list bilingual dictionary (using generator to produce all the forms, and Apertium to do any transfer necessary -- e.g. verb tenses, adjectives).
  4. Translate the Irish Wikipedia using Google Translate.
  5. Train 3 different SMT systems (Moses, Joshua, Marclator) and do system recombination using the CMU MEMT system.
  6. Compare output of MEMT system with Google translate :)

Result

  • Joshua decoder
  • Data from: EU, KDE, Vocab. list, Wikipedia (ga-en with Google)
  • 187,521 sentences (training)
  • 2,000 sentences (dev)
  • 1,667 sentences (devtest)

Google: BLEU (1 Ref) = 0,5924

MTM2010: BLEU (1 Ref) = 0,5326

Links

http://elx.dlsi.ua.es/~fran/mtm2010.new.tar.gz

Page last modified on January 30, 2010, at 09:38 AM