Moses
statistical
machine translation
system

Preparing Training Data

Training data has to be provided sentence aligned (one sentence per line), in two files, one for the foreign sentences, one for the English sentences:

 >head -3 corpus/euro.*
 ==> corpus/euro.de <==
 wiederaufnahme der sitzungsperiode
 ich erklaere die am donnerstag , den 28. maerz 1996 unterbrochene 
 sitzungsperiode des europaeischen parlaments fuer wiederaufgenommen .
 begruessung

 ==> corpus/euro.en <==
 resumption of the session
 i declare resumed the session of the european parliament adjourned 
 on thursday , 28 march 1996 .
 welcome

A few other points have to be taken care of:

  • unix commands require the environment variable LC_ALL=C
  • one sentence per line, no empty lines
  • sentences longer than 100 words (and their corresponding translations) have to be eliminated (note that a shorter sentence length limit will speed up training
  • everything lowercased (use lowercase.perl)

Training data for factored models

You will have to provide training data in the format

 word0factor0|word0factor1|word0factor2 word1factor0|word1factor1|word1factor2 ...

instead of the un-factored

 word0 word1 word2

Cleaning the corpus

The script clean-corpus-n.perl is small script that cleans up a parallel corpus, so it works well with the training script.

It performs the following steps:

  • removes empty lines
  • removes redundant space characters
  • drops lines (and their corresponding lines), that are empty, too short, too long or violate the 9-1 sentence ratio limit of GIZA++

The command syntax is:

 clean-corpus-n.perl CORPUS L1 L2 OUT MIN MAX

For example: clean-corpus-n.perl raw de en clean 1 50 takes the corpus files raw.de and raw.en, deletes lines longer than 50, and creates the output files clean.de and clean.en.

Edit - History - Print
Page last modified on July 14, 2006, at 01:07 AM