Preparing Training Data

Training data has to be provided sentence aligned (one sentence per line), in two files, one for the foreign sentences, one for the English sentences:

 >head -3 corpus/euro.*
 ==> corpus/euro.de <==
 wiederaufnahme der sitzungsperiode
 ich erklaere die am donnerstag , den 28. maerz 1996 unterbrochene 
 sitzungsperiode des europaeischen parlaments fuer wiederaufgenommen .
 begruessung

 ==> corpus/euro.en <==
 resumption of the session
 i declare resumed the session of the european parliament adjourned 
 on thursday , 28 march 1996 .
 welcome

A few other points have to be taken care of:

unix commands require the environment variable LC_ALL=C
one sentence per line, no empty lines
sentences longer than 100 words (and their corresponding translations) have to be eliminated (note that a shorter sentence length limit will speed up training
everything lowercased (use lowercase.perl)

Training data for factored models

You will have to provide training data in the format

 word0factor0|word0factor1|word0factor2 word1factor0|word1factor1|word1factor2 ...

instead of the un-factored

 word0 word1 word2

Cleaning the corpus

The script clean-corpus-n.perl is small script that cleans up a parallel corpus, so it works well with the training script.

It performs the following steps:

removes empty lines
removes redundant space characters
drops lines (and their corresponding lines), that are empty, too short, too long or violate the 9-1 sentence ratio limit of GIZA++

The command syntax is:

 clean-corpus-n.perl CORPUS L1 L2 OUT MIN MAX

For example: clean-corpus-n.perl raw de en clean 1 50 takes the corpus files raw.de and raw.en, deletes lines longer than 50, and creates the output files clean.de and clean.en.