Preparing Training Data

Training data has to be provided sentence aligned (one sentence per line), in two files, one for the foreign sentences, one for the English sentences:

 >head -3 corpus/euro.*
 ==> corpus/euro.de <==
 wiederaufnahme der sitzungsperiode
 ich erklaere die am donnerstag , den 28. maerz 1996 unterbrochene 
 sitzungsperiode des europaeischen parlaments fuer wiederaufgenommen .
 begruessung

 ==> corpus/euro.en <==
 resumption of the session
 i declare resumed the session of the european parliament adjourned 
 on thursday , 28 march 1996 .
 welcome

A few other points have to be taken care of:

Training data for factored models

You will have to provide training data in the format

 word0factor0|word0factor1|word0factor2 word1factor0|word1factor1|word1factor2 ...

instead of the un-factored

 word0 word1 word2

Cleaning the corpus

The script clean-corpus-n.perl is small script that cleans up a parallel corpus, so it works well with the training script.

It performs the following steps:

The command syntax is:

 clean-corpus-n.perl CORPUS L1 L2 OUT MIN MAX

For example: clean-corpus-n.perl raw de en clean 1 50 takes the corpus files raw.de and raw.en, deletes lines longer than 50, and creates the output files clean.de and clean.en.