Training data has to be provided sentence aligned (one sentence per line), in two files, one for the foreign sentences, one for the English sentences:
>head -3 corpus/euro.* ==> corpus/euro.de <== wiederaufnahme der sitzungsperiode ich erklaere die am donnerstag , den 28. maerz 1996 unterbrochene sitzungsperiode des europaeischen parlaments fuer wiederaufgenommen . begruessung ==> corpus/euro.en <== resumption of the session i declare resumed the session of the european parliament adjourned on thursday , 28 march 1996 . welcome
A few other points have to be taken care of:
LC_ALL=C
lowercase.perl
)
You will have to provide training data in the format
word0factor0|word0factor1|word0factor2 word1factor0|word1factor1|word1factor2 ...
instead of the un-factored
word0 word1 word2
The script clean-corpus-n.perl
is small script that cleans up a parallel corpus, so it works well with the training script.
It performs the following steps:
The command syntax is:
clean-corpus-n.perl CORPUS L1 L2 OUT MIN MAX
For example: clean-corpus-n.perl raw de en clean 1 50
takes the corpus files raw.de
and raw.en
,
deletes lines longer than 50, and creates the
output files clean.de
and clean.en
.