We will start with an overview of the training process. This should give a feel for what is going on and what files are produced. In the following, we will go into more details of the options of the training process and additional tools.
The training process takes place in 7 steps, all of them executed by the script
train-model.perl
The eight steps are
The run times mentioned in the steps refer to a recent training run on the 751'000 sentence, 16 million word German-English Europarl corpus, on a 3GHz Linux machine.
For an standard phrase model, you will run the training script like this:
It is usually better to carry out the word alignment (step 2-3 of the training process) on more general word representations with rich statistics. Even successful word alignment with words stemmed to 4 characters have been reported. For factored models, this suggests that word alignment should be done only on either the surface form or the stem/lemma.
Which factors are used during word alignment is set with the --alignment-factors
switch. Let us formally define the parameter syntax:
0
- 9
]+
,
FACTOR ]*
-
FACTORLIST
The switch requires a FACTORMAP as argument, for instance 0-0
(using only factor 0 from source and target language) or 0,1,2-0,1
(using factors 0, 1, and 2 from the source language and 0 and 1 from the target language).
Purpose of training factored translation model training is to create one or more translation tables between a subset of the factors. All translation tables are trained from the same word alignment, and are specified with the switch --translation-factors
.
To define the syntax, we have to extend our parameter syntax with
+
FACTORMAP]*
since we want to specify multiple mappings.
One example is 0-0+1-1
, which create the two tables
phrase-table.0-0.gz phrase-table.1-1.gz
Reordering tables can be trained with --reordering-factors
, but this is currently not supported by any decoder. Syntax is the same as for translation factors.
Finally, we also want to create generation tables between target factors. Which tables to generate is specified with --generation-factors
, which takes a FACTORMAPSET as a parameter. Note that this time the mapping is between target factors, not between source and target factors.
One example is 0-1
with creates a generation table between factor factor 0 and 1.