FactoredTraining » Overview

Training: Overview

We will start with an overview of the training process. This should give a feel for what is going on and what files are produced. In the following, we will go into more details of the options of the training process and additional tools.

The training process takes place in 7 steps, all of them executed by the script

 train-model.perl

The eight steps are

Prepare data (45 minutes)
Run GIZA++ (16 hours)
Align words (2:30 hours)
Get lexical translation table (30 minutes)
Extract phrases (10 minutes)
Score phrases (1:15 hours)
Build lexicalized reordering model (1 hour)
Build generation models
Create configuration file (1 second)

The run times mentioned in the steps refer to a recent training run on the 751'000 sentence, 16 million word German-English Europarl corpus, on a 3GHz Linux machine.

Running the training script

For an standard phrase model, you will run the training script like this:

Alignment factors

It is usually better to carry out the word alignment (step 2-3 of the training process) on more general word representations with rich statistics. Even successful word alignment with words stemmed to 4 characters have been reported. For factored models, this suggests that word alignment should be done only on either the surface form or the stem/lemma.

Which factors are used during word alignment is set with the --alignment-factors switch. Let us formally define the parameter syntax:

FACTOR = [ 0 - 9 ]+
FACTORLIST = FACTOR [ , FACTOR ]*
FACTORMAP = FACTORLIST - FACTORLIST

The switch requires a FACTORMAP as argument, for instance 0-0 (using only factor 0 from source and target language) or 0,1,2-0,1 (using factors 0, 1, and 2 from the source language and 0 and 1 from the target language).

Translation factors

Purpose of training factored translation model training is to create one or more translation tables between a subset of the factors. All translation tables are trained from the same word alignment, and are specified with the switch --translation-factors.

To define the syntax, we have to extend our parameter syntax with

FACTORMAPSET = FACTORMAP[+FACTORMAP]*

since we want to specify multiple mappings.

One example is 0-0+1-1, which create the two tables

 phrase-table.0-0.gz
 phrase-table.1-1.gz

Reordering factors

Reordering tables can be trained with --reordering-factors, but this is currently not supported by any decoder. Syntax is the same as for translation factors.

Generation factors

Finally, we also want to create generation tables between target factors. Which tables to generate is specified with --generation-factors, which takes a FACTORMAPSET as a parameter. Note that this time the mapping is between target factors, not between source and target factors.

One example is 0-1 with creates a generation table between factor factor 0 and 1.

Moses
statistical
machine translation
system

1. Moses

2. Getting Started

3. Tutorials

4. Training

5. User Documentation

6. Development

7. Background

Training: Overview

Running the training script

Alignment factors

Translation factors

Reordering factors

Generation factors

Mosesstatisticalmachine translationsystem

1. Moses

2. Getting Started

3. Tutorials

4. Training

5. User Documentation

6. Development

7. Background

Training: Overview

Running the training script

Alignment factors

Translation factors

Reordering factors

Generation factors

Moses
statistical
machine translation
system