Reference: All Training Parameters

--root-dir -- root directory, where output files are stored
--corpus -- corpus file name (full pathname), excluding extension
--e -- extension of the English corpus file
--f -- extension of the foreign corpus file
--lm -- language model: <factor>:<order>:<filename> (option can be repeated)
--first-step -- first step in the training process (default 1)
--last-step -- last step in the training process (default 7)
--parts -- break up corpus in smaller parts before GIZA++ training
--corpus-dir -- corpus directory (default $ROOT/corpus)
--lexical-dir -- lexical translation probability directory (default $ROOT/model)
--model-dir -- model directory (default $ROOT/model)
--extract-file -- extraction file (default $ROOT/model/extract)
--giza-f2e -- GIZA++ directory (default $ROOT/giza.$F-$E)
--giza-e2f -- inverse GIZA++ directory (default $ROOT/giza.$E-$F)
--alignment -- heuristic used for word alignment: intersect, union, grow, grow-final, grow-diag, grow-diag-final (default), grow-diag-final-and, srctotgt, tgttosrc
--max-phrase-length -- maximum length of phrases entered into phrase table (default 7)
--giza-option -- additional options for GIZA++ training
--verbose -- prints additional word alignment information
--no-lexical-weighting -- only use conditional probabilities for the phrase table, not lexical weighting
--parts -- prepare data for GIZA++ by running snt2cooc in parts
--direction -- run training step 2 only in direction 1 or 2 (for parallelization)
--reordering -- specifies which reordering models to train using a comma-separated list of config-strings, see FactoredTraining.BuildReorderingModel. (default distance)
--reordering-smooth -- specifies the smoothing constant to be used for training lexicalized reordering models. If the letter "u" follows the constant, smoothing is based on actual counts. (default 0.5)
--alignment-factors --
--translation-factors --
--reordering-factors --
--generation-factors --
--decoding-steps --

Basic Options

A number of parameters are required to point the training script to the correct training data. We will describe them in this section. Other options allow for partial training runs and alternative settings.

As mentioned before, you want to create a special directory for training. The path to that directory has to be specified with the parameter --root-dir.

The root directory has to contain a sub directory (called corpus) that contains the training data. The training data is a parallel corpus, stored in two files, one for the English sentences, one for the foreign sentences. The corpus has to be sentence-aligned, meaning that the 1624th line in the English file is the translation of the 1624th line in the foreign file.

Typically, the data is lowercased, no empty lines are allowed, and having multiple spaces between words may cause problems. Also, sentence length is limited to 100 words per sentence. The sentence length ratio for a sentence pair can be at most 9 (i.e, having a 10-word sentence aligned to a 1-word sentence is disallowed). These restrictions on sentence length are caused by GIZA++ and may be changed (see below).

The two corpus files have a common file stem (say, euro) and extensions indicating the language (say, en and de). The file stem (--corpus-file), and the language extensions (--e and --f) have to be specified to the training script.

In summary, the training script may be invoked as follows:

 train-model.perl --root-dir . --f de --e en --corpus corpus/euro >& LOG

After training, typically the following files can be found in the root directory (note the time stamps that tell you something about how much time was spent on each step took for this data):

 > ls -lh *
 -rw-rw-r--    1 koehn    user         110K Jul 13 21:49 LOG

 corpus:
 total 399M
 -rw-rw-r--    1 koehn    user         104M Jul 12 19:58 de-en-int-train.snt
 -rw-rw-r--    1 koehn    user         4.2M Jul 12 19:56 de.vcb
 -rw-rw-r--    1 koehn    user         3.2M Jul 12 19:42 de.vcb.classes
 -rw-rw-r--    1 koehn    user         2.6M Jul 12 19:42 de.vcb.classes.cats
 -rw-rw-r--    1 koehn    user         104M Jul 12 19:59 en-de-int-train.snt
 -rw-rw-r--    1 koehn    user         1.1M Jul 12 19:56 en.vcb
 -rw-rw-r--    1 koehn    user         793K Jul 12 19:56 en.vcb.classes
 -rw-rw-r--    1 koehn    user         614K Jul 12 19:56 en.vcb.classes.cats
 -rw-rw-r--    1 koehn    user          94M Jul 12 18:08 euro.de
 -rw-rw-r--    1 koehn    user          84M Jul 12 18:08 euro.en

 giza.de-en:
 total 422M
 -rw-rw-r--    1 koehn    user         107M Jul 13 03:57 de-en.A3.final.gz
 -rw-rw-r--    1 koehn    user         314M Jul 12 20:11 de-en.cooc
 -rw-rw-r--    1 koehn    user         2.0K Jul 12 20:11 de-en.gizacfg

 giza.en-de:
 total 421M
 -rw-rw-r--    1 koehn    user         107M Jul 13 11:03 en-de.A3.final.gz
 -rw-rw-r--    1 koehn    user         313M Jul 13 04:07 en-de.cooc
 -rw-rw-r--    1 koehn    user         2.0K Jul 13 04:07 en-de.gizacfg

 model:
 total 2.1G
 -rw-rw-r--    1 koehn    user          94M Jul 13 19:59 aligned.de
 -rw-rw-r--    1 koehn    user          84M Jul 13 19:59 aligned.en
 -rw-rw-r--    1 koehn    user          90M Jul 13 19:59 aligned.grow-diag-final
 -rw-rw-r--    1 koehn    user         214M Jul 13 20:33 extract.gz
 -rw-rw-r--    1 koehn    user         212M Jul 13 20:35 extract.inv.gz
 -rw-rw-r--    1 koehn    user          78M Jul 13 20:23 lex.f2n
 -rw-rw-r--    1 koehn    user          78M Jul 13 20:23 lex.n2f
 -rw-rw-r--    1 koehn    user          862 Jul 13 21:49 pharaoh.ini
 -rw-rw-r--    1 koehn    user         1.2G Jul 13 21:49 phrase-table

Summary

--root-dir -- root directory, where output files are stored
--corpus -- corpus, expected in $ROOT/corpus
--e -- extension of the English corpus file
--f -- extension of the foreign corpus file
--lm -- language model file

Factored Translation Model Settings

More on factored translation models in the Overviev.

Summary

--alignment-factors --
--translation-factors --
--reordering-factors --
--generation-factors --
--decoding-steps --

Lexicalized Reordering Model

More on lexicalized reording on the description of Training step 7: build reordering model.

Summary

--reordering --
--reordering-smooth --

Partial Training

You may have better ideas how to do word alignment, extract phrases or score phrases. Since the training is modular, you can start training at any of the seven training steps --first-step and end it at any subsequent step --last-step.

Again, the nine training steps are:

Prepare data
Run GIZA++
Align words
Get lexical translation table
Extract phrases
Score phrases
Build reordering model
Build generation models
Create configuration file

For instance, if you may have your own method to generate a word alignment, you want to skip these training steps and start with lexical translation table generation, you may specify this by

 train-model.perl [...] --first-step 4

Summary

--first-step -- first step in the training process (default 1)
--last-step -- last step in the training process (default 7)

File Locations

A number of parameters allow you to break out of the rigid file name conventions of the training script. A typical use for this is that you want to try alternative training runs, but there is no need to repeat all the training steps.

For instance, you may want to try an alternative alignment heuristic. There is no need to rerun GIZA++. You could copy the necessary files from the corpus and the giza.* directories into a new root directory, but this takes up a lot of additional disk space and makes the file organization unnecessarily complicated.

Since you only need a new model directory, you can specify this with the parameter --model-dir, and stay within the precious root directory structure:

 train-model.perl [...] --first-step 3 --alignment union --model-dir model-union

The other parameters for file and directory names fullfill similar purposes.

Summary

--corpus-dir -- corpus directory (default $ROOT/corpus)
--lexical-dir -- lexical translation probability directory (default $ROOT/model)
--model-dir -- model directory (default $ROOT/model)
--extract-file -- extraction file (default $ROOT/model/extract)
--giza-f2e -- GIZA++ directory (default $ROOT/giza.\$F-\$E}
--giza-e2f -- inverse GIZA++ directory (default $ROOT/giza.\$E-\$F)

Alignment Heuristic

A number of different word alignment heuristics are implemented, and can be specified with the parameter --alignment. The options are:

intersect -- the intersection of the two GIZA++ alignments is taken. This usually creates a lot of extracted phrases, since the unaligned words create a lot of freedom to align phrases.
union -- the union of the two GIZA++ alignments is taken
grow-diag-final -- the default heuristic
grow-diag -- same as above, but without a call to function FINAL() (see background to word alignment).
grow -- same as above, but with a different definition of neighboring. Now diagonally adjacent alignment points are excluded.
grow -- no diagonal neighbors, but with FINAL()

Different heuristic may show better performance for a specific language pair or corpus, so some experimentation may be useful.

Summary

--alignment -- heuristic used for word alignment: intersect, union, grow, grow-final, grow-diag, grow-diag-final (default)

Maximum Phrase Length

The maximum length of phrases is limited to 7 words. The maximum phrase length impacts the size of the phrase translation table, so shorter limits may be desirable, if phrase table size is an issue. Previous experiments have shown that performance increases only slightly when including phrases of more that 3 words.

Summary

--max-phrase-length -- maximum length of phrases entered into phrase table (default 7)

GIZA++ Options

GIZA++ takes a lot of parameters to specify the behavior of the training process and limits on sentence length, etc. Please refer to the corresponding documentation for details on this.

Parameters can be passed on to GIZA++ with the switch --giza-option.

For instance, if you want to the change the number of iterations for the different IBM models to 4 iterations of Model 1, 0 iterations of Model 2, 4 iterations of the HMM Model, 0 iterations of Model 3, and 3 iterations of Model 4, you can specify this by

 train-model.perl [...] --giza-option m1=4,m2=0,mh=4,m3=0,m4=3

Summary

--giza-option -- additional options for GIZA++ training

Dealing with large training corpora

Training on large training corpora may become a problem for the GIZA++ word alignment tool. Since it stores the word translation table in memory, the size of this table may become too large for the available RAM of the machine. For instance, the data sets for the NIST Arabic-English and Chinese-English competitions require more than 4 GB of RAM, which is a problem for current 32-bit machines.

This problem can be remedied to some degree by a more efficient data structure in GIZA++, which requires the run of snt2cooc in advance on the corpus in parts and the merging on the resulting output. All you need to know is that running the training script with the option --parts n, e.g. --parts 3 may allow you to train on a corpus that was too large for a regular run.

Somewhat related to this problem caused by large training corpora is the problem of the large run time of GIZA++. It is possible to run the two GIZA++ separately on two machines with the switch --direction. When running one of the runs on one machine with --direction 1 and the other run on a different machine or CPU with --direction 2, the processing time for training step 2 can be cut in half.

Summary

--parts -- prepare data for GIZA++ by running snt2cooc in parts
--direction -- run training step 2 only in direction 1 or 2 (for parallelization)