Reference: All Training Parameters

Basic Options

A number of parameters are required to point the training script to the correct training data. We will describe them in this section. Other options allow for partial training runs and alternative settings.

As mentioned before, you want to create a special directory for training. The path to that directory has to be specified with the parameter --root-dir.

The root directory has to contain a sub directory (called corpus) that contains the training data. The training data is a parallel corpus, stored in two files, one for the English sentences, one for the foreign sentences. The corpus has to be sentence-aligned, meaning that the 1624th line in the English file is the translation of the 1624th line in the foreign file.

Typically, the data is lowercased, no empty lines are allowed, and having multiple spaces between words may cause problems. Also, sentence length is limited to 100 words per sentence. The sentence length ratio for a sentence pair can be at most 9 (i.e, having a 10-word sentence aligned to a 1-word sentence is disallowed). These restrictions on sentence length are caused by GIZA++ and may be changed (see below).

The two corpus files have a common file stem (say, euro) and extensions indicating the language (say, en and de). The file stem (--corpus-file), and the language extensions (--e and --f) have to be specified to the training script.

In summary, the training script may be invoked as follows:

 train-model.perl --root-dir . --f de --e en --corpus corpus/euro >& LOG

After training, typically the following files can be found in the root directory (note the time stamps that tell you something about how much time was spent on each step took for this data):

 > ls -lh *
 -rw-rw-r--    1 koehn    user         110K Jul 13 21:49 LOG

 corpus:
 total 399M
 -rw-rw-r--    1 koehn    user         104M Jul 12 19:58 de-en-int-train.snt
 -rw-rw-r--    1 koehn    user         4.2M Jul 12 19:56 de.vcb
 -rw-rw-r--    1 koehn    user         3.2M Jul 12 19:42 de.vcb.classes
 -rw-rw-r--    1 koehn    user         2.6M Jul 12 19:42 de.vcb.classes.cats
 -rw-rw-r--    1 koehn    user         104M Jul 12 19:59 en-de-int-train.snt
 -rw-rw-r--    1 koehn    user         1.1M Jul 12 19:56 en.vcb
 -rw-rw-r--    1 koehn    user         793K Jul 12 19:56 en.vcb.classes
 -rw-rw-r--    1 koehn    user         614K Jul 12 19:56 en.vcb.classes.cats
 -rw-rw-r--    1 koehn    user          94M Jul 12 18:08 euro.de
 -rw-rw-r--    1 koehn    user          84M Jul 12 18:08 euro.en

 giza.de-en:
 total 422M
 -rw-rw-r--    1 koehn    user         107M Jul 13 03:57 de-en.A3.final.gz
 -rw-rw-r--    1 koehn    user         314M Jul 12 20:11 de-en.cooc
 -rw-rw-r--    1 koehn    user         2.0K Jul 12 20:11 de-en.gizacfg

 giza.en-de:
 total 421M
 -rw-rw-r--    1 koehn    user         107M Jul 13 11:03 en-de.A3.final.gz
 -rw-rw-r--    1 koehn    user         313M Jul 13 04:07 en-de.cooc
 -rw-rw-r--    1 koehn    user         2.0K Jul 13 04:07 en-de.gizacfg

 model:
 total 2.1G
 -rw-rw-r--    1 koehn    user          94M Jul 13 19:59 aligned.de
 -rw-rw-r--    1 koehn    user          84M Jul 13 19:59 aligned.en
 -rw-rw-r--    1 koehn    user          90M Jul 13 19:59 aligned.grow-diag-final
 -rw-rw-r--    1 koehn    user         214M Jul 13 20:33 extract.gz
 -rw-rw-r--    1 koehn    user         212M Jul 13 20:35 extract.inv.gz
 -rw-rw-r--    1 koehn    user          78M Jul 13 20:23 lex.f2n
 -rw-rw-r--    1 koehn    user          78M Jul 13 20:23 lex.n2f
 -rw-rw-r--    1 koehn    user          862 Jul 13 21:49 pharaoh.ini
 -rw-rw-r--    1 koehn    user         1.2G Jul 13 21:49 phrase-table

Summary

Factored Translation Model Settings

More on factored translation models in the Overviev.

Summary

Lexicalized Reordering Model

More on lexicalized reording on the description of Training step 7: build reordering model.

Summary

Partial Training

You may have better ideas how to do word alignment, extract phrases or score phrases. Since the training is modular, you can start training at any of the seven training steps --first-step and end it at any subsequent step --last-step.

Again, the nine training steps are:

  1. Prepare data
  2. Run GIZA++
  3. Align words
  4. Get lexical translation table
  5. Extract phrases
  6. Score phrases
  7. Build reordering model
  8. Build generation models
  9. Create configuration file

For instance, if you may have your own method to generate a word alignment, you want to skip these training steps and start with lexical translation table generation, you may specify this by

 train-model.perl [...] --first-step 4

Summary

File Locations

A number of parameters allow you to break out of the rigid file name conventions of the training script. A typical use for this is that you want to try alternative training runs, but there is no need to repeat all the training steps.

For instance, you may want to try an alternative alignment heuristic. There is no need to rerun GIZA++. You could copy the necessary files from the corpus and the giza.* directories into a new root directory, but this takes up a lot of additional disk space and makes the file organization unnecessarily complicated.

Since you only need a new model directory, you can specify this with the parameter --model-dir, and stay within the precious root directory structure:

 train-model.perl [...] --first-step 3 --alignment union --model-dir model-union

The other parameters for file and directory names fullfill similar purposes.

Summary

Alignment Heuristic

A number of different word alignment heuristics are implemented, and can be specified with the parameter --alignment. The options are:

Different heuristic may show better performance for a specific language pair or corpus, so some experimentation may be useful.

Summary

Maximum Phrase Length

The maximum length of phrases is limited to 7 words. The maximum phrase length impacts the size of the phrase translation table, so shorter limits may be desirable, if phrase table size is an issue. Previous experiments have shown that performance increases only slightly when including phrases of more that 3 words.

Summary

GIZA++ Options

GIZA++ takes a lot of parameters to specify the behavior of the training process and limits on sentence length, etc. Please refer to the corresponding documentation for details on this.

Parameters can be passed on to GIZA++ with the switch --giza-option.

For instance, if you want to the change the number of iterations for the different IBM models to 4 iterations of Model 1, 0 iterations of Model 2, 4 iterations of the HMM Model, 0 iterations of Model 3, and 3 iterations of Model 4, you can specify this by

 train-model.perl [...] --giza-option m1=4,m2=0,mh=4,m3=0,m4=3

Summary

Dealing with large training corpora

Training on large training corpora may become a problem for the GIZA++ word alignment tool. Since it stores the word translation table in memory, the size of this table may become too large for the available RAM of the machine. For instance, the data sets for the NIST Arabic-English and Chinese-English competitions require more than 4 GB of RAM, which is a problem for current 32-bit machines.

This problem can be remedied to some degree by a more efficient data structure in GIZA++, which requires the run of snt2cooc in advance on the corpus in parts and the merging on the resulting output. All you need to know is that running the training script with the option --parts n, e.g. --parts 3 may allow you to train on a corpus that was too large for a regular run.

Somewhat related to this problem caused by large training corpora is the problem of the large run time of GIZA++. It is possible to run the two GIZA++ separately on two machines with the switch --direction. When running one of the runs on one machine with --direction 1 and the other run on a different machine or CPU with --direction 2, the processing time for training step 2 can be cut in half.

Summary