Moses
statistical
machine translation
system

Training Step 2: Run GIZA++

GIZA++ is a freely available implementation of the IBM models. We need it as a initial step to establish word alignments. Our word alignments are taken from the intersection of bidirectional runs of GIZA++ plus some additional alignment points from the union of the two runs.

Running GIZA++ is the most time consuming step in the training process. It also requires a lot of memory (1-2 GB RAM is common for large parallel corpora).

GIZA++ learns the translation tables of IBM Model 4, but we are only interested in the word alignment file:

 > zcat giza.de-en/de-en.A3.final.gz | head -9
 # Sentence pair (1) source length 4 target length 3 alignment score : 0.00643931
 wiederaufnahme der sitzungsperiode 
 NULL ({ }) resumption ({ 1 }) of ({ }) the ({ 2 }) session ({ 3 }) 
 # Sentence pair (2) source length 17 target length 18 alignment score : 1.74092e-26
 ich erklaere die am donnerstag , den 28. maerz 1996 unterbrochene sitzungsperiode
   des europaeischen parlaments fuer wiederaufgenommen . 
 NULL ({ 7 }) i ({ 1 }) declare ({ 2 }) resumed ({ }) the ({ 3 }) session ({ 12 }) 
   of ({ 13 }) the ({ }) european ({ 14 }) parliament ({ 15 }) 
   adjourned ({ 11 16 17 }) on ({ }) thursday ({ 4 5 }) , ({ 6 }) 28 ({ 8 }) 
   march ({ 9 }) 1996 ({ 10 }) . ({ 18 }) 
 # Sentence pair (3) source length 1 target length 1 alignment score : 0.012128
 begruessung 
 NULL ({ }) welcome ({ 1 }) 

In this file, after some statistical information and the foreign sentence, the English sentence is listed word by word, with references to aligned foreign words: The first word resumption ({ 1 }) is aligned to the first German word wiederaufnahme. The second word of ({ }) is unaligned. And so on.

Note that each English word may be aligned to multiple foreign words, but each foreign word may only be aligned to at most one English word. This one-to-many restriction is reversed in the inverse GIZA++ training run:

 > zcat giza.en-de/en-de.A3.final.gz | head -9
 # Sentence pair (1) source length 3 target length 4 alignment score : 0.000985823
 resumption of the session 
 NULL ({ }) wiederaufnahme ({ 1 2 }) der ({ 3 }) sitzungsperiode ({ 4 }) 
 # Sentence pair (2) source length 18 target length 17 alignment score : 6.04498e-19
 i declare resumed the session of the european parliament adjourned on thursday ,
   28 march 1996 . 
 NULL ({ }) ich ({ 1 }) erklaere ({ 2 10 }) die ({ 4 }) am ({ 11 }) 
   donnerstag ({ 12 }) , ({ 13 }) den ({ }) 28. ({ 14 }) maerz ({ 15 }) 
   1996 ({ 16 }) unterbrochene ({ 3 }) sitzungsperiode ({ 5 }) des ({ 6 7 }) 
   europaeischen ({ 8 }) parlaments ({ 9 }) fuer ({ }) wiederaufgenommen ({ })
   . ({ 17 }) 
 # Sentence pair (3) source length 1 target length 1 alignment score : 0.706027
 welcome 
 NULL ({ }) begruessung ({ 1 }) 

Training on really large corpora

GIZA++ is not only the slowest part of the training, it is also the most critical in terms of memory requirements. To better be able to deal with the memory requirements, it is possible to train a preparation step on parts of the data that involves an additional program called snt2cooc.

For practical purposes, all you need to know is that the switch --parts n may allow training on large corpora that would not be feasible otherwise (a typical value for n is 3).

This is currently not a problem for Europarl training, but is necessary for large Arabic and Chinese training runs.

Training in parallel

Using the --parallel option will fork the script and run the two directions of GIZA++ as independent processes. This is the best choice on a multi-processor machine.

If you have only single-processor machines and still wish to run the two GIZA++ processes in parallel, use the following (rather obsolete) trick. Support for this is not fully user friendly, some manual involvement is essential.

  • First you start training the usual way with the additional switches --last-step 2 --direction 1, which runs the data preparation and one direction of GIZA++ training
  • When the GIZA++ step started, start a second training run with the switches --first-step 2 --direction 2. This runs the second GIZA++ run in parallel, and then continues the rest of the model training. (Beware of race conditions! The second GIZA++ run might finish earlier than the first one to training step 3 might start too early!)
Edit - History - Print
Page last modified on July 28, 2013, at 08:47 AM