GIZA++ is a freely available implementation of the IBM models. We need it as a initial step to establish word alignments. Our word alignments are taken from the intersection of bidirectional runs of GIZA++ plus some additional alignment points from the union of the two runs.
Running GIZA++ is the most time consuming step in the training process. It also requires a lot of memory (1-2 GB RAM is common for large parallel corpora).
GIZA++ learns the translation tables of IBM Model 4, but we are only interested in the word alignment file:
> zcat giza.de-en/de-en.A3.final.gz | head -9 # Sentence pair (1) source length 4 target length 3 alignment score : 0.00643931 wiederaufnahme der sitzungsperiode NULL ({ }) resumption ({ 1 }) of ({ }) the ({ 2 }) session ({ 3 }) # Sentence pair (2) source length 17 target length 18 alignment score : 1.74092e-26 ich erklaere die am donnerstag , den 28. maerz 1996 unterbrochene sitzungsperiode des europaeischen parlaments fuer wiederaufgenommen . NULL ({ 7 }) i ({ 1 }) declare ({ 2 }) resumed ({ }) the ({ 3 }) session ({ 12 }) of ({ 13 }) the ({ }) european ({ 14 }) parliament ({ 15 }) adjourned ({ 11 16 17 }) on ({ }) thursday ({ 4 5 }) , ({ 6 }) 28 ({ 8 }) march ({ 9 }) 1996 ({ 10 }) . ({ 18 }) # Sentence pair (3) source length 1 target length 1 alignment score : 0.012128 begruessung NULL ({ }) welcome ({ 1 })
In this file, after some statistical information and the foreign sentence,
the English sentence is listed word by word, with references to aligned
foreign words: The first word resumption ({ 1 })
is aligned to
the first German word wiederaufnahme
. The second word of ({ })
is unaligned. And so on.
Note that each English word may be aligned to multiple foreign words, but each foreign word may only be aligned to at most one English word. This one-to-many restriction is reversed in the inverse GIZA++ training run:
> zcat giza.en-de/en-de.A3.final.gz | head -9 # Sentence pair (1) source length 3 target length 4 alignment score : 0.000985823 resumption of the session NULL ({ }) wiederaufnahme ({ 1 2 }) der ({ 3 }) sitzungsperiode ({ 4 }) # Sentence pair (2) source length 18 target length 17 alignment score : 6.04498e-19 i declare resumed the session of the european parliament adjourned on thursday , 28 march 1996 . NULL ({ }) ich ({ 1 }) erklaere ({ 2 10 }) die ({ 4 }) am ({ 11 }) donnerstag ({ 12 }) , ({ 13 }) den ({ }) 28. ({ 14 }) maerz ({ 15 }) 1996 ({ 16 }) unterbrochene ({ 3 }) sitzungsperiode ({ 5 }) des ({ 6 7 }) europaeischen ({ 8 }) parlaments ({ 9 }) fuer ({ }) wiederaufgenommen ({ }) . ({ 17 }) # Sentence pair (3) source length 1 target length 1 alignment score : 0.706027 welcome NULL ({ }) begruessung ({ 1 })
GIZA++ is not only the slowest part of the training, it is also the most critical in terms of memory requirements. To better be able to deal with the memory requirements, it is possible to train a preparation step on parts of the data that involves an additional program called snt2cooc
.
For practical purposes, all you need to know is that the switch --parts n
may allow training on large corpora that would not be feasible otherwise (a typical value for n
is 3).
This is currently not a problem for Europarl training, but is necessary for large Arabic and Chinese training runs.
Using the --parallel
option will fork the script and run the two directions of GIZA++ as independent processes. This is the best choice on a multi-processor machine.
If you have only single-processor machines and still wish to run the two GIZA++ processes in parallel, use the following (rather obsolete) trick. Support for this is not fully user friendly, some manual involvement is essential.
--last-step 2 --direction 1
, which runs the data preparation and one direction of GIZA++ training
--first-step 2 --direction 2
. This runs the second GIZA++ run in parallel, and then continues the rest of the model training. (Beware of race conditions! The second GIZA++ run might finish earlier than the first one to training step 3 might start too early!)