The parallel corpus has to be converted into a format that is suitable to the GIZA++ toolkit. Two vocabulary files are generated and the parallel corpus is converted into a numberized format.
The vocabulary files contain words, integer word identifiers and word count information:
==> corpus/de.vcb <== 1 UNK 0 2 , 928579 3 . 723187 4 die 581109 5 der 491791 6 und 337166 7 in 230047 8 zu 176868 9 den 168228 10 ich 162745 ==> corpus/en.vcb <== 1 UNK 0 2 the 1085527 3 . 714984 4 , 659491 5 of 488315 6 to 481484 7 and 352900 8 in 330156 9 is 278405 10 that 262619
The sentence-aligned corpus now looks like this:
> head -9 corpus/en-de-int-train.snt 1 3469 5 2049 4107 5 2 1399 1 10 3214 4 116 2007 2 9 5254 1151 985 6447 2049 21 44 141 14 2580 3 14 2213 1866 2 1399 5 2 29 46 3256 18 1969 4 2363 1239 1111 3 1 7179 306
A sentence pair now consists of three lines:
First the frequency of this sentence. In our training process this is always 1
. This number can be used for weighting different parts of the training corpus differently.
The two lines below contain word ids of the foreign and the English sentence. In the sequence 4107 5 2 1399
we can recognize of (5)
and the (2)
.
GIZA++ also requires words to be placed into word classes. This is done
automatically by calling the mkcls
program. Word classes are
only used for the IBM reordering model in GIZA++. A peek into the
foreign word class file:
> head corpus/de.vcb.classes ! 14 " 14 # 30 % 31 & 10 ' 14 ( 10 ) 14 + 31 , 11