To establish word alignments based on the two GIZA++ alignments,
a number of heuristics may be applied. The default heuristic
grow-diag-final
starts with the intersection of the two
alignments and then adds additional alignment points.
Other possible alignment methods:
intersection
grow
(only add block-neighboring points)
grow-diag
(without final step)
union
srctotgt
(only consider word-to-word alignments from the source-target GIZA++ alignment file)
tgttosrc
(only consider word-to-word alignments from the target-source GIZA++ alignment file)
Alternative alignment methods can be specified with the switch --alignment
.
Here, the pseudo code for the default heuristic:
GROW-DIAG-FINAL(e2f,f2e): neighboring = ((-1,0),(0,-1),(1,0),(0,1),(-1,-1),(-1,1),(1,-1),(1,1)) alignment = intersect(e2f,f2e); GROW-DIAG(); FINAL(e2f); FINAL(f2e); GROW-DIAG(): iterate until no new points added for english word e = 0 ... en for foreign word f = 0 ... fn if ( e aligned with f ) for each neighboring point ( e-new, f-new ): if ( ( e-new not aligned or f-new not aligned ) and ( e-new, f-new ) in union( e2f, f2e ) ) add alignment point ( e-new, f-new ) FINAL(a): for english word e-new = 0 ... en for foreign word f-new = 0 ... fn if ( ( e-new not aligned or f-new not aligned ) and ( e-new, f-new ) in alignment a ) add alignment point ( e-new, f-new )
To illustrate this heuristic, see the example in the Figure below with the intersection of the two alignments for the second sentence in the corpus above
and then add some additional alignment points that lie in the union of the two alignments
This alignment has a blatant error: the alignment of the two verbs
is mixed up. resumed
is aligned to unterbrochene
, and
adjourned
is aligned to wiederaufgenommen
, but it should
be the other way around.
To conclude this section, a quick look into the files generated by the word alignment process:
==> model/aligned.de <== wiederaufnahme der sitzungsperiode ich erklaere die am donnerstag , den 28. maerz 1996 unterbrochene sitzungsperiode des europaeischen parlaments fuer wiederaufgenommen . begruessung ==> model/aligned.en <== resumption of the session i declare resumed the session of the european parliament adjourned on thursday , 28 march 1996 . welcome ==> model/aligned.grow-diag-final <== 0-0 0-1 1-2 2-3 0-0 1-1 2-3 3-10 3-11 4-11 5-12 7-13 8-14 9-15 10-2 11-4 12-5 12-6 13-7 14-8 15-9 16-9 17-16 0-0
The third file contains alignment information, one alignment point at a time, in form of the position of the foreign and English word.