GIZA++: Training of statistical translation models.
GIZA++ is an extension of the program GIZA (part of the SMT
toolkit EGYPT) which was
developed by the Statistical Machine Translation team during
the summer workshop in 1999 at the Center for Language and Speech
Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a
lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och.
About GIZA++
The program includes the following extensions to GIZA:
- Model 4;
- Model 5;
- Alignment models depending on word classes (software for
producing word classes can be downloaded here;
- Implements the HMM alignment model: Baum-Welch training,
Forward-Backward algorithm, empty word, dependency on word classes,
transfer to fertility models, ...;
- Includes a variant of Model 3 and Model 4 which allow the
training of the parameter p_0;
- Various smoothing techniques for fertility, distortion/alignment
parameters;
- Significant more efficient training of the fertility models;
- Correct implementation of pegging as described in (Brown et
al. 1993), a series of heuristics in order to make pegging
sufficiently efficient;
- ...
In order to compile GIZA++ you may need:
- a recent version of the GNU compiler (2.95 or higher)
- a recent version of assembler and linker which do not have restrictions
with respect to the length of symbol names
It is known to compile on Linux, Irix and SUNOS systems. A lot of
older compiler version do not fully support all features of STL that
are used by GIZA++. Therefore,
frequently occur compiler, assembler or linker problems which are
mostly due to the intensive use of STL within the program. If any
compilation problem occurs, please first
try to get the newest compiler version. Patches to the code are most
welcome. Feel free to send me mail asking for help, but please do not
necessarily expect me to have time to help.
It is released under the GNU Public
License (GPL).
Citation:
You are welcome to use the code under the terms of the licence for
research or commercial purposes, however please acknowledge its use
with a citation:
Franz Josef Och, Hermann Ney.
"A Systematic Comparison of Various Statistical Alignment Models",
Computational Linguistics, volume 29, number 1, pp. 19-51
March 2003.
Here is a BiBTeX entry:
@ARTICLE{och03:asc,
AUTHOR = {Franz Josef Och and Hermann Ney},
TITLE = {A Systematic Comparison of Various Statistical Alignment Models},
JOURNAL= {Computational Linguistics},
NUMBER = 1,
VOLUME = 29,
YEAR = 2003,
PAGES = {19--51}}
Versions:
newest version on code.google.com
GIZA++.2003-09-30.tar.gz
- various bug fixes/improved efficiency
- allows generation of n-best lists of alignments (-nbestalignments N)
- compiles with gcc version 2.95 - 3.3 / MacOSX
- more memory efficient (if compiled with -DBINARY_SEARCH_FOR_TTABLE)
GIZA++.2001-01-30.tar.gz
(old version)
Acknowledgements
This work was supported by the
National Science Foundation under Grant No. No. IIS-9820687 through the 1999
Workshop on Language Engineering, Center for
Language and Speech Processing, Johns Hopkins University.
Last updated: 30 January 2001,
och@isi.edu