NB: This page requires refactoring
Translation models for Moses are typically batch trained. That is, before training you have all the data you wish to use, you compute the alignments using GIZA, and from that produce a phrase table which you can use in the decoder. If some time later you wish to utilize some new training data, you must repeat the process from the start, and for large data sets, that can take quite some time.
Incremental training provides a way of avoiding having to retrain the model from scratch every time you wish to use some new training data. Instead of producing a phrase table with precalculated scores for all translations, the entire source and target corpora are stored in memory as a suffix array along with their alignments, and translation scores are calculated on the fly. Now, when you have new data, you simply update the word alignments, and append the new sentences to the corpora along with their alignments. Moses provides a means of doing this via XML RPC, so you don't even need to restart the decoder to use the new data.
Note that at the moment the incremental phrase table code is not thread safe.
This section describes how to initially train and use a model which support incremental training.
training-options = "-final-alignment-model hmm"
phrase-based decoding only!
1. Compile Moses with the bjam switch --with-mm
2. You need
- sentences aligned text files - the word alignment between these files in symal output format
3. Build binary files
Let ${L1} be the extension of the language that you are translating from, ${L2} the extension of the language that you want to translate into, and ${CORPUS} the name of the word-aligned training corpus % zcat ${CORPUS}.${L1}.gz | mtt-build -i -o /some/path/${CORPUS}.${L1} % zcat ${CORPUS}.${L2}.gz | mtt-build -i -o /some/path/${CORPUS}.${L2} % zcat ${CORPUS}.${L1}-${L2}.symal.gz | symal2mam /some/path/${CORPUS}.${L1}-${L2}.mam % mmlex-build /some/path/${CORPUS} ${L1} ${L2} -o /some/path/${CORPUS}.${L1}-${L2}.lex
4. Define line in moses.ini (all needs to be on one line, the continuation marks are for typesetting in the pdf of this manual only):
for static systems:
PhraseDictionaryBitextSampling name=PT0 output-factor=0 \ path=/some/path/${CORPUS} L1=${L1} L2=${L2}
for post-editing, e.g.:
PhraseDictionaryBitextSampling name=PT0 output-factor=0 \ path=/some/path/${CORPUS} L1=${L1} L2=${L2} smooth=0 prov=1
(Note: the best configuration of phrase table features is still under investigation.)
Phrase table features are explained below
Add the following lines to your config file to use the sampling phrase table within experiment.perl:
### build memory mapped suffix array phrase table # mmsapt = "pfwd=g pbwd=g smooth=0.01 rare=0 prov=0 sample=1000 workers=1" binarize-all = $moses-script-dir/training/binarize-model.perl
OR (for use with interactive post-editing)
### build memory mapped suffix array phrase table # mmsapt = "pfwd=g pbwd=g smooth=0 rare=1 prov=1 sample=1000 workers=1" binarize-all = $moses-script-dir/training/binarize-model.perl
DEPRECATED:
* Modify the moses.ini file found in
<experiment-dir> /evaluation/filtered.<evaluation-set>.<run-number>
to have a ttable-file entry as follows:
PhraseDictionaryDynSuffixArray source=<path-to-source-corpus> target=<path-to-target-corpus> alignment=<path-to-alignments>
The source and target corpus paths should be to the tokenized, cleaned, and truecased versions found in
<experiment-dir>
/training
/corpus.<run>.<lang>
, and the alignment path should be to <experiment-dir>
/model/aligned.<run>.grow-diag-final-and
.
(phrase-based decoding only) See Section Phrase Table Features for PhraseDictionaryBitextSampling below.
<experiment-dir>/steps/<run>/CORPUS_{tokenize,clean,truecase}.<run>
scripts.
snt
format, and updating the cooccurrence file.
plain2snt
$ $INC_GIZA_PP/GIZA++-v2/plain2snt.out <new-source-sentences> <new-target-sentences> \ -txt1-vocab <previous-source-vocab> -txt2-vocab <previous-target-vocab>
<experiment-dir>/
training
/prepared.<run>/{<source-lang>,<target-lang>}.vcb
. Running this command with the files containing your new tokenized, cleaned, and truecased source and target as txt1
and txt2
will produce new a new vocab file for each language and a couple of .snt
files. Any further references to vocabs in commands or config files should reference the new vocabulary files just produced.
plain2snt.cpp:28: int loadVocab(): Assertion `iid1.size()-1 == ID' failed.
, then change line 15 in plain2snt.cpp
to vector
<string> iid1(1),iid2(1);
and recompile.
snt2cooc
$ $INC_GIZA_PP/bin/snt2cooc.out <new-source-vcb> <new-target-vcb> <new-source_target.snt> \ <previous-source-target.cooc > new.source-target.cooc $ $INC_GIZA_PP/bin/snt2cooc.out <new-target-vcb> <new-source-vcb> <new-target_source.snt> \ <previous-target-source.cooc > new.target-source.cooc
<experiment-dir>/
training/
giza.<run>/
<target-lang>-<source-lang>.cooc
and <experiment-dir>/
training/
giza-inverse.<run>/
<source-lang>-<target-lang>.cooc
.
S: <path-to-src-vocab> T: <path-to-tgt-vocab> C: <path-to-src-to-tgt-snt> O: <prefix-of-output-files> coocurrencefile: <path-to-src-tgt-cooc-file> model1iterations: 1 model1dumpfrequency: 1 hmmiterations: 1 hmmdumpfrequency: 1 model2iterations: 0 model3iterations: 0 model4iterations: 0 model5iterations: 0 emAlignmentDependencies: 1 step_k: 1 oldTrPrbs: <path-to-original-thmm> oldAlPrbs: <path-to-original-hhmm>
To run GIZA++ with these config files, just issue the command
GIZA++ <path-to-config-file>
With the alignments updated, we can get the alignments for the new data by running the command:
giza2bal.pl -d <path-to-updated-tgt-to-src-ahmm> -i <path-to-updated-src-to-tgt-ahmm> \ | symal -alignment="grow" -diagonal="yes" -final="yes" -both="yes" > new-alignment-file
This is still work in progress. Feature sets and names may change at any time without notice. It is best not to rely on defaults but to always specify for each feature explicitly whether or not it is to be used.
Some of the features below are described in the following publication: Ulrich Germann. 20014. "Dynamic Phrase Tables for Statistical Machine Translation in an Interactive Post-editing Scenario". AMTA 2014 Workshop on Interactive and Adaptive Machine Translation. Vancouver, BC, Canada.
The sampling phrase table offers a number of fixed and configurable phrase table features. For the descriptions below, it is necessary to distinguish different kinds of counts.
Phrase pair features are specified as follows:
pfwd=
spec log of lower bound on forward phrase-level conditional probability; details below
pbwd=
spec log of lower bound on backward phrase-level conditional probability; details below
logcnt=
spec logs of plain counts
coh=
{1|0} log of coherence (include / don't include)
rare=
param: global rarity penalty param/(j + param), where param determines the steepness of the asymptotic penalty, which slowly decreases towards zero as the number of joint phrase occurrences j increases.
prov=
param: foreground/background-specific provenance reward j/(j + param) that asymtotically grows to 1 for the specific corpus as the number of joint phrase occurrences j increases.
unal=
spec: number of unaligned words in the phrase pair; detailed documentation pending.
pcnt=
spec: phrase penalty ??? (introduced by M. Denkowski)
wcnt=
spec: word penalty ??? (introduced by M. Denkowski)
lenrat=
{1|0}: use / don't use the phrase length ratio feature described here
The specficication for pfwd
and pbwd
consists of one or more of the letters 'r','s', and 'g' plus optionally the '+' sign. The letter (r/s/g) determines the denominator (see types of counts above); the plus sign indicates that these features are to be computed separately for (static) background corpus and (dynamic) foreground corpus. For example, pfwd=g+
will compute the lower bound on the probability given j joint occurrences of the phrase pair in question in g samples, computed separately for the two corpora. The confidence level for the lower bound is specified by the parameter smooth=
value, where value is a value between 0 and 1 indicating the risk of overestimating the true probability given the evidence that we are willing to take. smooth=0
causes the maximum likelihood estimate to be used.
r1
include raw counts for L1 phrase
r2
include raw counts for L2 phrase
s1
include sample size for L1 phrase
g1
include number of samples used ('good samples')
j
include joint phrase counts
as with pfwd
/pbwd
, a '+' at the end indicates that the features are to be provided per corpus, not pooled. E.g., logcnt=g1jr2
: provide log of the number of samples actually used for phrase extraction, joint counts, and raw L2 phrase counts.
extra=
path allows you to specify a set of files path.L1, path.L2, and path.symal
to seed the dynamic foreground corpus with a word-aligned corpus in text format. path.L1 and path.L2 must be one sentence per line, cased as required for translation. path.symal
should contain the word alignment info in symal output format.
the program ptable-describe-features can be used to list the features used in the order they are provided by the phrase table:
cat moses.ini | ptable-describe-features
The phrase-based model uses a suffix array implementation which comes with Moses.
If you want to use suffix arrays for hierarchical models, use Adam Lopez's implementation. The source code for this is currently available in cdec. You have to compile cdec so please follow its instructions.
You also need to install pycdec
cd python python setup.py install
Note: the suffix array code requires Python 2.7 or above. If you have Linux installations which are a few years old, check this first.
Adam Lopez's implementation writes the suffix array to binary files, given the parallel training data and word alignment. The Moses toolkit has a wrapper script which simplifies this process:
./scripts/training/wrappers/adam-suffix-array/suffix-array-create.sh \ [path to cdec/python/pkg] \ [source corpus] \ [target corpus] \ [word alignment] \ [output suffix array directory] \ [output glue rules]
WARNING - This requires a lot of memory (approximately 10GB for a parallel corpus of 15 million sentence pairs)
Once the suffix array has been created, run another Moses wrapper script to extract the translation rules required for a particular set of input sentences.
./scripts/training/wrappers/adam-suffix-array/suffix-array-extract.sh \ [suffix array directory from previous command] \ [input sentences] \ [output rules directory] \ [number of jobs]
This command creates one file for each input sentences with just the rules required to decode that sentences. eg.
# ls filtered.5/ grammar.0.gz grammar.3.gz grammar.7.gz grammar.1.gz grammar.4.gz grammar.8.gz grammar.10.gz grammar.5.gz grammar.9.gz ....
Note - these files are gzipped, and the rules are formatted in the Hiero format, rather than the Moses format. eg.
# zcat filtered.5/grammar.out.0.gz | head -1 [X] ||| monsieur [X,1] ||| mr [X,1] ||| 0.178069829941 2.04532289505 1.8692317009 0.268405526876 0.160579100251 0.0 0.0 ||| 0-0
To use these rules in the decoder, put this into the ini file
PhraseDictionaryALSuffixArray name=TranslationModel0 table-limit=20 \ num-features=7 path=[path-to-filtered-dir] input-factor=0 output-factor=0 PhraseDictionaryMemory name=TranslationModel1 num-features=1 \ path=[path-to-glue-grammar] input-factor=0 output-factor=0
Adam Lopez's suffix array implementation is integrated into the EMS, where all of the above commands are executed for you. Add the following line to your EMS config file:
[TRAINING] suffix-array = [pycdec package path] # e.g. # suffix-array = /home/github/cdec/python/pkg
and the EMS will use the suffix array instead of the usual Moses rule extraction algorithms.
You can also have multiple extractors running at once
[GENERAL] sa_extractors = 8
WARNING: currently the pycdec
simply forks itself N times, therefore this will require N
times more memory. Be careful with the interaction with multiple evaluations in parallel in EMS and large suffix arrays.