machine translation

Incremental Training

NB: This page requires refactoring



Translation models for Moses are typically batch trained. That is, before training you have all the data you wish to use, you compute the alignments using GIZA, and from that produce a phrase table which you can use in the decoder. If some time later you wish to utilize some new training data, you must repeat the process from the start, and for large data sets, that can take quite some time.

Incremental training provides a way of avoiding having to retrain the model from scratch every time you wish to use some new training data. Instead of producing a phrase table with precalculated scores for all translations, the entire source and target corpora are stored in memory as a suffix array along with their alignments, and translation scores are calculated on the fly. Now, when you have new data, you simply update the word alignments, and append the new sentences to the corpora along with their alignments. Moses provides a means of doing this via XML RPC, so you don't even need to restart the decoder to use the new data.

Note that at the moment the incremental phrase table code is not thread safe.

Initial Training

This section describes how to initially train and use a model which support incremental training.

 training-options = "-final-alignment-model hmm"
to the TRAINING section of your experiment configuration file.
  • Train the system using the initial training data as normal.

Virtual Phrase Tables Based on Sampling Word-aligned Bitexts

phrase-based decoding only!


1. Compile Moses with the bjam switch --with-mm

2. You need

   - sentences aligned text files
   - the word alignment between these files in symal output format

3. Build binary files

   ${L1} be the extension of the language that you are translating from,
   ${L2} the extension of the language that you want to translate into, and 
   ${CORPUS} the name of the word-aligned training corpus

   % zcat ${CORPUS}.${L1}.gz  | mtt-build -i -o /some/path/${CORPUS}.${L1}
   % zcat ${CORPUS}.${L2}.gz  | mtt-build -i -o /some/path/${CORPUS}.${L2}
   % zcat ${CORPUS}.${L1}-${L2}.symal.gz | symal2mam /some/path/${CORPUS}.${L1}-${L2}.mam
   % mmlex-build /some/path/${CORPUS} ${L1} ${L2} -o /some/path/${CORPUS}.${L1}-${L2}.lex 

4. Define line in moses.ini (all needs to be on one line, the continuation marks are for typesetting in the pdf of this manual only):

for static systems:

   PhraseDictionaryBitextSampling name=PT0 output-factor=0 \
   path=/some/path/${CORPUS} L1=${L1} L2=${L2} 

for post-editing, e.g.:

   PhraseDictionaryBitextSampling name=PT0 output-factor=0 \
   path=/some/path/${CORPUS} L1=${L1} L2=${L2} smooth=0 prov=1

(Note: the best configuration of phrase table features is still under investigation.)

Phrase table features are explained below

Use within EMS

Add the following lines to your config file to use the sampling phrase table within experiment.perl:

 ### build memory mapped suffix array phrase table
 mmsapt = "pfwd=g pbwd=g smooth=0.01 rare=0 prov=0 sample=1000 workers=1"
 binarize-all = $moses-script-dir/training/binarize-model.perl

OR (for use with interactive post-editing)

 ### build memory mapped suffix array phrase table
 mmsapt = "pfwd=g pbwd=g smooth=0 rare=1 prov=1 sample=1000 workers=1"
 binarize-all = $moses-script-dir/training/binarize-model.perl


* Modify the moses.ini file found in <experiment-dir> /evaluation/filtered.<evaluation-set>.<run-number> to have a ttable-file entry as follows:

PhraseDictionaryDynSuffixArray source=<path-to-source-corpus> target=<path-to-target-corpus> alignment=<path-to-alignments>

The source and target corpus paths should be to the tokenized, cleaned, and truecased versions found in <experiment-dir>/training/corpus.<run>.<lang>, and the alignment path should be to <experiment-dir>/model/aligned.<run>.grow-diag-final-and.

How to use memory-mapped dynamic suffix array phrase tables in the moses decoder

(phrase-based decoding only) See Section Phrase Table Features for PhraseDictionaryBitextSampling below.


Preprocess New Data

First, tokenise, clean, and truecase both target and source sentences (in that order) in the same manner as for the original corpus. You can see how this was done by looking at the <experiment-dir>/steps/<run>/CORPUS_{tokenize,clean,truecase}.<run> scripts.

Prepare New Data

The preprocessed data now needs to be prepared for use by GIZA. This involves updating the vocab files for the corpus, converting the sentences into GIZA's snt format, and updating the cooccurrence file.


 $ $INC_GIZA_PP/GIZA++-v2/plain2snt.out <new-source-sentences> <new-target-sentences> \
 -txt1-vocab <previous-source-vocab> -txt2-vocab <previous-target-vocab>
The previous vocabulary files for the original corpus can be found in <experiment-dir>/training/prepared.<run>/{<source-lang>,<target-lang>}.vcb. Running this command with the files containing your new tokenized, cleaned, and truecased source and target as txt1 and txt2 will produce new a new vocab file for each language and a couple of .snt files. Any further references to vocabs in commands or config files should reference the new vocabulary files just produced.
Note: if this command fails with the error message plain2snt.cpp:28: int loadVocab(): Assertion `iid1.size()-1 == ID' failed., then change line 15 in plain2snt.cpp to vector<string> iid1(1),iid2(1); and recompile.


 $ $INC_GIZA_PP/bin/snt2cooc.out <new-source-vcb> <new-target-vcb> <new-source_target.snt> \
   <previous-source-target.cooc > new.source-target.cooc
 $ $INC_GIZA_PP/bin/snt2cooc.out <new-target-vcb> <new-source-vcb> <new-target_source.snt> \
   <previous-target-source.cooc >
This commands is run once in the source-target direction, and once in the target-source direction. The previous cooccurrence files can be found in <experiment-dir>/training/giza.<run>/<target-lang>-<source-lang>.cooc and <experiment-dir>/training/giza-inverse.<run>/<source-lang>-<target-lang>.cooc.

Update and Compute Alignments

GIZA++ can now be run to update and compute the alignments for the new data. This should be run in the source to target, and target to source directions. A sample GIZA++ config file is given below for the source to target direction; for the target to source direction, simply swap mentions of target and source.
 S: <path-to-src-vocab>
 T: <path-to-tgt-vocab>
 C: <path-to-src-to-tgt-snt>
 O: <prefix-of-output-files>
 coocurrencefile: <path-to-src-tgt-cooc-file>
 model1iterations: 1
 model1dumpfrequency: 1
 hmmiterations: 1
 hmmdumpfrequency: 1
 model2iterations: 0
 model3iterations: 0
 model4iterations: 0
 model5iterations: 0
 emAlignmentDependencies: 1
 step_k: 1
 oldTrPrbs: <path-to-original-thmm> 
 oldAlPrbs: <path-to-original-hhmm>

To run GIZA++ with these config files, just issue the command

 GIZA++ <path-to-config-file>

With the alignments updated, we can get the alignments for the new data by running the command: -d <path-to-updated-tgt-to-src-ahmm> -i <path-to-updated-src-to-tgt-ahmm> \
 | symal -alignment="grow" -diagonal="yes" -final="yes" -both="yes" > new-alignment-file
  • Update Model
Now that alignments have been computed for the new sentences, you can use them in the decoder. Updating a running Moses instance is done via XML RPC, however to make the changes permanent, you must append the tokenized, cleaned, and truecased source and target sentences to the original corpora, and the new alignments to the alignment file.

Phrase Table Features for PhraseDictionaryBitextSampling

This is still work in progress. Feature sets and names may change at any time without notice. It is best not to rely on defaults but to always specify for each feature explicitly whether or not it is to be used.

Some of the features below are described in the following publication: Ulrich Germann. 20014. "Dynamic Phrase Tables for Statistical Machine Translation in an Interactive Post-editing Scenario". AMTA 2014 Workshop on Interactive and Adaptive Machine Translation. Vancouver, BC, Canada.

Types of counts

The sampling phrase table offers a number of fixed and configurable phrase table features. For the descriptions below, it is necessary to distinguish different kinds of counts.

  • raw [r] counts: raw monolingual phrase occurrence counts
  • sample size [s]: number of samples considered
  • good [g]: number of samples with a coherent translations (i.e., at least one target phrase could be extracted)
  • joint [j]: joint phrase occurrences

List of phrase table entry features

Phrase pair features are specified as follows:

  • lexical forward and backward probabilities (currently always included)
  • pfwd=spec log of lower bound on forward phrase-level conditional probability; details below
  • pbwd=spec log of lower bound on backward phrase-level conditional probability; details below
  • logcnt=spec logs of plain counts
  • coh={1|0} log of coherence (include / don't include)
  • rare=param: global rarity penalty param/(j + param), where param determines the steepness of the asymptotic penalty, which slowly decreases towards zero as the number of joint phrase occurrences j increases.
  • prov=param: foreground/background-specific provenance reward j/(j + param) that asymtotically grows to 1 for the specific corpus as the number of joint phrase occurrences j increases.
  • unal=spec: number of unaligned words in the phrase pair; detailed documentation pending.
  • pcnt=spec: phrase penalty ??? (introduced by M. Denkowski)
  • wcnt=spec: word penalty ??? (introduced by M. Denkowski)
  • lenrat={1|0}: use / don't use the phrase length ratio feature described here

Specification of forward/backward phrase-level conditional probabilities

The specficication for pfwd and pbwd consists of one or more of the letters 'r','s', and 'g' plus optionally the '+' sign. The letter (r/s/g) determines the denominator (see types of counts above); the plus sign indicates that these features are to be computed separately for (static) background corpus and (dynamic) foreground corpus. For example, pfwd=g+ will compute the lower bound on the probability given j joint occurrences of the phrase pair in question in g samples, computed separately for the two corpora. The confidence level for the lower bound is specified by the parameter smooth=value, where value is a value between 0 and 1 indicating the risk of overestimating the true probability given the evidence that we are willing to take. smooth=0 causes the maximum likelihood estimate to be used.

Specification of log count features

  • r1 include raw counts for L1 phrase
  • r2 include raw counts for L2 phrase
  • s1 include sample size for L1 phrase
  • g1 include number of samples used ('good samples')
  • j include joint phrase counts

as with pfwd/pbwd, a '+' at the end indicates that the features are to be provided per corpus, not pooled. E.g., logcnt=g1jr2: provide log of the number of samples actually used for phrase extraction, joint counts, and raw L2 phrase counts.

Seeding the dynamic forground corpus

extra=path allows you to specify a set of files path.L1, path.L2, and path.symal to seed the dynamic foreground corpus with a word-aligned corpus in text format. path.L1 and path.L2 must be one sentence per line, cased as required for translation. path.symal should contain the word alignment info in symal output format.

Checking the active feature set

the program ptable-describe-features can be used to list the features used in the order they are provided by the phrase table:

 cat moses.ini | ptable-describe-features 

Suffix Arrays for Hierarchical Models

The phrase-based model uses a suffix array implementation which comes with Moses.

If you want to use suffix arrays for hierarchical models, use Adam Lopez's implementation. The source code for this is currently available in cdec. You have to compile cdec so please follow its instructions.

You also need to install pycdec

    cd python
    python install 

Note: the suffix array code requires Python 2.7 or above. If you have Linux installations which are a few years old, check this first.

Adam Lopez's implementation writes the suffix array to binary files, given the parallel training data and word alignment. The Moses toolkit has a wrapper script which simplifies this process:

    ./scripts/training/wrappers/adam-suffix-array/ \
           [path to cdec/python/pkg] \
           [source corpus] \
           [target corpus] \
           [word alignment] \
           [output suffix array directory] \
           [output glue rules]

WARNING - This requires a lot of memory (approximately 10GB for a parallel corpus of 15 million sentence pairs)

Once the suffix array has been created, run another Moses wrapper script to extract the translation rules required for a particular set of input sentences.

     ./scripts/training/wrappers/adam-suffix-array/ \
           [suffix array directory from previous command] \
           [input sentences] \   
           [output rules directory] \
           [number of jobs]

This command creates one file for each input sentences with just the rules required to decode that sentences. eg.

    # ls filtered.5/
    grammar.0.gz	grammar.3.gz	grammar.7.gz
    grammar.1.gz	grammar.4.gz	grammar.8.gz
    grammar.10.gz	grammar.5.gz	grammar.9.gz ....

Note - these files are gzipped, and the rules are formatted in the Hiero format, rather than the Moses format. eg.

    # zcat filtered.5/grammar.out.0.gz | head -1
    [X] ||| monsieur [X,1] ||| mr [X,1] ||| 0.178069829941 2.04532289505 1.8692317009 0.268405526876 0.160579100251 0.0 0.0 ||| 0-0

To use these rules in the decoder, put this into the ini file

    PhraseDictionaryALSuffixArray name=TranslationModel0 table-limit=20 \
       num-features=7 path=[path-to-filtered-dir] input-factor=0 output-factor=0
    PhraseDictionaryMemory name=TranslationModel1 num-features=1 \
       path=[path-to-glue-grammar] input-factor=0 output-factor=0

Using the EMS

Adam Lopez's suffix array implementation is integrated into the EMS, where all of the above commands are executed for you. Add the following line to your EMS config file:

   suffix-array = [pycdec package path]
   # e.g.
   # suffix-array = /home/github/cdec/python/pkg

and the EMS will use the suffix array instead of the usual Moses rule extraction algorithms.

You can also have multiple extractors running at once

   sa_extractors = 8

WARNING: currently the pycdec simply forks itself N times, therefore this will require N times more memory. Be careful with the interaction with multiple evaluations in parallel in EMS and large suffix arrays.

Edit - History - Print
Page last modified on October 26, 2015, at 10:20 PM