machine translation

Domain Adaptation


Translation Model Combination

You can combine several phrase tables by linear interpolation or instance weighting using the script contrib/tmcombine/, or by fill-up or back-off using the script contrib/combine-ptables/

Linear Interpolation and Instance Weighting

Linear interpolation works with any models; for instance weighting, models need to be trained with the option -write-lexical-counts so that all sufficient statistics are available. You can set corpus weights by hand, and instance weighting with uniform weights corresponds to a concatenation of your training corpora (except for differences in word alignment).

You can also set weights automatically so that perplexity on a tuning set is minimized. To obtain a tuning set from a parallel tuning corpus, use the Moses training pipeline to automatically extract a list of phrase pairs. The file model/extract.sorted.gz is in the right format.

An example call: (this weights test/model1 and test/model2 with instance weighting (-m counts) and test/extract as development set for perplexity minimization, and writes the combined phrase table to test/phrase-table_test5)

    python combine_given_tuning_set test/model1 test/model2 
-m counts -o test/phrase-table_test5 -r test/extract

More information is available in (Sennrich, 2012 EACL) and contrib/tmcombine/

Fill-up Combination

This combination technique is useful when the relevance of the models is known a priori: typically, when one is trained on in-domain data and the others on out-of-domain data.

Fill-up preserves all the entries and scores coming from the first model, and adds entries from the other models only if new. Moreover, a binary feature is added for each additional table to denote the provenance of an entry. These binary features work as scaling factors that can be tuned directly by MERT along with other models' weights.

Fill-up can be applied to both translation and reordering tables.

Example call, where ptable0 is the in-domain model:

    perl --mode=fillup   
ptable0 ptable1 ... ptableN > ptable-fillup

More information is available in (Bisazza et al., 2011 IWSLT) and contrib/combine-ptables/

Back-Off Combination

An additional combination technique, called back-off, is available, which is a simplified version of fill-up. The only difference is that back-off technique does not generate the binary feature denoting the provenance an entry. This is also the main advantage of back-off: the combined table (ptable-backoff) contains the exact number of scores of their combining tables (ptable0, ptable1, ... ptableN).

Example call, where ptable0 is the in-domain model:

    perl --mode=backoff    
ptable0 ptable1 ... ptableN > ptable-backoff

OSM Model Combination (Interpolated OSM)

OSM model trained from the plain concatenation of in-domain data with large and diverse multi-domain data is sub-optimal. When other domains are sufficiently larger and/or different than the in-domain, the probability distribution can skew away from the target domain resulting in poor performance. The LM-like nature of the model provides motivation to apply methods such as perplexity optimization for model weighting. The idea is to train OSM model on each domain separately and interpolate them by minimizing optimizing perplexity on held-out tuning set. To know more read Durrani et al. (2015).


Provide tuning files as additional parameter in the settings. For example:

 interpolated-operation-sequence-model = "yes"
 operation-sequence-model-order = 5
 operation-sequence-model-settings = "--factor 0-0 --tune /path-to-tune-folder/tune_file --srilm-dir /path-to-srilm/bin/i686-m64"

This method requires word-alignment for the source and reference tuning files to generate operation sequences. This can be done using force-decoding of tuning set or by aligning tuning sets along with the training. The folder should contain files as < , tune.en , tune.align>.

Interpolation script does not work with LMPLZ and will require SRILM installation.

Online Translation Model Combination (Multimodel phrase table type)

Additionally to the log-linear combination of translation models, Moses supports additional methods to combine multiple translation models into a single virtual model, which is then passed to the decoder. The combination is performed at decoding time.

In the config, add a feature PhraseDictionaryMultiModel, which refers to its components as follows:

 0 T 2 [or whatever the zero-based index of PhraseDictionaryMultiModel is]

 PhraseDictionaryMemory tuneable=false num-features=4 input-factor=0 output-factor=0 path=/path/to/model1/phrase-table.gz table-limit=20
 PhraseDictionaryMemory tuneable=false num-features=4 input-factor=0 output-factor=0 path=/path/to/model2/phrase-table.gz table-limit=20
 PhraseDictionaryMultiModel num-features=4 input-factor=0 output-factor=0 table-limit=20 mode=interpolate lambda=0.2,0.8 components=PhraseDictionaryMemory0,PhraseDictionaryMemory1


 PhraseDictionaryMemory0= 0 0 1 0
 PhraseDictionaryMemory1= 0 0 1 0
 PhraseDictionaryMultiModel0= 0.2 0.2 0.2 0.2

As component models, PhraseDictionaryMemory, PhraseDictionaryBinary and PhraseDictionaryCompact are supported (you may mix them freely). Set the key tuneable=false for all component models; their weights are only used for table-limit pruning, so we recommend 0 0 1 0 0 (which means p(e|f) is used for pruning).

There are two additional valid options for PhraseDictionaryMultiModel, mode and lambda. The only mode supported so far is interpolate, which linearly interpolates all component models, and passes the results to the decoder as if they were coming from a single model. Results are identical to offline interpolation with and -mode interpolate, except for pruning and rounding differences. The weights for each component model can be configured through the key lambda. The number of weights must be one per model, or one per model per feature.

Weights can also be set for each sentence during decoding through mosesserver by passing the parameter lambda. See contrib/server/ for an example. Sentence-level weights override those defined in the config.

With a running Moses server instance, the weights can also be optimized on a tuning set of phrase pairs, using perplexity minimization. This is done with the XMLRPC method optimize and the parameter phrase_pairs, which is an array of phrase pairs, each phrase pair being an array of two strings. For an example, consult contrib/server/ Online optimization depends on the dlib library, and requires Moses to be compiled with the flag --with-dlib=/path/to/dlib. Note that optimization returns a weight vector, but does not affect the running system. To use the optimized weights, either update the moses.ini and restart the server, or pass the optimized weights as a parameter for each sentence.

Online Computation of Translation Model Features Based on Sufficient Statistics

With default phrase tables, only linear interpolation can be performed online. Moses also supports computing translation probabilities and lexical weights online, based on a (weighted) combination of the sufficient statistics from multiple corpora, i.e. phrase and word (pair) frequencies.

As preparation, the training option --write-lexical-counts must be used when training the translation model. Then, use the script scripts/training/ to convert the phrase tables into phrase tables that store phrase (pair) frequencies as their feature values.

  scripts/training/ /path/to/model/phrase-table.gz /path/to/model

The format for the translation tables in the moses.ini is similar to that of the Multimodel type, but using the feature type PhraseDictionaryMultiModelCounts and additional parameters to specify the component models. Four parameters are required: components, target-table, lex-f2e and lex-e2f. The files required for the first two are created by, the last two during training of the model with --write-lexical-counts. Binarized/compacted tables are also supported (like for PhraseDictionaryMultiModel). Note that for the target count tables, phrase table filtering needs to be disabled (filterable=false).

 0 T 4 [or whatever the zero-based index of PhraseDictionaryMultiModelCounts is]

 PhraseDictionaryMemory tuneable=false num-features=3 input-factor=0 output-factor=0 path=/path/to/model1/count-table.gz table-limit=20
 PhraseDictionaryMemory tuneable=false num-features=3 input-factor=0 output-factor=0 path=/path/to/model2/count-table.gz table-limit=20

 PhraseDictionaryMemory tuneable=false filterable=false num-features=1 input-factor=0 output-factor=0 path=/path/to/model1/count-table-target.gz
 PhraseDictionaryMemory tuneable=false filterable=false num-features=1 input-factor=0 output-factor=0 path=/path/to/model2/count-table-target.gz

 PhraseDictionaryMultiModelCounts num-features=4 input-factor=0 output-factor=0 table-limit=20 mode=instance_weighting lambda=1.0,10.0 components=PhraseDictionaryMemory0,PhraseDictionaryMemory1 target-table=PhraseDictionaryMemory2,PhraseDictionaryMemory3 lex-e2f=/path/to/model1/lex.counts.e2f,/path/to/model2/lex.counts.e2f lex-f2e=/path/to/model1/lex.counts.f2e,/path/to/model2/lex.counts.f2e

 PhraseDictionaryMemory0= 1 0 0
 PhraseDictionaryMemory1= 1 0 0
 PhraseDictionaryMemory2= 1 
 PhraseDictionaryMemory3= 1
 PhraseDictionaryMultiModelCounts0= 0.00402447059454402 0.0685647475075862 0.294089113124688 0.0328320356515851

Setting and optimizing weights is done as for the Multimodel phrase table type, but the supported modes are different. The weights of the component models are only used for table-limit pruning, and the weight 1 0 0, which is pruning by phrase pair frequency, is recommended.

The following modes are implemented:

  • instance_weighting: weights are applied to the sufficient statistics (i.e. the phrase (pair) frequencies), not to model probabilities. Results are identical to offline optimization with and -mode counts, except for pruning and rounding differences.
  • interpolate: both phrase and word translation probabilities (the latter being used to compute lexical weights) are linearly interpolated. This corresponds to with -mode interpolate and -recompute-lexweights.

Alternate Weight Settings

Note: this functionality currently does not work with multi-threaded decoding.

You may want to translate different some sentences with different weight settings than others, due to significant differences in genre, text type, style, or even to have separate settings for headlines and questions.

Moses allows you to specify alternate weight settings in the configuration file, e.g.:

 Distortion0= 0.1
 LexicalReordering0= 0.1 0.1 0.1 0.1 0.1 0.1
 LM0= 1
 WordPenalty0= 0
 TranslationModel0= 0.1 0.1 0.1 0.1 0

This example specifies a weight setting with the identifying name strong-lm.

When translating a sentence, the default weight setting is used, unless the use of an alternate weight setting is specified with an XML tag:

 <seg weight-setting="strong-lm">This is a small house .</seg>

This functionality also allows for the selective use of feature functions and decoding graphs (unless decomposed factored models are used, a decoding graph corresponds to a translation table).

Feature functions can be turned off by adding the parameter ignore-ff to the identifier line (names of feature functions, separated by comma), decoding graphs can be ignored with the parameter ignore-decoding-path (number of decoding paths, separated by comma).

Note that with these additional options all the capability of the previously (pre-2013) implemented "Translation Systems" is provided. You can even have one configuration file and one Moses process to translate two different language pairs that share nothing but basic features.

See the example below for a complete configuration file with exactly this setup. In this case, the default weight setting is not useful since it mixes translation models and language models from both language pairs.


 # mapping steps
 0 T 0
 1 T 1


 # feature functions
 PhraseDictionaryBinary name=TranslationModel0 num-features=5 \ 
    path=/path/to/french-english/phrase-table output-factor=0 
 LexicalReordering num-features=6 name=LexicalReordering0 \ 
    type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 \ 
 KENLM name=LM0 order=5 factor=0 path=/path/to/french-english/language-model lazyken=0
 PhraseDictionaryBinary name=TranslationModel1 num-features=5 \ 
    path=/path/to/german-english/phrase-table output-factor=0 
 LexicalReordering num-features=6 name=LexicalReordering1 \ 
    type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 \ 
 KENLM name=LM1 order=5 factor=0 path=/path/to/german-english/language-model lazyken=0

 # core weights - not used 
 Distortion0= 0
 WordPenalty0= 0
 TranslationModel0= 0 0 0 0 0
 LexicalReordering0= 0 0 0 0 0 0
 LM0= 0
 TranslationModel1= 0 0 0 0 0
 LexicalReordering1= 0 0 0 0 0 0
 LM1= 0

 id=fr ignore-ff=LM1,LexicalReordering1 ignore-decoding-path=1
 Distortion0= 0.155
 LexicalReordering0= 0.074 -0.008 0.002 0.050 0.033 0.042
 LM0= 0.152
 WordPenalty0= -0.097
 TranslationModel0= 0.098 0.065 -0.003 0.060 0.156
 id=de ignore-ff=LM0,LexicalReordering0 ignore-decoding-path=0
 LexicalReordering1= 0.013 -0.012 0.053 0.116 0.006 0.080
 Distortion0= 0.171
 LM0= 0.136
 WordPenalty0= 0.060
 TranslationModel1= 0.112 0.160 -0.001 0.067 0.006

With this model, you can translate:

 <seg weight-setting=de>Hier ist ein kleines Haus .</seg>
 <seg weight-setting=fr>C' est une petite maison . </seg>

Modified Moore-Lewis Filtering

When you have a lot of out-of-domain data and you do not want to use all of it, then you can filter down that data to the parts that are more similar to the in-domain data. Moses implements a method called modified Moore-Lewis filtering. The method basically train in-domain and out-of-domain language models, and removes sentence pairs that receive relatively low scores by the in-domain models. For more details, please refer to the following paper:

Axelrod, Amittai and He, Xiaodong and Gao, Jianfeng: Domain Adaptation via Pseudo In-Domain Data Selection, Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing mentioned in Domain Adaptation, pdf, bib.

The Moses implementation is integrated into EMS. You have to specify in-domain and out-of-domain in separate CORPUS sections (you can have more than one of each), and then set in the configuration file which out-of-domain corpora need to be filtered

 ### filtering some corpora with modified Moore-Lewis
 mml-filter-corpora = giga
 mml-before-wa = "-proportion 0.2"
 #mml-after-wa = "-proportion 0.2"

There are two different places when to do the filtering, either before or after word alignment. There may be some benefits of having out-of-domain data to improve sentence alignment, but that may also be computationally to expensive. In the configuration file, you specify the proportion of the out-of-domain data that will be retained - in the example above 20% will be kept, 80% will be thrown out.

Edit - History - Print
Page last modified on October 09, 2015, at 11:33 AM