machine translation



Generating n-Best Lists

The generation of n-best lists (the top n translations found by the search according to the model) is pretty straight-forward. You simple have to specify the file where the n-best list will be stored and the size of the n-best list for each sentence.

Example: The command

 % moses -f moses.ini -n-best-list listfile 100 < in

stores the n-best list in the file listfile with up to 100 translations per input sentence.

Here an example n-best list:

 0 ||| we must discuss on greater vision .  ||| d: 0 -5.56438 0 0 -7.07376 0 0 \
   lm: -36.0974 -13.3428 tm: -39.6927 -47.8438 -15.4766 -20.5003 4.99948 w: -7 ||| -9.2298
 0 ||| we must also discuss on a vision .  ||| d: -10 -2.3455 0 -1.92155 -3.21888 0 -1.51918 \
   lm: -31.5841 -9.96547 tm: -42.3438 -48.4311 -18.913 -20.0086 5.99938 w: -8 ||| -9.26197
 0 ||| it is also discuss a vision .  ||| d: -10 -1.63574 -1.60944 -2.70802 -1.60944 -1.94589 -1.08417 \
   lm: -31.9699 -12.155 tm: -40.4555 -46.8605 -14.3549 -13.2247 4.99948 w: -7 ||| -9.31777

Each line of the n-best list file is made up of (separated by |||):

  • sentence number (in above example 0, the first sentence)
  • output sentence
  • individual component scores (unweighted)
  • weighted overall score

Note that it is possible (and very likely) that the n-best list contains many sentences that look the same on the surface, but have different scores. The most common reason for this is different phrase segmentation (two words may be mapped by a single phrase mapping, or by two individual phrase mappings for each word).

To produce an n-best list that only contains the first occurrence of an output sentence, add the word distinct after the file and size specification:

 % moses -f moses.ini -n-best-list listfile 100 distinct < in

This creates an n-best list file that contains up to 100 distinct output sentences for each input sentences. Note that potentially a large numbers of candidate translations have to be examined to find the top 100. To keep memory usage in check only 20 times the specified number of distinct entries are examined. This factor can be changed with the switch -n-best-factor.


  • -n-best-list FILE SIZE [distinct] --- output an n-best list of size SIZE to file FILE
  • -n-best-factor FACTOR --- exploring at most FACTOR*SIZE candidates for distinct
  • -print-alignment-info-in-n-best --- output of word-to-word alignments in the n-best list; it requires that w2w alignments are included in the phrase table. (See here for further details).

Minimum Bayes Risk Decoding

Minumum Bayes Risk (MBR) decoding was proposed by Kumar and Byrne (HLT/NAACL 2004). Roughly speaking, instead of outputting the translation with the highest probability, MBR decoding outputs the translation that is most similar to the most likely translations. This requires a similarity measure to establish similar. In Moses, this is a smoothed BLEU score.

Using MBR decoding is straight-forward, just use the switch -mbr when invoking the decoder.


 % moses -f moses.ini -mbr < in

MBR decoding uses by default the top 200 distinct candidate translations to find the translation with minimum Bayes risk. If you want to change this to some other number, use the switch -mbr-size:

 % moses -f moses.ini -decoder-type 1 -mbr-size 100 < in

MBR decoding requires that the translation scores are converted into probabilities that add up to one. The default is to take the log-scores at face value, but you may get better results with scaling the scores. This may be done with the switch -mbr-scale, so for instance:

 % moses -f moses.ini -decoder-type 1 -mbr-scale 0.5 < in


  • -mbr -- use MBR decoding
  • -mbr-size SIZE -- number of translation candidates to consider (default 200)
  • -mbr-scale SCALE -- scaling factor used to adjust the translation scores (default 1.0)

Note: MBR decoding and its variants is currently only implemented for the phrase-based decoder, not the chart decoder.

Lattice MBR and Consensus Decoding

These are extensions to MBR which may run faster or give better results. For more details see Tromble et al (2008), Kumar et al (2009) and De Nero et al (2009). The n-gram posteriors (required for Lattice MBR) and the ngram expectations (for Consensus decoding) are both calculated using an algorithm described in De Nero et al (2010). Currently both lattice MBR and consensus decoding are implemented as n-best list rerankers, in other words the hypothesis space is an n-best list (not a lattice).

Here's the list of options which affect both Lattice MBR and Consensus decoding.


  • -lmbr -- use Lattice MBR decoding
  • -con -- use Consensus decoding
  • -mbr-size SIZE -- as for MBR
  • -mbr-scale SCALE -- as for MBR
  • -lmbr-pruning-factor FACTOR -- mean words per node in pruned lattice, as described in Tromble et al (2008) (default 30)

Lattice MBR has several further parameters which are described in the Tromble et al 2008 paper.


  • -lmbr-p P -- The unigram precision (default 0.8)
  • -lmbr-r R -- The ngram precision ratio (default 0.6)
  • -lmbr-thetas THETAS Instead of specifying p and r, lattice MBR can be configured by specifying all the ngram weights and the length penalty (5 numbers). This is described fully in the references.
  • -lmbr-map-weight WEIGHT The weight given to the map hypothesis (default 0)

Since Lattice MBR has so many parameters, a utility to perform a grid search is provided. This is in moses-cmd/src and is called lmbrgrid. A typical usage would be

 % ./lmbrgrid -lmbr-p 0.4,0.6,0.8 -lmbr-r 0.4,0.6,0.8 -mbr-scale 0.1,0.2,0.5,1 -lmbr-pruning-factor   \
      30 -mbr-size 1000 -f moses.ini -i input.txt

In other words, the same Lattice MBR parameters as for Moses are used, but this time a comma separated list can be supplied. Each line in the output takes the following format:

 <sentence-id> ||| <p> <r> <pruning-factor> <scale> ||| <translation>

In the Moses Lattice MBR experiments that have been done to date, lattice MBR showed small overall improvements on a NIST Arabic data set (+0.4 over map, +0.1 over MBR), once the parameters were chosen carefully. Parameters were optimized by grid search on 500 sentences of held-out, and the following were found to be optimal

 -lmbr-p 0.8 -lmbr-r 0.8 -mbr-scale 5 -lmbr-pruning-factor 50

Output Search Graph

It may be useful for many downstream applications to have a dump of the search graph, for instance to compile a word lattice. One the one hand you can use the -verbose 3 option, which will give a trace of all generated hypotheses, but this creates logging of many hypotheses that get immediately discarded. If you do not want this, a better option is using the switch -output-search-graph FILE, which also provides some additional information.

The generated file contains lines that could be seen as both a dump of the states in the graph and the transitions in the graph. The state graph more closely reflects the hypotheses that are generated in the search. There are three types of hypotheses:

  • The initial empty hypothesis is the only one that is not generated by a phrase translation
 0 hyp=0 stack=0 [...]
  • Regular hypotheses
 0 hyp=17 stack=1 back=0 score=-1.33208 [...] covered=0-0 out=from now on
  • Recombined hypotheses
 0 hyp=5994 stack=2 back=108 score=-1.57388 [...] recombined=13061 [...] covered=2-2 out=be

The relevant information for viewing each line as a state in the search graph is the sentence number (initial 0), the hypothesis id (hyp), the stack where the hypothesis is placed (same as number of foreign words covered, stack), the back-pointer to the previous hypotheses (back), the score so far (score), the last output phrase (out) and that phrase's foreign coverage (covered). For recombined hypotheses, also the superior hypothesis id is given (recombined).

The search graph output includes additional information that is computed after the fact. While the back-pointer and score (back, score) point to the cheapest path and cost to the beginning of the graph, the generated output also included the pointer to the cheapest path and score (forward, fscore) to the end of the graph.

One way to view the output of this option is a reflection of the search and all (relevant) hypotheses that are generated along the way. But often, we want to generate a word lattice, where the states are less relevant, but the information is in the transitions from one state to the next, each transition emitting a phrase at a certain cost. The initial empty hypothesis is irrelevant here, so we need to consider only the other two hypothesis types:

  • Regular hypotheses
 0 hyp=17 [...] back=0 [...] transition=-1.33208 [...] covered=0-0 out=from now on 
  • Recombined hypotheses
 0 [...] back=108 [...] transition=-0.640114 recombined=13061 [...] covered=2-2 out=be

For the word lattice, the relevant information is the cost of the transition (transition), its output (out), maybe the foreign coverage (covered), and the start (back) and endpoint (hyp). Note that the states generated by recombined hypothesis are ignored, since the transition points to the superior hypothesis (recombined).

Here, for completeness sake, the full lines for the three examples we used above:

 0 hyp=0 stack=0 forward=9 fscore=-107.279
 0 hyp=17 stack=1 back=0 score=-1.33208 transition=-1.33208 \
   forward=517 fscore=-106.484 covered=0-0 out=from now on 
 0 hyp=5994 stack=2 back=108 score=-1.57388 transition=-0.640114 \
   recombined=13061 forward=22455 fscore=-106.807 covered=2-2 out=be

When using the switch -output-search-graph-extended (or short: -osgx), a detailed score breakdown is provided for each line. The format is the same as in the n-best list.

What is the difference between the search graph output file generated with this switch and the true search graph?

  • It contains the additional forward costs and forward paths.
  • It also only contains the hypotheses that are part of a fully connected path from the initial empty hypothesis to a final hypothesis that covers the full foreign input sentence.
  • The recombined hypotheses already point to the correct superior hypothesis, while the -verbose 3 log shows the recombinations as they happen (recall that momentarily superior hypotheses may be recombined to even better ones down the road).

Note again that you can get the full search graph with the -verbose 3 option. It is, however, much larger and mostly consists of discarded hypotheses.


  • -output-search-graph FILE -- output the search graph for each sentence in a file
  • -output-search-graph-extended FILE -- output the search graph for each sentence in a file, with detailed feature breakdown

Early Discarding of Hypotheses

During the beam search, many hypotheses are created that are too bad to be even entered on a stack. For many of them, it is even clear before the construction of the hypothesis that it would be not useful. Early discarding of such hypotheses hazards a guess about their viability. This is based on correct score except for the actual language model costs which are very expensive to compute. Hypotheses that, according to this estimate, are worse than the worst hypothesis of the target stack, even given an additional specified threshold as cushion, are not constructed at all. This often speeds up decoding significantly. Try threshold factors between 0.5 and 1.


  • -early-discarding-threshold THRESHOLD -- use early discarding of hypotheses with the specified threshold (default: 0 = not used)

Maintaining stack diversity

The beam search organizes and compares hypotheses based on the number of foreign words they have translated. Since they may have different foreign words translated, we use future score estimates about the remaining sentence translation score.

Instead of comparing such apples and oranges, we could also organize hypotheses by their exact foreign word coverage. The disadvantage of this is that it would require an exponential number of stacks, but with reordering limits the number of stacks is only exponential with regard to maximum reordering distance.

Such coverage stacks are implemented in the search, and their maximum size is specified with the switch -stack-diversity (or -sd), which sets the maximum number of hypotheses per coverage stack.

The actual implementation is a hybrid of coverage stacks and foreign word count stacks: the stack diversity is a constraint on which hypotheses are kept on the traditional stack. If the stack diversity limits leave room for additional hypotheses according to the stack size limit (specified by -s, default 200), then the stack is filled up with the best hypotheses, using score so far and the future score estimate.


  • -stack-diversity LIMIT -- keep a specified number of hypotheses for each foreign word coverage (default: 0 = not used)

Cube Pruning

Cube pruning, as described by Huang and Chiang (2007), has been implemented in the Moses decoder. This is in addition to the traditional search algorithm. The code offers developers the opportunity to implement different search algorithms using an extensible framework.

Cube pruning is faster than the traditional search at comparable levels of search errors. To get faster performance than the default Moses setting at roughly the same performance, use the parameter settings:

 -search-algorithm 1 -cube-pruning-pop-limit 2000 -s 2000

This uses cube pruning (-search-algorithm) that adds 2000 hypotheses to each stack (-cube-pruning-pop-limit 2000) and also increases the stack size to 2000 (-s 2000). Note that with cube pruning, the size of the stack has little impact on performance, so it should be set rather high. The speed/quality trade-off is mostly regulated by the cube pruning pop limit, i.e. the number of hypotheses added to each stack. To guarantee deterministic output for a minor (about 3%) slowdown, use the -cube-pruning-deterministic-search switch.

Stacks are organized by the number of foreign words covered, so they may differ by which words are covered. You may also require that a minimum number of hypotheses is added for each word coverage (they may be still pruned out, however). This is done using the switch -cube-pruning-diversity MINIMUM which sets the minimum. The default is 0.


  • -search-algorithm 1 -- turns on cube pruning
  • -cube-pruning-pop-limit LIMIT -- number of hypotheses added to each stack
  • -cube-pruning-diversity MINIMUM -- minimum number of hypotheses from each coverage pattern
  • -cube-pruning-deterministic-search -- use deterministic tie breaking during search
Edit - History - Print
Page last modified on November 06, 2015, at 07:15 PM