Smoothing Phrase-Tables

In Moses no smoothing is applied on the relative frequencies stored in the phrase-table. Moreover, given the way phrase-pairs are collected, no distinction is made between long and short translations of a given phrase. The idea is to apply standard smoothing methods to the phrase-table entries, which have been successfully applied on n-gram language models. This project will follow the work by Foster et al. (2006). Moreover, we plan to add some simple "translation" length model to the phrase probabilities to better discriminate between short, average and long phrase translations. This information should nicely complement the length-penalty features of Moses which are instead computed at the global sentence level. Involved programming: C++: to modify the estimation of phrase-table scores Perl: to prepare regression tests, case studies.

Possible activities are related to the implementation and evaluation of smoothing methods on the phrase-table and eventually on the hierarchical reordering model developed in the parallel project. In particular:

  1. Comparison of interpolation and back-off smoothing schemes. Interpolation adds a less specific distribution. Foster et al. (2006) suggest to use the 1-gram frequency, that is P(F|E)=discounted-freq(F|E) + total-discount(E) x P(F).
  2. Implementation of the Kneser-Ney, improved Kneser-Ney and Good-Turing discounting methods. To cope the well know problem of the GT method with the higher order counts, instead of taking the log-linear fit, there is also the variant explained here that only discounts frequencies up to a given threshold (e.g. 7).
  3. Include a length model in the phrase-table probability: e.g. P(F,m|E,n)=P(m|E,n)x P(F|m,E) or P(F,m|E,n)=P(m|n)x P(F|m,E).
  4. Comparison of different lexicalized phrase probabilities: the one implemented in Moses, an exact IBM model, the Zens-Ney model explained in Foster et al. (2006)
  5. Comparison of log-linear versus linear interpolation of the phrase-based probabilities and the lexicalised phrase probabilities. In the latter case, replace the unigram back-off distribution with the lexicalized probability. This also permits to reduce the number of features of the decoder. The interpolation weight should come out from the employed discounting method.
  6. Study the interaction of smoothing with the phrase-table pruning procedure.
  7. Find a proper likelihood measure for the phrase-table that could be used to optimize new smoothing methods or to perform interpolation of phrase-tables for the sake of adaptation. E.g. probability of the set of phrase-pairs in the test set? (What about TM entropy used in Koehn et al. 2009 MT Summit paper?)

Experimental Results

Impact of pruning on smoothing 3 experiments: original phrase table: 5398666

  • Pruning:

Pruning is usually done on the phrase table, here we can apply the pruning on the result of memscore. the pruning is the one described in moses manual "Pruning the Translation Table" in "advanced features" with parameters -l a+e -n 30 (1-1-1 phrase pairs are discarded and we keep only the first 30 possibilities for each phrase) pruning the phrasetable (basic scoring)

BLEU = 15.09 (resulting phrase table: 294125 lines 5.45%)

  • Filtering-extract:

An other option is to filter the "extract" before the scoring (avantage: it reduces memory usage) here we used an extremlly naive algorithm: we discard the entries appearing only once in the extract, then we score:

BLEU=15.01 (phraseTable has 230518 lines 4.27%)

  • Filtering-extract+smoothing+pruning

A last option: filter extract, smooth it, then prune the resulting phrase table

BLEU=14.89 (resulting phraseTable has 181592 lines 3.36%)

Conclusion: Pruning is an option to keep the phrase table small. Filtering is a good option when the extract is too big to fit in memory (big training corpus): memscore has to load everything in memory

Schedule of meetings


  • Marcello Federico (FBK-irst Italy) SF user: mfederico; email: federico [at]
  • Pascual Martinez-Gomez (UPV Spain); email: pmartinez [at] dsic [dot] upv [dot] es
  • Felipe Sánchez-Martínez (U. Alicante, Spain) SF user: sanmarf - added; e-mail: fsanchez [at] dlsi [dot] ua [dot] es
  • Tsuyoshi Okita (CNGL Ireland) sourceforge id: tokita2 - added, mail: tokita[at]
  • Joern Wuebker (RWTH Germany), sourceforge id: joewue - added, mail: joern.wuebker[at]
  • Bruno Pouliquen (WIPO), sourceforge id: poulique - added, mail: Bruno.Pouliquen[at]


TODO (put names!)

  1. Implement IBM1 and noisy-or lexical probabilities (done) -- Felipe Sánchez-Martínez
    • IBM1 does not need to be implemented, lexical weights produced by Moses will be used instead (based on the word alignments after symmetrisation).
  2. Implement Kneser-Ney and Good-Turing discounting methods -- Tsuyoshi Okita, Pascual Martínez-Gómez (Kneser-Ney and modified Kneser-Ney)
  3. Implement interpolation and back-off schemes with discounting, using as lower-order prob, either the 1-gram frequency (with corrected counts) or the lexical phrase probabilities. (partly done) -- Felipe Sánchez-Martínez
    • Implemented interpolation of absolute discount with a lower-order model based on the "noisy-or" combination (Zens and Ney, 2004) for p(s_j|t) and a length model p(J|I): p(s|t) = f(s|t) + \alfa(t) * \prod_{j=1}^{J}( p(s_j|t) ) * p(J|I); where s and t are source and target phrases, s_j is a source word, J is the length on s, I is the length of t, f(s|t) is the the smoothed probability by applying abosulte discount, and \alfa(t) is the normalisation factor.
  4. Investigate how phrase-table pruning interacts with frequency smoothing -- Bruno Pouliquen
  5. Investigate how smoothing methods perform under different combinations
  6. Investigate how linear interpolation vs log-linear interpolation work to combine smoothed/unsmoothed frequencies and lexical phrase probs. (This means reducing from 4 to two the number of phrase-table features).
Page last modified on January 29, 2010, at 07:33 PM