Coupling IRSTLM and RandLM

Recent MT evaluations have shown the importance of being able to handle gigantic language models, e.g. 6-gram LMs trained on several billions of words. The project aims at better integrating two LM tools, IRSTLM which provides effective ways to estimate and query large and accurate LMs and RandLM which permits to very efficiently store and query huge LMs. Our recent experience, has shown that better results can be obtained by interpolating a huge general-purpose and less-accurate LM with a smaller domain-related and more accurate LM. This project aims at developing the necessary software (in C++ and Perl) to:

  1. Apply RandLM to load and query huge LMs in ARPA format, created by IRSTLM. Check how the out-of-vocabulary word probability is managed. Compare performance (speed and PP) of the resulting LM, either when run with IRSTLM or with RandLM. Compare memory requirements of the two conditions. In alternative to PP, consider the discriminatory power of the LM: compare the score of the "right" word against a random set of "wrong" but frequent words taken at random.
  2. Train a linear interpolations of RandLM and IRSTLM models. It has to be checked if the range of scores produced by RandLM are compatible with LM probabilities computed by IRSTLM.
  3. Extend on-line LM interpolation in Moses to include RandLMs. We should be careful about the way the LM state is generated in the mixture case: the state should represent the longest history used by the involved LMs.

Experimental results

We present here some simple performance tests we ran to compare the speed of IRSTLM and RandLM. These tests were executed considering the whole English Europarl corpus for training (1.4M sentences) and the corpus provided by Sara for the hierarchical reordering project as test set (18K sentences). The different tokenisation on both corpora might explain high perplexity values, but these are not important for our evaluation.

First, we generated five fragments of Europarl (EP), ranging from 20% to 100% of the 1.4M sentences. The goal is to see whether the performance of model querying decreases linearly as the size of the models increase (linearly as well).

Build time of LM with IRSTLM and RandLM
Evaluation of speed of IRSTLM and RandLM to build a model of increasing size (up to 1.4M sentences).

The graphic shows the building time for LMs with both tools. Both curves are (as expected) linear with respect to the size of the training data. The construction of the randomised model is faster probably because (a) it builds the model over the counts that were already calculated by IRSTLM and (b) it does not keep all the ngrams it finds (since it is not a lossless ngram model). The latter might also explain why the time taken to build the model for RandLM increases slower than the time taken by IRSTLM.

Query time of LM with IRSTLM and RandLM
Evaluation of speed of IRSTLM and RandLM to query a model of increasing size (up to 1.4M sentences) given a fixed-size (18K sentences) test set.

The graphic above shows another interesting result: given a fixed-size test set, we queried the model to calculate the perplexity of the whole test set given models of increasing size. Our main conclusion is that RandLM is roughly twice as slow as IRSTLM. However, since we used the arpa format for IRSTLM, this difference could be even larger in reality. Indeed, as the size of the model increases, the time to query the model for IRSTLM increases too. This could mean that most of the time is spent in loading the model rather than in querying it. More sophisticated tests would be needed in order to validate this hypothesis. As for RandLM, query time increases slowly showing that the size of the model does not matter that much in performance as it does for IRSTLM. The size of the model was not measured, but we estimate that RandLM generates models of more or less constant size whereas IRSTLM uses all the available information to construct linearly increasing size models.

Perplexity decrease with IRSTLM and RandLM
Evaluation of perplexity decrease using IRSTLM and RandLM given a fixed-size (18K sentences) test set.

This can be validated by the graphic above: while for IRSTLM the perplexity decreases linearly as the model increases, for RandLM the perplexity only decreases a tiny bit (since perplexity was calculated differently in both evaluation tools, we normalised it with respect to the perplexity obtained on the smallest EP fragment). This validates the theory that the quality of the RandLM model does not depend on the available data, but on the quantisation and compressing parameters (default parameters used for these experiments). However, there seems to be a limit for IRSTLM too, as the line tends to get more and more horizontal.

As a conclusion, RandLM seems indeed twice as slow as IRSTLM and the size of the model is not as easy to customise as it is with conventional models (SRI, IRST). More experiments should be carried out in order to know whether this could be optimised and whether the loss in quality is critical for SMT. We would also need further experiments to know how both behave in terms of memory consumption.

P.S.: All the performance tests described here were ran twice to guarantee that results are not influenced by external factors. Hardware configuration is a Intel Core Duo (T6600) 2.2GHz with 3GB RAM.


  • Marcello Federico (FBK-irst Italy)
  • Christian Kohlschein (RWTH Germany)
  • Carlos Ramisch (U. Grenoble France)


There is an implementation of on-line LM interpolation in a moses branch at It's been implemented for srilm, irstlm and randlm. It's known to work with irstlm, but probably doesn't with the other LM implementations - should be tested. Ask Christian Hardmeier for details. Note that this branch isn't completely up to date with the trunk anymore, some changes need to be merged.

Page last modified on January 29, 2010, at 05:44 PM