Moses supports multi-threaded operation, enabling faster decoding on multi-core machines. The current limitations of multi-threaded Moses are:
Multi-threaded Moses is now built by default. If you omit the -threads
argument, then Moses will use a single worker thread, and a thread to read the input stream. Using the argument -threads n
specifies a pool of n
threads, and -threads all
will use all the cores on the machine.
The single-most important thing you need to run Moses fast is MEMORY. Lots of MEMORY. (For example, the Edinburgh group have servers with 144GB of RAM). The rest of this section is just details of how to make the training and decoding run fast.
Calculate total file size of the binary phrase tables, binary language models and binary reordering models.
For example,
% ll -h phrase-table.0-0.1.1.binphr.* -rw-r--r-- 1 s0565741 users 157K 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.idx -rw-r--r-- 1 s0565741 users 5.4M 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.srctree -rw-r--r-- 1 s0565741 users 282K 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.srcvoc -rw-r--r-- 1 s0565741 users 1.1G 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.tgtdata -rw-r--r-- 1 s0565741 users 1.7M 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.tgtvoc % ll -h reordering-table.1.wbe-msd-bidirectional-fe.binlexr.* -rw-r--r-- 1 s0565741 users 157K 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.idx -rw-r--r-- 1 s0565741 users 1.1G 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.srctree -rw-r--r-- 1 s0565741 users 1.1G 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.tgtdata -rw-r--r-- 1 s0565741 users 282K 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.voc0 -rw-r--r-- 1 s0565741 users 1.7M 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.voc1 % ll -h interpolated-binlm.1 -rw-r--r-- 1 s0565741 users 28G 2012-06-15 11:07 interpolated-binlm.1
The total size of these files is approx. 31GB. Therefore, a translation system using these models requires 31GB (+ roughly 500MB) of memory to run fast.
Run this:
cat phrase-table.0-0.1.1.binphr.* > /dev/null cat reordering-table.1.wbe-msd-bidirectional-fe.binlexr.* > /dev/null cat interpolated-binlm.1 > /dev/null
This forces the operating system to cache the binary models in memory, minimizing pages faults while the decoder is running. Other memory-intensive processes on the computer should not be running, otherwise the file-system cache may be reduced.
Moses does a lot of random lookups. If you're running Linux, check that transparent huge pages are enabled. If
cat /sys/kernel/mm/transparent_hugepage/enabled
responds with
[always] madvise never
then transparent huge pages are enabled.
On some RedHat/Centos systems, the file is /sys/kernel/mm/redhat_transparent_hugepage/enabled
and madvise
will not appear. If neither file exists, upgrade the kernel to at least 2.6.38 and compile with CONFIG_SPARSEMEM_VMEMMAP
. If the file exists, but the square brackets are not around "always", then run
echo always > /sys/kernel/mm/transparent_hugepage/enabled
as root (NB: to use sudo
, quote the >
character). This setting will not be preserved across reboots, so consider adding it to an init script.
See the manual on binarized and compact phrase table for a description how to compact your phrase tables. All the things said above for the standard binary phrase table are also true for the compact versions. The principle is the same, the total size of the binary files determines your memory usage, but since the combined size of the compact phrase table and the compact reordering model maybe up to 10 to 12 times smaller than with the original binary implementations, you will save exactly this much memory. You can also use the --minphr-memory
and --minlexr-memory
options to load the tables into memory at Moses start-up instead of doing the above mentioned caching trick. This may take some time during warm-up, but may save a lot of time in the long term. If you are concerned for performance, see Junczys-Dowmunt (2012) for a comparison. There is virtually no overhead due to on-the-fly decompression on large-memory-systems and considerable speed-up on systems with limited memory.
The decoder can run on very little memory, about 200-300MB for phrase-based and 400-500MB for hierarchical decoding (according to Hieu). The decoder can run on an iPhone! And laptops.
However, it will be VERY slow, unless you have very small models or the models are on fast disks such as flash disks.
When word aligning, using mgiza with multiple threads significantly speed up word alignment.
To use MGIZA with multiple threads in the Moses training script, add these arguments:
.../train-model.perl -mgiza -mgiza-cpus 8 ....
To enable it in the EMS, add this to the [TRAINING] section
[TRAINING] training-options = "-mgiza -mgiza-cpus 8"
When running GIZA++ or MGIZA, the first stage involves running a program called
snt2cooc
This requires approximately 6GB+ for typical Europarl-size corpora (1.8 million sentences). For users without this amount of memory on their computers, an alternative version is included in MGIZA:
snt2cooc.pl
To use this script, you must copy 2 files to the same place where snt2cooc
is run:
snt2cooc.pl snt2coocrmp
Add this argument when running the Moses training script:
.../train-model.perl -snt2cooc snt2cooc.pl
Once word alignment is completed, the phrase table is created from the aligned parallel corpus. There are 2 main ways to speed up this part of the training process.
Firstly, the training corpus and alignment can be split and phrase pairs from each part can be extracted simultaneously. This can be done by simply using the argument -cores
, e.g.,
.../train-model.perl -cores 4
Secondly, the Unix sort
command is often executed during training. It is essential to optimize this command to make use of the available disk and CPU. For example, recent versions of sort can take the following arguments
sort -S 10G --batch-size 253 --compress-program gzip --parallel 5
The Moses training script names these arguments
.../train-model.perl -sort-buffer-size 10G -sort-batch-size 253 \ -sort-compress gzip -sort-parallel 5
You should set these arguments. However, DO NOT just blindly copy the above settings, they must be tuned to the particular computer you are running on. The most important issues are:
--parallel
, --compress-program
, and --batch-size
arguments have only recently been added to the sort command.
-sort-buffer-size
. In particular, you should take into account other programs running on the computer. Also, two or three simultaneous sort program will run (one to sort the extract
file, one to sort extract.inv
, one to sort extract.o
). If there is not enough memory because you've set sort-buffer-size
too high, your entire computer will likely crash.
--batch-size
argument is OS-dependent. For example, it is 1024 on Linux, 253 on old Mac OSX, 2557 on new OSX.
--compress-program
can occasionally result in the following timeout errors.
gsort: couldn't create process for gzip -d: Operation timed out
In summary, to maximize speed on a large server with many cores and up-to-date software, add this to your training script:
.../train-model.perl -mgiza -mgiza-cpus 8 -cores 10 \ -parallel -sort-buffer-size 10G -sort-batch-size 253 \ -sort-compress gzip -sort-parallel 10
To run on a laptop with limited memory
.../train-model.perl -mgiza -mgiza-cpus 2 -snt2cooc snt2cooc.pl \ -parallel -sort-batch-size 253 -sort-compress gzip
In the EMS, for large servers, this can be done by adding:
[TRAINING] script = $moses-script-dir/training/train-model.perl training-options = "-mgiza -mgiza-cpus 8 -cores 10 \ -parallel -sort-buffer-size 10G -sort-batch-size 253 \ -sort-compress gzip -sort-parallel 10" parallel = yes
For servers with older OSes, and therefore older sort commands:
[TRAINING] script = $moses-script-dir/training/train-model.perl training-options = "-mgiza -mgiza-cpus 8 -cores 10 -parallel" parallel = yes
For laptops with limited memory:
[TRAINING] script = $moses-script-dir/training/train-model.perl training-options = "-mgiza -mgiza-cpus 2 -snt2cooc snt2cooc.pl \ -parallel -sort-batch-size 253 -sort-compress gzip" parallel = yes
Convert your language model to binary format. This reduces loading time and provides more control.
See the KenLM web site for the time-memory tradeoff presented by the KenLM data structures. Use bin/build_binary
(found in the same directory as moses
and moses_chart
) to convert ARPA files to the binary format. You can preview memory consumption with:
bin/build_binary file.arpa
This preview includes only the language model's memory usage, which is in addition to the phrase table etc. For speed, use the default probing data structure.
bin/build_binary file.arpa file.binlm
To save memory, change to the trie data structure
bin/build_binary trie file.arpa file.binlm
To further losslessly compress the trie ("chop" in the benchmarks), use -a 64
which will compress pointers to a depth of up to 64 bits.
bin/build_binary -a 64 trie file.arpa file.binlm
Note that you can also make this parameter smaller which will go faster but use more memory. Quantization will make the trie smaller at the expense of accuracy. You can choose any number of bits from 2 to 25, for example 10:
bin/build_binary -a 64 -q 10 trie file.arpa file.binlm
Note that quantization can be used independently of -a.
By default, language models fully load into memory at the beginning. If you are short on memory, you can use on-demand language model loading. The language model must be converted to binary format in advance and should be placed on LOCAL DISK, preferably SSD. For KenLM, you should use the trie data structure, not the probing data structure.
If the LM for binarized using IRSTLM, append .mm to the file name and change the ini file to reflect this. Eg. change
[feature] IRSTLM .... path=file.lm
to
[feature] IRSTLM .... path=file.lm.mm
If the LM was binarized using KenLM, add the argument lazyken=true. Eg. from
[feature] KENLM ....
to
[feature] KENLM .... lazyken=true
Suffix arrays store the entire parallel corpora and word alignment information in memory, instead of the phrase table. The parallel corpora and alignment file is often much smaller than the phrase table. For example, for the Europarl German-English (gzipped files):
de = 94MB en = 84MB alignment = 57MB phrase-based = 2.0GB hierarchical = 16.0GB
Therefore, it is more memory efficient to store the corpus in memory, rather than the entire phrase-table. This is usually structured as a suffix array to enable fast extraction of translations.
Translations are extracted as needed, usually per input test set, or per input sentence.
Moses support two different implementations of suffix arrays, one for phrase-based models, [[one for hierarchical models -> AdvancedFeatures#ntoc43 ]].
Cube pruning limits the number of hypotheses created for each stack (or chart cell in chart decoding). It is essential for chart decoding (otherwise decoding will take a VERY long time) and an option in phrase-based decoding.
In the phrase-based decoder, add:
[search-algorithm] 1 [cube-pruning-pop-limit] 500
There is a speed-quality tradeoff, lower pop limit means less work for the decoder, so faster decoding but less accurate translation.
TODO: MGIZA with reduced memory sntcoc
The biggest consumer of memory during decoding are typically the models. Here are some links on how to reduce the size of each.
Language model:
* use KenLM with trie data structure Moses.Optimize#ntoc14 * use on-demand loading Moses.Optimize#ntoc15
Translation model:
* use phrase table pruning Advanced.RuleTables#ntoc5 * use a compact phrase table http://www.statmt.org/moses/?n=Advanced.RuleTables#ntoc3 * filter the translation model given the text you want to translate Moses.SupportTools#ntoc3
Reordering model:
* similar techniques than for translation models are possible: pruning Advanced.RuleTables#ntoc3, compact tables Advanced.RuleTables#ntoc4, and filtering Moses.SupportTools#ntoc3.
These options can be added to the bjam command line, trading generality for performance.
You should do a full rebuild with -a
when changing the values of most of these options.
Don't use factors? Add
--max-factors=1
Tailor KenLM's maximum order to only what you need. If your highest-order language model has order 5, add
--kenlm-max-order=5
Turn debug symbols off for speed and a little more memory.
debug-symbols=off
But don't expect support from the mailing list until you rerun with debug symbols on!
Don't care about debug messages?
--notrace
Download tcmalloc
and see BUILD-INSTRUCTIONS.txt
in Moses for installation instructions. bjam
will automatically detect tcmalloc's presence and link against it for multi-threaded builds.
Install Boost and zlib
static libraries. Then link statically:
--static
This may mean you have to install Boost and zlib
yourself.
Running single-threaded? Add threading=single
.
Using hierarchical or string-to-tree models, but none with source syntax?
--unlabelled-source
Moses has multiple phrase table implementations. The one that suits you best depends on the model you're using (phrase-based or hierarchical/syntax), and how much memory your server has.
Here is a complete list of the types:
Memory - this read in the phrase table into memory. For phrase-based model and chart decoding. Note that this is much faster than Binary and OnDisk phrase table format, but it uses a lot of RAM.
Binary - a phrase table is converted into a 'database'. Only the translations which are required are loaded into memory. Therefore, requiring less memory, but potentially slower to run. For phrase-based model
OnDisk - reimplementation of Binary for chart decoding.
SuffixArray - stores the parallel training data and word alignment in memory, instead of the phrase table. Extraction is done on the fly. Also have a feature where you can add parallel data while the decoder is running ('Dynamic Suffix Array'). For Phrase-based models. See Levenberg et al., (2010).
ALSuffixArray - Suffix array for hierarchical models. See Lopez (2008).
FuzzyMatch - Implementation of Koehn and Senellart (2010).
Hiero - like SCFG, but translation rules are in standard Hiero-style format
Compact - for phrase-based model. See Junczys-Dowmunt (2012).