Scripts are in the scripts
subdirectory in the source release in the Git repository.
The following basic tools are described elsewhere:
train-model.perl
clean-corpus-n.perl
mert-moses.pl
Moses is a successor to the Pharaoh decoder, so you can use the same models that work for Pharaoh and use them with Moses. The following script makes the necessary changes to the configuration file:
exodus.perl < pharaoh.ini > moses.ini
Since decoding large amounts of text takes a long time, you may want to split up the text into blocks of a few hundred sentences (or less), and distribute the task across a Sun GridEngine cluster. This is supported by the script moses-parallel.pl
, which is run as follows:
moses-parallel.pl -decoder decoder -config cfgfile -i input -jobs N [options]
Use absolute paths for your parameters (decoder, configuration file, models, etc.).
decoder
is the file location of the binary of Moses used for decoding
cfgfile
is the configuration fileofthe decoder
input
is the file to translate
N
is the number of processors you require
options
are used to overwrite parameters provided in cfgfile
-n-best-file
output file for nbest list
-n-best-size
size of nbest list
Phrase tables easily get too big, but for the translation of a specific set of text only a fraction of the table is needed. So, you may want to filter the translation table, and this is possible with the script:
filter-model-given-input.pl filter-dir config input-file
This creates a filtered translation table with new configuration file in the directory filter-dir
from the model specified with the configuration file config
(typically named moses.ini
), given the (tokenized) input from the file input-file
.
In the advanced feature section, you find the additional option of binarizing translation and reordering table, which allows these models to be kept on disk and queried by the decoder. If you want to both filter and binarize these tables, you can use the script:
filter-model-given-input.pl filter-dir config input-file -Binarizer binarizer
The additional binarizer
option points to the appropriate version of processPhraseTable
.
Instead of the two following scripts, this one does both at the same time, and is better suited for our directory structure and factor naming conventions:
reduce_combine.pl \ czeng05.cs \ 0,2 pos lcstem4 \ > czeng05_restricted_to_0,2_and_with_pos_and_lcstem4_added
A simple BLEU scoring tool is the script multi-bleu.perl
:
multi-bleu.perl reference < mt-output
Reference file and system output have to be sentence-aligned (line X in the reference file corresponds to line X in the system output). If multiple reference translation exist, these have to be stored in seperate files and named reference0
, reference1
, reference2
, etc. All the texts need to be tokenized.
A popular script to score translations with BLEU is the NIST mteval script. It requires that text is wrapped into a SGML format. This format is used for instance by the NIST evaluation and the WMT Shared Task evaluations. See the latter for more details on using this script.
Missing n-grams are those that all reference translations wanted but MT system did not produce. Extra n-grams are those that the MT system produced but none of the references approved.
missing_and_extra_ngrams.pl hypothesis reference1 reference2 ...
Assume you have a moses.ini
file already and want to run an experiment with it. Some months from now, you might still want to know what exactly did the model (incl. all the tables) look like, but people tend to move files around or just delete them.
To solve this problem, create a blank directory, go in there and run:
clone_moses_model.pl ../path/to/moses.ini
close_moses_model.pl
will make a copy of the moses.ini
file and local symlinks (and if possible also hardlinks, in case someone deleted the original file) to all the tables and language models needed.
It will be now safe to run moses locally in the fresh directory.
Run:
absolutize_moses_model.pl ../path/to/moses.ini > moses.abs.ini
to build an ini file where all paths to model parts are absolute. (Also checks the existence of the files.)
The script
analyse_moses_model.pl moses.ini
Prints basic statistics about all components mentioned in the moses.ini. This can be useful to set the order of mapping steps to avoid explosion of translation options or just to check that the model components are as big/detailed as we expect.
Sample output lists information about a model with 2 translation and 1 generation step. The three language models over three factors used and their n-gram counts (after discounting) are listed, too.
Translation 0 -> 1 (/fullpathto/phrase-table.0-1.gz): 743193 phrases total 1.20 phrases per source phrase Translation 1 -> 2 (/fullpathto/phrase-table.1-2.gz): 558046 phrases total 2.75 phrases per source phrase Generation 1,2 -> 0 (/fullpathto/generation.1,2-0.gz): 1.04 outputs per source token Language model over 0 (/fullpathto/lm.1.lm): 1 2 3 49469 245583 27497 Language model over 1 (/fullpathto/lm.2.lm): 1 2 3 25459 199852 32605 Language model over 2 (/fullpathto/lm.3.lm): 1 2 3 4 5 6 7 709 20946 39885 45753 27964 12962 7524
Often, we train machine translation systems on lowercased data. If we want to present the output to a user, we need to re-case (or re-capitalize) the output. Moses provides a simple tool to recase data, which essentially runs Moses without reordering, using a word-to-word translation model and a cased language model.
The recaser requires a model (i.e., the word mapping model and language model mentioned above), which is trained with the command:
train-recaser.perl --dir MODEL --corpus CASED [--train-script TRAIN]
The script expects a cased (but tokenized) training corpus in the file CASED
, and creates a recasing model in the directory MODEL
. KenLM's lmplz is used to train language models by default; pass --lm to change the toolkit.
To recase output from the Moses decoder, you run the command
recase.perl --in IN --model MODEL/moses.ini --moses MOSES [--lang LANGUAGE] [--headline SGML] > OUT
The input is in file IN
, the output in file OUT
. You also need to specify a recasing model MODEL
. Since headlines are capitalized different from regular text, you may want to provide an SGML
file that contains information about headline. This file uses the NIST format, and may be identical to source test sets provided by the NIST or other evluation campaigns. A language LANGUAGE
may also be specified, but only English (en
) is currently supported.
By default, EMS trains a truecaser (see below). To use a recaser, you have to make the following changes:
output-truecaser
and detruecaser
and add instead output-lowercaser
and EVALUATION:recaser
.
IGNORE
to the [TRUECASING]
section, and remove it from the [RECASING]
section
[RECASING]
section, which training corpus should be used for the recaser. This is typically the target side of the parallel corpus or a large language model corpus. You can directly link to a corpus already specified to the config file, e.g.,
tokenized = [LM:europarl:tokenized-corpus]
Instead of lowercasing all training and test data, we may also want to keep words in their natural case, and only change the words at the beginning of their sentence to their most frequent form. This is what we mean by truecasing. Again, this requires first the training of a truecasing model, which is a list of words and the frequency of their different forms.
train-truecaser.perl --model MODEL --corpus CASED
The model is trained from the cased (but tokenized) training corpus CASED
and stored in the file MODEL
.
Input to the decoder has to be truecased with the command
truecase.perl --model MODEL < IN > OUT
Output from the decoder has to be restored into regular case. This simply uppercases words at the beginning of sentences:
detruecase.perl < in > out [--headline SGML]
An SGML file with headline information may be provided, as done with the recaser.
This small tool converts Moses searchgraph (-output-search-graph FILE
option) to dot format. The dot format can be rendered using the graphviz tool dot.
moses ... --output-search-graph temp.graph -s 3 # we suggest to use a very limited stack size, -s 3 sg2dot.perl [--organize-to-stacks] < temp.graph > temp.dot dot -Tps temp.dot > temp.ps
Using --organize-to-stacks
makes nodes in the same stack appear in the same column (this slows down the rendering, off by default).
Caution: the input must contain the searchgraph of one sentence only.
The phrase table trained by Moses contains by default all phrase pairs encountered in the parallel training corpus. This often includes 100,000 different translations for the word "the" or the comma ",". These may clog up various processing steps down the road, so it is helpful to prune the phrase table to the reasonable choices.
Threshold pruning is currently implemented at two different stages: You may filter the entire phrase table file, or use threshold pruning as an additional filtering criterion when filtering the phrase table for a given test set. In either case, phrase pairs are thrown out when their phrase translation probability p(e|f) falls below a specified threshold. A safe number for this threshold may be 0.0001, in the sense that it hardly changes any phrase translation while ridding the table of a lot of junk.
The script scripts/training/threshold-filter.perl
operates on any phrase table file:
cat PHRASE_TABLE | \ threshold-filter.perl 0.0001 > PHRASE_TABLE.reduced
If the phrase table is zipped, then:
zcat PHRASE_TABLE.gz | \ threshold-filter.perl 0.0001 | \ gzip - > PHRASE_TABLE.reduced.gz
While this often does not remove much of the phrase table (which contains to a large part singleton phrase pairs with p(e|f)=1), it may nevertheless be helpful to also reduce the reordering model. This can be done with a second script:
cat REORDERING_TABLE | \ remove-orphan-phrase-pairs-from-reordering-table.perl PHRASE_TABLE \ > REORDERING_TABLE.pruned
Again, this also works for zipped files:
zcat REORDERING_TABLE.gz | \ remove-orphan-phrase-pairs-from-reordering-table.perl PHRASE_TABLE | \ gzip - > REORDERING_TABLE.pruned.gz
In the typical experimental setup, the phrase table is filtered for a tuning or test set using the script. During this process, we can also remove low-probability phrase pairs. This can be done simply by adding the switch
-MinScore
, which takes a specification of the following form:
filter-model-given-input.pl [...] \ -MinScore FIELD1:THRESHOLD2[,FIELD2:THRESHOLD2[,FIELD3:THRESHOLD3]]
where FIELDn
is the position of the score (typically 2 for the direct phrase probability p(e|f), or 0 for the indirect phrase probability p(f|e)) and THRESHOLD
the maximum probability allowed.