machine translation

Adding Feature Functions

April 13th, 2012: Checked and revised for latest version (Barry Haddow)

The log-linear model underlying statistical machine translation allows for the combination of several components that each weigh in on the quality of the translation. Each component is represented by one or more features, which are weighted, and multiplied together.

Formally, the probability of a translation e of an input sentence f is computed as

p(e|f) = Πi hi(e,f)λi

where hi are the feature functions and λi the corresponding weights.

Note that the decoder internally uses logs, so in fact what is computed is

log p(e|f) = Σi log( hi(e,f) ) λi

The tuning stage of the decoder is used to set the weights.

The following components are typically used:

  • phrase translation model (4 features, described here)
  • language model (1 feature)
  • distance-based reordering model (1 feature)
  • word penalty (1 feature)
  • lexicalized reordering model (6 features, described here)

One way to attempt to improve the performance of the system is to add additional feature functions. This section explains what needs to be done to add a feature. Unless otherwise specified the Moses code files are in the directory moses/. In the following we refer to the new feature as xxx.

Side note: Adding a new component may imply that several scores are added. In the following, as in the Moses source code, we refer to both components and scores as features. So, a feature may have multiple features. Sorry about the confusion.


There is a 10 minute video demonstrating how to create your own feature function here.

Other resources

 Kenton Murray has a nice blog on how to add a new feature function to Moses

Feature Function

The feature computes one or more values, and we need to write a feature function for it.

One important question about the new feature is, if it depends on just on the current phrase translation, or also on prior translation decision. We call the first case stateless, the second stateful. If the new feature is stateless, then it should inherit from the class StatelessFeatureFunction, otherwise it should inherit from StatefulFeatureFunction

The second case causes additional complications for the dynamic programming strategy of recombining hypotheses. If two hypotheses differ in their past translation decisions which matters for the new feature, then they cannot be recombined.

For instance, the word penalty does only depend on the current phrase translation and is hence stateless. The distortion features also depend on the previous phrase translation and they are hence stateful. You can see the implementation of WordPenaltyProducer and DistortionScoreProducer in the directory moses/FF.

However, new features are usually more complicated. For instance, it requires reading in a file and representing it with a data structure and more complex computations. See moses/LM/SRI.h and moses/LM/SRI.cpp for something more involved.

In the following, we assume such a more complex feature, which is implemented in its own source files XXX.h and XXX.cpp. The feature is implemented as a class which inherits from either StatefulFeatureFunction or StatelessFeatureFunction. So, you will write some code in XXX.h that starts with

  namespace Moses
  class XXX : public StatefulFeatureFunction {

The class must contain the constructor:

   XXX::XXX(const std::string &line)
    : StatefulFeatureFunction(line)

The constructor must call the

   StatelessFeatureFunction(...) or 
   StatefulFeatureFunction(...) or 

or something that eventually calls one of this functions.

The constructor must also call the method


This is inherited from class FeatureFunction, it should NOT be overridden.

The line is the complete line from the ini file that instantiate this feature, eg.

   KENLM factor=0 order=5 num-features=1 lazyken=1 path=path/file

The class must also contain the function:

    bool IsUseable(const FactorMask &mask) const;

This function returns true if, given a target phrase only factors in mask, the feature can be evaluated. If the feature doesn't need to look at words in the target phrase, always return true. Return true if you don't use factors.

A good example of IsUseable() is in


This is the only necessary method the class HAS to implement. All other methods are optional.

An important function to override is

  void Load(AllOptions::ptr const& opts)

Override this function if the feature needs to load files. For example, language model classes load their LM files here. The first thing this function should do is save a pointer to the current set of options that is passed as the parameter:

  void Load(AllOptions::ptr const& opts) {
    m_options = opts;

Many feature function needs parameters to be passed in from the ini file. For example,

   KENLM factor=0 order=5 num-features=1 lazyken=1 path=path/file

has the parameters factor, order, num-features, lazyken, path. To read in these parameters, override the method

   void FeatureFunction::SetParameter(const std::string& key, const std::string& value)

This method MUST call the same method in it's parent class if the parameter is unknown, eg.

  if (key == "input-factor") {
    m_factorTypeSource = Scan<FactorType>(value);
  } else {
    StatelessFeatureFunction::SetParameter(key, value);

The feature function needs to be registered in Factory.cpp, FeatureRegistry():

  #include "XXX.h"

Stateless Feature Function

The above is all that is required to a feature function. However, it doesn't do anything yet.

If the feature is stateless, it should override one of these methods from the class @FeatureFunction@@ :

  1. virtual void EvaluateInIsolation(const Phrase &source
                        , const TargetPhrase &targetPhrase
                        , ScoreComponentCollection &scoreBreakdown) const
  2. virtual void EvaluateWithSourceContext(const InputType &input
                        , const InputPath &inputPath
                        , const TargetPhrase &targetPhrase
                        , const StackVec *stackVec
                        , ScoreComponentCollection &scoreBreakdown
                        , ScoreComponentCollection *estimatedFutureScore) const

Or it can override one of these methods, specific to the StatelessFeatureFunction class.

  3. virtual void EvaluateWhenApplied(const Hypothesis& hypo,
                        ScoreComponentCollection* accumulator) const
  4. virtual void EvaluateWhenApplied(const ChartHypothesis &hypo,
                             ScoreComponentCollection* accumulator) const

Usually, method (1) should be overridden. See WordPenaltyProducer.cpp for a simple example using (1).

Note - Only scores evaluted in (1) is included in future cost estimation in phrase-based model

Some stateless feature functions needs to know the entire input sentence to evaluate, for example. a bag of word feature. In this case, use method (2).

Use method (3) or (4) if the feature function requires the segmentation of the source, or any other information available from the context. Note - these methods are identical to the those used by stateful features, except that they don't return state.

Each stateless feature function can override 1 or more of the above methods. So far (June, 2013) all stateless feature override only 1 method.

The methods are called at different stages in the decoding process.

  • (1) is called before the search process, when the translation rule is created. This could be when the phrase-table is loaded (in the case of memory-based phrase-table), or just before the search begin for binary phrase tables.
  • (2) is called just before the search begins.
  • (3) and (4) are called during search when hypotheses are created.

Stateful Feature Function

Stateful feature functions should inherit from class StatefulFeatureFunction. There are 2 class methods that can be overridden by the feature functions to score hypotheses:

    5.  virtual FFState* EvaluateWhenApplied(
                 const Hypothesis& cur_hypo,
                 const FFState* prev_state,
                 ScoreComponentCollection* accumulator) const = 0;

    6. virtual FFState* EvaluateWhenApplied(
                 const ChartHypothesis& /* cur_hypo */,
                 int /* featureID - used to index the state in the previous hypotheses */,
                 ScoreComponentCollection* accumulator) const = 0;

As the names suggest, (5) is used to score a hypothesis from a phrase-based model. (6) is used to score 1 from the hierarchical/syntax model.

In addition, a stateful feature function can also override methods (1) and (2) from the base FeatureFunction class.

For example, language models are stateful. All language model implementation should override (5) and (6). However, they should also override (1) to score the translation rule in isolation. See classes LanguageModelImplementation and LanguageModel for the implementation of scoring language models.

Stateful feature function must also implement

  const FFState* EmptyHypothesisState() const

Place-holder features

Some features don't implement any Evaluate() functions. Their evaluation is more interwoven with the creation of the translation rule, the feature function is just used as a placeholder where the scores should be added.

Phrase-table (class PhraseDictionary), generation model (class GenerationDictionary), unknown word feature (class UnknownWordPenaltyProducer), and input scores for confusion networks and lattices (class InputFeature).


All feature functions are specified in the [feature] section. It should be in the format:

   * Feature-name key1=value1 key2=value2 ....

For example,

  KENLM factor=0 order=3 num-features=1 lazyken=0 path=file.lm.gz

Keys must be unique. There must be a key

  * num-features=??

which specifies the number of dense scores for this feature.

The key

  * name=??

is optional. If it is specified, the feature name must be unique. If it is not specified, then a name is automatically created. All other key/value pairs are up to the feature function implementation.


The struck-out examples are formatted in the old Moses v.1, and before. The clear examples are for current Moses in github.

NB. moses.ini files in the old format can still be read by the new decoder, if they just contain the common, vanilla features (ie. no sparse features, suffix arrays, or new features that have recently been added).

NB. 2 - Do NOT mix the old and new format in 1 ini file.


In-memory phrase-table (phrase-based):

   PhraseDictionaryMemory num-features=5 path=phrase-table.gz input-factor=0 output-factor=0 table-limit=20

Note - The old method is relaxed about whether you add '.gz' to the file name; it will try it with and without and see what exists. The new method is strict - you MUST specify '.gz' if the file ends with .gz, otherwise you must NOT specify '.gz'

Binary phrase-table (phrase-based):

   PhraseDictionaryBinary num-features=5 path=phrase-table.gz input-factor=0 output-factor=0 

Note - the binary phrase table consist of 5 files with the following suffixes:


and (without word alignment):


or (WITH word alignment)


The path value must point to the PREFIX of the files. For example, if the files are called:

   folder/pt.binphr.idx, folder/pt.binphr.srcvoc, folder/pt.binphr.tgtvoc ....



In-memory phrase-table (hierarchical/syntax):

   PhraseDictionaryMemory num-features=5 path=phrase-table.gz input-factor=0 output-factor=0 table-limit=20

See ''In-memory phrase-table (phrase-based) for notes.

On-disk phrase-table (hierarchical/syntax):

   PhraseDictionaryOnDisk num-features=5 path=phrase-table.gz input-factor=0 output-factor=0 table-limit=20

Note - the on-disk phrase-table consists of 5 files:


The path value must point to the FOLDER in which these files are found.

Language models


   SRILM factor=0 order=5 path=lm.gz


   IRSTLM factor=0 order=5 path=lm.gz


   KENLM factor=0 order=5 path=lm.gz

Lazy KenLM:

   KENLM factor=0 order=5 path=lm.gz lazy=1

Reordering models

   LexicalReordering num-features=6 type=msd-bidirectional-fe input-factor=0 output-factor=0

Misc features

New mose must have Distortion, WordPenalty, and UnknownWordPenalty explicitly in the list of feature functions. They require no arguments, ie.


In the old moses, they were implicitly added by the decoder.

Sparse features

There are lots of ad-hoc features are currently implemented. You must look at the code and ask the developer to see how to run them

Edit - History - Print
Page last modified on January 05, 2016, at 08:14 AM