Moses
statistical
machine translation
system

Factors, Words, Phrases

Moses is the implemented of a factored translation model. This means that each word is represented by a vector of factors, which are typically word, part-of-speech tags, etc. It also means that the implementation is a bit more complicated than a non-factored translation model.

This section intends to provide some documentation of how factors, words, and phrases are implemented in Moses.

Factors

The class Factor implements the most basic unit of representing text in Moses. In essence it is a string.

Factors do not know about their own type (which component in the word vector they represent), this is referred to as its FactorType when needed. This factor type is implemented as a size_t, i.e. an integer. What a factor really represents (be it a surface form or a part of speech tag), does not concern the decoder at all. All the decoder knows is that there are a number of factors that are referred to by their factor type, i.e. an integer index.

Since we do not want to store the same strings over and over again, the class FactorCollection contains all known factors. The class has one global instance, and it provides the essential functions to check if a newly constructed factor already exists and to add a factor. This enables the comparison of factors by the cheaper comparison of the pointers to factors. Think of the FactorCollection as the global factor dictionary.

Words

A word is, as we said, a vector of factors. The class Word implements this. As data structure, it is a array over pointers to factors. This does require the code to know what the array size is, which is set by the global MAX_NUM_FACTORS. The word class implements a number of functions for comparing and copying words, and the addressing of individual factors.

Again, a word does not know, how many factors it really has. So, for instance, when you want to print out a word with all its factors, you need to provide also the factor types that are valid within the word. See the function Word::GetString for details.

Factor Types

This is a good place to note that referring to words gets a bit more complicated. If more than one factor is used, it does not mean that all the words in the models have all the factors. Take again the example of a two-factored representation of words as surface form and part-of-speech. We may still use a simple surface word language model, so for that language model, a word only has one factor.

We expect the input to the decoder to have all factors specified and during decoding the output will have all factors of all words set. The process may not be a straight-forward mapping of the input word to the output word, but it may be decomposed into several mapping steps that either translate input factors into output factors, or generate additional output factors from existing output factors.

At this point, keep on mind that a Factor has a FactorType and a Word has a vector<FactorType>, but these are not internally stored with the Factor and the Word.

Related to factor types is the class FactorMask, which is a bit array indicating which factors are valid for a particular word.

Phrases

Since decoding proceeds in the translation of input phrases to output phrases, a lot of operation involve the class Phrase. Since the total number of input and output factors is known to the decoder (it has to be specified in the configuration file moses.ini), phrases are also a bit smarter about copying and comparing.

The Phrase class implements many useful functions, and two other classes are derived from it:

  • The simplest form of input, a sentence as string of words, is implemented in the class Sentence.
  • The class TargetPhrase may be somewhat misleadingly named, since it not only contains a output phrase, but also a phrase translation score, future cost estimate, pointer to source phrase, and potentially word alignment information.
Edit - History - Print
Page last modified on April 26, 2012, at 08:34 PM