Subsequently, a translation table is created from the stored phrase translation pairs. The two steps are separated, because for larger translation models, the phrase translation table does not fit into memory. Fortunately, we never have to store the phrase translation table into memory --- we can construct it on disk.
To estimate the phrase translation probability φ(e|f) we proceed as follows: First, the extract file is sorted. This ensures that all English phrase translations for an foreign phrase are next to each other in the file. Thus, we can process the file, one foreign phrase at a time, collect counts and compute φ(e|f) for that foreign phrase f. To estimate φ(f|e), the inverted file is sorted, and then φ(f|e) is estimated for an English phrase at a time.
Next to phrase translation probability distributions φ(f|e) and φ(e|f), additional phrase translation scoring functions can be computed, e.g. lexical weighting, word penalty, phrase penalty, etc. Currently, lexical weighting is added for both directions and a fifth score is the phrase penalty.
> grep '| in europe |' model/phrase-table | sort -nrk 7 -t\| | head in europa ||| in europe ||| 0.829007 0.207955 0.801493 0.492402 europas ||| in europe ||| 0.0251019 0.066211 0.0342506 0.0079563 in der europaeischen union ||| in europe ||| 0.018451 0.00100126 0.0319584 0.0196869 in europa , ||| in europe ||| 0.011371 0.207955 0.207843 0.492402 europaeischen ||| in europe ||| 0.00686548 0.0754338 0.000863791 0.046128 im europaeischen ||| in europe ||| 0.00579275 0.00914601 0.0241287 0.0162482 fuer europa ||| in europe ||| 0.00493456 0.0132369 0.0372168 0.0511473 in europa zu ||| in europe ||| 0.00429092 0.207955 0.714286 0.492402 an europa ||| in europe ||| 0.00386183 0.0114416 0.352941 0.118441 der europaeischen ||| in europe ||| 0.00343274 0.00141532 0.00099583 0.000512159
Currently, four different phrase translation scores are computed:
Previously, there was another score:
This has now been superceded by it's own feature function, PhrasePenalty
.
You may not want to use all the scores in your translation table. The following options allow you to remove some of the scores:
NoLex
-- do not use lexical scores (removes score 2 and 4)
OnlyDirect
-- do not use the inverse scores (removes score 1 and 2)
These settings have to be specified with the setting -score-options
when calling the script train-model.perl
, for instance:
train-model.perl [... other settings ...] -score-options '--NoLex'
NB - the consolidate program (that runs after score) also has a few arguments. For example, it has
PhraseCount
-- add the old phrase count feature (score 5)
However, this can't be set with by the train-model.perl script.
Singleton phrase pairs tend to have overestimated phrase translation probabilities. Consider the extreme case of a source phrase that occurs only once in the corpus and has only one translation. The corresponding phrase translation probability φ(e|f) would be 1.
To obtain better phrase translation probabilities, the observed counts may be reduced by expected counts which takes unobserved events into account. Borrowing a method from language model estimation, Good Turing discounting can be used to reduce the actual counts (such as 1 in the example above) and reduce it to a more realistic number (maybe 0.3). The value of the adjusted count is determined by an analysis of the number of singleton, twice-occuring, thrice-occuring, etc. phrase pairs that were extracted.
To use Good Turing discounting of the phrase translation probabilities, you have to specify --GoodTuring
as one of the -score-options
, as in the section above. The adjusted counts are reported to STDERR.
An enhanced version of the scoring script outputs the word-to-word alignments
between f and e as they are in the files (extract
and extract.inv
) generated in the previous training step "Extract Phrases".
The alignments information are reported in the fourth fields. The format is identical to the alignment output obtained when the GIZA++ output has been symmetrized priot to phrase extraction.
> grep '| in europe |' model/phrase-table | sort -nrk 7 -t\| | head in europa ||| in europe ||| 0.829007 0.207955 ||| 0-0 1-1 ||| ... europas ||| in europe ||| ... ||| 0-0 0-1 ||| ... in der europaeischen union ||| in europe ||| ... ||| 0-0 2-1 3-1 ||| in europa , ||| in europe ||| ... ||| 0-0 1-1 ||| ... europaeischen ||| in europe ||| ... ||| 0-1 ||| ... im europaeischen ||| in europe ||| ... ||| 0-0 1-1 |||
For instance:
in der europaeischen union ||| in europe ||| 0-0 2-1 3-1 ||| ...
means
German -> English in -> in der -> europaeischen -> europe union -> europe
The word-to-word alignments come from one word alignment (see training step "Align words").
The alignment information is also used in SCFG-rules for the chart-decoder to link non-terminals together in the source and target side. In this instance, the alignment information is not an option, but a necessity. For example, the following Moses SCFG rule
[X][X] miss [X][X] [X] ||| [X][X] [X][X] manques [X] ||| ... ||| 0-1 2-0 ||| ...
is formated as this in the Hiero format:
[X] ||| [X,1] miss [X,2] ||| [X,2] [X,1] manques ||| ....
ie. this rule reordes the 1st and 3rd non-terminals in the source.
Therefore, the same alignment field can be used for word-alignment and non-terminal co-indexes. However, I'm (Hieu) sure if anyone has implemented this in the chart decoder yet
There is a maximum of 7 columns in the phrase table:
1. Source phrase 2. Target phrase 3. Scores 4. Alignment 5. Counts 6. Sparse feature scores 7. Key-value properties