Project goals

We need implementations of the following data structures:

A numberized representation of the source and target side texts (a data array).
A suffix array on the source side text
An efficient representation of alignments.
A method to map between the Moses numberization and the local numberizations.
Methods to extract and score phrases.

These will need to be hidden underneath the PhraseDictionary interface.

We will need to write code to create these data structures and read/write them to disk as memory-mapped data structures.

There is an example of converting between numberization schemes in the implementation of the prefix tree adaptor. This looks a bit complicated, as one integer is first converted to a string, then to an integer in the other scheme. A better way to do this might be to contruct and integer-to-integer mapping when the data structure is first read into memory, and use this for subsequent access.

Things we need to understand in Moses code:

PhraseDictionary interface.
numberization (FactorCollection).
how score producers are integrated into the code.

Things we'll need to understand algorithmically:

Suffix arrays
Prefix trees
The phrase extraction algorithm.

Running the code

Once Moses is compiled the SuffixArray generation is in the misc/ subdirectory. You get the instructions for running it with the -h flag:

 misc/generateSuffixArrays -h

 Arguments for generateSuffixArrays (defaults):

 -source	source corpus (f)
 -target	target corpus (e)
 -align	alignment data (alignment)

 -format	alignment format {pharaoh, rwth} (pharaoh)

 -debug	dumps suffix array and alignment in text format (false)
 -name	prefix for data & vocab file names (suffix.array)

 Juri Ganitkevitch
 RWTH Aachen, Lehrstuhl f. Informatik 6

And example run would be...

 misc/generateSuffixArrays -s test2007.de -t test2007.en -a de_en_test2007_alignments.txt -debug 1

This creates a number of files:

suffix.array.align.bin - a binary file containing the auxiliary data structure facilitating fast phrase extraction from the word alignments
suffix.array.align.debug.txt - similar to the above, but in txt format instead of binary
suffix.array.source.bin - a binary version of the source suffix array
suffix.array.source.debug.txt - a txt version of the source suffix array
suffix.array.source.voc - the source vocab file
suffix.array.target.bin - a binary version of the target suffix array
suffix.array.target.debug.txt - a text version of the target suffix array
suffix.array.target.voc - the target vocab file

Source code

Classes and interfaces (.cpp/.h)

SuffixArray: implements suffix array, including construction
SuffixArrayVocabulary: implements numberization scheme (integer representation of words), and handles the mapping of this representation with the moses-internal integer representation
PhraseDictionarySuffixArray: subclasses Moses's PhraseDictionary and allows translation look up for source phrases
SuffixArrayAlignment: provides the auxiliary data structures for the fast phrase extraction.

Page last modified on May 18, 2008, at 12:34 PM