Designers:

  • Chris Callison-Burch
  • Andreas Eisele
  • Juri Ganitkevitch
  • Adam Lopez

Here is the basic idea that we are trying to implement: http://aclweb.org/anthology-new/P/P05/P05-1032.pdf http://projectile.is.cs.cmu.edu/research/public/publications/eamt2005zhang.pdf

Project goals

We need implementations of the following data structures:

  • A numberized representation of the source and target side texts (a data array).
  • A suffix array on the source side text
  • An efficient representation of alignments.
  • A method to map between the Moses numberization and the local numberizations.
  • Methods to extract and score phrases.

These will need to be hidden underneath the PhraseDictionary interface.

We will need to write code to create these data structures and read/write them to disk as memory-mapped data structures.

There is an example of converting between numberization schemes in the implementation of the prefix tree adaptor. This looks a bit complicated, as one integer is first converted to a string, then to an integer in the other scheme. A better way to do this might be to contruct and integer-to-integer mapping when the data structure is first read into memory, and use this for subsequent access.

Things we need to understand in Moses code:

  • PhraseDictionary interface.
  • numberization (FactorCollection).
  • how score producers are integrated into the code.

Things we'll need to understand algorithmically:

  • Suffix arrays
  • Prefix trees
  • The phrase extraction algorithm.

Running the code

Once Moses is compiled the SuffixArray generation is in the misc/ subdirectory. You get the instructions for running it with the -h flag:

 misc/generateSuffixArrays -h

 Arguments for generateSuffixArrays (defaults):

 -source	source corpus (f)
 -target	target corpus (e)
 -align	alignment data (alignment)

 -format	alignment format {pharaoh, rwth} (pharaoh)

 -debug	dumps suffix array and alignment in text format (false)
 -name	prefix for data & vocab file names (suffix.array)

 Juri Ganitkevitch
 RWTH Aachen, Lehrstuhl f. Informatik 6

And example run would be...

 misc/generateSuffixArrays -s test2007.de -t test2007.en -a de_en_test2007_alignments.txt -debug 1

This creates a number of files:

  • suffix.array.align.bin - a binary file containing the auxiliary data structure facilitating fast phrase extraction from the word alignments
  • suffix.array.align.debug.txt - similar to the above, but in txt format instead of binary
  • suffix.array.source.bin - a binary version of the source suffix array
  • suffix.array.source.debug.txt - a txt version of the source suffix array
  • suffix.array.source.voc - the source vocab file
  • suffix.array.target.bin - a binary version of the target suffix array
  • suffix.array.target.debug.txt - a text version of the target suffix array
  • suffix.array.target.voc - the target vocab file

Source code

Classes and interfaces (.cpp/.h)

  • SuffixArray: implements suffix array, including construction
  • SuffixArrayVocabulary: implements numberization scheme (integer representation of words), and handles the mapping of this representation with the moses-internal integer representation
  • PhraseDictionarySuffixArray: subclasses Moses's PhraseDictionary and allows translation look up for source phrases
  • SuffixArrayAlignment: provides the auxiliary data structures for the fast phrase extraction.
Page last modified on May 18, 2008, at 12:34 PM