Designers:
Here is the basic idea that we are trying to implement: http://aclweb.org/anthology-new/P/P05/P05-1032.pdf http://projectile.is.cs.cmu.edu/research/public/publications/eamt2005zhang.pdf
We need implementations of the following data structures:
These will need to be hidden underneath the PhraseDictionary interface.
We will need to write code to create these data structures and read/write them to disk as memory-mapped data structures.
There is an example of converting between numberization schemes in the implementation of the prefix tree adaptor. This looks a bit complicated, as one integer is first converted to a string, then to an integer in the other scheme. A better way to do this might be to contruct and integer-to-integer mapping when the data structure is first read into memory, and use this for subsequent access.
Things we need to understand in Moses code:
Things we'll need to understand algorithmically:
Once Moses is compiled the SuffixArray generation is in the misc/ subdirectory. You get the instructions for running it with the -h flag:
misc/generateSuffixArrays -h Arguments for generateSuffixArrays (defaults): -source source corpus (f) -target target corpus (e) -align alignment data (alignment) -format alignment format {pharaoh, rwth} (pharaoh) -debug dumps suffix array and alignment in text format (false) -name prefix for data & vocab file names (suffix.array) Juri Ganitkevitch RWTH Aachen, Lehrstuhl f. Informatik 6
And example run would be...
misc/generateSuffixArrays -s test2007.de -t test2007.en -a de_en_test2007_alignments.txt -debug 1
This creates a number of files:
Classes and interfaces (.cpp/.h)