Dynamic Suffix Arrays for Moses

Suffix arrays are space-efficient data structures for fast searching over large text strings and have been utilized in SMT to extract arbitrarily large phrases or number of hierarchical syntax rules over terascale-level corpora during decoding by storing the bi-text corpora itself in memory. When a phrase or rule is needed by the decoder, the corpora is searched, counts accumulated, and the rule probability computed on the fly. Without using these suffix arrays the scale of the experiments conducted in the literature would have been impossible due to the intractable computational resources required to hold the full set of rules and associated parameters explicitly in memory over the whole dataset. This project aims to incorporate the suffix array data structure into the Moses decoder enabling large scale experimentation for online SMT models, domain adaptation, scaling effects, etc. To enable experimentation for adaptive TMs the project aims to implement the dynamic variant of the suffix array. Dynamizing the suffix array adds some complexity compared to the static variant since we must keep the array lexically ordered despite inserts and deletes. This requires some sophisticated algorithmic techniques to enable this to occur quickly and in small space.

Page last modified on January 25, 2010, at 10:41 AM