Projects

Bounded Memory Lm And Phrase Table Building

Project leader: Kenneth Heafield

Desirable skills for participants: C++

Modified Kneser-Ney estimation can be done in several stages:

1. Count n-grams

2. Sort n-grams in suffix order

3. Adjust counts and compute smoothing summary statistics

4. Estimate probabilities

5. Sort in a different order

6. Compute backoffs

7. For storage in a reverse trie, sort in suffix order again

This algorithm makes use of multiple streams, one for each n-gram length. That critical addition generalizes MapReduce and will make building faster than the algorithm given in Brants et al. For phrase table estimation, the problem is even simpler: sort by target to obtain p(s|t) and sort by source to obtain p(t|s), annotating each phrase. During the marathon, the goal is to write a framework to support multiple streams with optional sorting, flushing to disk as necessary. We will work on n-gram count or phrase table scoring as an intermediate task. Continuation work will use this framework to estimate language models in bounded user-specified memory.