Bounded Memory Lm And Phrase Table Building
Project leader: Kenneth Heafield
Desirable skills for participants: C++
Modified Kneser-Ney estimation can be done in several stages:
1. Count n-grams
2. Sort n-grams in suffix order
3. Adjust counts and compute smoothing summary statistics
4. Estimate probabilities
5. Sort in a different order
6. Compute backoffs
7. For storage in a reverse trie, sort in suffix order again
This algorithm makes use of multiple streams, one for each n-gram length. That critical addition generalizes MapReduce and will make building faster than the algorithm given in Brants et al. For phrase table estimation, the problem is even simpler: sort by target to obtain p(s|t) and sort by source to obtain p(t|s), annotating each phrase. During the marathon, the goal is to write a framework to support multiple streams with optional sorting, flushing to disk as necessary. We will work on n-gram count or phrase table scoring as an intermediate task. Continuation work will use this framework to estimate language models in bounded user-specified memory.