Left-to-right LM integration for SCFG decoders

People

  • Blunsom, leader in exile
  • Chris Dyer
  • Jonathan Weese
  • Juri Ganitkevitch
  • Adam Lopez

LM integration in SCFG-based decoders is typically accomplished using heuristic search strategies that process the model's best guess for what the top k items are at each node in the -LM translation forest (Huang 2007). This approach is unsatisfactory for at least two reasons: the k-best lists are often quite redundant, resulting in repeated requests to score the same n-grams. Second, since items are processed in a bottom-up order, their context is unavailable, and it is therefore necessary to split search states using both left and right contexts of (n-1) words for every span, resulting in a massive combinatorial explosion that must be dealt with with aggressive pruning. Recent work in synchronous parsing (Dyer, submitted) and in finite-state variations of SCFG decoding (Iglesias et al. 2009) suggest that both of these limitations may be avoided. Our project is to outfit an existing SCFG-based decoder (either Joshua, Moses, or cdec) with a novel LM integration algorithm based on generation from a sentence-specific SCFG that is weakly equivalent to what is commonly called the -LM translation forest. In our proposed algorithm, integration occurs in three phases: first the source is parsed exhaustively, without using a language model; then, in a second phase, the -LM forest is determinized, non-coaccessible states are removed, and the result is projected onto a CFG generating strings in the target language; in the final phase, we incorporate a target LM using a top-down parsing algorithm. We expect the performance of this algorithm will be quite good since we can leverage numerous speedups: 1) Only paths capable of leading to a goal derivation will be scored (ie., non-coaccessible states will have been removed by the time the LM is incorporated). 2) The intermediate SCFG will be determinized, reducing redundancy in the grammar (cf. Iglesias et al. 2009). 3) LM will be integrated from left to right, so parse states will be split only using the (n-1) right-most words, rather than the (n-1) words on both sides of a span. Additionally, this algorithm can be explained intuitively using basic concepts from parsing and formal language theory, making it accessible to new-comers to translation who are familiar with formal language theory.

Page last modified on April 22, 2010, at 01:09 PM