Search Descriptions


Neural machine Translation

Statistical Machine Translation

Search Publications





Large-Scale Discriminative Training

The current mix of generative models, ad hoc scoring functions, and discriminative parameter of a handful of weights is theoretically unappealing, so there has been a long standing effort to train all the millions of parameters of a statistical machine translation model discriminatively.

Large Scale Discriminative Training is the main subject of 44 publications. 18 are discussed here.

Topics in MachineLearning

Word Graphs | Parameter Tuning | Reranking | Large Scale Discriminative Training | Minimum Bayes Risk | Confidence Measures | System Combination | Incremental Updating


Large-scale discriminative training methods that optimize millions of features over the entire training corpus have emerged recently. Tillmann and Zhang (2005) add a binary feature for each phrase translation table entry and train feature weights using a stochastic gradient descent method. Kernel regression methods may be applied to the same task (Wang et al., 2007; Wang and Shawe-Taylor, 2008). Wellington et al. (2006) applies discriminative training to a tree translation model. Large scale discriminative training may also use the perceptron algorithm (Liang et al., 2006) or variations thereof (Tillmann and Zhang, 2006) to directly optimize on error metrics such as BLEU.
Arun and Koehn (2007) compare MIRA and the Perceptron algorithm and point out some of the problems on the road to large-scale discriminative training. This approach has also been applied to a variant of the hierarchical phrase model (Watanabe et al., 2007; Watanabe et al., 2007b). The MIRA algorithm may be also used for an extended form of parameter tuning (Chiang et al., 2008), allowing for the use of thousands of features (Chiang et al., 2009), covering properties such as source and target syntax (Chiang, 2010), on a larger tuning set.
Blunsom et al. (2008) argue the importance to perform feature updates on all derivations of translation, not just the most likely one, to address spurious ambiguity. A representative subset of translations may be acquired by sampling (Arun et al., 2009). This allows for a unified approach to Minimum Risk training and decoding (Arun et al., 2010). While Arun et al. (2009) use Gibbs sampling, simpler methods such as SampleRank (Haddow et al., 2011) may be used as well.
Machine translation may be framed as a structured prediction problem, which is a current strain of machine learning research. Zhang et al. (2008) frame ITG decoding in such a way and propose a discriminative training method following the SEARN algorithm (Daumé III et al., 2006).



Related Topics

Discriminative training methods require the translation of the training corpus, which is also a requirement for generative training of word based models, phrase based models, and syntax based models.

New Publications

  • Tamchyna et al. (2016)
  • Braune et al. (2016)
  • Wuebker et al. (2015)
  • Sokolov et al. (2015)
  • Eidelman et al. (2013)
  • Song et al. (2014)
  • Saluja and Zhang (2014)
  • Green et al. (2014)
  • Tan et al. (2013)
  • Zhao et al. (2014)
  • Auli et al. (2014)
  • Simianer and Riezler (2013)
  • Flanigan et al. (2013)
  • Cherry and Foster (2012)
  • Gimpel and Smith (2012)
  • Chiang (2012)
  • Green et al. (2013)
  • Flanigan et al. (2013)
  • Arun et al. (2010)
  • Duan et al. (2012)
  • Simianer et al. (2012)
  • Wuebker et al. (2012)
  • Cao and Khudanpur (2012)
  • Hasler et al. (2012)
  • Li et al. (2011)
  • Xiao et al. (2011)