Search Descriptions


Neural machine Translation

Statistical Machine Translation

Search Publications






Neural machine translation models are typically trained on word predictions as given by sentence pairs from a parallel corpus with cross-entropy loss as an objective function.

Training is the main subject of 55 publications. 26 are discussed here.


A number of key techniques that have been recently developed have entered the standard repertoire of neural machine translation research. Ranges for the random initialization of weights need to be carefully chosen (Glorot and Bengio, 2010). To avoid overconfidence of the model, label smoothing may be applied, i.e., optimization towards a target distribution that shifts probability mass away from the correct given target word towards other words (Chorowski and Jaitly, 2017). Distributing training over several GPUs creates the problem of synchronizing updates. Chen et al. (2016) compare various methods, including asynchronous updates. Training is made more robust by methods such as drop-out (Srivastava et al., 2014), where during training intervals a number of nodes are randomly masked. To avoid exploding or vanishing gradients during back-propagation over several layers, gradients are typically clipped (Pascanu et al., 2013). Chen et al. (2018) present briefly adaptive gradient clipping. Layer normalization (Lei Ba et al., 2016) has similar motivations, by ensuring that node values are within reasonable bounds.

Adjusting the Learning Rate:

An active topic of research are optimization methods that adjust the learning rate of gradient descent training. Popular methods are Adagrad (Duchi et al., 2011), Adadelta (Zeiler, 2012), and currently Adam (Kingma and Ba, 2015).

Sequence-Level Optimization:

Shen et al. (2016) introduce minimum risk training that allows for sentence level optimization with metrics such as the BLEU score. A set of possible translation is sampled and their relative probability is used to compute the expected loss (probability-weighted BLEU scores of the sampled translations). They show large gains on a Chinese-English task. Neubig (2016) also showed gains when optimizing towards smoothed sentence-level BLEU, using a sample of 20 translations. Hashimoto and Tsuruoka (2019) optimize towards the GLEU score and speed by training by vocabulary reduction. Wiseman and Rush (2016) use a loss function that penalizes the gold standard falling off the beam during training. Ma et al. (2019) also consider the point where the gold standard falls of the beam but record the loss for this initial sequence prediction and then reset the beam to the gold standard at that point. Edunov et al. (2018) compare various word-level and sentence-level optimization techniques but see only small gains by the best-performing sentence-level minimum risk method over alternatives. Xu et al. (2019) use a mix of gold-standard and predicted words in the prefix. They use an alignment component to keep the mixed prefix and the target training sentence in sync. Zhang et al. (2019) gradually shift from matching towards ground truth towards so-called word-level oracle obtained with Gumbel noise and sentence-level oracles obtained by selecting the BLEU-best translation from the n-best list obtained by beam search.

Right-to-Left Training

Several researcher report that translation quality for the right half of the sentence is lower than for the left half of the sentence and attribute this to the exposure bias: during training a correct prefix (also called teacher forcing) is used to make word predictions, while during decoding only the previously predicted words can be used. Wu et al. (2018) show that this imbalance is to a large degree due to linguistic reasons: it happens for right-branching languages like English and Chinese, but the opposite is the case for left-branching languages like Japanese.

Adversarial Training:

Wu et al. (2017) introduce adversarial training to neural machine translation, in which a discriminator is trained alongside a traditional machine translation model to distinguish between machine translation output and human reference translations. The ability to fool the discriminator is used as an additional training objective for the machine translation model. Yang et al. (2018) propose a similar setup, but add a BLEU-based training objective to neural translation model training. Cheng et al. (2018) employ adversarial training to address the problem of robustness, which they identify in the evidence that 70% of translations change when an input word is changed to a synonym. They aim to achieve more robust behavior by adding synthetic training data where one of the input words is replaced with a synonym (neighbor in embedding space) and by using a discriminator that predicts from the encoding of an input sentence if it is an original or an altered source sentence.

Knowledge Distillation:

There are several techniques that change the loss function to not only reward good word predictions that closely match the training data but that also closely match predictions of a previous model, called the teacher model. Khayrallah et al. (2018) use a general domain model as teacher to avoid overfitting to in-domain data during domain adaptation by fine-tuning. Wei et al. (2019) use the models that achieved the best results during training at previous checkpoints to guide training.

Faster Training:

Ott et al. (2018) improve training speed with 16 bit arithmetic and larger batches that lead to less idle time due to less variance in processing batches on different GPU. They scale up training to 128 GPUs.



Related Topics

New Publications

  • Nishimura et al. (2018)

Adversarial Training

  • Cheng et al. (2019)
  • Sato et al. (2019)
  • Elliott (2018)


  • Kreutzer et al. (2018)
  • Kreutzer et al. (2018)
  • Kreutzer et al. (2017)

8-Bit / Speed

  • Quinn and Ballesteros (2018)
  • Bogoychev et al. (2018)

Training Objective

  • Shao et al. (2018) - sequence-level
  • Wieting et al. (2019) - sentence-level optimization
  • Petrushkov et al. (2018) - chunk-based feedback
  • Zheng et al. (2018) - multi-reference
  • Wu et al. (2018) - reinforcement learning


  • Zhou et al. (2019)


  • Chen et al. (2017)


  • Zhang et al. (2017)


  • Wang et al. (2017)


  • Qin et al. (2017)

Automatic Post-Editing

  • Vu and Haffari (2018)


  • Zhang et al. (2016)


  • Cheng et al. (2016)


  • Do et al. (2015)


  • Huang et al. (2015)

Contrastive Noise Estimation

  • Cherry (2016)


  • Freitag et al. (2017)
  • Chen et al. (2017)
  • Kim and Rush (2016)
  • Zhang et al. (2018)
  • Chen et al. (2018)