Neural machine Translation
Statistical Machine Translation
The currently dominant model in neural machine translation is the sequence-to-sequence model with attention.
Attention Model is the main subject of 31 publications. 8 are discussed here.
Topics in NeuralNetworkModelsNeural Language Models | Attention Model | Training | Inference | Coverage | Vocabulary | Embeddings | Multilingual Word Embeddings | Monolingual Data | Adaptation | Linguistic Annotation | Multilingual Multimodal Multitask | Alternative Architectures | Analysis And Visualization | Neural Components In Statistical Machine Translation
The attention model has its roots in a sequence-to-sequence model.Cho et al. (2014) use recurrent neural networks for the approach. Sutskever et al. (2014) use a LSTM (long short-term memory) network and reverse the order of the source sentence before decoding. The seminal work by Bahdanau et al. (2015) adds an alignment model (so called "attention mechanism") to link generated output words to source words, which includes conditioning on the hidden state that produced the preceding target word. Source words are represented by the two hidden states of recurrent neural networks that process the source sentence left-to-right and right-to-left. Luong et al. (2015) propose variants to the attention mechanism (which they call "global" attention model) and also a hard-constraint attention model ("local" attention model) which is restricted to a Gaussian distribution around a specific input word. To explicitly model the trade-off between source context (the input words) and target context (the already produced target words), Tu et al. (2016) introduce an interpolation weight (called "context gate") that scales the impact of the (a) source context state and (b) the previous hidden state and the last word when predicting the next hidden state in the decoder.
Deep Models:There are several various to add layers to the encoder and the decoder of he neural translation model. Wu et al. (2016) first use the traditional bidirectional recurrent neural networks to compute input word representations and then refine them with several stacked recurrent layers. Zhou et al. (2016) alternate between forward and backward recurrent layers. Barone et al. (2017) show good results with 4 stacks and 2 deep transitions each for encoder and decoder, as well as alternating networks for the encoder. There are a large number of variations (including the use of skip connections, the choice of LSTM vs. GRU, number of layers of any type) that still need to be explored empirical for various data conditions.