Neural machine Translation
Statistical Machine Translation
While the attentional sequence-to-sequence model is currently the dominant architecture for neural machine translation, other architectures have been explored.
Alternative Architectures is the main subject of 44 publications. 14 are discussed here.
Topics in NeuralNetworkModelsNeural Language Models | Attention Model | Training | Inference | Coverage | Vocabulary | Embeddings | Multilingual Word Embeddings | Monolingual Data | Adaptation | Linguistic Annotation | Multilingual Multimodal Multitask | Alternative Architectures | Analysis And Visualization | Neural Components In Statistical Machine Translation
Self Attention (Transformer)Vaswani et al. (2017) replace the recurrent neural networks used in attentional sequence-to-sequence models with multiple self-attention layers (called Transformer), both for the encoder as well as the decoder. Chen et al. (2018) compare different configurations of Transformer or recurrent neural networks in the encoder and decoder, and report that many of the different quality gains are due to a handful of training tricks, and show better results with a Transformer encoder and a RNN decoder. Emelin et al. (2019) claim a representation bottleneck in the self-attention layers that requires carrying through lexical features, preventing it from focusing on more complex features. They add shortcut connections from the initial embedding layer to each of the self-attention layers, in both encoder and decoder. Dehghani et al. (2019) propose a variant, called Universal Transformers, that do not use a fixed number of processing layers, but a arbitrary long loop through a single processing layer.
Deeper Transformer ModelsNaive implementations of deeper transformer models by just increasing number of encoder and decoder blocks leads to worse and sometimes catastrophic results. Wu et al. (2019) first train a model with n transformer blocks, then keep their parameters fixed and add m additional blocks. Bapna et al. (2018) argue that earlier encoder layers may be lost and connect all encoder layers to the attention computation of the decoder. Wang et al. (2019) successfully train deep transformer models with up to 30 layers by relocating the normalization step to the beginning of the block and by adding residual connections to all previous layers, not just the directly preceding one.
Document ContextMaruf et al. (2018) consider the entire source document as context when translating a sentence. Attention is computed over all input sentences and the sentences are weighted accordingly. Miculicich et al. (2018) extend this work with hierarchical attention which first computes attention over sentences and then over words. Due to computational problems, this is limited to a window of surrounding sentences. Maruf et al. (2019) also use hierarchical attention but compute sentence-level attention over the entire document and filters out the most relevant sentences before extending attention over words. A gate distinguishes between words in the source sentence and words in the context sentences. Junczys-Dowmunt (2019) translates entire source documents (up to 1000 words) at a time by concatenating all input sentences, showing significant improvements.