Neural machine Translation
Statistical Machine Translation
Training machine translation for multiple language pairs leads to more generalization in the models, and helps low-resource language pairs. Moreover, the input to machine translation may also be enriched by information from other modalities, such as images or speech. And finally, machine translation may just be one task of an integrated neural network that performs other language processing tasks.
Multilingual Multimodal Multitask is the main subject of 71 publications. 42 are discussed here.
Topics in NeuralNetworkModelsNeural Language Models | Attention Model | Training | Inference | Coverage | Vocabulary | Embeddings | Multilingual Word Embeddings | Monolingual Data | Adaptation | Linguistic Annotation | Multilingual Multimodal Multitask | Alternative Architectures | Analysis And Visualization | Neural Components In Statistical Machine Translation
Multi-language training:Zoph et al. (2016) first train on a resource language pair and then adapt the resulting model towards a targeted low resource language, and show gains over just training on the low resource language. Nguyen and Chiang (2017) show better results when merging the vocabularies of the different input languages. Ha et al. (2016) prefix each input work with a language identifier (e.g., @en@dog, @de@Hund) and add monolingual data, both as source and target. Ha et al. (2017) observe that translation in multi-language systems with multiple target languages may switch to the wrong language. They limit word predictions to words existing in the desired target language, and also add source side language-identifying word factors. Lakew et al. (2018) show that Transformer models perform better for multi-language pair training than previous models based on recurrent neural networks. Lakew et al. (2018) build one-to-many translation models for languages varieties, i.e., closely related dialects such Brazilian and European Portuguese or Croatian and Serbian. This requires language variety identification to separate out the training data. Lakew et al. (2018) start with a model trained on a high-resource language pair and then incrementally add low-resource language pairs, including new vocabulary items. They show much faster training convergence and slight quality gains over joint training. Neubig and Hu (2018) train a many-to-one model for 58 language pairs and fine-tune it towards each of them. Aharoni et al. (2019) scale up multi-language training to up to 103 languages, training on language pairs with English on either side, measuring average translation performance from English and into English. They show that many-to-many systems improve over many-to-one system when translating into English but not over one-to-many systems when translating from English. They also see degradation when combining more than 5 languages. Murthy et al. (2019) identify a problem when a targeted language pair in the multi-language setup is low resource and has different word order from the other language pair. They propose to pre-order the input to match the word order of the dominant language.
Zero-Shot:Johnson et al. (2017) explore how well a single canonical neural translation model is able to learn from multiple to multiple languages, by simultaneously training on on parallel corpora for several language pairs. They show small benefits for several input languages with the same output languages, mixed results for translating into multiple output languages (indicated by an additional input language token). The most interesting result is the ability for such a model to translate in language directions for which no parallel corpus is provided ("zero-shot"), thus demonstrating that some interlingual meaning representation is learned, although less well than using traditional pivot methods. Mattoni et al. (2017) explore zero-shot training for Indian languages with sparse training data, achieving limited success. Al-Shedivat and Parikh (2019) extend the training objective of zero-shot training in the scenario of English-X parallel corpora so that given an English-French sentence pair the translations French-Russian and English-Russian are consistent.
Multi-Language Training with Language-Specific Components:There have been a few suggestions to alter the model for multi-language pair training. Dong et al. (2015) use different decoders for each target language. Firat et al. (2016) support multi-language input and output by training language-specific encoders and decoders and a shared attention mechanism. Firat et al. (2016) evaluate how well this model works for zero-shot translation. Lu et al. (2018) add an additional interlingua layer between specialized encoders and decoders that is shared across all language pairs. Conversely, Blackwood et al. (2018) use shared encoders and decoders but language-pair specific attention. Sachan and Neubig (2018) investigate which parameters in a Transformer model should be shared during one-to-many training and find that partial sharing of components outperforms no sharing or full sharing, although the best configuration depends on the languages involved. Wang et al. (2018) add language-dependent positional embeddings and split the decoder state into a general and language-dependent part. Platanios et al. (2018) generate the language-pair specific parameters for the encoder and decoder with a parameter generator that takes embeddings of input and output language identifiers as input. Gu et al. (2018) frame the multi-language training setup as meta learning, which they define as either learning a policy for updating model parameters or learning a good parameter initialization method for fast adaptation. Their approach falls under that second definition and is similar to multi-language training with adaptation via fine-tuning, except for optimization during the first phase towards parameters that can be quickly adapted. Gu et al. (2018) focus on the problem of word representation in multi-lingual training. They map the tokens of every language into a universal embedding space, aided monolingual data. Wang et al. (2019) have the same goal in mind and use language-specific and language-independent character-based word representations to map to a shared word embedding space. This is done for input words in a 58 language to English translation model. Tan et al. (2019) change the training objective for multi-language training. In addition to matching the training data for the language pairs, an additional training objective is to match the prediction of a "teacher" model that was trained on the corresponding single-language pair data. Malaviya et al. (2017) use the embedding associated with the language indicator token in massively multi-language models to predict typological properties of a language. Ren et al. (2018) address the challenge of pivot translation (train a X-Z model by using a third language Y with large corpora X-Y and X-Z) in a neural model approach by setting up training objectives that match translation through the pivot path and the direct translation, and also other paths in this language triangle.
Multiple Inputs:Zoph and Knight (2016) augment a translation model to consume two meaning-equivalent sentences in different languages as input. Zhou et al. (2017) apply this idea to the task of system combination, i.e., obtaining a consensus translation from multiple machine translation outputs. Garmash and Monz (2016) train multiple single-language systems, feed each the corresponding meaning-equivalent input sentence and combine these predictions of the models in an ensemble approach during decoding. Nishimura et al. (2018) explore how a multi-source model works when input for some languages is missing. In their experiments, the multi-encoder approach works more often better than the ensemble. Nishimura et al. (2018) fill in the missing sentences in the training data with (multi-source) back-translation. Dabre et al. (2017) concatenate the input sentences, and also use training data in the same format (which requires intersecting overlapping parallel corpora).
Pre-trained word embeddings:Di Gangi and Federico (2017) do not observe improvement when using monolingual word embeddings in a gated network that trains additional word embeddings purely on parallel data. Abdou et al. (2017) showed worse performance on a WMT news translation task with pre-trained word embeddings. They argue, as Hill et al. (2014); Hill et al. (2017) did previously, that neural machine translation requires word embeddings that are based on semantic similarity of words (teacher and professor) rather than other kinds of relatedness (teacher and student), and demonstrate that word embeddings trained for translation score better on standard semantic similarity tasks. Artetxe et al. (2018) use monolingually trained word embeddings in a neural machine translation system, without using any parallel corpus. Qi et al. (2018) do show gains with pre-trained word embeddings in low resource conditions, but that benefits decrease with larger data sizes.
Multi-task training:Niehues and Cho (2017) tackle multiple tasks (translation, part-of-speech tagging, and named entity identification) with shared components of a sequence to sequence model, showing that training on several tasks improves performance on each individual task. Zaremoodi and Haffari (2018) refine this approach with adversarial training that enforces task-independent representation in intermediate layers, and apply to to joint training with syntactic and semantic parsing. Li et al. (2019) add as auxiliary tasks the prediction of hierarchical word classes obtained by hierarchical Brown clustering. In the first layer of the decoder of a transformer model, the coarsest word classes are predicted, and in later layers more fine-grained word classes are predicted. The authors argue that this increases generalization ability of intermediate representations and show improvements in translation quality.