Neural machine Translation
Statistical Machine Translation
The large number of words in natural language vocabulary is a challenge for the vector space representations used in neural networks. Several strategies have been explored to handle large vocabulary or resort to sub-word representations of words.
Vocabulary is the main subject of 47 publications. 22 are discussed here.
Topics in NeuralNetworkModelsNeural Language Models | Attention Model | Training | Inference | Coverage | Vocabulary | Embeddings | Multilingual Word Embeddings | Monolingual Data | Adaptation | Linguistic Annotation | Multilingual Multimodal Multitask | Alternative Architectures | Analysis And Visualization | Neural Components In Statistical Machine Translation
Special Handling of Rare Words:A significant limitation of neural machine translation models is the computational burden to support very large vocabularies. To avoid this, the vocabulary may be reduced to a shortlist of, say, 20,000 words, and the remaining tokens are replaced with the unknown word token "UNK". To translate such an unknown word, Luong et al. (2015); Jean et al. (2015) resort to a separate dictionary. Arthur et al. (2016) argue that neural translation models are worse for rare words and interpolate a traditional probabilistic bilingual dictionary with the prediction of the neural machine translation model. They use the attention mechanism to link each target word to a distribution of source words and weigh the word translations accordingly. Source words such as names and numbers may also be directly copied into the target. Gulcehre et al. (2016) use a so-called switching network to predict either a traditional translation operation or a copying operation aided by a softmax layer over the source sentence. They preprocess the training data to change some target words into word positions of copied source words. Similarly, Gu et al. (2016) augment the word prediction step of the neural translation model to either translate a word or copy a source word. They observe that the attention mechanism is mostly driven by semantics and the language model in the case of word translation, but by location in case of copying.
Subwords:Sennrich et al. (2016) split up all words to sub-word units, using character n-gram models and a segmentation based on the byte pair encoding compression algorithm. Schuster and Nakajima (2012) developed a similar method originally for speech recognition, called word piece or sentence piece, that also starts with breaking up all words into character strings and join them together to obtain a lower perplexity unigram language model trained on the data. Kudo and Richardson (2018) present a toolkit for the sentence piece method and describe it in more detail. Kudo (2018) propose subword regularization that samples different subword segmentation during training to allow for richer data to learn smaller subword units. Morishita et al. (2018) use different granularities of subword segmentation (using 16,000, 1000, and 300 operations) in the model and during decoding for the input words and the output word conditioning by summing up the different representations (a single subword from the large vocabulary may decompose into multiple subwords from the smaller vocabularies). Ataman et al. (2017) proposes a linguistically motivated vocabulary reduction methods that models word formation as a sequence of stem and morphemes with a hidden Markov model, which can be optimized for a fixed target vocabulary size. Ataman and Federico (2018) show that this method outperforms byte pair encoding for several morphologically rich language pairs. Banerjee and Bhattacharyya (2018) also not that morphologically inspired segmentation, as provided by a tool called Morfessor (Virpioja et al., 2013), sometimes gives better results than byte pair encoding, and that both methods combined may outperform either. Nikolov et al. (2018); Zhang and Komachi (2018) extend the idea of splitting up words to logographic languages such as Chinese by allowing breaking up characters based on their romanized version or decomposition into strokes.
Character-Based Models:Generating word representations from their character sequence has been originally proposed for machine translation by Costa-jussà et al. (2016). They use a convolutional neural network to encode input words, but Costa-jussà and Fonollosa (2016) show success also with character-based language models in reranking machine translation . Chung et al. (2016) propose using a recurrent neural network to encode target words and also propose a bi-scale decoder where a fast layer outputs a character at a time, while a slow layer outputs a word at a time. Ataman et al. (2018); Ataman and Federico (2018) show good results with a recurrent neural network over character trigrams for input words but not output words.