Neural Language Models
Various neural network architectures have been applied to the basic task of language modelling, such as n-gram feed-forward models, recurrent neural networks, convolutional neural networks.
Neural Language Models is the main subject of 31 publications. 15 are discussed here.
The first vanguard of neural network research tackled language models. A prominent reference for neural language model is Bengio et al. (2003)
, who implement an n-gram language model as a feed-forward neural network with the history words as input and the predicted word as output. Schwenk et al. (2006)
introduce such language models to machine translation (also called "continuous space language models"), and use them in re-ranking, similar to the earlier work in speech recognition. Schwenk (2007)
propose a number of speed-ups. They made their implementation available as a open source toolkit (Schwenk, 2010)
, which also supports training on a graphical processing unit (GPU) (Schwenk et al., 2012)
By first clustering words into classes and encoding words as pair of class and word-in-class bits, Baltescu et al. (2014)
reduce the computational complexity sufficiently to allow integration of the neural network language model into the decoder. Another way to reduce computational complexity to enable decoder integration is the use of noise contrastive estimation by Vaswani et al. (2013)
, which roughly self-normalizes the output scores of the model during training, hence removing the need to compute the values for all possible output words. Baltescu and Blunsom (2015)
compare the two techniques - class-based word encoding with normalized scores vs. noise-contrastive estimation without normalized scores - and show that the latter gives better performance with much higher speed.
As another way to allow straightforward decoder integration, Wang et al. (2013)
convert a continuous space language model for a short list of 8192 words into a traditional n-gram language model in ARPA (SRILM) format. Wang et al. (2014)
present a method to merge (or "grow") a continuous space language model with a traditional n-gram language model, to take advantage of both better estimate for the words in the short list and the full coverage from the traditional model.
Finch et al. (2012)
use a recurrent neural network language model to rescore n-best lists for a transliteration system. Sundermeyer et al. (2013)
compare feed-forward with long short-term neural network language models, a variant of recurrent neural network language models, showing better performance for the latter in a speech recognition re-ranking task. Mikolov (2012)
reports significant improvements with reranking n-best lists of machine translation systems with a recurrent neural network language model.
Neural language models are not deep learning in the sense that they use a lot of hidden layers. However, Luong et al. (2015)
show that having 3-4 hidden layers improves over having just the typical 1 layer.
Language Models in Neural Machine Translation:
Traditional statistical machine translation models have a straightforward mechanism to integrate additional knowledge sources, such as a large out of domain language model. It is harder for end-to-end neural machine translation. Gülçehre et al. (2015)
add a language model trained on additional monolingual data to this model, in form of a recurrently neural network that runs in parallel. They compare the use of the language model in re-ranking (or, re-scoring) against deeper integration where a gated unit regulates the relative contribution of the language model and the translation model when predicting a word.
- Herold et al. (2018)
- Stahlberg et al. (2018)
- Ter-Sarkisov et al. (2014)
- Verwimp et al. (2017)
- Pham et al. (2016)
- Miyamoto and Cho (2016)
- Neubig and Dyer (2016)
- Niehues et al. (2016)
- Chen et al. (2016)
- Chen et al. (2016)
- Devlin et al. (2015)
- Aransa et al. (2015)
- Auli and Gao (2014)
- Niehues et al. (2014)
- Niehus and Waibel (2012)
- Alkhouli et al. (2015)
- Wang et al. (2013)