Training
Neural machine translation models are typically trained on word predictions as given by sentence pairs from a parallel corpus with cross-entropy loss as an objective function.
Training is the main subject of 55 publications. 26 are discussed here.
Publications
A number of key techniques that have been recently developed have entered the standard repertoire of neural machine translation research. Ranges for the random initialization of weights need to be carefully chosen
Xavier Glorot and Yoshua Bengio (2010):
Understanding the difficulty of training deep feedforward neural networks, Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS)
@inproceedings{weight-initialization,
author = {Xavier Glorot and Yoshua Bengio},
title = {Understanding the difficulty of training deep feedforward neural networks},
booktitle = {Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS)},
location = {Sardinia, Italy},
year = 2010
}
(Glorot and Bengio, 2010). To avoid overconfidence of the model, label smoothing may be applied, i.e., optimization towards a target distribution that shifts probability mass away from the correct given target word towards other words
Jan Chorowski and Navdeep Jaitly (2017):
Towards better decoding and language model integration in sequence to sequence models, Interspeech
@inproceedings{label-smoothing,
author = {Jan Chorowski and Navdeep Jaitly},
title = {Towards better decoding and language model integration in sequence to sequence models},
booktitle = {Interspeech},
pages = {523--527},
location = {Stockholm, Sweden},
year = 2017
}
(Chorowski and Jaitly, 2017). Distributing training over several GPUs creates the problem of synchronizing updates.
Jianmin Chen and Rajat Monga and Samy Bengio and Rafal Jozefowicz (2016):
Revisiting Distributed Synchronous SGD, International Conference on Learning Representations Workshop Track
@inproceedings{synchronous-sgd,
author = {Jianmin Chen and Rajat Monga and Samy Bengio and Rafal Jozefowicz},
title = {Revisiting Distributed Synchronous SGD},
url = {
https://arxiv.org/abs/1604.00981},
booktitle = {International Conference on Learning Representations Workshop Track},
year = 2016
}
Chen et al. (2016) compare various methods, including asynchronous updates. Training is made more robust by methods such as drop-out
Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov (2014):
Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Journal of Machine Learning Research
@article{JMLR:v15:srivastava14a,
author = {Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov},
title = {Dropout: A Simple Way to Prevent Neural Networks from Overfitting},
journal = {Journal of Machine Learning Research},
volume = {15},
pages = {1929-1958},
url = {
http://jmlr.org/papers/v15/srivastava14a.html},
year = 2014
}
(Srivastava et al., 2014), where during training intervals a number of nodes are randomly masked. To avoid exploding or vanishing gradients during back-propagation over several layers, gradients are typically clipped
Razvan Pascanu and Tomas Mikolov and Yoshua Bengio (2013):
On the difficulty of training recurrent neural networks, Proceedings of the 30th International Conference on Machine Learning, ICML
@inproceedings{DBLP:conf/icml/PascanuMB13,
author = {Razvan Pascanu and Tomas Mikolov and Yoshua Bengio},
title = {On the difficulty of training recurrent neural networks},
booktitle = {Proceedings of the 30th International Conference on Machine Learning, {ICML}},
location = {Atlanta, GA, USA},
month = {June},
pages = {1310--1318},
url = {
http://proceedings.mlr.press/v28/pascanu13.pdf},
year = 2013
}
(Pascanu et al., 2013).
Chen, Mia Xu and Firat, Orhan and Bapna, Ankur and Johnson, Melvin and Macherey, Wolfgang and Foster, George and Jones, Llion and Schuster, Mike and Shazeer, Noam and Parmar, Niki and Vaswani, Ashish and Uszkoreit, Jakob and Kaiser, Lukasz and Chen, Zhifeng and Wu, Yonghui and Hughes, Macduff (2018):
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
mentioned in Training and Alternative Architectures@InProceedings{P18-1008,
author = {Chen, Mia Xu and Firat, Orhan and Bapna, Ankur and Johnson, Melvin and Macherey, Wolfgang and Foster, George and Jones, Llion and Schuster, Mike and Shazeer, Noam and Parmar, Niki and Vaswani, Ashish and Uszkoreit, Jakob and Kaiser, Lukasz and Chen, Zhifeng and Wu, Yonghui and Hughes, Macduff},
title = {The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation},
booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
publisher = {Association for Computational Linguistics},
pages = {76--86},
location = {Melbourne, Australia},
url = {
http://aclweb.org/anthology/P18-1008},
year = 2018
}
Chen et al. (2018) present briefly adaptive gradient clipping. Layer normalization
Lei Ba, J. and Kiros, J. R. and Hinton, Geoffrey (2016):
Layer Normalization, ArXiv e-prints
@ARTICLE{2016arXiv160706450L,
author = {{Lei Ba}, J. and {Kiros}, J.~R. and {Hinton}, Geoffrey},
title = {Layer Normalization},
journal = {ArXiv e-prints},
archiveprefix = {arXiv},
eprint = {1607.06450},
primaryclass = {stat.ML},
keywords = {Statistics - Machine Learning, Computer Science - Learning},
month = {jul},
adsurl = {
http://adsabs.harvard.edu/abs/2016arXiv160706450L},
adsnote = {Provided by the SAO/NASA Astrophysics Data System},
year = 2016
}
(Lei Ba et al., 2016) has similar motivations, by ensuring that node values are within reasonable bounds.
Adjusting the Learning Rate:
An active topic of research are optimization methods that adjust the learning rate of gradient descent training. Popular methods are Adagrad
Duchi, John and Hazan, Elad and Singer, Yoram (2011):
Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research
@article{duchi2011adaptive,
author = {Duchi, John and Hazan, Elad and Singer, Yoram},
title = {Adaptive subgradient methods for online learning and stochastic optimization},
journal = {Journal of Machine Learning Research},
volume = {12},
number = {Jul},
pages = {2121--2159},
year = 2011
}
(Duchi et al., 2011), Adadelta
Matthew D. Zeiler (2012):
ADADELTA: An Adaptive Learning Rate Method, CoRR
@article{DBLP:journals/corr/abs-1212-5701,
author = {Matthew D. Zeiler},
title = {{ADADELTA:} An Adaptive Learning Rate Method},
journal = {CoRR},
volume = {abs/1212.5701},
url = {
http://arxiv.org/abs/1212.5701},
timestamp = {Wed, 07 Jun 2017 14:43:02 +0200},
biburl = {
http://dblp.uni-trier.de/rec/bib/journals/corr/abs-1212-5701},
bibsource = {dblp computer science bibliography,
http://dblp.org},
year = 2012
}
(Zeiler, 2012), and currently Adam
Diederik P. Kingma and Jimmy Ba (2015):
Adam: A Method for Stochastic Optimization, ICLR
@inproceedings{ICLR:2015:KingmaBa,
author = {Diederik P. Kingma and Jimmy Ba},
title = {Adam: {A} Method for Stochastic Optimization},
booktitle = {ICLR},
url = {
https://arxiv.org/pdf/1412.6980.pdf},
year = 2015
}
(Kingma and Ba, 2015).
Sequence-Level Optimization:
Shen, Shiqi and Cheng, Yong and He, Zhongjun and He, Wei and Wu, Hua and Sun, Maosong and Liu, Yang (2016):
Minimum Risk Training for Neural Machine Translation, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
@InProceedings{shen-EtAl:2016:P16-1,
author = {Shen, Shiqi and Cheng, Yong and He, Zhongjun and He, Wei and Wu, Hua and Sun, Maosong and Liu, Yang},
title = {Minimum Risk Training for Neural Machine Translation},
booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {1683--1692},
url = {
http://www.aclweb.org/anthology/P16-1159},
year = 2016
}
Shen et al. (2016) introduce minimum risk training that allows for sentence level optimization with metrics such as the BLEU score. A set of possible translation is sampled and their relative probability is used to compute the expected loss (probability-weighted BLEU scores of the sampled translations). They show large gains on a Chinese-English task.
Neubig, Graham (2016):
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016, Proceedings of the 3rd Workshop on Asian Translation (WAT2016)
mentioned in Training and Vocabulary@InProceedings{neubig:2016:WAT2016,
author = {Neubig, Graham},
title = {Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016},
booktitle = {Proceedings of the 3rd Workshop on Asian Translation (WAT2016)},
month = {December},
address = {Osaka, Japan},
publisher = {The COLING 2016 Organizing Committee},
pages = {119--125},
url = {
http://aclweb.org/anthology/W16-4610},
year = 2016
}
Neubig (2016) also showed gains when optimizing towards smoothed sentence-level BLEU, using a sample of 20 translations.
Hashimoto, Kazuma and Tsuruoka, Yoshimasa (2019):
Accelerated Reinforcement Learning for Sentence Generation by Vocabulary Prediction, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
@inproceedings{hashimoto-tsuruoka-2019-accelerated,
author = {Hashimoto, Kazuma and Tsuruoka, Yoshimasa},
title = {Accelerated Reinforcement Learning for Sentence Generation by Vocabulary Prediction},
booktitle = {Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
month = {jun},
address = {Minneapolis, Minnesota},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/N19-1315},
pages = {3115--3125},
year = 2019
}
Hashimoto and Tsuruoka (2019) optimize towards the GLEU score and speed by training by vocabulary reduction.
Wiseman, Sam and Rush, Alexander M. (2016):
Sequence-to-Sequence Learning as Beam-Search Optimization, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
@InProceedings{wiseman-rush:2016:EMNLP2016,
author = {Wiseman, Sam and Rush, Alexander M.},
title = {Sequence-to-Sequence Learning as Beam-Search Optimization},
booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing},
month = {November},
address = {Austin, Texas},
publisher = {Association for Computational Linguistics},
pages = {1296--1306},
url = {
https://aclweb.org/anthology/D16-1137},
year = 2016
}
Wiseman and Rush (2016) use a loss function that penalizes the gold standard falling off the beam during training.
Ma, Mingbo and Zheng, Renjie and Huang, Liang (2019):
Learning to Stop in Structured Prediction for Neural Machine Translation, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
@inproceedings{ma-etal-2019-learning,
author = {Ma, Mingbo and Zheng, Renjie and Huang, Liang},
title = {Learning to Stop in Structured Prediction for Neural Machine Translation},
booktitle = {Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
month = {jun},
address = {Minneapolis, Minnesota},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/N19-1187},
pages = {1884--1889},
year = 2019
}
Ma et al. (2019) also consider the point where the gold standard falls of the beam but record the loss for this initial sequence prediction and then reset the beam to the gold standard at that point.
Edunov, Sergey and Ott, Myle and Auli, Michael and Grangier, David and Ranzato, Marc'Aurelio (2018):
Classical Structured Prediction Losses for Sequence to Sequence Learning, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
@InProceedings{N18-1033,
author = {Edunov, Sergey and Ott, Myle and Auli, Michael and Grangier, David and Ranzato, Marc'Aurelio},
title = {Classical Structured Prediction Losses for Sequence to Sequence Learning},
booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)},
publisher = {Association for Computational Linguistics},
pages = {355--364},
location = {New Orleans, Louisiana},
url = {
http://aclweb.org/anthology/N18-1033},
year = 2018
}
Edunov et al. (2018) compare various word-level and sentence-level optimization techniques but see only small gains by the best-performing sentence-level minimum risk method over alternatives.
Xu, Weijia and Niu, Xing and Carpuat, Marine (2019):
Differentiable Sampling with Flexible Reference Word Order for Neural Machine Translation, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
@inproceedings{xu-etal-2019-differentiable,
author = {Xu, Weijia and Niu, Xing and Carpuat, Marine},
title = {Differentiable Sampling with Flexible Reference Word Order for Neural Machine Translation},
booktitle = {Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
month = {jun},
address = {Minneapolis, Minnesota},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/N19-1207},
pages = {2047--2053},
year = 2019
}
Xu et al. (2019) use a mix of gold-standard and predicted words in the prefix. They use an alignment component to keep the mixed prefix and the target training sentence in sync.
Zhang, Wen and Feng, Yang and Meng, Fandong and You, Di and Liu, Qun (2019):
Bridging the Gap between Training and Inference for Neural Machine Translation, Proceedings of the 57th Conference of the Association for Computational Linguistics
@inproceedings{zhang-etal-2019-bridging,
author = {Zhang, Wen and Feng, Yang and Meng, Fandong and You, Di and Liu, Qun},
title = {Bridging the Gap between Training and Inference for Neural Machine Translation},
booktitle = {Proceedings of the 57th Conference of the Association for Computational Linguistics},
month = {jul},
address = {Florence, Italy},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/P19-1426},
pages = {4334--4343},
year = 2019
}
Zhang et al. (2019) gradually shift from matching towards ground truth towards so-called word-level oracle obtained with Gumbel noise and sentence-level oracles obtained by selecting the BLEU-best translation from the n-best list obtained by beam search.
Right-to-Left Training
Several researcher report that translation quality for the right half of the sentence is lower than for the left half of the sentence and attribute this to the exposure bias: during training a correct prefix (also called teacher forcing) is used to make word predictions, while during decoding only the previously predicted words can be used.
Wu, Lijun and Tan, Xu and He, Di and Tian, Fei and Qin, Tao and Lai, Jianhuang and Liu, Tie-Yan (2018):
Beyond Error Propagation in Neural Machine Translation: Characteristics of Language Also Matter, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
@inproceedings{D18-1396,
author = {Wu, Lijun and Tan, Xu and He, Di and Tian, Fei and Qin, Tao and Lai, Jianhuang and Liu, Tie-Yan},
title = {Beyond Error Propagation in Neural Machine Translation: Characteristics of Language Also Matter},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/D18-1396},
pages = {3602--3611},
year = 2018
}
Wu et al. (2018) show that this imbalance is to a large degree due to linguistic reasons: it happens for right-branching languages like English and Chinese, but the opposite is the case for left-branching languages like Japanese.
Adversarial Training:
Lijun Wu and Yingce Xia and Li Zhao and Fei Tian and Tao Qin and Jianhuang Lai and Tie-Yan Liu (2017):
Adversarial Neural Machine Translation, CoRR
@article{DBLP:journals/corr/WuXZTQLL17,
author = {Lijun Wu and Yingce Xia and Li Zhao and Fei Tian and Tao Qin and Jianhuang Lai and Tie{-}Yan Liu},
title = {Adversarial Neural Machine Translation},
journal = {CoRR},
volume = {abs/1704.06933},
url = {
https://arxiv.org/pdf/1704.06933.pdf},
timestamp = {Wed, 07 Jun 2017 14:42:51 +0200},
biburl = {
http://dblp.uni-trier.de/rec/bib/journals/corr/WuXZTQLL17},
bibsource = {dblp computer science bibliography,
http://dblp.org},
year = 2017
}
Wu et al. (2017) introduce adversarial training to neural machine translation, in which a discriminator is trained alongside a traditional machine translation model to distinguish between machine translation output and human reference translations. The ability to fool the discriminator is used as an additional training objective for the machine translation model.
Yang, Zhen and Chen, Wei and Wang, Feng and Xu, Bo (2018):
Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
@InProceedings{N18-1122,
author = {Yang, Zhen and Chen, Wei and Wang, Feng and Xu, Bo},
title = {Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets},
booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)},
publisher = {Association for Computational Linguistics},
pages = {1346--1355},
location = {New Orleans, Louisiana},
url = {
http://aclweb.org/anthology/N18-1122},
year = 2018
}
Yang et al. (2018) propose a similar setup, but add a BLEU-based training objective to neural translation model training.
Cheng, Yong and Tu, Zhaopeng and Meng, Fandong and Zhai, Junjie and Liu, Yang (2018):
Towards Robust Neural Machine Translation, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
@InProceedings{P18-1163,
author = {Cheng, Yong and Tu, Zhaopeng and Meng, Fandong and Zhai, Junjie and Liu, Yang},
title = {Towards Robust Neural Machine Translation},
booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
publisher = {Association for Computational Linguistics},
pages = {1756--1766},
location = {Melbourne, Australia},
url = {
http://aclweb.org/anthology/P18-1163},
year = 2018
}
Cheng et al. (2018) employ adversarial training to address the problem of robustness, which they identify in the evidence that 70% of translations change when an input word is changed to a synonym. They aim to achieve more robust behavior by adding synthetic training data where one of the input words is replaced with a synonym (neighbor in embedding space) and by using a discriminator that predicts from the encoding of an input sentence if it is an original or an altered source sentence.
Knowledge Distillation:
There are several techniques that change the loss function to not only reward good word predictions that closely match the training data but that also closely match predictions of a previous model, called the teacher model.
Khayrallah, Huda and Thompson, Brian and Duh, Kevin and Koehn, Philipp (2018):
Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation, Proceedings of the 2nd Workshop on Neural Machine Translation and Generation
mentioned in Training and Adaptation@InProceedings{W18-2705,
author = {Khayrallah, Huda and Thompson, Brian and Duh, Kevin and Koehn, Philipp},
title = {Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation},
booktitle = {Proceedings of the 2nd Workshop on Neural Machine Translation and Generation},
publisher = {Association for Computational Linguistics},
pages = {36--44},
location = {Melbourne, Australia},
url = {
http://aclweb.org/anthology/W18-2705},
year = 2018
}
Khayrallah et al. (2018) use a general domain model as teacher to avoid overfitting to in-domain data during domain adaptation by fine-tuning.
Wei, Hao-Ran and Huang, Shujian and Wang, Ran and Dai, Xin-yu and Chen, Jiajun (2019):
Online Distilling from Checkpoints for Neural Machine Translation, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
@inproceedings{wei-etal-2019-online,
author = {Wei, Hao-Ran and Huang, Shujian and Wang, Ran and Dai, Xin-yu and Chen, Jiajun},
title = {Online Distilling from Checkpoints for Neural Machine Translation},
booktitle = {Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
month = {jun},
address = {Minneapolis, Minnesota},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/N19-1192},
pages = {1932--1941},
year = 2019
}
Wei et al. (2019) use the models that achieved the best results during training at previous checkpoints to guide training.
Faster Training:
Ott, Myle and Edunov, Sergey and Grangier, David and Auli, Michael (2018):
Scaling Neural Machine Translation, Proceedings of the Third Conference on Machine Translation: Research Papers
@inproceedings{W18-6301,
author = {Ott, Myle and Edunov, Sergey and Grangier, David and Auli, Michael},
title = {Scaling Neural Machine Translation},
booktitle = {Proceedings of the Third Conference on Machine Translation: Research Papers},
month = {oct},
address = {Belgium, Brussels},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/W18-6301},
pages = {1--9},
year = 2018
}
Ott et al. (2018) improve training speed with 16 bit arithmetic and larger batches that lead to less idle time due to less variance in processing batches on different GPU. They scale up training to 128 GPUs.
Benchmarks
Discussion
Related Topics
New Publications
Yuta Nishimura and Katsuhito Sudoh and Graham Neubig and Satoshi Nakamura (2018):
Multi-Source Neural Machine Translation with Data Augmentation, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
mentioned in Training and Multilingual Multimodal Multitask@inproceedings{iwslt18-Nishimura-Multi-Source,
author = {Yuta Nishimura and Katsuhito Sudoh and Graham Neubig and Satoshi Nakamura},
title = {Multi-Source Neural Machine Translation with Data Augmentation},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
url = {
https://arxiv.org/pdf/1810.06826.pdf},
year = 2018
}
Nishimura et al. (2018)
Adversarial Training
Cheng, Yong and Jiang, Lu and Macherey, Wolfgang (2019):
Robust Neural Machine Translation with Doubly Adversarial Inputs, Proceedings of the 57th Conference of the Association for Computational Linguistics
@inproceedings{cheng-etal-2019-robust,
author = {Cheng, Yong and Jiang, Lu and Macherey, Wolfgang},
title = {Robust Neural Machine Translation with Doubly Adversarial Inputs},
booktitle = {Proceedings of the 57th Conference of the Association for Computational Linguistics},
month = {jul},
address = {Florence, Italy},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/P19-1425},
pages = {4324--4333},
year = 2019
}
Cheng et al. (2019)
Sato, Motoki and Suzuki, Jun and Kiyono, Shun (2019):
Effective Adversarial Regularization for Neural Machine Translation, Proceedings of the 57th Conference of the Association for Computational Linguistics
@inproceedings{sato-etal-2019-effective,
author = {Sato, Motoki and Suzuki, Jun and Kiyono, Shun},
title = {Effective Adversarial Regularization for Neural Machine Translation},
booktitle = {Proceedings of the 57th Conference of the Association for Computational Linguistics},
month = {jul},
address = {Florence, Italy},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/P19-1020},
pages = {204--210},
year = 2019
}
Sato et al. (2019)
Elliott, Desmond (2018):
Adversarial Evaluation of Multimodal Machine Translation, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
@inproceedings{D18-1329,
author = {Elliott, Desmond},
title = {Adversarial Evaluation of Multimodal Machine Translation},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/D18-1329},
pages = {2974--2978},
year = 2018
}
Elliott (2018)
Bandit
Kreutzer, Julia and Khadivi, Shahram and Matusov, Evgeny and Riezler, Stefan (2018):
Can Neural Machine Translation be Improved with User Feedback?, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)
@InProceedings{N18-3012,
author = {Kreutzer, Julia and Khadivi, Shahram and Matusov, Evgeny and Riezler, Stefan},
title = {Can Neural Machine Translation be Improved with User Feedback?},
booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)},
publisher = {Association for Computational Linguistics},
pages = {92--105},
location = {New Orleans - Louisiana},
url = {
http://aclweb.org/anthology/N18-3012},
year = 2018
}
Kreutzer et al. (2018)
Kreutzer, Julia and Uyheng, Joshua and Riezler, Stefan (2018):
Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
@InProceedings{P18-1165,
author = {Kreutzer, Julia and Uyheng, Joshua and Riezler, Stefan},
title = {Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning},
booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
publisher = {Association for Computational Linguistics},
pages = {1777--1788},
location = {Melbourne, Australia},
url = {
http://aclweb.org/anthology/P18-1165},
year = 2018
}
Kreutzer et al. (2018)
Kreutzer, Julia and Sokolov, Artem and Riezler, Stefan (2017):
Bandit Structured Prediction for Neural Sequence-to-Sequence Learning, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
@InProceedings{kreutzer-sokolov-riezler:2017:Long,
author = {Kreutzer, Julia and Sokolov, Artem and Riezler, Stefan},
title = {Bandit Structured Prediction for Neural Sequence-to-Sequence Learning},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {July},
address = {Vancouver, Canada},
publisher = {Association for Computational Linguistics},
pages = {1503--1513},
url = {
http://aclweb.org/anthology/P17-1138},
year = 2017
}
Kreutzer et al. (2017)
8-Bit / Speed
Quinn, Jerry and Ballesteros, Miguel (2018):
Pieces of Eight: 8-bit Neural Machine Translation, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)
@InProceedings{N18-3014,
author = {Quinn, Jerry and Ballesteros, Miguel},
title = {Pieces of Eight: 8-bit Neural Machine Translation},
booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)},
publisher = {Association for Computational Linguistics},
pages = {114--120},
location = {New Orleans - Louisiana},
url = {
http://aclweb.org/anthology/N18-3014},
year = 2018
}
Quinn and Ballesteros (2018)
Bogoychev, Nikolay and Heafield, Kenneth and Aji, Alham Fikri and Junczys-Dowmunt, Marcin (2018):
Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
@inproceedings{D18-1332,
author = {Bogoychev, Nikolay and Heafield, Kenneth and Aji, Alham Fikri and Junczys-Dowmunt, Marcin},
title = {Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/D18-1332},
pages = {2991--2996},
year = 2018
}
Bogoychev et al. (2018)
Training Objective
Shao, Chenze and Chen, Xilin and Feng, Yang (2018):
Greedy Search with Probabilistic N-gram Matching for Neural Machine Translation, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
@inproceedings{D18-1510,
author = {Shao, Chenze and Chen, Xilin and Feng, Yang},
title = {Greedy Search with Probabilistic N-gram Matching for Neural Machine Translation},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/D18-1510},
pages = {4778--4784},
year = 2018
}
Shao et al. (2018) - sequence-level
Wieting, John and Berg-Kirkpatrick, Taylor and Gimpel, Kevin and Neubig, Graham (2019):
Beyond BLEU:Training Neural Machine Translation with Semantic Similarity, Proceedings of the 57th Conference of the Association for Computational Linguistics
@inproceedings{wieting-etal-2019-beyond,
author = {Wieting, John and Berg-Kirkpatrick, Taylor and Gimpel, Kevin and Neubig, Graham},
title = {Beyond {BLEU}:Training Neural Machine Translation with Semantic Similarity},
booktitle = {Proceedings of the 57th Conference of the Association for Computational Linguistics},
month = {jul},
address = {Florence, Italy},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/P19-1427},
pages = {4344--4355},
year = 2019
}
Wieting et al. (2019) - sentence-level optimization
Petrushkov, Pavel and Khadivi, Shahram and Matusov, Evgeny (2018):
Learning from Chunk-based Feedback in Neural Machine Translation, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
@InProceedings{P18-2052,
author = {Petrushkov, Pavel and Khadivi, Shahram and Matusov, Evgeny},
title = {Learning from Chunk-based Feedback in Neural Machine Translation},
booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
publisher = {Association for Computational Linguistics},
pages = {326--331},
location = {Melbourne, Australia},
url = {
http://aclweb.org/anthology/P18-2052},
year = 2018
}
Petrushkov et al. (2018) - chunk-based feedback
Zheng, Renjie and Ma, Mingbo and Huang, Liang (2018):
Multi-Reference Training with Pseudo-References for Neural Translation and Text Generation, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
@inproceedings{D18-1357,
author = {Zheng, Renjie and Ma, Mingbo and Huang, Liang},
title = {Multi-Reference Training with Pseudo-References for Neural Translation and Text Generation},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/D18-1357},
pages = {3188--3197},
year = 2018
}
Zheng et al. (2018) - multi-reference
Wu, Lijun and Tian, Fei and Qin, Tao and Lai, Jianhuang and Liu, Tie-Yan (2018):
A Study of Reinforcement Learning for Neural Machine Translation, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
@inproceedings{D18-1397,
author = {Wu, Lijun and Tian, Fei and Qin, Tao and Lai, Jianhuang and Liu, Tie-Yan},
title = {A Study of Reinforcement Learning for Neural Machine Translation},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/D18-1397},
pages = {3612--3621},
year = 2018
}
Wu et al. (2018) - reinforcement learning
Bidirectional
Zhou, Long and Zhang, Jiajun and Zong, Chengqing (2019):
Synchronous Bidirectional Neural Machine Translation, Transactions of the Association for Computational Linguistics
@article{zhou-etal-2019-synchronous,
author = {Zhou, Long and Zhang, Jiajun and Zong, Chengqing},
title = {Synchronous Bidirectional Neural Machine Translation},
journal = {Transactions of the Association for Computational Linguistics},
volume = {7},
url = {
https://www.aclweb.org/anthology/Q19-1006},
doi = {10.1162/tacl_a_00256},
pages = {91--105},
year = 2019
}
Zhou et al. (2019)
Context
Chen, Kehai and Wang, Rui and Utiyama, Masao and Sumita, Eiichiro and Zhao, Tiejun (2017):
Context-Aware Smoothing for Neural Machine Translation, Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
@InProceedings{chen-EtAl:2017:I17-1,
author = {Chen, Kehai and Wang, Rui and Utiyama, Masao and Sumita, Eiichiro and Zhao, Tiejun},
title = {Context-Aware Smoothing for Neural Machine Translation},
booktitle = {Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
month = {November},
address = {Taipei, Taiwan},
publisher = {Asian Federation of Natural Language Processing},
pages = {11--20},
url = {
http://www.aclweb.org/anthology/I17-1002},
year = 2017
}
Chen et al. (2017)
Boosting
Zhang, Dakun and Kim, Jungi and Crego, Josep and Senellart, Jean (2017):
Boosting Neural Machine Translation, Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
@inproceedings{zhang-etal-2017-boosting,
author = {Zhang, Dakun and Kim, Jungi and Crego, Josep and Senellart, Jean},
title = {Boosting Neural Machine Translation},
booktitle = {Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
month = {nov},
address = {Taipei, Taiwan},
publisher = {Asian Federation of Natural Language Processing},
url = {
https://www.aclweb.org/anthology/I17-2046},
pages = {271--276},
year = 2017
}
Zhang et al. (2017)
Dropout
Xiaolin Wang and Masao Utiyama and Eiichiro Sumita (2017):
Empirical Study of Dropout Scheme for Neural Machine Translation, Machine Translation Summit XVI
@inproceedings{mtsummit2017:Wang,
author = {Xiaolin Wang and Masao Utiyama and Eiichiro Sumita},
title = {Empirical Study of Dropout Scheme for Neural Machine Translation},
booktitle = {Machine Translation Summit XVI},
location = {Nagoya, Japan},
year = 2017
}
Wang et al. (2017)
Tuning
Hao Qin and Takahiro Shinozaki and Kevin Duh (2017):
Evolution Strategy based Automatic Tuning of Neural Machine Translation Systems, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
@inproceedings{IWSLT2017:Qin,
author = {Hao Qin and Takahiro Shinozaki and Kevin Duh},
title = {Evolution Strategy based Automatic Tuning of Neural Machine Translation Systems},
url = {
http://workshop2017.iwslt.org/downloads/O03-2-Paper.pdf},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
location = {Tokyo, Japan},
year = 2017
}
Qin et al. (2017)
Automatic Post-Editing
Vu, Thuy-Trang and Haffari, Gholamreza (2018):
Automatic Post-Editing of Machine Translation: A Neural Programmer-Interpreter Approach, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
@inproceedings{D18-1341,
author = {Vu, Thuy-Trang and Haffari, Gholamreza},
title = {Automatic Post-Editing of Machine Translation: A Neural Programmer-Interpreter Approach},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/D18-1341},
pages = {3048--3053},
year = 2018
}
Vu and Haffari (2018)
Variational
Zhang, Biao and Xiong, Deyi and su, jinsong and Duan, Hong and Zhang, Min (2016):
Variational Neural Machine Translation, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
@InProceedings{zhang-EtAl:2016:EMNLP20162,
author = {Zhang, Biao and Xiong, Deyi and su, jinsong and Duan, Hong and Zhang, Min},
title = {Variational Neural Machine Translation},
booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing},
month = {November},
address = {Austin, Texas},
publisher = {Association for Computational Linguistics},
pages = {521--530},
url = {
https://aclweb.org/anthology/D16-1050},
year = 2016
}
Zhang et al. (2016)
Semi-Supervised
Cheng, Yong and Xu, Wei and He, Zhongjun and He, Wei and Wu, Hua and Sun, Maosong and Liu, Yang (2016):
Semi-Supervised Learning for Neural Machine Translation, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
@InProceedings{cheng-EtAl:2016:P16-1,
author = {Cheng, Yong and Xu, Wei and He, Zhongjun and He, Wei and Wu, Hua and Sun, Maosong and Liu, Yang},
title = {Semi-Supervised Learning for Neural Machine Translation},
booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {1965--1974},
url = {
http://www.aclweb.org/anthology/P16-1185},
year = 2016
}
Cheng et al. (2016)
Discriminative
Do, Quoc-Khanh and Allauzen, Alexandre and Yvon, François (2015):
A Discriminative Training Procedure for Continuous Translation Models, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
@InProceedings{do-allauzen-yvon:2015:EMNLP,
author = {Do, Quoc-Khanh and Allauzen, Alexandre and Yvon, Fran\c{c}ois},
title = {A Discriminative Training Procedure for Continuous Translation Models},
booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
month = {September},
address = {Lisbon, Portugal},
publisher = {Association for Computational Linguistics},
pages = {1046--1052},
url = {
http://aclweb.org/anthology/D15-1121},
year = 2015
}
Do et al. (2015)
Non-Linear
Huang, Shujian and Chen, Huadong and Dai, Xin-Yu and Chen, Jiajun (2015):
Non-linear Learning for Statistical Machine Translation, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
@InProceedings{huang-EtAl:2015:ACL-IJCNLP,
author = {Huang, Shujian and Chen, Huadong and Dai, Xin-Yu and Chen, Jiajun},
title = {Non-linear Learning for Statistical Machine Translation},
booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
month = {July},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
pages = {825--835},
url = {
http://www.aclweb.org/anthology/P15-1080},
year = 2015
}
Huang et al. (2015)
Contrastive Noise Estimation
Cherry, Colin (2016):
An Empirical Evaluation of Noise Contrastive Estimation for the Neural Network Joint Model of Translation, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
@InProceedings{cherry:2016:N16-1,
author = {Cherry, Colin},
title = {An Empirical Evaluation of Noise Contrastive Estimation for the Neural Network Joint Model of Translation},
booktitle = {Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
address = {San Diego, California},
publisher = {Association for Computational Linguistics},
pages = {41--46},
url = {
http://www.aclweb.org/anthology/N16-1006},
year = 2016
}
Cherry (2016)
Distillation
Markus Freitag and Yaser Al-Onaizan and Baskaran Sankaran (2017):
Ensemble Distillation for Neural Machine Translation, CoRR
@article{DBLP:journals/corr/FreitagAS17,
author = {Markus Freitag and Yaser Al{-}Onaizan and Baskaran Sankaran},
title = {Ensemble Distillation for Neural Machine Translation},
journal = {CoRR},
volume = {abs/1702.01802},
url = {
http://arxiv.org/abs/1702.01802},
archiveprefix = {arXiv},
eprint = {1702.01802},
timestamp = {Mon, 13 Aug 2018 16:46:40 +0200},
biburl = {
https://dblp.org/rec/bib/journals/corr/FreitagAS17},
bibsource = {dblp computer science bibliography,
https://dblp.org},
year = 2017
}
Freitag et al. (2017)
Chen, Yun and Liu, Yang and Cheng, Yong and Li, Victor O.K. (2017):
A Teacher-Student Framework for Zero-Resource Neural Machine Translation, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
mentioned in Neural Network Models and Training@InProceedings{chen-EtAl:2017:Long5,
author = {Chen, Yun and Liu, Yang and Cheng, Yong and Li, Victor O.K.},
title = {A Teacher-Student Framework for Zero-Resource Neural Machine Translation},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {July},
address = {Vancouver, Canada},
publisher = {Association for Computational Linguistics},
pages = {1925--1935},
url = {
http://aclweb.org/anthology/P17-1176},
year = 2017
}
Chen et al. (2017)
Kim, Yoon and Rush, Alexander M. (2016):
Sequence-Level Knowledge Distillation, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
@inproceedings{kim-rush-2016-sequence,
author = {Kim, Yoon and Rush, Alexander M.},
title = {Sequence-Level Knowledge Distillation},
booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing},
month = {nov},
address = {Austin, Texas},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/D16-1139},
doi = {10.18653/v1/D16-1139},
pages = {1317--1327},
year = 2016
}
Kim and Rush (2016)
Dakun Zhang and Josep Crego and Jean Senellart (2018):
Analyzing Knowledge Distillation in Neural Machine Translation, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
@inproceedings{iwslt18-Distillation-Zhang,
author = {Dakun Zhang and Josep Crego and Jean Senellart},
title = {Analyzing Knowledge Distillation in Neural Machine Translation},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
year = 2018
}
Zhang et al. (2018)
Chen, Yun and Li, Victor O.K. and Cho, Kyunghyun and Bowman, Samuel (2018):
A Stable and Effective Learning Strategy for Trainable Greedy Decoding, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
@inproceedings{D18-1035,
author = {Chen, Yun and Li, Victor O.K. and Cho, Kyunghyun and Bowman, Samuel},
title = {A Stable and Effective Learning Strategy for Trainable Greedy Decoding},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/D18-1035},
pages = {380--390},
year = 2018
}
Chen et al. (2018)