A common source of error in neural machine translation is dropping or double-translating words in the input sentence. Explicit models of word coverage address this problem.
Coverage is the main subject of 11 publications. 8 are discussed here.
Chen et al. (2016)
; Liu et al. (2016)
add supervised word alignment information (obtained with traditional statistical word alignment methods) to training. They augment the objective function to also optimize matching of the attention mechanism to the given alignments.
To better model coverage, Tu et al. (2016)
add coverage states for each input word by either (a) summing up attention values, scaled by a fertility value predicted from the input word in context, or (b) learning a coverage update function as a feed-forward neural network layer. This coverage state is added as additional conditioning context for the prediction of the attention state.
Feng et al. (2016)
condition the prediction of the attention state also on the previous context state and also introduce a coverage state (initialized with the sum of input source embeddings) that aims to subtract covered words at each step. Similarly, Meng et al. (2016)
separate hidden states that keep track of source coverage and hidden states that keep track of produced output.
Cohn et al. (2016)
add a number of biases to model coverage, fertility, and alignment inspired by traditional statistical machine translation models. They condition the prediction of the attention state on absolute word positions, the attention state of the previous output word in a limited window, and coverage (added attention state values) over a limited window. They also add a fertility model and add coverage in the training objective.
Alkhouli et al. (2016)
propose to integrate an alignment model that is similar to word-based statistical machine translation into a basic sequence-to-sequence translation model. This model is trained externally with traditional word alignment methods and informs predictions about which input word to translate next and bases the lexical translation decision on that word. Alkhouli and Ney (2017)
combine such a alignment model with the more traditional attention model, showing improvements.
- Yang et al. (2019)
- Li et al. (2018)