Search Descriptions


Neural machine Translation

Statistical Machine Translation

Search Publications





Analysis and Visualization

Neural machine translation models operate on high-dimensional representation at any stage of processing. Their abilities and failures are hard to determine from their millions of parameters. To better understand the behavior of neural machine translation models, researchers compared performance to phrase-based systems, explored linguistic abilities of the models, and developed method to visualize their processing.

Analysis And Visualization is the main subject of 85 publications. 62 are discussed here.


Quality of Machine Translation Output:

With the advent of neural machine translation and its better quality in terms of automatic metrics such as BLEU and human ranking of translation quality (Bojar et al., 2016), researchers and users of machine translation were initially interested in more fine-grained assessment the differences of these two technologies. Bentivogli et al. (2016); Bentivogli et al. (2018) considered different automatically assessed linguistic categories when comparing the performance of neural vs. statistical machine translation systems for English-German. Klubička et al. (2017) use multidimensional quality metrics (MQM) for a manual error analysis to compare two statistical and one neural system for English-Croatian. Burchardt et al. (2017) pose difficult linguistic challenges to assess several statistical, neural, and rule-based systems for German-English and English-German, showing better performance for the rule-based system for verb tense and valency, but better performance for the neural system for many other categories such as handling of composition, function words, multi-word expressions, and subordination. Harris et al. (2017) extend this analysis to English-Latvian and English-Czech. Popović (2017) uses similar manual annotation of different linguistic error categories to compare a neural and statistical system for these language pairs. Parida and Bojar (2018) compare a phrase-based statistical model, a recurrent neural translation model and a transformer model for the task of translation of short English-to-Hindi segments, with the transformer model coming out on top. Toral and Sánchez-Cartagena (2017) compared different broad aspects such as fluency and reordering for nine language directions. Castilho et al. (2017) use automatic scores when comparing neural and statistical machine translation for different domains (e-commerce, patents, educational content), showing better performance for the neural systems except for patent abstracts and e-commerce. They followed this up (Castilho et al., 2017) with a more detailed human assessment of linguistic aspects for the educational content. They find better performance for the neural model across categories such as inflectional morphology, word order, omission, addition and mistranslation for 4 languages. Cohn-Gordon and Goodman (2019) examine how sentences are translated that are ambiguous in one language due to underspecification.
Addressing the use of machine translation, Martindale and Carpuat (2018) highlight that the typically fluent output of neural machine translation systems may lead to unwarranted high level of trust. They show that exposure to bad translations reduces user's trust, but more so for disfluent than misleading translations. Castilho and Guerberof (2018) carry out a task-based comparative evaluation between a neural and a statistical machine translation system for 3 language pairs. The human evaluators read the translation and answered questions about the content, allowing for measurement of reading speed and correctness of answers, as well as solicitation of feedback.
The claim of human parity for Chinese-English news translation (Hassan et al., 2018) has triggered a number of responses. Toral et al. (2018) call this claim into question by observing the impact of using test sets created in the reverse order (translated from the target side to the source side, opposite to the machine translation direction and the skill of human evaluators. Läubli et al. (2018) present results that show annotators gave machine translation higher scores on adequacy than human translations, but only on the sentence level, not the document level, and also that human translations are ranked higher in terms of fluency.

Targeted Test Sets:

Isabelle et al. (2017) pose a challenge set of manually crafted French sentences for a number of linguistic categories that pose hard problems for translations, such as long distance agreement or preservation of polarity. Sennrich (2017) developed an automatic method to detect specific morphosyntactic errors. First a test set is created by taking sentence pairs, and modifying the target sentence to exhibit specific types of error, such as wrong gender of determiners, wrong particles for verbs, wrong transliteration. Then a neural translation model is evaluated by how often it scores the correct translation higher then the faulty translations. The paper compares byte-pair encoding against character-based models for rare and unknown words. Rios et al. (2017) use this method to create contrastive translation pair to address the problem of translating ambiguous nouns. Burlot and Yvon (2017) use it to create a test set for selecting the correct morphological variant in a morphologically rich target language, Latvian. Müller et al. (2018) created a test set to evaluate the translation of pronouns, although Guillou and Hardmeier (2018) point out that automatic evaluation of pronoun translation is tricky and may not correlate well with human judgment. Shao et al. (2018) propose to evaluate the translation of idioms with a blacklist method: if words that are part of a literal translation of the idiomatic phrase occur in the output, it is flagged as incorrect.


It is common to plot word embeddings (Maaten and Hinton, 2008) or attention weights (Koehn and Knowles, 2017; Vaswani et al., 2017) for inspection of parameters and model states. Marvin and Koehn (2018) plot embedding states for words marked with their senses. Ghader and Monz (2017) more closely examine attention states, in comparison to traditional word alignments. Tran et al. (2016) integrate an attention mechanism into a language model and show which previous words had the most influence on predictions of the next word. Stahlberg et al. (2018) add additional markup to the target side of the parallel corpus and hence the output of the translation model that flags translation decisions.
Lee et al. (2017) developed an interactive tool that allows exploration of the behavior of beam search. Strobelt et al. (2019) present the more comprehensive tool Seq2seq-Vis that also allows the plotting and comparison of encoder and decoder states to neighbor states seen during training.
Neubig et al. (2019) present the tool compare-mt that allows more fine-grained error analysis by comparing the output of two systems in terms of automatic scores, break-downs by word frequency, part-of-speech tags, and others, as well as identification of source words with strongly divergent translation quality.
Schwarzenberg et al. (2019) train a classifier using a convolutional neural network to distinguish between human and machine translations and use the contribution of word-based features to identify words that drive this decision.

Predicting Properties from Internal Representations: To probe intermediate representations, such as encoder and decoder states, a strategy is to use them as input to a classifier that predicts specific, mostly linguistic, properties.

Belinkov et al. (2017) predict the part of speech and morphological features of words linked to encoder and decoder states, showing better performance of character-based models, but not much difference for deeper layers. Belinkov et al. (2017) also consider semantic properties. Shi et al. (2016) find that basic syntactic properties are learned by translation models. Poliak et al. (2018) probe if sentence embeddings (the first and last state of the RNN encoder) have sufficient semantic information to serve as input to semantic entailment tasks.
Raganato and Tiedemann (2018) assess the encoder states of the transformer model. They develop 4 syntactic probing tasks (part-of-speech tagging, chunking, named entity recognition, and semantic dependency) and find that the earlier layers contain more syntactic information (e.g., part-of-speech tagging) while later layer contain more semantic information (e.g., semantic dependencies). Tang et al. (2018) examine the role of the attention mechanism when handling ambiguous nouns. Contrary to their intuition, the decoder pays more attention to the word itself instead of context words in the case of ambiguous nouns compared to nouns in general. This is the case both for RNN-based and transformer-based translation models. They suspect that word sense disambiguation already takes place in the encoder.
A number of studies of internal representation focus on just language modeling. Linzen et al. (2016) propose the task of subject-verb agreement, especially when interrupted by other nouns, as a challenge to sequence models that have to preserve agreement information. Gulordava et al. (2018) extend this idea into several other hierarchical language problems. Giulianelli et al. (2018) build classifiers to predict the verb agreement information from the internal states at different layers of an LSTM language model and go even a step further and demonstrate that changing the decoder states based on insight gained from the classifiers allows them to make better decisions. Tran et al. (2018) compare how well fully attentional (transformer) models compare against recurrent neural networks when it comes to decisions depending of hierarchical structure. Their experiments show that recurrent neural networks perform better at tasks such as subject verb agreement separated by recursive phrases. Zhang and Bowman (2018) show that states obtained from bidirectional language models are a better at part of speech tagging and supertagging tasks than the encoder states of a neural translation model. Dhar and Bisazza (2018) explore if multi-lingual language training leads generalizing more general syntactic but find only small improvement on agreement tasks when completely separating the vocabularies.

Role of Individual Neurons:

Karpathy et al. (2016) inspect individual neurons in a character-based language model and find single neurons that appear to keep track of position in the line (expecting a line break character), and the opening of brackets. Shi et al. (2016) correlated activation values of specific nodes in the state of a simple LSTM encoder-decoder translation model (without attention) with the length of the output and discovered nodes that count the number of words to ensure proper output length.

Tracing Decisions Back to Prior States:

Ding et al. (2017) propose to use layer-wise relevance feedback to measure which of the input states or intermediate states had the biggest influence on prediction decisions. Tackling the same problem, Ding et al. (2019) propose to use saliency, a method that measures the impact of input states based on how much small changes in their values (as indicated by the gradients) impact prediction decisions. Ma et al. (2018) examine the relative role of source context and prior decoder states on output word predictions. Knowles and Koehn (2018) explore what drives decisions of the model to copy input words such as names. They show the impact of both the context and properties of the word (such as capitalization.
Wallace et al. (2018) change the way predictions are made in neural models. Instead of a softmax prediction layer, final decoder states are compared to states during training, providing examples that explain the decision of the network.


Koehn and Knowles (2017) identify six challenges for neural machine translation, such as domain mismatch, low resource, beam search, etc. Khayrallah and Koehn (2018) find that neural methods are more sensitive to noise that previous statistical methods, especially untranslated source sentences. The copy noise problem was also identified by Ott et al. (2018) who suggest several remedies. Belinkov and Bisk (2018) consider natural and synthetic noise in the spelling of words and propose character-based word embedding models. They develop their own character models to address it, while Heigold et al. (2018) show that character-based models are better than byte-pair-encoding-based models. Michel et al. (2019) develop metrics aimed to find minimal changes to the input that result in maximal changes in the output, so-called adversarial examples. Michel and Neubig (2018) propose a test set of noisy text, derived from the web forum Reddit, consisting of acronyms, misspellings, hashtags, emoticons and exaggerated capitalization. Wees et al. (2018) propose a test set for 4 different domains (news, colloquial, editorial, and speech) for 4 different language (Arabic, Chinese, Bulgarian, Persian) as challenge for research in adaptation. Wei et al. (2018) examine if the output of neural machine translation models is syntactically well formed by parsing it with is linguistically precise HPSG grammar. While they find that 93% of output sentence conform to the grammar, they also identify a number of constructions that pose challenges.



Related Topics

New Publications

  • Freitag et al. (2019)
  • Hashimoto et al. (2019)
  • Zhang and Toral (2019)
  • Anastasopoulos (2019)
  • Clark et al. (2019)
  • Li et al. (2019)
  • Voita et al. (2019)
  • Kepler et al. (2019)
  • Chinea-Rios et al. (2018)
  • Grundkiewicz and Junczys-Dowmunt (2018)
  • Pham et al. (2018)
  • Unanue et al. (2018)
  • Domhan (2018)
  • Tang et al. (2018)


  • Alkhouli et al. (2018)

Linguistic Properties of Hidden Representations

  • Eger et al. (2016)

Translation Quality

  • Rabinovich et al. (2016)
  • Chinea-Rios et al. (2018)
  • Guta et al. (2015)

Evaluation Metrics

  • Shimanaka et al. (2018)
  • Apidianaki et al. (2018)