Search Descriptions


Neural machine Translation

Statistical Machine Translation

Search Publications





Multilingual Word Embeddings

Mapping between the word embeddings spaces for different languages, or a common word embedding space for all languages enables a shared semantic space that reveals word correspondences across languages.

Multilingual Word Embeddings is the main subject of 60 publications. 44 are discussed here.


Ruder et al. (2017) gives a comprehensive overview of work on cross-lingual word embeddings. The observation that word representations obtained from their distributional properties (i.e., how they are used in text) are similar across languages has been made long known, but Mikolov et al. (2013) was among the first to observe this for the word embeddings generated by neural models and suggest that a simple linear transformation from word embeddings in one language to word embeddings in another language may be used to translate words.

Aligning Embedding Spaces

Mikolov et al. (2013) learn the linear mapping between pre-existing embedding spaces by minimizing the distance between a projected source word vector and a target word vector for a given seed lexicon. Xing et al. (2015) improve this method by requiring the mapping matrix to be orthogonal. Artetxe et al. (2016) refine this method further with mean centering. Faruqui and Dyer (2014) map monolingually generated word embeddings into a shared bilingual embedding state using canonical correlation analysis by maximizing the correlation of the two vector for each word translation pair. Braune et al. (2018) point out that the accuracy of obtained bilingual lexicons is much lower for rare words, a problem that can be somewhat addressed with additional features such as representations built on letter n-grams and taking orthographic distance into account when mapping words. Heyman et al. (2019) learn a linear transform between embedding spaces based on an automatically generated seed lexicon and show improvements by incrementally adding languages and matching the spaces of newly added languages to all previous languages (multi-hub). Alqaisi and O'Keefe (2019) consider the problem of morphological rich languages at the example of Arabic and demonstrate the importance of morphological analysis and word splitting.

Seed Lexicon

Supervised and semi-supervised approaches to map embedding spaces require a seed lexicon of word translation pairs. These are most commonly generated with traditional statistical methods from parallel corpora (Faruqui and Dyer, 2014). Lubin et al. (2019) address the problem of noisy word pairs in such automatically generated lexicons, showing that they cause significant harm, and develop a method that learns the noise level and finds noisy pairs. Søgaard et al. (2018) use identically spelled words in both languages as seeds. Shi et al. (2019) use off-the-shelf bilingual dictionaries and detail how such human-targeted dictionary definitions needs to be preprocessed. Artetxe et al. (2017) reduce the need for large seed dictionaries by starting with just 25 entries and iteratively increasing the dictionary based on the obtained mappings. Making do with weaker supervision, Gouws et al. (2015) learn directly from sentence pairs by predicting words in a target sentence from words in a source sentence. Coulmance et al. (2015) explore a variant of this idea. Vulić and Moens (2015) use pairs of Wikipedia document pairs, aiming to predict words in mixed language documents. Zhou et al. (2019) use identically spelled words as seeds. Vulić and Korhonen (2016) compare different types and sizes of seed lexicons.

Unsupervised Methods

Barone (2016) suggests the idea of using auto-encoders and adversarial training to learn a mapping between monolingual alignment spaces without any parallel data or any other bilingual signal but does not report any results. Zhang et al. (2017) demonstrate the effectiveness this idea, exploring both unidirectional and bidirectional mappings. Conneau et al. (2018) add a fine-tuning step based on a synthetic dictionary of high-confidence word pairs, achieving vastly better results. Mohiuddin and Joty (2019) extend this approach into a symmetric setup that learns mappings into both directions, with a discriminator for each languages (called a CycleGAN), and reconstruction loss as a training objective component. Xu et al. (2018) propose a similar method, using Sinkhorn distance. Chen and Cardie (2018) extend the adversarial training approach to more than two languages.
Instead of using adversarial training, Zhang et al. (2017) measure the difference between the two embedding spaces with earth mover's distance, defined as the sum of distances of how far each word vector has to be moved towards the nearest vector in the other language's embedding space. Hoshen and Wolf (2018) follow the same intuition but first reduce the complexity of the word vectors with principle component analysis (PCA) and align the spaces alongside the resulting axis first. Their iterative algorithm moves the projections of word vectors to the closest target-side vector in the projected space. Alvarez-Melis and Jaakkola (2018) draw parallels between this approach and Optimal Transport. In their method, they minimize the distance between a projected vector and all target-side vectors, measured by the L2 norm. Alaux et al. (2019) extend this to more than two languages, by mapping all languages into a common space and matching the word embedding distributions of any two languages at a time. Mukherjee et al. (2018) use squared-loss mutual information (SMI) as optimization measure to match the monolingual distributions. Zhou et al. (2019) first learn a density distribution over the each of the monolingual word embedding spaces using Gaussian mixture models, and then map these spaces so that word vectors in one language are mapped to vectors in the language space with similar density, measured with KL divergence.
Instead of learning a mapping between embedding spaces, Marie and Fujita (2019) learn a joint embedding space for multiple languages using a skip-gram model trained on mixed-language text. Their method is bootstrapped with unsupervised statistical machine translation. Wada et al. (2019) train a multi-lingual bidirectional language model with language-specific embeddings but shared state progression parameters. The resulting word embeddings are in a common space, with words close to their translations.

Properties of the Mapping

Methods that operate on fixed monolingual embedding spaces often learn a linear mapping between them, hence assuming that they are orthogonal. Nakashole and Flauger (2018) show that this assumption is less accurate when distant languages are involved. Søgaard et al. (2018) find the same when the languages are linguistically different, using a metric based on eigenvectors. They also note that the method works less well when the monolingual data is not drawn from the same domain or when different methods for monolingual word embedding training are used. Nakashole (2018) proposes to use linear mappings that are local to neighborhoods of words. Xing et al. (2015) argue that the mapping matrix should be orthogonal and show improvements when constraining it thus. Patra et al. (2019) relax orthogonality to a soft constraint in the training objective.

Hubness Problem

An identified problem in finding the most similar word in another languages is the hubness problem. Some words are close to many other words and hence get more frequently identified as translations. Conneau et al. (2018) consider the average distance to neighboring words in the other language and scale the distance calculation accordingly. Joulin et al. (2018) use this adjustment during training. Smith et al. (2017) propose to normalize the distance matrix between input and output words. Given the distances of source word to every target word, the distances are normalized to add up to 1 using the softmax, and vice versa. Huang et al. (2019) formalize the underlying intuition behind this idea as an optimization problem to enforce both normalizations jointly and propose a gradient descent method to solve it.

Multilingual Sentence Embeddings

Schwenk and Douze (2017) propose to obtain sentence embeddings from an LSTM-based neural machine translation models by adopting the final encoder state or max-pooling over all encoder states. Schwenk (2018) obtains better results by training a joint encoder for multiple languages and apply it to filter noisy parallel corpora. Similarly, España-Bonet et al. (2017) compute the sum of the encoder states to obtain sentence embeddings. Artetxe and Schwenk (2019) presented a encoder-decoder model built specifically to generate sentence embeddings, trained on parallel sentence pairs but with a single sentence embedding vector as the interface between encoder and decoder. Artetxe and Schwenk (2018) implemented this approach as a freely available toolkit called LASER. Schwenk et al. (2019) use it to extract large parallel corpora from Wikipedia. Ruiter et al. (2019) compute sentence embeddings as sum of word embeddings or encoder states of a neural machine translation model. They use these sentence embeddings to find parallel sentence pairs in a comparable corpus and iterate this process to improve the translation model and then find more and better sentence pairs.

Multilingual Document Embeddings

With the aim to address the task of aligning bilingual documents, Guo et al. (2019) present a model to obtain document embeddings, built from word and sentence embeddings.



Related Topics

New Publications

Unsupervised Methods

  • Aldarmaki et al. (2018)
  • Chi and Chen (2018)
  • Duong et al. (2016)


  • Ramesh and Sankaranarayanan (2018)
  • Hazem and Morin (2017)
  • Doval et al. (2018)
  • Ruder et al. (2018)
  • Dou et al. (2018)
  • Duong et al. (2017)
  • Cao et al. (2016)
  • Shi et al. (2015)
  • Hermann and Blunsom (2014)
  • Zou et al. (2013)
  • Chandar A P et al. (2014)
  • Huang et al. (2015)
  • Su et al. (2015)