Lexicon Induction for Less Commonly Used Languages

Introduction

Parallel corpora is an invaluable resource for inferring accurate translation correspondences. However, sufficiently large parallel texts only exist for a very small fraction of language pairs, and are very laborious to produce in sufficient quantities. The expense associated with generating parallel training data is considered the principle hurdle preventing wider deployment of translation systems, and we may have to make use of other means to get around it.

Alternative resources which are either available or cheap to collect include vast amounts of monolingual texts published on the web, comparable multilingual corpora (e.g. Wikipedia page translations), etc. They often provide cues which can be used to define similarity measures between words across languages, and thus induce high quality translation lexicons. Examples of these cues include:

Contextual similarity: words which are translations of one another are likely to occur in similar contexts across languages (e.g. [3]).

Temporal similarity: the idea is that in time-stamped texts (e.g. news streams) word translations will have similar temporal histograms (e.g. [4]).

Topic / category similarity: words which are translations of each other are likely to appear in similar topics; e.g. news topics or wikipedia categories.

Phonetic similarity: useful for pairing up transliterated named entities and cognates.

Each of these cues provides an independent method for scoring candidate translations, and combining them has been shown to improve the quality of induced lexicon (e.g. [4, 2]).

Details

For this project, we propose to develop tools for collecting and aggregating monolingual cues for bilingual lexicon induction. In particular:

1. Implement a set of data structures for efficient collection of features / cues from large monolingual corpora. We can build on the existing Java code developed for [1].

2. Implement procedures for collecting temporal, contextual, and category information for words in L1 and L2; we will use the similarity measures to produce rankings of translation candidates in L2 for words in L1. Each of these procedures can be implemented independently of the others.

3. Finally, we would like to implement and experiment with strategies for combining these cues. In addition to simple heuristics, we would like to try more sophisticated methods for combining ranked candidates (e.g. [4]).

For experiments in this project we have collected seed bilingual dictionaries, plain text wikipedia dumps, and monolingual texts fetched from news sites on the Web for a number of less commonly used languages.

References

[1] Alexandre Klementiev and Dan Roth. Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), 2006.

[2] Philipp Koehn and Kevin Knight. Learning a translation lexicon from monolingual corpora. In ACL Workshop on Unsupervised Lexical Acquisition, 2002.

[3] Reinhard Rapp. Identifying word translations in non-parallel texts. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 320322, 1995.

[4] Charles Schafer and David Yarowsky. Inducing translation lexicons via diverse similarity measures and bridge languages. In Proc. of the Annual Conference on Computational Natural Language Learning (CoNLL), pages 146152, 2002.

Code Repository

Code is available at http://github.com/aklement/babel. It is being updated to include the work done at the workshop.

Page last modified on February 08, 2010, at 03:55 PM