Search Descriptions


Neural machine Translation

Statistical Machine Translation

Search Publications





Training Data for Transliteration

Since transliteration examples typically do not exist, there has been significant effort to collect such data.

Transliteration Training Data is the main subject of 43 publications. 18 are discussed here.


Training data may be collected from parallel corpora (Lee and Chang, 2003; Lee et al., 2004), or by mining comparable data such as news streams (Klementiev and Roth, 2006; Klementiev and Roth, 2006b). Training data for transliteration may also be obtained from monolingual text where the spelling of a foreign name is followed by its native form in parenthesis (Lin et al., 2004; Chen and Chen, 2006; Lin et al., 2008), which is common for instance for unusual English names in Chinese text. Such an acquisition may be improved by bootstrapping — iteratively extracting high-confidence pairs and improving the matching model (Sherif and Kondrak, 2007). Sproat et al. (2006) fish for name transliteration in comparable corpora, also using phonetic correspondences. Tao et al. (2006) exploit additionally temporal distributions of name mentions, and Yoon et al. (2007) use a Winnow algorithm and a classifier to bootstrap the acquisition process. Cao et al. (2007) use various features, including that a Chinese character is part of a transliteration a priori in a perceptron classifier. Large monolingual corpus resources such as the web are used for validation (Al-Onaizan and Knight, 2002; Al-Onaizan and Knight, 2002b; Qu and Grefenstette, 2004; Kuo et al., 2006; Yang et al., 2008). Of course, training data may also be manually created, possibly aided by an active learning component that suggests the most valuable new examples (Goldwasser and Roth, 2008).



Related Topics

New Publications

  • You et al. (2013)
  • Kunchukuttan and Bhattacharyya (2015)
  • Richardson et al. (2013)
  • Chen et al. (2013)
  • El-Kahki et al. (2012)
  • Munro and Manning (2012)
  • Sajjad et al. (2012)
  • Aransa et al. (2012)
  • Chang et al. (2009)
  • Yang et al. (2009)
  • You et al. (2010)
  • Ji (2009)
  • Chen et al. (2010)
  • Udupa et al. (2009)
  • Kumaran et al. (2010)
  • Kumaran et al. (2010)
  • Li et al. (2010)
  • Li et al. (2010)
  • Sajjad et al. (2011)
  • Kahki et al. (2011)
  • Freeman et al. (2006)
  • Wu and Chang (2007)
  • Oh and Isahara (2008)
  • Jin et al. (2008)
  • Kuo et al. (2008)