Translating Tense, Case, and Markers
Syntactic properties such as tense or case that add information about content words or define relationships between them may be encoded with morphological inflection or function words. Languages differ in this respect, thus posing a special challenge for translation between languages with different encoding schemes.
Translating Tense Case Markers is the main subject of 32 publications. 14 are discussed here.
analyzes the translation of tense across languages and warns against a simplistic view of the problem. Murata et al. (2001)
propose a machine learning method using support vector machines to predict target language tense. Ye et al. (2006)
propose to use additional features in a conditional random field classifier to determine verb tenses when translating from Chinese to English. Ueffing and Ney (2003)
use a pre-processing method to transform the English verb complex to match its Spanish translation more closely. A similar problem is the prediction of case markers in Japanese (Suzuki and Toutanova, 2006)
, which may be done using a maximum entropy model as part of a treelet translation system (Toutanova and Suzuki, 2007)
, or the prediction of aspect markers in Chinese, which may framed as classification problem and models with conditional random fields (Ye et al., 2007)
When noun compounds such as "finance minister" have to be translated into a language that explicitly marks the relationship between nouns, Paul et al. (2010)
propose to first paraphrase the compound into constructions where this relationship is marked, here: "minister of finance", using a rich source side language model, before translating it into an under-resourced target language.
Linguistic analysis of some languages suggests the existence of empty categories. The most commonly known is pro-drop, the omission of pronouns if they are implied by the wider document context, or the verb inflection. Chung and Gildea (2010)
use a syntactic parse of Chinese as the source language to detect empty categories and add special tokens in the training and test sets. Xiang et al. (2013)
extend this work and also apply it to Korean, by using sparse features in the decoder to further improve translation performance.
Another example of a syntactic markers that are more common in Asian than European languages are numeral classifiers (as in three sheets of paper
). Paul et al. (2002)
present a corpus-based method to generate them for Japanese and Zhang et al. (2008)
present a method for generating Chinese measure words.
presents an overview of linguistic differences between languages, called divergences (Gupta and Chatterjee, 2003)
- Weller et al. (2015)
- Xu et al. (2015)
- Steele (2015)
- Weller et al. (2014)
- Baran and Xue (2011)
- Cui et al. (2011)
- Ma et al. (2011)
- Gong et al. (2012)
- Chong et al. (2012)
- Gong et al. (2012)
- Shilon et al. (2012)
- Jayan et al. (2012)
- Chang et al. (2009)
- Setiawan et al. (2009)
- Ramanathan et al. (2009)
- Khemakhem et al. (2010)
- Meyer (2011)
- Naskar and Bandyopadhyayn (2006)