While the words in English compounds such as machine translation remain separate, others merge them into a single new word, a highly productive process that leads to large vocabulary sizes.
Compounds is the main subject of 22 publications. 18 are discussed here.
Translating from compounding languages like German requires compound splitting methods (Brown, 2002)
. A frequency-based method, supported by linguistic clues is introduced by Koehn and Knight (2003)
. This method is refined by Stymne (2008)
, such as by addressing more of the morphological changes that occur due to compounding. Macherey et al. (2011)
learn the required morphological changes. Compound splitting can also be provided by morphological analysers (Nießen and Ney, 2000
; Holmqvist et al., 2007)
. Fritzinger and Fraser (2010)
combine linguistic analysis with corpus-driven statistics. Weller et al. (2014)
also consider the semantic similarity (using distributional models) between the compound and its potential parts to guide splitting decisions.
Since there are multiple ways to split potential compounds, Dyer (2009)
provides multiple splits to the decoder in an input lattice. Wuebker and Ney (2012)
consider multiple splits also during phrase model training.
When translating into compounding languages, compounds have to be generated. Stymne et al. (2013)
provide an extensive overview. Popovic et al. (2006)
split compounds during training and merge them in post-processing. Stymne et al. (2008)
also allow the creation of novel words by compounding. Stymne (2009)
compares various methods to mark split points, and consider the part of speech of split words. Stymne and Cancedda (2011)
extend this approach further by a Conditional Random Field (CRF) classifier that detects merge points. This work was integrated by Fraser et al. (2012)
as a post-processing step into a machine translation system. Armed with both a corpus based approach and a morphological analyzer to split words, Cap et al. (2014)
build a CRF classifier for merge points that also includes features about the source language, such as that the two words are part of the same base noun phrase.
Botha et al. (2012)
develop a hierarchical Pitman-Yor language model to better handle compounds.
- Cap et al. (2015)
- Matthews et al. (2016)
- Junczys-Dowmunt and Pouliquen (2014)
- Pu et al. (2015)