Compounds
While the words in English compounds such as machine translation remain separate, others merge them into a single new word, a highly productive process that leads to large vocabulary sizes.
Compounds is the main subject of 22 publications. 18 are discussed here.
Publications
Translating from compounding languages like German requires compound splitting methods
Ralf D. Brown (2002):
Corpus-Driven Splitting of Compound Words, Proceedings of the Ninth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI)
@InProceedings{Brown:2002,
author = {Ralf D. Brown},
title = {Corpus-Driven Splitting of Compound Words},
url = {
http://www.eamt.org/events/tmi2002/conference/02\_brown.pdf},
googlescholar = {2171527846185286418},
booktitle = {Proceedings of the Ninth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI)},
year = 2002
}
(Brown, 2002). A frequency-based method, supported by linguistic clues is introduced by
Philipp Koehn and Kevin Knight (2003):
Empirical Methods for Compound Splitting, Proceedings of Meeting of the European Chapter of the Association of Computational Linguistics (EACL)
@InProceedings{Koehn:2003c,
author = {Philipp Koehn and Kevin Knight},
title = {Empirical Methods for Compound Splitting},
booktitle = {Proceedings of Meeting of the European Chapter of the Association of Computational Linguistics (EACL)},
url = {
http://acl.ldc.upenn.edu/E/E03/E03-1076.pdf},
year = 2003
}
Koehn and Knight (2003). This method is refined by
Stymne (2008), such as by addressing more of the morphological changes that occur due to compounding.
Macherey, Klaus and Dai, Andrew and Talbot, David and Popat, Ashok and Och, Franz (2011):
Language-independent compound splitting with morphological operations, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Techologies
@InProceedings{macherey-EtAl:2011:ACL-HLT2011,
author = {Macherey, Klaus and Dai, Andrew and Talbot, David and Popat, Ashok and Och, Franz},
title = {Language-independent compound splitting with morphological operations},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Techologies},
month = {June},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {1395--1404},
url = {
http://www.aclweb.org/anthology/P11-1140},
year = 2011
}
Macherey et al. (2011) learn the required morphological changes. Compound splitting can also be provided by morphological analysers
Nießen, Sonja and Ney, Hermann (2000):
Improving SMT quality with morpho-syntactic analysis, Proceedings of the 18th conference on Computational linguistics-Volume 2
@inproceedings{niessen2000improving,
author = {Nie{\ss}en, Sonja and Ney, Hermann},
title = {Improving {SMT} quality with morpho-syntactic analysis},
booktitle = {Proceedings of the 18th conference on Computational linguistics-Volume 2},
pages = {1081--1085},
url = {
http://www.aclweb.org/anthology/C00-2162},
organization = {Association for Computational Linguistics},
year = 2000
}
(Nießen and Ney, 2000;
Holmqvist, Maria and Stymne, Sara and Ahrenberg, Lars (2007):
Getting to Know Moses: Initial Experiments on German-English Factored Translation, Proceedings of the Second Workshop on Statistical Machine Translation
mentioned in Research Groups and Compounds@InProceedings{holmqvist-stymne-ahrenberg:2007:WMT,
author = {Holmqvist, Maria and Stymne, Sara and Ahrenberg, Lars},
title = {Getting to Know Moses: Initial Experiments on {German-English} Factored Translation},
booktitle = {Proceedings of the Second Workshop on Statistical Machine Translation},
month = {June},
address = {Prague, Czech Republic},
publisher = {Association for Computational Linguistics},
pages = {181--184},
url = {
http://www.aclweb.org/anthology/W/W07/W07-0723},
year = 2007
}
Holmqvist et al., 2007).
Fritzinger, Fabienne and Fraser, Alexander (2010):
How to Avoid Burning Ducks: Combining Linguistic Analysis and Corpus Statistics for German Compound Processing, Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
@InProceedings{fritzinger-fraser:2010:WMT,
author = {Fritzinger, Fabienne and Fraser, Alexander},
title = {How to Avoid Burning Ducks: Combining Linguistic Analysis and Corpus Statistics for {German} Compound Processing},
booktitle = {Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR},
month = {July},
address = {Uppsala, Sweden},
publisher = {Association for Computational Linguistics},
pages = {230--240},
url = {
http://www.aclweb.org/anthology/W10-1734},
year = 2010
}
Fritzinger and Fraser (2010) combine linguistic analysis with corpus-driven statistics.
Weller, Marion and Cap, Fabienne and Müller, Stefan and Schulte im Walde, Sabine and Fraser, Alexander (2014):
Distinguishing Degrees of Compositionality in Compound Splitting for Statistical Machine Translation, Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014)
@InProceedings{weller-EtAl:2014:ComAComA,
author = {Weller, Marion and Cap, Fabienne and M\"{u}ller, Stefan and Schulte im Walde, Sabine and Fraser, Alexander},
title = {Distinguishing Degrees of Compositionality in Compound Splitting for Statistical Machine Translation},
booktitle = {Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014)},
month = {August},
address = {Dublin, Ireland},
publisher = {Association for Computational Linguistics and Dublin City University},
pages = {81--90},
url = {
http://www.aclweb.org/anthology/W14-5709},
year = 2014
}
Weller et al. (2014) also consider the semantic similarity (using distributional models) between the compound and its potential parts to guide splitting decisions.
Since there are multiple ways to split potential compounds,
Dyer, Chris (2009):
Using a maximum entropy model to build segmentation lattices for MT, Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
@InProceedings{dyer:2009:NAACLHLT09,
author = {Dyer, Chris},
title = {Using a maximum entropy model to build segmentation lattices for MT},
booktitle = {Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics},
month = {June},
address = {Boulder, Colorado},
publisher = {Association for Computational Linguistics},
pages = {406--414},
url = {
http://www.aclweb.org/anthology/N/N09/N09-1046},
year = 2009
}
Dyer (2009) provides multiple splits to the decoder in an input lattice.
Wuebker, Joern and Ney, Hermann (2012):
Phrase Model Training for Statistical Machine Translation with Word Lattices of Preprocessing Alternatives, Proceedings of the Seventh Workshop on Statistical Machine Translation
@InProceedings{wuebker-ney:2012:WMT,
author = {Wuebker, Joern and Ney, Hermann},
title = {Phrase Model Training for Statistical Machine Translation with Word Lattices of Preprocessing Alternatives},
booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation},
month = {June},
address = {Montreal, Canada},
publisher = {Association for Computational Linguistics},
pages = {447--456},
url = {
http://www.aclweb.org/anthology/W12-3157},
year = 2012
}
Wuebker and Ney (2012) consider multiple splits also during phrase model training.
When translating into compounding languages, compounds have to be generated.
Sara Stymne and Nicola Cancedda and Lars Ahrenberg (2013):
Generation of Compound Words in Statistical Machine Translation into Compounding Languages, Computational Linguistics
@Article{CL:2013-4009,
author = {Sara Stymne and Nicola Cancedda and Lars Ahrenberg},
title = {Generation of Compound Words in Statistical Machine Translation into Compounding Languages},
journal = {Computational Linguistics},
volume = {39},
number = {4},
url = {
http://aclweb.org/anthology-new/J/J13/J13-4009.pdf},
year = 2013
}
Stymne et al. (2013) provide an extensive overview.
Maja Popovic and Daniel Stein and Hermann Ney (2006):
Statistical Machine Translation of German Compound Words, Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL 2006, Turku, Finland, August 23-25, 2006, Proceedings
@inproceedings{fintal:2006:PopovicSN06,
author = {Maja Popovic and Daniel Stein and Hermann Ney},
title = {Statistical Machine Translation of {German} Compound Words},
booktitle = {Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL 2006, Turku, Finland, August 23-25, 2006, Proceedings},
pages = {616--624},
crossref = {DBLP:conf/fintal/2006},
url = {
http://www-i6.informatik.rwth-aachen.de/publications/download/365/Popovic--2006.pdf},
doi = {10.1007/11816508\_61},
timestamp = {Mon, 21 Aug 2006 09:10:08 +0200},
biburl = {
http://dblp.uni-trier.de/rec/bib/conf/fintal/PopovicSN06},
bibsource = {dblp computer science bibliography,
http://dblp.org},
year = 2006
}
Popovic et al. (2006) split compounds during training and merge them in post-processing.
Stymne, Sara and Holmqvist, Maria and Ahrenberg, Lars (2008):
Effects of Morphological Analysis in Translation between German and English, Proceedings of the Third Workshop on Statistical Machine Translation
mentioned in Research Groups and Compounds@InProceedings{stymne-holmqvist-ahrenberg:2008:WMT,
author = {Stymne, Sara and Holmqvist, Maria and Ahrenberg, Lars},
title = {Effects of Morphological Analysis in Translation between {German} and {English}},
booktitle = {Proceedings of the Third Workshop on Statistical Machine Translation},
month = {June},
address = {Columbus, Ohio},
publisher = {Association for Computational Linguistics},
pages = {135--138},
url = {
http://www.aclweb.org/anthology/W/W08/W08-0317},
year = 2008
}
Stymne et al. (2008) also allow the creation of novel words by compounding.
Stymne, Sara (2009):
A Comparison of Merging Strategies for Translation of German Compounds, Proceedings of the Student Research Workshop at EACL 2009
@InProceedings{stymne:2009:EACL-SRWS,
author = {Stymne, Sara},
title = {A Comparison of Merging Strategies for Translation of {G}erman Compounds},
booktitle = {Proceedings of the Student Research Workshop at EACL 2009},
month = {April},
address = {Athens, Greece},
publisher = {Association for Computational Linguistics},
pages = {61--69},
url = {
http://www.aclweb.org/anthology/E09-3008},
year = 2009
}
Stymne (2009) compares various methods to mark split points, and consider the part of speech of split words.
Stymne, Sara and Cancedda, Nicola (2011):
Productive Generation of Compound Words in Statistical Machine Translation, Proceedings of the Sixth Workshop on Statistical Machine Translation
@InProceedings{stymne-cancedda:2011:WMT,
author = {Stymne, Sara and Cancedda, Nicola},
title = {Productive Generation of Compound Words in Statistical Machine Translation},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {250--260},
url = {
http://www.aclweb.org/anthology/W11-2129},
year = 2011
}
Stymne and Cancedda (2011) extend this approach further by a Conditional Random Field (CRF) classifier that detects merge points. This work was integrated by
Fraser, Alexander and Weller, Marion and Cahill, Aoife and Cap, Fabienne (2012):
Modeling Inflection and Word-Formation in SMT, Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
mentioned in Generating Rich Morphology and Compounds@InProceedings{fraser-EtAl:2012:EACL2012,
author = {Fraser, Alexander and Weller, Marion and Cahill, Aoife and Cap, Fabienne},
title = {Modeling Inflection and Word-Formation in SMT},
booktitle = {Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics},
month = {April},
address = {Avignon, France},
publisher = {Association for Computational Linguistics},
pages = {664--674},
url = {
http://www.aclweb.org/anthology/E12-1068},
year = 2012
}
Fraser et al. (2012) as a post-processing step into a machine translation system. Armed with both a corpus based approach and a morphological analyzer to split words,
Cap, Fabienne and Fraser, Alexander and Weller, Marion and Cahill, Aoife (2014):
How to Produce Unseen Teddy Bears: Improved Morphological Processing of Compounds in SMT, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics
@InProceedings{cap-EtAl:2014:EACL,
author = {Cap, Fabienne and Fraser, Alexander and Weller, Marion and Cahill, Aoife},
title = {How to Produce Unseen Teddy Bears: Improved Morphological Processing of Compounds in SMT},
booktitle = {Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics},
month = {April},
address = {Gothenburg, Sweden},
publisher = {Association for Computational Linguistics},
pages = {579--587},
url = {
http://www.aclweb.org/anthology/E14-1061},
year = 2014
}
Cap et al. (2014) build a CRF classifier for merge points that also includes features about the source language, such as that the two words are part of the same base noun phrase.
Botha, Jan A. and Dyer, Chris and Blunsom, Phil (2012):
Bayesian Language Modelling of German Compounds, Proceedings of COLING 2012
@InProceedings{botha-dyer-blunsom:2012:PAPERS,
author = {Botha, Jan A. and Dyer, Chris and Blunsom, Phil},
title = {{B}ayesian Language Modelling of {G}erman Compounds},
booktitle = {Proceedings of COLING 2012},
month = {December},
address = {Mumbai, India},
publisher = {The COLING 2012 Organizing Committee},
pages = {341--356},
url = {
http://www.aclweb.org/anthology/C12-1022},
year = 2012
}
Botha et al. (2012) develop a hierarchical Pitman-Yor language model to better handle compounds.
Benchmarks
Discussion
Related Topics
New Publications
Cap, Fabienne and Nirmal, Manju and Weller, Marion and Schulte im Walde, Sabine (2015):
How to Account for Idiomatic German Support Verb Constructions in Statistical Machine Translation, Proceedings of the 11th Workshop on Multiword Expressions
@InProceedings{cap-EtAl:2015:MWE,
author = {Cap, Fabienne and Nirmal, Manju and Weller, Marion and Schulte im Walde, Sabine},
title = {How to Account for Idiomatic {German} Support Verb Constructions in Statistical Machine Translation},
booktitle = {Proceedings of the 11th Workshop on Multiword Expressions},
month = {June},
address = {Denver, Colorado},
publisher = {Association for Computational Linguistics},
pages = {19--28},
url = {
http://www.aclweb.org/anthology/W15-0903},
year = 2015
}
Cap et al. (2015)
Matthews, Austin and Schlinger, Eva and Lavie, Alon and Dyer, Chris (2016):
Synthesizing Compound Words for Machine Translation, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
@InProceedings{matthews-EtAl:2016:P16-1,
author = {Matthews, Austin and Schlinger, Eva and Lavie, Alon and Dyer, Chris},
title = {Synthesizing Compound Words for Machine Translation},
booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {1085--1094},
url = {
http://www.aclweb.org/anthology/P16-1103},
year = 2016
}
Matthews et al. (2016)
Marcin Junczys-Dowmunt and Bruno Pouliquen (2014):
SMT of German patents at WIPO: decompounding and verb structure pre-reordering, Proceedings of 17th Annual conference of the European Association for Machine Translation
mentioned in Deployment and Compounds@inproceedings{eamt-2014-Junczyns-Dowmunt,
author = {Marcin Junczys-Dowmunt and Bruno Pouliquen},
title = {SMT of {German} patents at WIPO: decompounding and verb structure pre-reordering},
booktitle = {Proceedings of 17th Annual conference of the European Association for Machine Translation},
pages = {217-220},
url = {
http://www.mt-archive.info/10/EAMT-2014-Junczyns-Dowmunt.pdf},
location = {Dubrovnik, Croatia},
year = 2014
}
Junczys-Dowmunt and Pouliquen (2014)
Pu, Xiao and Mascarell, Laura and Popescu-Belis, Andrei and Fishel, Mark and Luong, Ngoc-Quang and Volk, Martin (2015):
Leveraging Compounds to Improve Noun Phrase Translation from Chinese and German, Proceedings of the ACL-IJCNLP 2015 Student Research Workshop
@InProceedings{pu-EtAl:2015:SRW,
author = {Pu, Xiao and Mascarell, Laura and Popescu-Belis, Andrei and Fishel, Mark and Luong, Ngoc-Quang and Volk, Martin},
title = {Leveraging Compounds to Improve Noun Phrase Translation from {Chinese} and German},
booktitle = {Proceedings of the ACL-IJCNLP 2015 Student Research Workshop},
month = {July},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
pages = {8--15},
url = {
http://www.aclweb.org/anthology/P15-3002},
year = 2015
}
Pu et al. (2015)