Suffix Array Translation Models
Large translation models take a long time to train and often exceed the available working memory of current machines. Storing the word aligned parallel corpus in a suffix array and retrieving translation options on demand offer an alternative.
Suffix Arrays is the main subject of 11 publications. 7 are discussed here.
Publications
The translation table may be represented in a suffix array as proposed for a searchable translation memory
Chris Callison-Burch and Colin Bannard and Josh Schroeder (2005):
A compact data structure for searchable translation memories, Proceedings of the 10th Conference of the European Association for Machine Translation (EAMT)
@InProceedings{Callison-Burch:2005:EAMT,
author = {Chris Callison-Burch and Colin Bannard and Josh Schroeder},
title = {A compact data structure for searchable translation memories},
booktitle = {Proceedings of the 10th Conference of the European Association for Machine Translation (EAMT)},
month = {May},
address = {Budapest},
year = 2005
}
(Callison-Burch et al., 2005) and integrated into the decoder
Ying Zhang and Stephan Vogel (2005):
An efficient phrase-to-phrase alignment model for arbitrarily long phrase and large corpora, Proceedings of the 10th Conference of the European Association for Machine Translation (EAMT)
@InProceedings{Zhang:2005:EAMT,
author = {Ying Zhang and Stephan Vogel},
title = {An efficient phrase-to-phrase alignment model for arbitrarily long phrase and large corpora},
url = {
http://www.researchgate.net/publication/228945734\_An\_Efficient\_Phrase-to-Phrase\_Alignment\_Model\_for\_Arbitrarily\_Long\_Phrase\_and\_Large\_Corpora/file/79e4150b1c4b70bd43.pdf},
googlescholar = {14065219488034748814},
booktitle = {Proceedings of the 10th Conference of the European Association for Machine Translation (EAMT)},
month = {May},
address = {Budapest},
year = 2005
}
(Zhang and Vogel, 2005).
Callison-Burch, Chris and Bannard, Colin and Schroeder, Josh (2005):
Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases, Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05)
@InProceedings{callisonburch-bannard-schroeder:2005:ACL,
author = {Callison-Burch, Chris and Bannard, Colin and Schroeder, Josh},
title = {Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases},
booktitle = {Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05)},
month = {June},
address = {Ann Arbor, Michigan},
publisher = {Association for Computational Linguistics},
pages = {255--262},
url = {
http://www.aclweb.org/anthology/P/P05/P05-1032},
year = 2005
}
Callison-Burch et al. (2005) propose a suffix-tree structure to keep corpora in memory and extract phrase-translations on the fly.
Suffix arrays may also be used to quickly learn phrase alignments from a parallel corpus without the use of a word alignment
Paul McNamee and James Mayfield (2006):
Translation of Multiword Expressions Using Parallel Suffix Arrays, 5th Conference of the Association for Machine Translation in the Americas (AMTA)
@InProceedings{McNamee:2006:AMTA,
author = {Paul Mc{N}amee and James Mayfield},
title = {Translation of Multiword Expressions Using Parallel Suffix Arrays},
booktitle = {5th Conference of the Association for Machine Translation in the Americas (AMTA)},
month = {August},
address = {Boston, Massachusetts},
year = 2006
}
(McNamee and Mayfield, 2006). Related to this is the idea of prefix data structures for the translation which allow quicker access and storing the model on disk for on-demand retrieval of applicable translation options
Zens, Richard and Ney, Hermann (2007):
Efficient Phrase-Table Representation for Machine Translation with Applications to Online MT and Speech Translation, Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference
@InProceedings{zens-ney:2007:main,
author = {Zens, Richard and Ney, Hermann},
title = {Efficient Phrase-Table Representation for Machine Translation with Applications to Online {MT} and Speech Translation},
booktitle = {Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference},
month = {April},
address = {Rochester, New York},
publisher = {Association for Computational Linguistics},
pages = {492--499},
url = {
http://www.aclweb.org/anthology/N/N07/N07-1062},
year = 2007
}
(Zens and Ney, 2007).
Hierarchical phrase based models may also be stored in such a way
Lopez, Adam (2007):
Hierarchical Phrase-Based Translation with Suffix Arrays, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)
@InProceedings{lopez:2007:EMNLP-CoNLL2007,
author = {Lopez, Adam},
title = {Hierarchical Phrase-Based Translation with Suffix Arrays},
booktitle = {Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)},
pages = {976--985},
url = {
http://www.aclweb.org/anthology/D/D07/D07-1104},
year = 2007
}
(Lopez, 2007) and allow for much bigger models
Lopez, Adam (2008):
Tera-Scale Translation Models via Pattern Matching, Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)
@InProceedings{lopez:2008:PAPERS,
author = {Lopez, Adam},
title = {Tera-Scale Translation Models via Pattern Matching},
booktitle = {Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)},
month = {August},
address = {Manchester, UK},
publisher = {Coling 2008 Organizing Committee},
pages = {505--512},
url = {
http://www.aclweb.org/anthology/C08-1064},
year = 2008
}
(Lopez, 2008).
Benchmarks
Discussion
Related Topics
New Publications
Ulrich Germann (2015):
Sampling Phrase Tables for the Moses Statistical Machine Translation System, The Prague Bulletin of Mathematical Linguistics
@article{pbml-104-Germann,
author = {Ulrich Germann},
title = {Sampling Phrase Tables for the Moses Statistical Machine Translation System},
pages = {39--50},
journal = {The Prague Bulletin of Mathematical Linguistics},
url = {
http://ufal.mff.cuni.cz/pbml/104/art-germann.pdf},
volume = {104},
month = {October},
year = 2015
}
Germann (2015)
Michael Denkowski and Alon Lavie and Isabel Lacruz and Chris Dyer (2014):
Real time adaptive machine translation: cdec and TransCenter, Proceedings of the Third workshop on post-editing technology and practice (WPTP-3)
@inproceedings{AMTA-2014-W2-Denkowski,
author = {Michael Denkowski and Alon Lavie and Isabel Lacruz and Chris Dyer},
title = {Real time adaptive machine translation: cdec and TransCenter},
pages = {123},
url = {
http://www.mt-archive.info/10/AMTA-2014-W2-Denkowski.pdf},
booktitle = {Proceedings of the Third workshop on post-editing technology and practice (WPTP-3)},
location = {Vancouver, BC, Canada},
year = 2014
}
Denkowski et al. (2014)
Ulrich Germann (2014):
Dynamic phrase tables for machine translation in an interactive post-editing scenario, Proceedings of the Workshop on interactive and adaptive machine translation
@inproceedings{AMTA-2014-W1-Germann,
author = {Ulrich Germann},
title = {Dynamic phrase tables for machine translation in an interactive post-editing scenario},
pages = {20-31},
url = {
http://www.mt-archive.info/10/AMTA-2014-W1-Germann.pdf},
booktitle = {Proceedings of the Workshop on interactive and adaptive machine translation},
location = {Vancouver, BC, Canada},
year = 2014
}
Germann (2014)
Cromieres, Fabien and Kurohashi, Sadao (2011):
Efficient retrieval of tree translation examples for Syntax-Based Machine Translation, Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing
@InProceedings{cromieres-kurohashi:2011:EMNLP,
author = {Cromieres, Fabien and Kurohashi, Sadao},
title = {Efficient retrieval of tree translation examples for Syntax-Based Machine Translation},
booktitle = {Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing},
month = {July},
address = {Edinburgh, Scotland, UK.},
publisher = {Association for Computational Linguistics},
pages = {508--518},
url = {
http://www.aclweb.org/anthology/D11-1047},
year = 2011
}
Cromieres and Kurohashi (2011)