Sparse Data
Building machine translation systems for under-resourced languages or in the face of sparse data conditions for other reasons, is a special challenge, and may require special methods.
Sparse Data is the main subject of 15 publications. 11 are discussed here.
Publications
Several reports show how statistical machine translation allows for rapid development with limited resources
Yaser Al-Onaizan and Ulrich Germann and Ulf Hermjakob and Kevin Knight and Philipp Koehn and Daniel Marcu and Kenji Yamada (2000):
Translating with Scarce Resources, Proceedings of Annual Meeting of the American Association of Artificial Intelligence (AAAI)
@InProceedings{Tetun,
author = {Yaser Al-Onaizan and Ulrich Germann and Ulf Hermjakob and Kevin Knight and Philipp Koehn and Daniel Marcu and Kenji Yamada},
title = {Translating with Scarce Resources},
url = {
http://www.researchgate.net/publication/221603843\_Translating\_with\_Scarce\_Resources/file/79e4150bfcdf265a1b.pdf},
googlescholar = {11406909270432185580},
booktitle = {Proceedings of Annual Meeting of the American Association of Artificial Intelligence (AAAI)},
year = 2000
}
(Al-Onaizan et al., 2000;
Yaser Al-Onaizan and Ulrich Germann and Ulf Hermjakob and Kevin Knight and Philipp Koehn and Daniel Marcu and Kenji Yamada (2002):
Translation with Scarce Bilingual Resources, Machine Translation
@article{MTJ:2002:Al-Onaizan,
author = {Yaser Al-Onaizan and Ulrich Germann and Ulf Hermjakob and Kevin Knight and Philipp Koehn and Daniel Marcu and Kenji Yamada},
title = {Translation with Scarce Bilingual Resources},
url = {
http://www.isi.edu/~marcu/papers/mt-translation-with-scarce-resources.pdf},
googlescholar = {18187158254612114010},
pages = {1--17},
journal = {Machine Translation},
volume = {17},
number = {1},
month = {March},
year = 2002
}
Al-Onaizan et al., 2002;
George Foster and Simona Gandrabur and Philippe Langlais and Graham Russell and Michel Simard and Pierre Plamondon (2003):
Statistical Machine Translation: Rapid Development with Limited Resources, Proceedings of the MT Summit IX
@inproceedings{Foster:2003,
author = {George Foster and Simona Gandrabur and Philippe Langlais and Graham Russell and Michel Simard and Pierre Plamondon },
title = { Statistical Machine Translation: Rapid Development with Limited Resources},
url = {
http://mt-archive.info/MTS-2003-Foster.pdf},
booktitle = {Proceedings of the {MT} Summit IX},
year = 2003
}
Foster et al., 2003;
Doug Oard and Franz Josef Och (2003):
Rapid Response Machine Translation for Unexpected Languages, Proceedings of the MT Summit IX
@inproceedings{Oard:2003,
author = { Doug Oard and Franz Josef Och},
title = { Rapid Response Machine Translation for Unexpected Languages},
url = {
http://www.mt-archive.info/MTS-2003-Oard.pdf},
googlescholar = {18184912281612120739},
booktitle = {Proceedings of the {MT} Summit IX},
year = 2003
}
Oard and Och, 2003).
A practical example of this is the rapid development of a Haitian Creole to English machine translation systems for first responder assistance for the aftermath of the 2010 earthquake in the country
Lewis, William and Munro, Robert and Vogel, Stephan (2011):
Crisis MT: Developing A Cookbook for MT in Crisis Situations, Proceedings of the Sixth Workshop on Statistical Machine Translation
@InProceedings{lewis-munro-vogel:2011:WMT,
author = {Lewis, William and Munro, Robert and Vogel, Stephan},
title = {Crisis MT: Developing A Cookbook for {MT} in Crisis Situations},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {501--511},
url = {
http://www.aclweb.org/anthology/W11-2164},
year = 2011
}
(Lewis et al., 2011). The training data made available and extended during this effort was the topic of a shared task
Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Zaidan, Omar (2011):
Findings of the 2011 Workshop on Statistical Machine Translation, Proceedings of the Sixth Workshop on Statistical Machine Translation
mentioned in Sparse Data, Evaluation Campaigns and System Combination@InProceedings{callisonburch-EtAl:2011:WMT,
author = {Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Zaidan, Omar},
title = {Findings of the 2011 Workshop on Statistical Machine Translation},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {22--64},
url = {
http://www.aclweb.org/anthology/W11-2103},
year = 2011
}
(Callison-Burch et al., 2011), where several research teams participated
Eidelman, Vladimir and Hollingshead, Kristy and Resnik, Philip (2011):
Noisy SMS Machine Translation in Low-Density Languages, Proceedings of the Sixth Workshop on Statistical Machine Translation
@InProceedings{eidelman-hollingshead-resnik:2011:WMT,
author = {Eidelman, Vladimir and Hollingshead, Kristy and Resnik, Philip},
title = {Noisy SMS Machine Translation in Low-Density Languages},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {344--350},
url = {
http://www.aclweb.org/anthology/W11-2140},
year = 2011
}
(Eidelman et al., 2011;
Hewavitharana, Sanjika and Bach, Nguyen and Gao, Qin and Ambati, Vamshi and Vogel, Stephan (2011):
CMU Haitian Creole-English Translation System for WMT 2011, Proceedings of the Sixth Workshop on Statistical Machine Translation
@InProceedings{hewavitharana-EtAl:2011:WMT,
author = {Hewavitharana, Sanjika and Bach, Nguyen and Gao, Qin and Ambati, Vamshi and Vogel, Stephan},
title = {CMU Haitian Creole-English Translation System for WMT 2011},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {386--392},
url = {
http://www.aclweb.org/anthology/W11-2146},
year = 2011
}
Hewavitharana et al., 2011;
Hu, Chang and Resnik, Philip and Kronrod, Yakov and Eidelman, Vladimir and Buzek, Olivia and Bederson, Benjamin B. (2011):
The Value of Monolingual Crowdsourcing in a Real-World Translation Scenario: Simulation using Haitian Creole Emergency SMS Messages, Proceedings of the Sixth Workshop on Statistical Machine Translation
mentioned in Parallel Corpora and Sparse Data@InProceedings{hu-EtAl:2011:WMT,
author = {Hu, Chang and Resnik, Philip and Kronrod, Yakov and Eidelman, Vladimir and Buzek, Olivia and Bederson, Benjamin B.},
title = {The Value of Monolingual Crowdsourcing in a Real-World Translation Scenario: Simulation using Haitian Creole Emergency SMS Messages},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {399--404},
url = {
http://www.aclweb.org/anthology/W11-2148},
year = 2011
}
Hu et al., 2011;
Stymne, Sara (2011):
Spell Checking Techniques for Replacement of Unknown Words and Data Cleaning for Haitian Creole SMS Translation, Proceedings of the Sixth Workshop on Statistical Machine Translation
mentioned in Spelling Correction and Sparse Data@InProceedings{stymne:2011:WMT,
author = {Stymne, Sara},
title = {Spell Checking Techniques for Replacement of Unknown Words and Data Cleaning for Haitian Creole SMS Translation},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {470--477},
url = {
http://www.aclweb.org/anthology/W11-2159},
year = 2011
}
Stymne, 2011).
Another good example study is the development of a Yiddish-English system
Dmitriy Genzel and Klaus Macherey and Jakob Uszkoreit (2009):
Creating a High-Quality Machine Translation System for a Low-Resource Language: Yiddish, Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)
@inproceedings{MTS09:Genzel,
author = {Dmitriy Genzel and Klaus Macherey and Jakob Uszkoreit},
title = {Creating a High-Quality Machine Translation System for a Low-Resource Language: {Y}iddish},
url = {
http://research.google.com/pubs/archive/35627.pdf},
googlescholar = {2616623327440954921},
booktitle = {Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)},
publisher = {International Association for Machine Translation},
location = {Ottawa, Ontario, Canada},
year = 2009
}
Genzel et al. (2009), where a range of methods were explored, such as taking advantages of the close relation of Yiddish to German and the existence of Polish and Hebrew loan words.
Benchmarks
A shared task on Haitian Creole organized at the 2011 ACL Workshop on statistical machine translation
Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Zaidan, Omar (2011):
Findings of the 2011 Workshop on Statistical Machine Translation, Proceedings of the Sixth Workshop on Statistical Machine Translation
mentioned in Sparse Data, Evaluation Campaigns and System Combination@InProceedings{callisonburch-EtAl:2011:WMT,
author = {Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Zaidan, Omar},
title = {Findings of the 2011 Workshop on Statistical Machine Translation},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {22--64},
url = {
http://www.aclweb.org/anthology/W11-2103},
year = 2011
}
(Callison-Burch et al., 2011) provides a data set that has been used by several research groups.
Discussion
Related Topics
Sparse data increases the problem of Unknown Words, which may be replaced by Paraphrasing. If training data into a bridge language is available, such Pivot Languages can be exploited. The need to make use of any available data resources, even Comparable Corpora, is more urgent.
In general, since many methods in statistical machine translations are geared towards making effective use of the training data, they will be more likely make a difference in a sparse data scenario.
New Publications
Jeff Ma and Spyros Matsoukas and Richard Schwartz (2011):
Improving Low-Resource Statistical Machine Translation with a Novel Semantic Word Clustering Algorithm, Proceedings of the 13th Machine Translation Summit (MT Summit XIII)
@inproceedings{MTS-2011-Ma-2,
author = {Jeff Ma and Spyros Matsoukas and Richard Schwartz},
title = {Improving Low-Resource Statistical Machine Translation with a Novel Semantic Word Clustering Algorithm},
url = {
http://www.mt-archive.info/MTS-2011-Ma-2.pdf},
pages = {352-359},
booktitle = {Proceedings of the 13th Machine Translation Summit (MT Summit XIII)},
publisher = {International Association for Machine Translation},
location = {Xiamen, China},
year = 2011
}
Ma et al. (2011)
Steve DeNeefe and Ulf Hermjakob and Kevin Knight (2008):
Overcoming Vocabulary Sparsity in MT Using Lattices, Proceedings of the Eighth Conference of the Association for Machine Translation in the Americas (AMTA)
@inproceedings{amta08:DeNeefe,
author = {Steve DeNeefe and Ulf Hermjakob and Kevin Knight},
title = {Overcoming Vocabulary Sparsity in {MT} Using Lattices},
url = {
http://www.isi.edu/natural-language/mt/amta2008su.pdf},
googlescholar = {4025401116724932831},
pages = {89--96},
booktitle = {Proceedings of the Eighth Conference of the Association for Machine Translation in the Americas (AMTA)},
location = {Waikiki, Hawaii},
year = 2008
}
DeNeefe et al. (2008)
Wang, Pidong and Nakov, Preslav and Ng, Hwee Tou (2012):
Source Language Adaptation for Resource-Poor Machine Translation, Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
@InProceedings{wang-nakov-ng:2012:EMNLP-CoNLL,
author = {Wang, Pidong and Nakov, Preslav and Ng, Hwee Tou},
title = {Source Language Adaptation for Resource-Poor Machine Translation},
booktitle = {Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning},
month = {July},
address = {Jeju Island, Korea},
publisher = {Association for Computational Linguistics},
pages = {286--296},
url = {
http://www.aclweb.org/anthology/D12-1027},
year = 2012
}
Wang et al. (2012)
William Lewis and Phong Yang (2012):
Building MT for a Severely Under-Resourced Language: White Hmong, Proceedings of the Tenth Conference of the Association for Machine Translation in the Americas (AMTA)
@inproceedings{AMTA-2012-Lewis,
author = {William Lewis and Phong Yang },
title = {Building {MT} for a Severely Under-Resourced Language: White Hmong},
url = {
http://www.mt-archive.info/AMTA-2012-Lewis.pdf},
booktitle = {Proceedings of the Tenth Conference of the Association for Machine Translation in the Americas (AMTA)},
location = {San Diego, California},
year = 2012
}
Lewis and Yang (2012)