Word Segmentation
Splitting up sentences into word tokens is especially a problem for languages where the writing system does not include spaces between words, such as many Asian languages.
Word Segmentation is the main subject of 16 publications. 5 are discussed here.
Publications
Ruiqiang Zhang and Eiichiro Sumita (2008):
Chinese Unknown Word Translation by Subword Re-segmentation , Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP)
@inproceedings{RuiqiangZhang:2008:IJCNLP,
author = {Ruiqiang Zhang and Eiichiro Sumita},
title = {Chinese Unknown Word Translation by Subword Re-segmentation },
url = {
http://www.mt-archive.info/IJCNLP-2008-Zhang-1.pdf},
googlescholar = {13978847271318305873},
booktitle = {Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP)},
year = 2008
}
Zhang and Sumita (2008);
Zhang, Ruiqiang and Yasuda, Keiji and Sumita, Eiichiro (2008):
Improved Statistical Machine Translation by Multiple Chinese Word Segmentation, Proceedings of the Third Workshop on Statistical Machine Translation
@InProceedings{zhang-yasuda-sumita:2008:WMT,
author = {Zhang, Ruiqiang and Yasuda, Keiji and Sumita, Eiichiro},
title = {Improved Statistical Machine Translation by Multiple {Chinese} Word Segmentation},
booktitle = {Proceedings of the Third Workshop on Statistical Machine Translation},
month = {June},
address = {Columbus, Ohio},
publisher = {Association for Computational Linguistics},
pages = {216--223},
url = {
http://www.aclweb.org/anthology/W/W08/W08-0335},
year = 2008
}
Zhang et al. (2008) discuss different granularities for Chinese words and suggest a back-off approach.
Ming-Hong Bai and Keh-Jiann Chen and Jason S. Chang (2008):
Improving Word Alignment by Adjusting Chinese Word Segmentation , Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP)
@inproceedings{Bai:2008:IJCNLP,
author = {Ming-Hong Bai and Keh-Jiann Chen and Jason S. Chang},
title = {Improving Word Alignment by Adjusting {C}hinese Word Segmentation },
url = {
http://www.newdesign.aclweb.org/anthology/I/I08/I08-1033.pdf},
googlescholar = {10393727842108893138},
booktitle = {Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP)},
year = 2008
}
Bai et al. (2008) aim for Chinese word segmentation in the training data to match English words one-to-one, while
Chang, Pi-Chuan and Galley, Michel and Manning, Christopher D. (2008):
Optimizing Chinese Word Segmentation for Machine Translation Performance, Proceedings of the Third Workshop on Statistical Machine Translation
@InProceedings{chang-galley-manning:2008:WMT,
author = {Chang, Pi-Chuan and Galley, Michel and Manning, Christopher D.},
title = {Optimizing {Chinese} Word Segmentation for Machine Translation Performance},
booktitle = {Proceedings of the Third Workshop on Statistical Machine Translation},
month = {June},
address = {Columbus, Ohio},
publisher = {Association for Computational Linguistics},
pages = {224--232},
url = {
http://www.aclweb.org/anthology/W/W08/W08-0336},
year = 2008
}
Chang et al. (2008) adjust the average word length to optimize translation performance.
Xu, Jia and Gao, Jianfeng and Toutanova, Kristina and Ney, Hermann (2008):
Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation, Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)
@InProceedings{xu-EtAl:2008:PAPERS,
author = {Xu, Jia and Gao, Jianfeng and Toutanova, Kristina and Ney, Hermann},
title = {Bayesian Semi-Supervised {Chinese} Word Segmentation for Statistical Machine Translation},
booktitle = {Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)},
month = {August},
address = {Manchester, UK},
publisher = {Coling 2008 Organizing Committee},
pages = {1017--1024},
url = {
http://www.aclweb.org/anthology/C08-1128},
year = 2008
}
Xu et al. (2008) also use the correspondence to English words in their Bayesian approach.
Benchmarks
Discussion
Related Topics
New Publications
Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro (2014):
Refining Word Segmentation Using a Manually Aligned Corpus for Statistical Machine Translation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
@InProceedings{wang-EtAl:2014:EMNLP20146,
author = {Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro},
title = {Refining Word Segmentation Using a Manually Aligned Corpus for Statistical Machine Translation},
booktitle = {Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
month = {October},
address = {Doha, Qatar},
publisher = {Association for Computational Linguistics},
pages = {1654--1664},
url = {
http://www.aclweb.org/anthology/D14-1173},
year = 2014
}
Wang et al. (2014)
Zeng, Xiaodong and Chao, Lidia S. and Wong, Derek F. and Trancoso, Isabel and Tian, Liang (2014):
Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
@InProceedings{zeng-EtAl:2014:P14-1,
author = {Zeng, Xiaodong and Chao, Lidia S. and Wong, Derek F. and Trancoso, Isabel and Tian, Liang},
title = {Toward Better {Chinese} Word Segmentation for {SMT} via Bilingual Constraints},
booktitle = {Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {June},
address = {Baltimore, Maryland},
publisher = {Association for Computational Linguistics},
pages = {1360--1369},
url = {
http://www.aclweb.org/anthology/P14-1128},
year = 2014
}
Zeng et al. (2014)
Al-Mannai, Kamla and Sajjad, Hassan and Khader, Alaa and Al Obaidli, Fahad and Nakov, Preslav and Vogel, Stephan (2014):
Unsupervised Word Segmentation Improves Dialectal Arabic to English Machine Translation, Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)
@InProceedings{almannai-EtAl:2014:ANLP2014,
author = {Al-Mannai, Kamla and Sajjad, Hassan and Khader, Alaa and Al Obaidli, Fahad and Nakov, Preslav and Vogel, Stephan},
title = {Unsupervised Word Segmentation Improves Dialectal Arabic to {English} Machine Translation},
booktitle = {Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)},
month = {October},
address = {Doha, Qatar},
publisher = {Association for Computational Linguistics},
pages = {207--216},
url = {
http://www.aclweb.org/anthology/W14-3628},
year = 2014
}
Al-Mannai et al. (2014)
Neubig, Graham and Watanabe, Taro and Mori, Shinsuke and Kawahara, Tatsuya (2012):
Machine Translation without Words through Substring Alignment, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
@InProceedings{neubig-EtAl:2012:ACL2012,
author = {Neubig, Graham and Watanabe, Taro and Mori, Shinsuke and Kawahara, Tatsuya},
title = {Machine Translation without Words through Substring Alignment},
booktitle = {Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {July},
address = {Jeju Island, Korea},
publisher = {Association for Computational Linguistics},
pages = {165--174},
url = {
http://www.aclweb.org/anthology/P12-1018},
year = 2012
}
Neubig et al. (2012)
Graham Neubig and Taro Watanabe and Shinsuke Mori and Tatsuya Kawahara (2013):
Substring-based machine translation, Machine Translation
@article{mtj13-Neubig,
author = {Graham Neubig and Taro Watanabe and Shinsuke Mori and Tatsuya Kawahara},
title = {Substring-based machine translation},
pages = {139--166},
journal = {Machine Translation},
volume = {27},
number = {2},
month = {June},
year = 2013
}
Neubig et al. (2013)
Nguyen, ThuyLinh and Vogel, Stephan and Smith, Noah A. (2010):
Nonparametric Word Segmentation for Machine Translation, Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)
@InProceedings{nguyen-vogel-smith:2010:PAPERS,
author = {Nguyen, ThuyLinh and Vogel, Stephan and Smith, Noah A.},
title = {Nonparametric Word Segmentation for Machine Translation},
booktitle = {Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)},
month = {August},
address = {Beijing, China},
publisher = {Coling 2010 Organizing Committee},
pages = {815--823},
url = {
http://www.aclweb.org/anthology/C10-1092},
year = 2010
}
Nguyen et al. (2010)
Paul, Michael and Finch, Andrew and Sumita, Eiichiro (2010):
Integration of Multiple Bilingually-Learned Segmentation Schemes into Statistical Machine Translation, Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
@InProceedings{paul-finch-sumita:2010:WMT,
author = {Paul, Michael and Finch, Andrew and Sumita, Eiichiro},
title = {Integration of Multiple Bilingually-Learned Segmentation Schemes into Statistical Machine Translation},
booktitle = {Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR},
month = {July},
address = {Uppsala, Sweden},
publisher = {Association for Computational Linguistics},
pages = {400--408},
url = {
http://www.aclweb.org/anthology/W10-1760},
year = 2010
}
Paul et al. (2010)
Ma, Yanjun and Way, Andy (2009):
Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation, Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)
@InProceedings{ma-way:2009:EACL,
author = {Ma, Yanjun and Way, Andy},
title = {Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation},
booktitle = {Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)},
month = {March},
address = {Athens, Greece},
publisher = {Association for Computational Linguistics},
pages = {549--557},
url = {
http://www.aclweb.org/anthology/E09-1063},
year = 2009
}
Ma and Way (2009)
Jia Xu and Evgeny Matusov and Richard Zens and Hermann Ney (2005):
Integrated Chinese Word Segmentation in Statistical Machine Translation, Proc. of the International Workshop on Spoken Language Translation
mentioned in Word Segmentation and Domain Adaptation@InProceedings{xu:2005:iwslt,
author = {Jia Xu and Evgeny Matusov and Richard Zens and Hermann Ney},
title = {Integrated {Chinese} Word Segmentation in Statistical Machine Translation},
url = {
http://20.210-193-52.unknown.qala.com.sg/archive/iwslt\_05/papers/slt5\_131.pdf},
googlescholar = {7489139888320891571},
booktitle = {Proc. of the International Workshop on Spoken Language Translation},
location = {Pittsburgh, PA, USA},
month = {October},
year = 2005
}
Xu et al. (2005)
Chung, Tagyoung and Gildea, Daniel (2009):
Unsupervised Tokenization for Machine Translation, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
@InProceedings{chung-gildea:2009:EMNLP,
author = {Chung, Tagyoung and Gildea, Daniel},
title = {Unsupervised Tokenization for Machine Translation},
booktitle = {Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing},
month = {August},
address = {Singapore},
publisher = {Association for Computational Linguistics},
pages = {718--726},
url = {
http://www.aclweb.org/anthology/D/D09/D09-1075},
year = 2009
}
Chung and Gildea (2009)
Xiao, Xinyan and Liu, Yang and Hwang, YoungSook and Liu, Qun and Lin, Shouxun (2010):
Joint Tokenization and Translation, Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)
@InProceedings{xiao-EtAl:2010:PAPERS,
author = {Xiao, Xinyan and Liu, Yang and Hwang, YoungSook and Liu, Qun and Lin, Shouxun},
title = {Joint Tokenization and Translation},
booktitle = {Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)},
month = {August},
address = {Beijing, China},
publisher = {Coling 2010 Organizing Committee},
pages = {1200--1208},
url = {
http://www.aclweb.org/anthology/C10-1135},
year = 2010
}
Xiao et al. (2010)