Word Segmentation

Splitting up sentences into word tokens is especially a problem for languages where the writing system does not include spaces between words, such as many Asian languages.

Word Segmentation is the main subject of 16 publications. 5 are discussed here.

Topics in Data

Publications

Ruiqiang Zhang and Eiichiro Sumita (2008): Chinese Unknown Word Translation by Subword Re-segmentation , Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP)

Zhang and Sumita (2008);

Zhang, Ruiqiang and Yasuda, Keiji and Sumita, Eiichiro (2008): Improved Statistical Machine Translation by Multiple Chinese Word Segmentation, Proceedings of the Third Workshop on Statistical Machine Translation

Zhang et al. (2008) discuss different granularities for Chinese words and suggest a back-off approach.

Ming-Hong Bai and Keh-Jiann Chen and Jason S. Chang (2008): Improving Word Alignment by Adjusting Chinese Word Segmentation , Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP)

Bai et al. (2008) aim for Chinese word segmentation in the training data to match English words one-to-one, while

Chang, Pi-Chuan and Galley, Michel and Manning, Christopher D. (2008): Optimizing Chinese Word Segmentation for Machine Translation Performance, Proceedings of the Third Workshop on Statistical Machine Translation

Chang et al. (2008) adjust the average word length to optimize translation performance.

Xu, Jia and Gao, Jianfeng and Toutanova, Kristina and Ney, Hermann (2008): Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation, Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

Xu et al. (2008) also use the correspondence to English words in their Bayesian approach.

Benchmarks

Discussion

New Publications

Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro (2014): Refining Word Segmentation Using a Manually Aligned Corpus for Statistical Machine Translation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
add
@InProceedings{wang-EtAl:2014:EMNLP20146,
author = {Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro},
title = {Refining Word Segmentation Using a Manually Aligned Corpus for Statistical Machine Translation},
booktitle = {Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
month = {October},
address = {Doha, Qatar},
publisher = {Association for Computational Linguistics},
pages = {1654--1664},
url = {http://www.aclweb.org/anthology/D14-1173},
year = 2014
}
Wang et al. (2014)
Zeng, Xiaodong and Chao, Lidia S. and Wong, Derek F. and Trancoso, Isabel and Tian, Liang (2014): Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
add
@InProceedings{zeng-EtAl:2014:P14-1,
author = {Zeng, Xiaodong and Chao, Lidia S. and Wong, Derek F. and Trancoso, Isabel and Tian, Liang},
title = {Toward Better {Chinese} Word Segmentation for {SMT} via Bilingual Constraints},
booktitle = {Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {June},
address = {Baltimore, Maryland},
publisher = {Association for Computational Linguistics},
pages = {1360--1369},
url = {http://www.aclweb.org/anthology/P14-1128},
year = 2014
}
Zeng et al. (2014)
Al-Mannai, Kamla and Sajjad, Hassan and Khader, Alaa and Al Obaidli, Fahad and Nakov, Preslav and Vogel, Stephan (2014): Unsupervised Word Segmentation Improves Dialectal Arabic to English Machine Translation, Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)
add
@InProceedings{almannai-EtAl:2014:ANLP2014,
author = {Al-Mannai, Kamla and Sajjad, Hassan and Khader, Alaa and Al Obaidli, Fahad and Nakov, Preslav and Vogel, Stephan},
title = {Unsupervised Word Segmentation Improves Dialectal Arabic to {English} Machine Translation},
booktitle = {Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)},
month = {October},
address = {Doha, Qatar},
publisher = {Association for Computational Linguistics},
pages = {207--216},
url = {http://www.aclweb.org/anthology/W14-3628},
year = 2014
}
Al-Mannai et al. (2014)
Neubig, Graham and Watanabe, Taro and Mori, Shinsuke and Kawahara, Tatsuya (2012): Machine Translation without Words through Substring Alignment, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
add
@InProceedings{neubig-EtAl:2012:ACL2012,
author = {Neubig, Graham and Watanabe, Taro and Mori, Shinsuke and Kawahara, Tatsuya},
title = {Machine Translation without Words through Substring Alignment},
booktitle = {Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {July},
address = {Jeju Island, Korea},
publisher = {Association for Computational Linguistics},
pages = {165--174},
url = {http://www.aclweb.org/anthology/P12-1018},
year = 2012
}
Neubig et al. (2012)
Graham Neubig and Taro Watanabe and Shinsuke Mori and Tatsuya Kawahara (2013): Substring-based machine translation, Machine Translation
add
@article{mtj13-Neubig,
author = {Graham Neubig and Taro Watanabe and Shinsuke Mori and Tatsuya Kawahara},
title = {Substring-based machine translation},
pages = {139--166},
journal = {Machine Translation},
volume = {27},
number = {2},
month = {June},
year = 2013
}
Neubig et al. (2013)
Nguyen, ThuyLinh and Vogel, Stephan and Smith, Noah A. (2010): Nonparametric Word Segmentation for Machine Translation, Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)
add
@InProceedings{nguyen-vogel-smith:2010:PAPERS,
author = {Nguyen, ThuyLinh and Vogel, Stephan and Smith, Noah A.},
title = {Nonparametric Word Segmentation for Machine Translation},
booktitle = {Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)},
month = {August},
address = {Beijing, China},
publisher = {Coling 2010 Organizing Committee},
pages = {815--823},
url = {http://www.aclweb.org/anthology/C10-1092},
year = 2010
}
Nguyen et al. (2010)
Paul, Michael and Finch, Andrew and Sumita, Eiichiro (2010): Integration of Multiple Bilingually-Learned Segmentation Schemes into Statistical Machine Translation, Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
add
@InProceedings{paul-finch-sumita:2010:WMT,
author = {Paul, Michael and Finch, Andrew and Sumita, Eiichiro},
title = {Integration of Multiple Bilingually-Learned Segmentation Schemes into Statistical Machine Translation},
booktitle = {Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR},
month = {July},
address = {Uppsala, Sweden},
publisher = {Association for Computational Linguistics},
pages = {400--408},
url = {http://www.aclweb.org/anthology/W10-1760},
year = 2010
}
Paul et al. (2010)
Ma, Yanjun and Way, Andy (2009): Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation, Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)
add
@InProceedings{ma-way:2009:EACL,
author = {Ma, Yanjun and Way, Andy},
title = {Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation},
booktitle = {Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)},
month = {March},
address = {Athens, Greece},
publisher = {Association for Computational Linguistics},
pages = {549--557},
url = {http://www.aclweb.org/anthology/E09-1063},
year = 2009
}
Ma and Way (2009)
Jia Xu and Evgeny Matusov and Richard Zens and Hermann Ney (2005): Integrated Chinese Word Segmentation in Statistical Machine Translation, Proc. of the International Workshop on Spoken Language Translation mentioned in Word Segmentation and Domain Adaptation
add
@InProceedings{xu:2005:iwslt,
author = {Jia Xu and Evgeny Matusov and Richard Zens and Hermann Ney},
title = {Integrated {Chinese} Word Segmentation in Statistical Machine Translation},
url = {http://20.210-193-52.unknown.qala.com.sg/archive/iwslt\_05/papers/slt5\_131.pdf},
googlescholar = {7489139888320891571},
booktitle = {Proc. of the International Workshop on Spoken Language Translation},
location = {Pittsburgh, PA, USA},
month = {October},
year = 2005
}
Xu et al. (2005)
Chung, Tagyoung and Gildea, Daniel (2009): Unsupervised Tokenization for Machine Translation, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
add
@InProceedings{chung-gildea:2009:EMNLP,
author = {Chung, Tagyoung and Gildea, Daniel},
title = {Unsupervised Tokenization for Machine Translation},
booktitle = {Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing},
month = {August},
address = {Singapore},
publisher = {Association for Computational Linguistics},
pages = {718--726},
url = {http://www.aclweb.org/anthology/D/D09/D09-1075},
year = 2009
}
Chung and Gildea (2009)
Xiao, Xinyan and Liu, Yang and Hwang, YoungSook and Liu, Qun and Lin, Shouxun (2010): Joint Tokenization and Translation, Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)
add
@InProceedings{xiao-EtAl:2010:PAPERS,
author = {Xiao, Xinyan and Liu, Yang and Hwang, YoungSook and Liu, Qun and Lin, Shouxun},
title = {Joint Tokenization and Translation},
booktitle = {Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)},
month = {August},
address = {Beijing, China},
publisher = {Coling 2010 Organizing Committee},
pages = {1200--1208},
url = {http://www.aclweb.org/anthology/C10-1135},
year = 2010
}
Xiao et al. (2010)

MT Research Survey Wiki

A Comprehensive Survey of Neural and Statistical Machine Translation Research Publications

Search Descriptions

Word Segmentation

Publications

Benchmarks

Discussion

Related Topics

New Publications