Sentence Alignment

Translated texts are often found in the form of translated documents or web pages. Since sentences are not always mapped one-to-one, sentence alignment methods are needed.

Sentence Alignment is the main subject of 46 publications. 28 are discussed here.

Topics in Data

Publications

Sentence alignment was a very active field of research in the early days of statistical machine translation. An influential early method is based on sentence length, measured in words (Brown et al., 1991; Gale and Church, 1991; Gale and Church, 1993) or characters (Church, 1993). Other methods may use alignment chains (Melamed, 1996; Melamed, 1999), model omissions (Melamed, 1996), distinguish between large-scale segmentation of text an detailed sentence alignment (Simard and Plamondon, 1996), apply line detection method from image processing to detect large-scale alignment patterns (Chang and Chen, 1997; Melamed, 1997).

Kay and Röscheisen (1993) propose an iterative algorithm that uses spelling similarity and word co-occurrences to drive sentence alignment. Several researchers proposed including lexical information (Chen, 1993; Dagan et al., 1993; Utsuro et al., 1994; Wu, 1994; Haruno and Yamazaki, 1996; Chuang and Chang, 2002; Kueng and Su, 2002; Moore, 2002; Nightingale and Tanaka, 2003; Aswani and Gaizauskas, 2005), content words (Papageorgiou et al., 1994), numbers and n-grams (Davis et al., 1995). Sentence alignment may also be improved by a third language in multilingual corpora (Simard, 1999). More effort is needed to align very noisy corpora (Zhao et al., 2003). Different sentence alignment methods are compared by Singh and Husain (2005). Xu et al. (2006) propose a method that iteratively performs binary splits of a document to obtain a sentence alignment. Enright and Kondrak (2007) use a simple and fast method for document alignment that relies of overlap of rare but identically spelled words, which are mostly cognates, names, and numbers.

Benchmarks

Discussion

New Publications

UNKNOWN CITATION 'Wołk2014'
\'Eva Mújdricza-Maydt and Huiqin Körkel-Qu and Stefan Riezler and Sebastian Padó (2013): High-Precision Sentence Alignment by Bootstrapping from Wood Standard Annotations, The Prague Bulletin of Mathematical Linguistics
add
@article{pbml-99-mujdricza-maydt-et-al,
author = {\'E}va M{\'u}jdricza-Maydt and Huiqin K{\"o}rkel-Qu and Stefan Riezler and Sebastian Pad{\'o},
title = {High-Precision Sentence Alignment by Bootstrapping from Wood Standard Annotations},
url = {http://ufal.mff.cuni.cz/pbml/99/art-mujdricza-maydt-et-al.pdf},
pages = {5--16},
journal = {The Prague Bulletin of Mathematical Linguistics},
volume = {99},
year = 2013
}
Mújdricza-Maydt et al. (2013)
Quan, Xiaojun and Kit, Chunyu and Song, Yan (2013): Non-Monotonic Sentence Alignment via Semisupervised Learning, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
add
@InProceedings{quan-kit-song:2013:ACL2013,
author = {Quan, Xiaojun and Kit, Chunyu and Song, Yan},
title = {Non-Monotonic Sentence Alignment via Semisupervised Learning},
booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {622--630},
url = {http://www.aclweb.org/anthology/P13-1061},
year = 2013
}
Quan et al. (2013)
Kutuzov, Andrey (2013): Improving English-Russian sentence alignment through POS tagging and Damerau-Levenshtein distance, Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing
add
@InProceedings{kutuzov:2013:BSNLP,
author = {Kutuzov, Andrey},
title = {Improving English-Russian sentence alignment through POS tagging and Damerau-Levenshtein distance},
booktitle = {Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {63--68},
url = {http://www.aclweb.org/anthology/W13-2410},
year = 2013
}
Kutuzov (2013)
Krstovski, Kriste and Smith, David A. (2013): Online Polylingual Topic Models for Fast Document Translation Detection, Proceedings of the Eighth Workshop on Statistical Machine Translation
add
@InProceedings{krstovski-smith:2013:WMT,
author = {Krstovski, Kriste and Smith, David A.},
title = {Online Polylingual Topic Models for Fast Document Translation Detection},
booktitle = {Proceedings of the Eighth Workshop on Statistical Machine Translation},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {252--261},
url = {http://www.aclweb.org/anthology/W13-2232},
year = 2013
}
Krstovski and Smith (2013)
Zaidan, Omar and Chowdhary, Vishal (2013): Evaluating (and Improving) Sentence Alignment under Noisy Conditions, Proceedings of the Eighth Workshop on Statistical Machine Translation
add
@InProceedings{zaidan-chowdhary:2013:WMT,
author = {Zaidan, Omar and Chowdhary, Vishal},
title = {Evaluating (and Improving) Sentence Alignment under Noisy Conditions},
booktitle = {Proceedings of the Eighth Workshop on Statistical Machine Translation},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {484--493},
url = {http://www.aclweb.org/anthology/W13-2261},
year = 2013
}
Zaidan and Chowdhary (2013)
Plamada, Magdalena and Volk, Martin (2013): Mining for Domain-specific Parallel Text from Wikipedia, Proceedings of the Sixth Workshop on Building and Using Comparable Corpora
add
@InProceedings{plamada-volk:2013:BUCC,
author = {Plamada, Magdalena and Volk, Martin},
title = {Mining for Domain-specific Parallel Text from Wikipedia},
booktitle = {Proceedings of the Sixth Workshop on Building and Using Comparable Corpora},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {112--120},
url = {http://www.aclweb.org/anthology/W13-2514},
year = 2013
}
Plamada and Volk (2013)
Zhang, Chengzhi and Yao, Xuchen and Kit, Chunyu (2013): Finding More Bilingual Webpages with High Credibility via Link Analysis, Proceedings of the Sixth Workshop on Building and Using Comparable Corpora
add
@InProceedings{zhang-yao-kit:2013:BUCC,
author = {Zhang, Chengzhi and Yao, Xuchen and Kit, Chunyu},
title = {Finding More Bilingual Webpages with High Credibility via Link Analysis},
booktitle = {Proceedings of the Sixth Workshop on Building and Using Comparable Corpora},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {138--143},
url = {http://www.aclweb.org/anthology/W13-2517},
year = 2013
}
Zhang et al. (2013)
Fethi Lamraoui and Philippe Langlais (2013): Yet Another Fast and Robust and Open Source Sentence Aligner. Time to Reconsider Sentence Alignment, Machine Translation Summit XIV
add
@inproceedings{MTS2013-Lamraoui,
author = {Fethi Lamraoui and Philippe Langlais},
title = {Yet Another Fast and Robust and Open Source Sentence Aligner. {Time} to Reconsider Sentence Alignment},
url = {http://www.mt-archive.info/10/MTS-2013-Lamraoui.pdf},
pages = {77--84},
booktitle = {Machine Translation Summit XIV},
year = 2013
}
Lamraoui and Langlais (2013)
Stymne, Sara and Hardmeier, Christian and Tiedemann, Jörg and Nivre, Joakim (2013): Feature Weight Optimization for Discourse-Level SMT, Proceedings of the Workshop on Discourse in Machine Translation
add
@InProceedings{stymne-EtAl:2013:DiscoMT,
author = {Stymne, Sara and Hardmeier, Christian and Tiedemann, J\"{o}rg and Nivre, Joakim},
title = {Feature Weight Optimization for Discourse-Level SMT},
booktitle = {Proceedings of the Workshop on Discourse in Machine Translation},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {60--69},
url = {http://www.aclweb.org/anthology/W13-3308},
year = 2013
}
Stymne et al. (2013)
Stymne, Sara and Hardmeier, Christian and Tiedemann, Jörg and Nivre, Joakim (2013): Feature Weight Optimization for Discourse-Level SMT, Proceedings of the Workshop on Discourse in Machine Translation
add
@InProceedings{stymne-EtAl:2013:DiscoMT,
author = {Stymne, Sara and Hardmeier, Christian and Tiedemann, J\"{o}rg and Nivre, Joakim},
title = {Feature Weight Optimization for Discourse-Level SMT},
booktitle = {Proceedings of the Workshop on Discourse in Machine Translation},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {60--69},
url = {http://www.aclweb.org/anthology/W13-3308},
year = 2013
}
Stymne et al. (2013)
Rico Sennrich and Martin Volk (2010): MT-based Sentence Alignment for OCR-generated Parallel Texts, Proceedings of the Ninth Conference of the Association for Machine Translation in the Americas
add
@inproceedings{AMTA-2010-Sennrich,
author = {Rico Sennrich and Martin Volk},
title = {MT}-based Sentence Alignment for {OCR-generated Parallel Texts},
url = {http://www.mt-archive.info/AMTA-2010-Sennrich.pdf},
booktitle = {Proceedings of the Ninth Conference of the Association for Machine Translation in the Americas},
location = {Denver, Colorado},
year = 2010
}
Sennrich and Volk (2010)
Shi, Lei and Zhou, Ming (2008): Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model, Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing
add
@InProceedings{shi-zhou:2008:EMNLP,
author = {Shi, Lei and Zhou, Ming},
title = {Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model},
booktitle = {Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing},
month = {October},
address = {Honolulu, Hawaii},
publisher = {Association for Computational Linguistics},
pages = {505--513},
url = {http://www.aclweb.org/anthology/D08-1053},
year = 2008
}
Shi and Zhou (2008)
Mamitimin, Samat and Hou, Min (2009): Chinese-Uyghur Sentence Alignment: An Approach Based on Anchor Sentences, Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
add
@InProceedings{mamitimin-hou:2009:BUCC,
author = {Mamitimin, Samat and Hou, Min},
title = {Chinese-Uyghur Sentence Alignment: An Approach Based on Anchor Sentences},
booktitle = {Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora},
month = {August},
address = {Singapore},
publisher = {Association for Computational Linguistics},
pages = {38--45},
url = {http://www.aclweb.org/anthology/W/W09/W09-3108},
year = 2009
}
Mamitimin and Hou (2009)
Braune, Fabienne and Fraser, Alexander (2010): Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora, Coling 2010: Posters
add
@InProceedings{braune-fraser:2010:POSTERS,
author = {Braune, Fabienne and Fraser, Alexander},
title = {Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora},
booktitle = {Coling 2010: Posters},
month = {August},
address = {Beijing, China},
publisher = {Coling 2010 Organizing Committee},
pages = {81--89},
url = {http://www.aclweb.org/anthology/C10-2010},
year = 2010
}
Braune and Fraser (2010)
Li, Peng and Sun, Maosong and Xue, Ping (2010): Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm, Coling 2010: Posters
add
@InProceedings{li-sun-xue:2010:POSTERS,
author = {Li, Peng and Sun, Maosong and Xue, Ping},
title = {Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm},
booktitle = {Coling 2010: Posters},
month = {August},
address = {Beijing, China},
publisher = {Coling 2010 Organizing Committee},
pages = {710--718},
url = {http://www.aclweb.org/anthology/C10-2081},
year = 2010
}
Li et al. (2010)
Slayden, Glenn and Hwang, Mei-Yuh and Schwartz, Lee (2010): Thai Sentence-Breaking for Large-Scale SMT, Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing
add
@InProceedings{slayden-hwang-schwartz:2010:SSANLP,
author = {Slayden, Glenn and Hwang, Mei-Yuh and Schwartz, Lee},
title = {Thai Sentence-Breaking for Large-Scale SMT},
booktitle = {Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing},
month = {August},
address = {Beijing, China},
publisher = {Coling 2010 Organizing Committee},
pages = {8--16},
url = {http://www.aclweb.org/anthology/W10-3602},
year = 2010
}
Slayden et al. (2010)
Vilar, Juan Miguel (2005): Experiments Using MAR for Aligning Corpora, Proceedings of the ACL Workshop on Building and Using Parallel Texts
add
@InProceedings{vilar:2005:WPT,
author = {Vilar, Juan Miguel},
title = {Experiments Using {MAR} for Aligning Corpora},
booktitle = {Proceedings of the ACL Workshop on Building and Using Parallel Texts},
month = {June},
address = {Ann Arbor, Michigan},
publisher = {Association for Computational Linguistics},
pages = {95--98},
url = {http://www.aclweb.org/anthology/W/W05/W05-0815},
year = 2005
}
Vilar (2005)
Thomas C. Chuang and Jiang-Cheng Wu and Tracy Lin and Wen-Chie Shei and Jason S. Chang (2004): Bilingual Sentence Alignment Based on Punctuation Statistics and Lexicon, Proceedings of the Internation Joint Conference on Natural Language Processing (IJCNLP)
add
@inproceedings{Chuang:2004,
author = {Thomas C. Chuang and Jiang-Cheng Wu and Tracy Lin and Wen-Chie Shei and Jason S. Chang},
title = {Bilingual Sentence Alignment Based on Punctuation Statistics and Lexicon},
booktitle = {Proceedings of the Internation Joint Conference on Natural Language Processing (IJCNLP)},
year = 2004
}
Chuang et al. (2004)
David D. Palmer and Marti A. Hearst (1997): Adaptive Multilingual Sentence Boundary Disambiguation, Computational Linguistics
add
@Article{Palmer:1997,
author = {David D. Palmer and Marti A. Hearst},
title = {Adaptive Multilingual Sentence Boundary Disambiguation},
url = {http://acl.ldc.upenn.edu/J/J97/J97-2002.pdf?origin=publication\_detail},
googlescholar = {10610553735381302170},
journal = {Computational Linguistics},
volume = {23},
number = {3},
year = 1997
}
Palmer and Hearst (1997)

MT Research Survey Wiki

A Comprehensive Survey of Neural and Statistical Machine Translation Research Publications

Search Descriptions

Sentence Alignment

Publications

Benchmarks

Discussion

Related Topics

New Publications