Comparable Corpora

A comparable corpus is a pair of corpora in two different languages, which come from the same domain.

Comparable Corpora is the main subject of 33 publications. 12 are discussed here.

Topics in Data

Publications

Parallel sentences may also be mined from comparable corpora such as news stories written on the same topic in different languages. Munteanu and Marcu (2002) uses suffix trees, and in later work log-likelyhood ratios (Munteanu et al., 2004; Munteanu and Marcu, 2005), to detect parallel sentences.

Abdul-Rauf and Schwenk (2009); Rauf and Schwenk (2009); Rauf and Schwenk (2011) translate one side of the comparable corpus into the other language, use information retrieval methods to find matching sentences and use the TER metric to measure their similarity. \,Stef\uanescu et al. (2012) report improvements with a more complex sentence similarity measure.

Instead of full sentences, parallel sentence fragments may be extracted from comparable corpora (Munteanu and Marcu, 2006). Methods have been proposed to extract matching phrases (Tanaka, 2002) or web pages (Smith, 2002) from such large collections. Quirk et al. (2007) propose a generative model for the same task.

Hewavitharana and Vogel (2011) extract phrase pairs from comparable corpora, using a classifier approach.

Benchmarks

Discussion

New Publications

Viktor Hangya and Fabienne Braune and Yuliya Kalasouskaya and Alexander Fraser (2018): Unsupervised Parallel Sentence Extraction from Comparable Corpora, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
add
@inproceedings{iwslt18-Unsupervised-Hangya,
author = {Viktor Hangya and Fabienne Braune and Yuliya Kalasouskaya and Alexander Fraser},
title = {Unsupervised Parallel Sentence Extraction from Comparable Corpora},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
year = 2018
}
Hangya et al. (2018)
Marie, Benjamin and Fujita, Atsushi (2017): Phrase Table Induction Using In-Domain Monolingual Data for Domain Adaptation in Statistical Machine Translation, Transactions of the Association for Computational Linguistics
add
@article{TACL1166,
author = {Marie, Benjamin and Fujita, Atsushi },
title = {Phrase Table Induction Using In-Domain Monolingual Data for Domain Adaptation in Statistical Machine Translation},
journal = {Transactions of the Association for Computational Linguistics},
volume = {5},
keywords = {{}},
issn = {2307-387X},
url = {https://transacl.org/ojs/index.php/tacl/article/view/1166},
pages = {487--500},
year = 2017
}
Marie and Fujita (2017)
Tufiş, Dan and Ion, Radu and Dumitrescu, Stefan and Stefanescu, Dan (2013): Wikipedia as an SMT Training Corpus, Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013
add
@InProceedings{tufics-EtAl:2013:RANLP-2013,
author = {Tufi\c{s}, Dan and Ion, Radu and Dumitrescu, Stefan and Stefanescu, Dan},
title = {Wikipedia as an {SMT} Training Corpus},
booktitle = {Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013},
month = {September},
address = {Hissar, Bulgaria},
publisher = {INCOMA Ltd. Shoumen, BULGARIA},
pages = {702--709},
url = {http://www.aclweb.org/anthology/R13-1091},
year = 2013
}
Tufiş et al. (2013)
Rios, Miguel and Sharoff, Serge (2015): Obtaining SMT dictionaries for related languages, Proceedings of the Eighth Workshop on Building and Using Comparable Corpora
add
@InProceedings{rios-sharoff:2015:BUCC,
author = {Rios, Miguel and Sharoff, Serge},
title = {Obtaining {SMT} dictionaries for related languages},
booktitle = {Proceedings of the Eighth Workshop on Building and Using Comparable Corpora},
month = {July},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
pages = {68--73},
url = {http://www.aclweb.org/anthology/W15-3410},
year = 2015
}
Rios and Sharoff (2015)
Seo, Hyeong-Won and Cheon, Minah and Kim, Jae-Hoon (2015): Extracting Bilingual Lexica from Comparable Corpora Using Self-Organizing Maps, Proceedings of the Eighth Workshop on Building and Using Comparable Corpora
add
@InProceedings{seo-cheon-kim:2015:BUCC,
author = {Seo, Hyeong-Won and Cheon, Minah and Kim, Jae-Hoon},
title = {Extracting Bilingual Lexica from Comparable Corpora Using Self-Organizing Maps},
booktitle = {Proceedings of the Eighth Workshop on Building and Using Comparable Corpora},
month = {July},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
pages = {62--67},
url = {http://www.aclweb.org/anthology/W15-3409},
year = 2015
}
Seo et al. (2015)
Rapp, Reinhard (2015): A Methodology for Bilingual Lexicon Extraction from Comparable Corpora, Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)
add
@InProceedings{rapp:2015:HyTra-4,
author = {Rapp, Reinhard},
title = {A Methodology for Bilingual Lexicon Extraction from Comparable Corpora},
booktitle = {Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)},
month = {July},
address = {Beijing},
publisher = {Association for Computational Linguistics},
pages = {46--55},
url = {http://www.aclweb.org/anthology/W15-4108},
year = 2015
}
Rapp (2015)
Krzysztof Wolk and Krzysztof Marasek (2015): Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
add
@inproceedings{IWSLT-2015-Wolk-2,
author = {Krzysztof Wolk and Krzysztof Marasek},
title = {Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents},
pages = {118-125},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
location = {Da Nang, Vietnam},
url = {http://www.mt-archive.info/15/IWSLT-2015-wolk-2.pdf},
month = {December},
year = 2015
}
Wolk and Marasek (2015)
Krstovski, Kriste and Smith, David (2016): Bootstrapping Translation Detection and Sentence Extraction from Comparable Corpora, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
add
@InProceedings{krstovski-smith:2016:N16-1,
author = {Krstovski, Kriste and Smith, David},
title = {Bootstrapping Translation Detection and Sentence Extraction from Comparable Corpora},
booktitle = {Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
address = {San Diego, California},
publisher = {Association for Computational Linguistics},
pages = {1127--1132},
url = {http://www.aclweb.org/anthology/N16-1132},
year = 2016
}
Krstovski and Smith (2016)
Barrón-Cedeño, Alberto and España-Bonet, Cristina and Boldoba, Josu and Màrquez, Lluís (2015): A Factory of Comparable Corpora from Wikipedia, Proceedings of the Eighth Workshop on Building and Using Comparable Corpora mentioned in Parallel Corpora and Comparable Corpora
add
@InProceedings{Barronetal:2015,
author = {{Barr\'on-Cede{\~n}o}, Alberto and {Espa{\~n}a-Bonet}, Cristina and {Boldoba}, Josu and {M\`arquez}, Llu\'{i}s},
title = {A Factory of Comparable Corpora from Wikipedia},
booktitle = {Proceedings of the Eighth Workshop on Building and Using Comparable Corpora},
pages = {3--13},
month = {July},
date = {30},
address = {Beijing, China},
language = {english},
url = {http://www.aclweb.org/anthology/W15-3402},
year = 2015
}
Barrón-Cedeño et al. (2015)
Hazem, Amir and Morin, Emmanuel (2016): Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora, Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
add
@InProceedings{hazem-morin:2016:COLING,
author = {Hazem, Amir and Morin, Emmanuel},
title = {Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora},
booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers},
month = {December},
address = {Osaka, Japan},
publisher = {The COLING 2016 Organizing Committee},
pages = {3401--3411},
url = {http://aclweb.org/anthology/C16-1321},
year = 2016
}
Hazem and Morin (2016)
Zhang, Meng and Liu, Yang and Luan, Huanbo and Liu, Yiqun and Sun, Maosong (2016): Inducing Bilingual Lexica From Non-Parallel Data With Earth Mover's Distance Regularization, Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
add
@InProceedings{zhang-EtAl:2016:COLING6,
author = {Zhang, Meng and Liu, Yang and Luan, Huanbo and Liu, Yiqun and Sun, Maosong},
title = {Inducing Bilingual Lexica From Non-Parallel Data With Earth Mover's Distance Regularization},
booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers},
month = {December},
address = {Osaka, Japan},
publisher = {The COLING 2016 Organizing Committee},
pages = {3188--3198},
url = {http://aclweb.org/anthology/C16-1300},
year = 2016
}
Zhang et al. (2016)
Liu, Chunyang and Liu, Yang and Sun, Maosong and Luan, Huanbo and Yu, Heng (2016): Agreement-based Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
add
@InProceedings{liu-EtAl:2016:P16-11,
author = {Liu, Chunyang and Liu, Yang and Sun, Maosong and Luan, Huanbo and Yu, Heng},
title = {Agreement-based Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora},
booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {1024--1033},
url = {http://www.aclweb.org/anthology/P16-1097},
year = 2016
}
Liu et al. (2016)
UNKNOWN CITATION 'WoÅ‚k2015'
UNKNOWN CITATION 'WoÅ‚k20150724'
UNKNOWN CITATION 'WoÅ‚k2014'
UNKNOWN CITATION 'WoÅ‚k2014126'
Dou, Qing and Vaswani, Ashish and Knight, Kevin and Dyer, Chris (2015): Unifying Bayesian Inference and Vector Space Models for Improved Decipherment, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
add
@InProceedings{dou-EtAl:2015:ACL-IJCNLP,
author = {Dou, Qing and Vaswani, Ashish and Knight, Kevin and Dyer, Chris},
title = {Unifying Bayesian Inference and Vector Space Models for Improved Decipherment},
booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
month = {July},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
pages = {836--845},
url = {http://www.aclweb.org/anthology/P15-1081},
year = 2015
}
Dou et al. (2015)
Nuhn, Malte and Schamper, Julian and Ney, Hermann (2015): UNRAVELÃ¢"‚¬"A Decipherment Toolkit, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
add
@InProceedings{nuhn-schamper-ney:2015:ACL-IJCNLP,
author = {Nuhn, Malte and Schamper, Julian and Ney, Hermann},
title = {UNRAVELÃ¢"‚¬"A Decipherment Toolkit},
booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
month = {July},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
pages = {549--553},
url = {http://www.aclweb.org/anthology/P15-2090},
year = 2015
}
Nuhn et al. (2015)
Meiping Dong and Yang Liu and Huanbo Luan and Maosong Sun and Tatsuya Izuha and Dakun Zhang (2015): Iterative Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora, Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI)
add
@inproceedings{Dong:2015:ijcai,
author = {Meiping Dong and Yang Liu and Huanbo Luan and Maosong Sun and Tatsuya Izuha and Dakun Zhang},
title = {Iterative Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora},
pages = {1250--1256},
booktitle = {Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI)},
url = {http://ijcai.org/papers15/Papers/IJCAI15-180.pdf},
location = {Buenos Aires, Argentina},
year = 2015
}
Dong et al. (2015)
Chu, Chenhui and Nakazawa, Toshiaki and Kurohashi, Sadao (2013): Accurate Parallel Fragment Extraction from Quasi--Comparable Corpora using Alignment Model and Translation Lexicon, Proceedings of the Sixth International Joint Conference on Natural Language Processing
add
@InProceedings{chu-nakazawa-kurohashi:2013:IJCNLP,
author = {Chu, Chenhui and Nakazawa, Toshiaki and Kurohashi, Sadao},
title = {Accurate Parallel Fragment Extraction from Quasi--Comparable Corpora using Alignment Model and Translation Lexicon},
booktitle = {Proceedings of the Sixth International Joint Conference on Natural Language Processing},
month = {October},
address = {Nagoya, Japan},
publisher = {Asian Federation of Natural Language Processing},
pages = {1144--1150},
url = {http://www.aclweb.org/anthology/I13-1163},
year = 2013
}
Chu et al. (2013)
Fu, Xiaoyin and Wei, Wei and Lu, Shixiang and Chen, Zhenbiao and Xu, Bo (2013): Phrase-based Parallel Fragments Extraction from Comparable Corpora, Proceedings of the Sixth International Joint Conference on Natural Language Processing
add
@InProceedings{fu-EtAl:2013:IJCNLP,
author = {Fu, Xiaoyin and Wei, Wei and Lu, Shixiang and Chen, Zhenbiao and Xu, Bo},
title = {Phrase-based Parallel Fragments Extraction from Comparable Corpora},
booktitle = {Proceedings of the Sixth International Joint Conference on Natural Language Processing},
month = {October},
address = {Nagoya, Japan},
publisher = {Asian Federation of Natural Language Processing},
pages = {972--976},
url = {http://www.aclweb.org/anthology/I13-1129},
year = 2013
}
Fu et al. (2013)
McCrae, John Philip and Cimiano, Philipp (2013): Mining translations from the web of open linked data, Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction
add
@InProceedings{mccrae-cimiano:2013:NLP-LOD-SWAIE,
author = {McCrae, John Philip and Cimiano, Philipp},
title = {Mining translations from the web of open linked data},
booktitle = {Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction},
month = {September},
address = {Hissar, Bulgaria},
publisher = {INCOMA Ltd. Shoumen, BULGARIA},
pages = {8--11},
url = {http://www.aclweb.org/anthology/W13-5203},
year = 2013
}
McCrae and Cimiano (2013)
Lapshinova-Koltunski, Ekaterina (2013): VARTRA: A Comparable Corpus for Analysis of Translation Variation, Proceedings of the Sixth Workshop on Building and Using Comparable Corpora
add
@InProceedings{lapshinovakoltunski:2013:BUCC,
author = {Lapshinova-Koltunski, Ekaterina},
title = {VARTRA: A Comparable Corpus for Analysis of Translation Variation},
booktitle = {Proceedings of the Sixth Workshop on Building and Using Comparable Corpora},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {77--86},
url = {http://www.aclweb.org/anthology/W13-2510},
year = 2013
}
Lapshinova-Koltunski (2013)
Preiss, Judita (2012): Identifying Comparable Corpora Using LDA, Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
add
@InProceedings{preiss:2012:NAACL-HLT,
author = {Preiss, Judita},
title = {Identifying Comparable Corpora Using LDA},
booktitle = {Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
address = {Montr\'{e}al, Canada},
publisher = {Association for Computational Linguistics},
pages = {558--562},
url = {http://www.aclweb.org/anthology/N12-1065},
year = 2012
}
Preiss (2012)
Toni Badia and Gemma Boleda and Maite Melero and Antoni Oliver (2005): An n-gram Approach to Exploiting a Monolingual Corpus for Machine Translation, Proceedings of the Workshop on Example-based Machine Translation at MT Summit X
add
@InProceedings{Badia:2005:MTS,
author = {Toni Badia and Gemma Boleda and Maite Melero and Antoni Oliver},
title = {An n-gram Approach to Exploiting a Monolingual Corpus for Machine Translation},
url = {http://mt-archive.info/MTS-2005-Badia.pdf},
googlescholar = {11888473321532340496},
booktitle = {Proceedings of the Workshop on Example-based Machine Translation at {MT} Summit X},
month = {September},
address = {Phuket, Thailand},
year = 2005
}
Badia et al. (2005)

MT Research Survey Wiki

A Comprehensive Survey of Neural and Statistical Machine Translation Research Publications

Search Descriptions

Comparable Corpora

Publications

Benchmarks

Discussion

Related Topics

New Publications