Comparable Corpora
A comparable corpus is a pair of corpora in two different languages, which come from the same domain.
Comparable Corpora is the main subject of 33 publications. 12 are discussed here.
Parallel sentences may also be mined from comparable corpora such as news stories written on the same topic in different languages.
Munteanu, Dragos Stefan and Marcu, Daniel (2002):
Processing Comparable Corpora With Bilingual Suffix Trees, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)
mentioned in Comparable Corpora, Truecasing and Spelling Correction@inproceedings{Munteanu:2002,
author = {Munteanu, Dragos Stefan and Marcu, Daniel},
title = {Processing Comparable Corpora With Bilingual Suffix Trees},
booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
url = {},
month = {July},
address = {Philadelphia},
publisher = {Association for Computational Linguistics},
pages = {289--295},
year = 2002
Munteanu and Marcu (2002) uses suffix trees, and in later work log-likelyhood ratios
Dragos Stefan Munteanu and Alexander Fraser and Daniel Marcu (2004):
Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora, Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL)

author = {Dragos Stefan Munteanu and Alexander Fraser and Daniel Marcu},
title = {Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora},
url = {\_Paper.pdf},
googlescholar = {13931404674250458886},
booktitle = {Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL)},
year = 2004
(Munteanu et al., 2004;
Dragos Stefan Munteanu and Daniel Marcu (2005):
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora, Computational Linguistics

author = {Dragos Stefan Munteanu and Daniel Marcu},
title = {Improving Machine Translation Performance by Exploiting Non-Parallel Corpora},
url = {\_detail},
googlescholar = {15197760803213593510},
journal = {Computational Linguistics},
volume = {31},
number = {4},
year = 2005
Munteanu and Marcu, 2005), to detect parallel sentences.
Abdul-Rauf, Sadaf and Schwenk, Holger (2009):
On the Use of Comparable Corpora to Improve SMT performance, Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

author = {Abdul-Rauf, Sadaf and Schwenk, Holger},
title = {On the Use of Comparable Corpora to Improve {SMT} performance},
booktitle = {Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)},
month = {March},
address = {Athens, Greece},
publisher = {Association for Computational Linguistics},
pages = {16--23},
url = {},
year = 2009
Abdul-Rauf and Schwenk (2009);
Abdul Rauf, Sadaf and Schwenk, Holger (2009):
Exploiting Comparable Corpora with TER and TERp, Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora

author = {Abdul Rauf, Sadaf and Schwenk, Holger},
title = {Exploiting Comparable Corpora with TER and TERp},
booktitle = {Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora},
month = {August},
address = {Singapore},
publisher = {Association for Computational Linguistics},
pages = {46--54},
url = {},
year = 2009
Rauf and Schwenk (2009);
Sadaf Abdul Rauf and Holger Schwenk (2011):
Parallel sentence generation from comparable corpora for improved SMT, Machine Translation

author = {Sadaf Abdul Rauf and Holger Schwenk},
title = {Parallel sentence generation from comparable corpora for improved {SMT}},
pages = {341-375},
journal = {Machine Translation},
volume = {25},
number = {4},
month = {December},
year = 2011
Rauf and Schwenk (2011) translate one side of the comparable corpus into the other language, use information retrieval methods to find matching sentences and use the TER metric to measure their similarity.
D. \,Stef\uanescu and R. Ion and S. Hunsicker (2012):
Hybrid Parallel Sentence Mining from Comparable Corpora, Proceedings of the 16th International Conference of the European Association for Machine Translation (EAMT)

author = {D. \,{S}tef\u{a}nescu and R. Ion and S. Hunsicker},
title = {Hybrid Parallel Sentence Mining from Comparable Corpora},
url = {},
pages = {137-144},
booktitle = {Proceedings of the 16th International Conference of the European Association for Machine Translation (EAMT)},
location = {Trento, Italy},
editor = {Mauro Cettolo and Marcello Federico and Lucia Specia and Andy Way},
year = 2012
\,Stef\uanescu et al. (2012) report improvements with a more complex sentence similarity measure.
Instead of full sentences, parallel sentence fragments may be extracted from comparable corpora
Munteanu, Dragos Stefan and Marcu, Daniel (2006):
Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora, Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

author = {Munteanu, Dragos Stefan and Marcu, Daniel},
title = {Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora},
booktitle = {Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics},
month = {July},
address = {Sydney, Australia},
publisher = {Association for Computational Linguistics},
pages = {81--88},
url = {},
year = 2006
(Munteanu and Marcu, 2006). Methods have been proposed to extract matching phrases
Takaaki Tanaka (2002):
Measuring the Similarity between Compound Nouns in Different Languages Using Non-Parallel Corpora, Proceedings of the International Conference on Computational Linguistics (COLING)

author = {Takaaki Tanaka},
title = {Measuring the Similarity between Compound Nouns in Different Languages Using Non-Parallel Corpora},
url = {},
googlescholar = {15794656224648594415},
booktitle = {Proceedings of the International Conference on Computational Linguistics (COLING)},
year = 2002
(Tanaka, 2002) or web pages
Smith, Noah A. (2002):
From Words to Corpora: Recognizing Translation, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

author = {Smith, Noah A.},
title = {From Words to Corpora: Recognizing Translation},
url = {},
googlescholar = {18379354085355431073},
booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
month = {July},
address = {Philadelphia},
publisher = {Association for Computational Linguistics},
pages = {95--102},
year = 2002
(Smith, 2002) from such large collections.
Chris Quirk and Raghavendra Udupa and Arul Menezes (2007):
Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction, Proceedings of the MT Summit XI

author = {Chris Quirk and Raghavendra Udupa and Arul Menezes},
title = {Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction},
url = {\_compcorp.pdf},
googlescholar = {9551046180396418551},
booktitle = {Proceedings of the {MT} Summit XI},
year = 2007
Quirk et al. (2007) propose a generative model for the same task.
Hewavitharana, Sanjika and Vogel, Stephan (2011):
Extracting Parallel Phrases from Comparable Data, Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

author = {Hewavitharana, Sanjika and Vogel, Stephan},
title = {Extracting Parallel Phrases from Comparable Data},
booktitle = {Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web},
month = {June},
address = {Portland, Oregon},
publisher = {Association for Computational Linguistics},
pages = {61--68},
url = {},
year = 2011
Hewavitharana and Vogel (2011) extract phrase pairs from comparable corpora, using a classifier approach.
Related Topics
The transition from parallel corpora over noisy corpora that require cleaning all the way to comparable corpora is fluent. A special topic is the extraction of bilingual dictionaries from comparable corpora. A comparable corpus is always a pair of two monolingual corpora. The target-side monolingual corpus may be used for training language models and the source-side monolingual corpus may be used for some domain adaptation methods.
New Publications
Viktor Hangya and Fabienne Braune and Yuliya Kalasouskaya and Alexander Fraser (2018):
Unsupervised Parallel Sentence Extraction from Comparable Corpora, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)

author = {Viktor Hangya and Fabienne Braune and Yuliya Kalasouskaya and Alexander Fraser},
title = {Unsupervised Parallel Sentence Extraction from Comparable Corpora},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
year = 2018
Hangya et al. (2018)
Marie, Benjamin and Fujita, Atsushi (2017):
Phrase Table Induction Using In-Domain Monolingual Data for Domain Adaptation in Statistical Machine Translation, Transactions of the Association for Computational Linguistics

author = {Marie, Benjamin and Fujita, Atsushi },
title = {Phrase Table Induction Using In-Domain Monolingual Data for Domain Adaptation in Statistical Machine Translation},
journal = {Transactions of the Association for Computational Linguistics},
volume = {5},
keywords = {{}},
issn = {2307-387X},
url = {},
pages = {487--500},
year = 2017
Marie and Fujita (2017)
Tufiş, Dan and Ion, Radu and Dumitrescu, Stefan and Stefanescu, Dan (2013):
Wikipedia as an SMT Training Corpus, Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

author = {Tufi\c{s}, Dan and Ion, Radu and Dumitrescu, Stefan and Stefanescu, Dan},
title = {Wikipedia as an {SMT} Training Corpus},
booktitle = {Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013},
month = {September},
address = {Hissar, Bulgaria},
publisher = {INCOMA Ltd. Shoumen, BULGARIA},
pages = {702--709},
url = {},
year = 2013
Tufiş et al. (2013)
Rios, Miguel and Sharoff, Serge (2015):
Obtaining SMT dictionaries for related languages, Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

author = {Rios, Miguel and Sharoff, Serge},
title = {Obtaining {SMT} dictionaries for related languages},
booktitle = {Proceedings of the Eighth Workshop on Building and Using Comparable Corpora},
month = {July},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
pages = {68--73},
url = {},
year = 2015
Rios and Sharoff (2015)
Seo, Hyeong-Won and Cheon, Minah and Kim, Jae-Hoon (2015):
Extracting Bilingual Lexica from Comparable Corpora Using Self-Organizing Maps, Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

author = {Seo, Hyeong-Won and Cheon, Minah and Kim, Jae-Hoon},
title = {Extracting Bilingual Lexica from Comparable Corpora Using Self-Organizing Maps},
booktitle = {Proceedings of the Eighth Workshop on Building and Using Comparable Corpora},
month = {July},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
pages = {62--67},
url = {},
year = 2015
Seo et al. (2015)
Rapp, Reinhard (2015):
A Methodology for Bilingual Lexicon Extraction from Comparable Corpora, Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)

author = {Rapp, Reinhard},
title = {A Methodology for Bilingual Lexicon Extraction from Comparable Corpora},
booktitle = {Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)},
month = {July},
address = {Beijing},
publisher = {Association for Computational Linguistics},
pages = {46--55},
url = {},
year = 2015
Rapp (2015)
Krzysztof Wolk and Krzysztof Marasek (2015):
Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)

author = {Krzysztof Wolk and Krzysztof Marasek},
title = {Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents},
pages = {118-125},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
location = {Da Nang, Vietnam},
url = {},
month = {December},
year = 2015
Wolk and Marasek (2015)
Krstovski, Kriste and Smith, David (2016):
Bootstrapping Translation Detection and Sentence Extraction from Comparable Corpora, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

author = {Krstovski, Kriste and Smith, David},
title = {Bootstrapping Translation Detection and Sentence Extraction from Comparable Corpora},
booktitle = {Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
address = {San Diego, California},
publisher = {Association for Computational Linguistics},
pages = {1127--1132},
url = {},
year = 2016
Krstovski and Smith (2016)
Barrón-Cedeño, Alberto and España-Bonet, Cristina and Boldoba, Josu and Màrquez, Lluís (2015):
A Factory of Comparable Corpora from Wikipedia, Proceedings of the Eighth Workshop on Building and Using Comparable Corpora
mentioned in Parallel Corpora and Comparable Corpora@InProceedings{Barronetal:2015,
author = {{Barr\'on-Cede{\~n}o}, Alberto and {Espa{\~n}a-Bonet}, Cristina and {Boldoba}, Josu and {M\`arquez}, Llu\'{i}s},
title = {A Factory of Comparable Corpora from Wikipedia},
booktitle = {Proceedings of the Eighth Workshop on Building and Using Comparable Corpora},
pages = {3--13},
month = {July},
date = {30},
address = {Beijing, China},
language = {english},
url = {},
year = 2015
Barrón-Cedeño et al. (2015)
Hazem, Amir and Morin, Emmanuel (2016):
Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora, Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

author = {Hazem, Amir and Morin, Emmanuel},
title = {Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora},
booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers},
month = {December},
address = {Osaka, Japan},
publisher = {The COLING 2016 Organizing Committee},
pages = {3401--3411},
url = {},
year = 2016
Hazem and Morin (2016)
Zhang, Meng and Liu, Yang and Luan, Huanbo and Liu, Yiqun and Sun, Maosong (2016):
Inducing Bilingual Lexica From Non-Parallel Data With Earth Mover's Distance Regularization, Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

author = {Zhang, Meng and Liu, Yang and Luan, Huanbo and Liu, Yiqun and Sun, Maosong},
title = {Inducing Bilingual Lexica From Non-Parallel Data With Earth Mover's Distance Regularization},
booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers},
month = {December},
address = {Osaka, Japan},
publisher = {The COLING 2016 Organizing Committee},
pages = {3188--3198},
url = {},
year = 2016
Zhang et al. (2016)
Liu, Chunyang and Liu, Yang and Sun, Maosong and Luan, Huanbo and Yu, Heng (2016):
Agreement-based Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

author = {Liu, Chunyang and Liu, Yang and Sun, Maosong and Luan, Huanbo and Yu, Heng},
title = {Agreement-based Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora},
booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {1024--1033},
url = {},
year = 2016
Liu et al. (2016)
- UNKNOWN CITATION 'Wołk20150724'
- UNKNOWN CITATION 'Wołk2014126'
Dou, Qing and Vaswani, Ashish and Knight, Kevin and Dyer, Chris (2015):
Unifying Bayesian Inference and Vector Space Models for Improved Decipherment, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

author = {Dou, Qing and Vaswani, Ashish and Knight, Kevin and Dyer, Chris},
title = {Unifying Bayesian Inference and Vector Space Models for Improved Decipherment},
booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
month = {July},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
pages = {836--845},
url = {},
year = 2015
Dou et al. (2015)
Nuhn, Malte and Schamper, Julian and Ney, Hermann (2015):
UNRAVELâ"‚¬"A Decipherment Toolkit, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

author = {Nuhn, Malte and Schamper, Julian and Ney, Hermann},
title = {UNRAVELâ"‚¬"A Decipherment Toolkit},
booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
month = {July},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
pages = {549--553},
url = {},
year = 2015
Nuhn et al. (2015)
Meiping Dong and Yang Liu and Huanbo Luan and Maosong Sun and Tatsuya Izuha and Dakun Zhang (2015):
Iterative Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora, Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI)

author = {Meiping Dong and Yang Liu and Huanbo Luan and Maosong Sun and Tatsuya Izuha and Dakun Zhang},
title = {Iterative Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora},
pages = {1250--1256},
booktitle = {Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI)},
url = {},
location = {Buenos Aires, Argentina},
year = 2015
Dong et al. (2015)
Chu, Chenhui and Nakazawa, Toshiaki and Kurohashi, Sadao (2013):
Accurate Parallel Fragment Extraction from Quasi--Comparable Corpora using Alignment Model and Translation Lexicon, Proceedings of the Sixth International Joint Conference on Natural Language Processing

author = {Chu, Chenhui and Nakazawa, Toshiaki and Kurohashi, Sadao},
title = {Accurate Parallel Fragment Extraction from Quasi--Comparable Corpora using Alignment Model and Translation Lexicon},
booktitle = {Proceedings of the Sixth International Joint Conference on Natural Language Processing},
month = {October},
address = {Nagoya, Japan},
publisher = {Asian Federation of Natural Language Processing},
pages = {1144--1150},
url = {},
year = 2013
Chu et al. (2013)
Fu, Xiaoyin and Wei, Wei and Lu, Shixiang and Chen, Zhenbiao and Xu, Bo (2013):
Phrase-based Parallel Fragments Extraction from Comparable Corpora, Proceedings of the Sixth International Joint Conference on Natural Language Processing

author = {Fu, Xiaoyin and Wei, Wei and Lu, Shixiang and Chen, Zhenbiao and Xu, Bo},
title = {Phrase-based Parallel Fragments Extraction from Comparable Corpora},
booktitle = {Proceedings of the Sixth International Joint Conference on Natural Language Processing},
month = {October},
address = {Nagoya, Japan},
publisher = {Asian Federation of Natural Language Processing},
pages = {972--976},
url = {},
year = 2013
Fu et al. (2013)
McCrae, John Philip and Cimiano, Philipp (2013):
Mining translations from the web of open linked data, Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction

author = {McCrae, John Philip and Cimiano, Philipp},
title = {Mining translations from the web of open linked data},
booktitle = {Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction},
month = {September},
address = {Hissar, Bulgaria},
publisher = {INCOMA Ltd. Shoumen, BULGARIA},
pages = {8--11},
url = {},
year = 2013
McCrae and Cimiano (2013)
Lapshinova-Koltunski, Ekaterina (2013):
VARTRA: A Comparable Corpus for Analysis of Translation Variation, Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

author = {Lapshinova-Koltunski, Ekaterina},
title = {VARTRA: A Comparable Corpus for Analysis of Translation Variation},
booktitle = {Proceedings of the Sixth Workshop on Building and Using Comparable Corpora},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {77--86},
url = {},
year = 2013
Lapshinova-Koltunski (2013)
Preiss, Judita (2012):
Identifying Comparable Corpora Using LDA, Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

author = {Preiss, Judita},
title = {Identifying Comparable Corpora Using LDA},
booktitle = {Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
address = {Montr\'{e}al, Canada},
publisher = {Association for Computational Linguistics},
pages = {558--562},
url = {},
year = 2012
Preiss (2012)
Toni Badia and Gemma Boleda and Maite Melero and Antoni Oliver (2005):
An n-gram Approach to Exploiting a Monolingual Corpus for Machine Translation, Proceedings of the Workshop on Example-based Machine Translation at MT Summit X

author = {Toni Badia and Gemma Boleda and Maite Melero and Antoni Oliver},
title = {An n-gram Approach to Exploiting a Monolingual Corpus for Machine Translation},
url = {},
googlescholar = {11888473321532340496},
booktitle = {Proceedings of the Workshop on Example-based Machine Translation at {MT} Summit X},
month = {September},
address = {Phuket, Thailand},
year = 2005
Badia et al. (2005)