Collecting Parallel Corpora
The web is the main source for parallel corpora today, which requires a number of processing steps, but also other data resources have been explored.
Parallel Corpora is the main subject of 84 publications. 30 are discussed here.
Publications
Philip Resnik (1999):
Mining the Web for Bilingual Text, Proceedings of the 37th Annual Meeting of the Association of Computational Linguistics (ACL)
@Inproceedings{Resnik:1999,
author = {Philip Resnik},
title = {Mining the Web for Bilingual Text},
url = {
http://acl.ldc.upenn.edu/P/P99/P99-1068.pdf},
googlescholar = {4360226935188574245},
booktitle = {Proceedings of the 37th Annual Meeting of the Association of Computational Linguistics (ACL)},
year = 1999
}
Resnik (1999) describes a method to automatically find parallel documents on the web.
Fukushima, Ken'ichi and Taura, Kenjiro and Chikayama, Takashi (2006):
A Fast and Accurate Method for Detecting English-Japanese Parallel Texts, Proceedings of the Workshop on Multilingual Language Resources and Interoperability
@InProceedings{fukushima-taura-chikayama:2006:MLRI,
author = {Fukushima, Ken'ichi and Taura, Kenjiro and Chikayama, Takashi},
title = {A Fast and Accurate Method for Detecting {English-Japanese} Parallel Texts},
booktitle = {Proceedings of the Workshop on Multilingual Language Resources and Interoperability},
month = {July},
address = {Sydney, Australia},
publisher = {Association for Computational Linguistics},
pages = {60--67},
url = {
http://www.aclweb.org/anthology/W/W06/W06-1008},
year = 2006
}
Fukushima et al. (2006) use a dictionary to detect parallel documents, while
Bo Li and Juan Liu (2008):
Mining Chinese-English Parallel Corpora from the Web , Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP)
@inproceedings{Li:2008:IJCNLP,
author = {Bo Li and Juan Liu},
title = {Mining {C}hinese-{E}nglish Parallel Corpora from the Web },
url = {
http://www.newdesign.aclweb.org/anthology-new/I/I08/I08-2120.pdf},
googlescholar = {8567864887980266695},
booktitle = {Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP)},
year = 2008
}
Li and Liu (2008) use a number of criteria such as similarity of the URL and page content. Acquiring parallel corpora, however, typically requires some manual involvement
Philipp Koehn (2002):
Europarl: A Multilingual Corpus for Evaluation of Machine Translation @misc{Europarl,
author = {Philipp Koehn},
title = {Europarl: A Multilingual Corpus for Evaluation of Machine Translation},
howpublished = {Unpublished, {\tt
http://www.isi.edu/$\sim$koehn/europarl/}},
year = 2002
}
(Koehn, 2002;
Joel Martin and Howard Johnson and Benoit Farley and Anna Maclachlan (2003):
Aligning and Using an English-Inuktitut Parallel Corpus, HLT-NAACL 2003 Workshop: Building and Using Parallel Texts: Data Driven Machine Translation and Beyond
@inproceedings{Martin:2003,
author = {Joel Martin and Howard Johnson and Benoit Farley and Anna Maclachlan },
title = {Aligning and Using an {English-Inuktitut} Parallel Corpus},
url = {
http://acl.ldc.upenn.edu/W/W03/W03-0320.pdf},
booktitle = {HLT-NAACL 2003 Workshop: Building and Using Parallel Texts: Data Driven Machine Translation and Beyond},
editor = {Rada Mihalcea and Ted Pedersen},
month = {May 31},
address = {Edmonton, Alberta, Canada},
publisher = {Association for Computational Linguistics},
year = 2003
}
Martin et al., 2003;
Philipp Koehn (2005):
Europarl: A Parallel Corpus for Statistical Machine Translation, Proceedings of the Tenth Machine Translation Summit (MT Summit X)
@InProceedings{Koehn:2005:MTS,
author = {Philipp Koehn},
title = {Europarl: A Parallel Corpus for Statistical Machine Translation},
url = {
http://mt-archive.info/MTS-2005-Koehn.pdf},
googlescholar = {6985235632472432229},
booktitle = {Proceedings of the Tenth Machine Translation Summit (MT Summit X)},
month = {September},
address = {Phuket, Thailand},
year = 2005
}
Koehn, 2005), including the matching of documents
Utiyama, Masao and Isahara, Hitoshi (2003):
Reliable Measures for Aligning Japanese-English News Articles and Sentences, Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics
@inproceedings{Utiyama:2003,
author = {Utiyama, Masao and Isahara, Hitoshi},
title = {Reliable Measures for Aligning {Japanese-English} News Articles and Sentences},
booktitle = {Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics},
editor = {Erhard Hinrichs and Dan Roth},
url = {
http://www.aclweb.org/anthology/P03-1010.pdf},
pages = {72--79},
year = 2003
}
(Utiyama and Isahara, 2003). A large collection of corpora is maintained at the
OPUS web site Jörg Tiedemann (2012):
Parallel Data, Tools and Interfaces in OPUS, Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012)
@inproceedings{TIEDEMANN12.463.L12-1246,
author = {J{\"o}rg Tiedemann},
title = {Parallel Data, Tools and Interfaces in OPUS},
url = {
http://www.lrec-conf.org/proceedings/lrec2012/pdf/463\_Paper.pdf},
note = {ACL Anthology Identifier: L12-1246},
booktitle = {Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012)},
month = {May},
address = {Istanbul, Turkey},
editor = {Nicoletta Calzolari and Khalid Choukri and Thierry Declerck and Mehmet U\u{g}ur Do\u{g}an and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
language = {English},
pages = {2214--2218},
year = 2012
}
(Tiedemann, 2012).
Masao Uchiyama and Hitoshi Isahara (2007):
A Japanese-English Patent Parallel Corpus, Proceedings of the MT Summit XI
@inproceedings{Uchiyama:2007:MTSummit,
author = {Masao Uchiyama and Hitoshi Isahara},
title = {A {J}apanese-{E}nglish Patent Parallel Corpus},
booktitle = {Proceedings of the {MT} Summit XI},
year = 2007
}
Uchiyama and Isahara (2007) report on the efforts to build a Japanese-English patent corpus and
Lieve Macken and Julia Trushkina and Lidia Rura (2007):
Dutch Parallel Corpus: MT Corpus and translator's aid, Proceedings of the MT Summit XI
@inproceedings{Macken:2007:MTSummit,
author = {Lieve Macken and Julia Trushkina and Lidia Rura},
title = {D}utch Parallel Corpus: {MT Corpus and translator's aid},
url = {
http://www.mt-archive.info/MTS-2007-Macken.pdf},
googlescholar = {1625623404376163668},
booktitle = {Proceedings of the {MT} Summit XI},
year = 2007
}
Macken et al. (2007) on efforts on a broad-based Dutch-English corpus.
Wolfgang Täger (2011):
The Sentence-Aligned European Patent Corpus, Proceedings of the 15th International Conference of the European Association for Machine Translation (EAMT)
@inproceedings{eamt11:Taeger,
author = {Wolfgang T{\"a}ger},
title = {The Sentence-Aligned European Patent Corpus},
url = {
http://mt-archive.info/EAMT-2011-Tager.pdf},
googlescholar = {8983346114011238566},
pages = {177--184},
booktitle = {Proceedings of the 15th International Conference of the European Association for Machine Translation (EAMT)},
location = {Leuven, Belgium},
editor = {Mikel L. Forcada and Heidi Depraetere and Vincent Vandeghinste},
year = 2011
}
Täger (2011) describes the creation of the European patent corpus.
M. Cettolo and C. Girardi and M. Federico (2012):
WIT3: Web Inventory of Transcribed and Translated Talks, Proceedings of th 16th International Conference of the European Association for Machine Translation (EAMT)
mentioned in Parallel Corpora and Evaluation Campaigns@inproceedings{EAMT-2012-Cettolo,
author = {M. Cettolo and C. Girardi and M. Federico},
title = {WIT3: Web Inventory of Transcribed and Translated Talks},
url = {
http://www.mt-archive.info/EAMT-2012-Cettolo},
pages = {261-268},
booktitle = {Proceedings of th 16th International Conference of the European Association for Machine Translation (EAMT)},
location = {Trento, Italy},
editor = {Mauro Cettolo and Marcello Federico and Lucia Specia and Andy Way},
year = 2012
}
Cettolo et al. (2012) explain the creation of a multilingual parallel corpus of subtitles from the TED Talks website. A discussion of the pitfalls during the construction of parallel corpora is given by
Heiki-Jaan Kaalep and Kaarel Veskis (2007):
Comparing Parallel Corpora and Evaluating their Quality, Proceedings of the MT Summit XI
@inproceedings{Kaalep:2007:MTSummit,
author = {Heiki-Jaan Kaalep and Kaarel Veskis},
title = {Comparing Parallel Corpora and Evaluating their Quality},
url = {
http://www.cl.ut.ee/yllitised/summit2007.pdf},
googlescholar = {11072725916960369152},
booktitle = {Proceedings of the {MT} Summit XI},
year = 2007
}
Kaalep and Veskis (2007). A 200 million word Czech-English corpus from various sources was collected
Ondřej Bojar and Adam Liška and Zdenek \vZabokrtský (2010):
Evaluating Utility of Data Sources in a Large Parallel Czech-English Corpus CzEng 0.9, Proceedings of LREC2010
@inProceedings{czeng09:lrec2010,
author = {Ond{\v{r}}ej Bojar and Adam Li\v{s}ka and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}}},
title = {Evaluating Utility of Data Sources in a Large Parallel {Czech-English} Corpus {CzEng} 0.9},
url = {
http://www.lrec-conf.org/proceedings/lrec2010/pdf/642\_Paper.pdf},
booktitle = {Proceedings of LREC2010},
year = 2010
}
(Bojar et al., 2010) and linguistically annotated
Ondřej Bojar and Zdenek \vZabokrtský and Ondřej Dušek and Petra Galuščáková and Martin Majliš and David Mareček and Jiří Maršík and Michal Novák and Martin Popel and Aleš Tamchyna (2012):
The Joy of Parallelism with CzEng 1.0, Proceedings of LREC2012
@inProceedings{czeng10:lrec2012,
author = {Ond{\v{r}}ej Bojar and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}} and Ond{\v{r}}ej Du{\v{s}}ek and Petra Galu{\v{s}}{\v{c}}{\'{a}}kov{\'{a}} and Martin Majli{\v{s}} and David Mare{\v{c}}ek and Ji{\v{r}}{\'{\i}} Mar{\v{s}}{\'{\i}}k and Michal Nov{\'{a}}k and Martin Popel and Ale{\v{s}} Tamchyna},
title = {The Joy of Parallelism with CzEng 1.0},
booktitle = {Proceedings of LREC2012},
organization = {ELRA},
address = {Istanbul, Turkey},
month = {May},
url = {
http://www.mt-archive.info/LREC-2012-Bojar.pdf},
publisher = {European Language Resources Association},
year = 2012
}
(Bojar et al., 2012).
Uszkoreit, Jakob and Ponte, Jay and Popat, Ashok and Dubiner, Moshe (2010):
Large Scale Parallel Document Mining for Machine Translation, Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)
@InProceedings{uszkoreit-EtAl:2010:PAPERS,
author = {Uszkoreit, Jakob and Ponte, Jay and Popat, Ashok and Dubiner, Moshe},
title = {Large Scale Parallel Document Mining for Machine Translation},
booktitle = {Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)},
month = {August},
address = {Beijing, China},
publisher = {Coling 2010 Organizing Committee},
pages = {1101--1109},
url = {
http://www.aclweb.org/anthology/C10-1124},
year = 2010
}
Uszkoreit et al. (2010) address the problem of document alignment by translation of all documents into English and then use of information retrieval methods.
With the increasing use of machine translation on the web, distinguishing between human and machine translated texts becomes a challenge.
Venugopal, Ashish and Uszkoreit, Jakob and Talbot, David and Och, Franz and Ganitkevitch, Juri (2011):
Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation., Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing
mentioned in Parallel Corpora and Corpus Cleaning@InProceedings{venugopal-EtAl:2011:EMNLP,
author = {Venugopal, Ashish and Uszkoreit, Jakob and Talbot, David and Och, Franz and Ganitkevitch, Juri},
title = {Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.},
booktitle = {Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing},
month = {July},
address = {Edinburgh, Scotland, UK.},
publisher = {Association for Computational Linguistics},
pages = {1363--1372},
url = {
http://www.aclweb.org/anthology/D11-1126},
year = 2011
}
Venugopal et al. (2011) propose a method to watermark the output of machine translation systems to aid this distinction.
Antonova, Alexandra and Misyurev, Alexey (2011):
Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text, Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
mentioned in Parallel Corpora and Corpus Cleaning@InProceedings{antonova-misyurev:2011:BUCC,
author = {Antonova, Alexandra and Misyurev, Alexey},
title = {Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text},
booktitle = {Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web},
month = {June},
address = {Portland, Oregon},
publisher = {Association for Computational Linguistics},
pages = {136--144},
url = {
http://www.aclweb.org/anthology/W11-1218},
year = 2011
}
Antonova and Misyurev (2011) report that rule-based machine translation output can be detected due to certain word choices, and machine translation output due to lack of reordering.
Spencer Rarrick and Chris Quirk and Will Lewis (2011):
MT Detection in Web-Scraped Parallel Corpora, Proceedings of the 13th Machine Translation Summit (MT Summit XIII)
mentioned in Parallel Corpora and Corpus Cleaning@inproceedings{MTS-2011-Rarrick,
author = {Spencer Rarrick and Chris Quirk and Will Lewis},
title = {MT Detection in Web-Scraped Parallel Corpora},
url = {
http://www.mt-archive.info/MTS-2011-Rarrick.pdf},
pages = {422-430},
booktitle = {Proceedings of the 13th Machine Translation Summit (MT Summit XIII)},
publisher = {International Association for Machine Translation},
location = {Xiamen, China},
year = 2011
}
Rarrick et al. (2011) train a classifier to learn the distinction and show that removing such data leads to better translation quality.
Parallel corpora may also be built by dedicated manual translation efforts
Ulrich Germann (2001):
Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect?, Workshop on Data-Driven Machine Translation at 39th Annual Meeting of the Association of Computational Linguistics (ACL)
@InProceedings{Germann:2001b,
author = {Ulrich Germann},
title = {Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect?},
url = {
http://acl.ldc.upenn.edu/acl2001/DD-MT/Germann.pdf},
booktitle = {Workshop on Data-Driven Machine Translation at 39th Annual Meeting of the Association of Computational Linguistics (ACL)},
year = 2001
}
(Germann, 2001). It may be useful to focus on the most relevant new sentences, using methods such active learning
Hemali Majithia and Philip Rennart and Evelyne Tzoukermann (2005):
Rapid Ramp-up for Statistical Machine Translation: Minimal Training for Maximal Coverage, Proceedings of the Tenth Machine Translation Summit (MT Summit X)
@InProceedings{Majithia:2005:MTS,
author = {Hemali Majithia and Philip Rennart and Evelyne Tzoukermann},
title = {Rapid Ramp-up for Statistical Machine Translation: Minimal Training for Maximal Coverage},
url = {
http://mt-archive.info/MTS-2005-Majithia.pdf},
googlescholar = {8162219579429246669},
booktitle = {Proceedings of the Tenth Machine Translation Summit (MT Summit X)},
month = {September},
address = {Phuket, Thailand},
year = 2005
}
(Majithia et al., 2005). Crowd-sourcing with inexperienced translators
Zaidan, Omar F. and Callison-Burch, Chris (2011):
Crowdsourcing Translation: Professional Quality from Non-Professionals, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Techologies
@InProceedings{zaidan-callisonburch:2011:ACL-HLT2011,
author = {Zaidan, Omar F. and Callison-Burch, Chris},
title = {Crowdsourcing Translation: Professional Quality from Non-Professionals},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Techologies},
month = {June},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {1220--1229},
url = {
http://www.aclweb.org/anthology/P11-1122},
year = 2011
}
(Zaidan and Callison-Burch, 2011) may be used to reduce cost.
Post, Matt and Callison-Burch, Chris and Osborne, Miles (2012):
Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing, Proceedings of the Seventh Workshop on Statistical Machine Translation
@InProceedings{post-callisonburch-osborne:2012:WMT,
author = {Post, Matt and Callison-Burch, Chris and Osborne, Miles},
title = {Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing},
booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation},
month = {June},
address = {Montreal, Canada},
publisher = {Association for Computational Linguistics},
pages = {154--162},
url = {
http://oldsite.aclweb.org/anthology-new/W/W12/W12-3152.pdf},
year = 2012
}
Post et al. (2012) follow this approach to create parallel corpora for 6 Indian languages.
Translation memories may also be a useful training resource
Philippe Langlais and Michel Simard (2002):
Merging Example-Based and Statistical Machine Translation: An Experiment, Machine Translation: From Research to Real Users, 5th Conference of the Association for Machine Translation in the Americas, AMTA 2002 Tiburon, CA, USA, October 6-12, 2002, Proceedings
@inproceedings{Langlais:2002,
author = {Philippe Langlais and Michel Simard},
title = {Merging Example-Based and Statistical Machine Translation: An Experiment},
url = {
http://transsearch.iro.umontreal.ca/rali/sites/default/files/publis/amta-2002.pdf},
editor = {Stephen D. Richardson},
booktitle = {Machine Translation: From Research to Real Users, 5th Conference of the Association for Machine Translation in the Americas, AMTA 2002 Tiburon, CA, USA, October 6-12, 2002, Proceedings},
publisher = {Springer},
series = {Lecture Notes in Computer Science},
volume = {2499},
isbn = {3-540-44282-0},
bibsource = {DBLP,
http://dblp.uni-trier.de},
year = 2002
}
(Langlais and Simard, 2002).
Other methods focus on fishing the web for the translation of particular terms
Nagata, Masaaki and Saito, Teruka and Suzuki, Kenji (2001):
Using the Web as a Bilingual Dictionary , Workshop on Data-Driven Machine Translation at 39th Annual Meeting of the Association of Computational Linguistics (ACL)
@InProceedings{Nagata:2001,
author = {Nagata, Masaaki and Saito, Teruka and Suzuki, Kenji},
title = {Using the Web as a Bilingual Dictionary },
url = {
http://acl.ldc.upenn.edu/W/W01/W01-1413.pdf},
googlescholar = {8038620055339305841},
booktitle = {Workshop on Data-Driven Machine Translation at 39th Annual Meeting of the Association of Computational Linguistics (ACL)},
year = 2001
}
(Nagata et al., 2001) or phrases
Yunbo Cao and Hang Li (2002):
Base Noun Phrase Translation Using Web Data and the EM Algorithm, Proceedings of the International Conference on Computational Linguistics (COLING)
@InProceedings{Cao:2002,
author = {Yunbo Cao and Hang Li},
title = {Base Noun Phrase Translation Using Web Data and the {EM} Algorithm},
url = {
http://acl.ldc.upenn.edu/coling2002/proceedings/data/area-12/co-043.pdf},
googlescholar = {10844118442717667840},
booktitle = {Proceedings of the International Conference on Computational Linguistics (COLING)},
year = 2002
}
(Cao and Li, 2002). Related is the targeted crawling for in-domain parallel corpora
Pavel Pecina and Antonio Toral and Andy Way and Vassilis Papavassiliou and Prokopis Prokopidis and Maria Giagkou (2011):
Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation, Proceedings of the 15th International Conference of the European Association for Machine Translation (EAMT)
@inproceedings{eamt11:Pecina,
author = {Pavel Pecina and Antonio Toral and Andy Way and Vassilis Papavassiliou and Prokopis Prokopidis and Maria Giagkou},
title = {Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation},
url = {
http://doras.dcu.ie/16468/1/Towards\_Using\_Web-Crawled\_Data\_for\_Domain\_Adaptation\_in\_Statistical\_Machine\_Translation.pdf},
googlescholar = {15625079164006063045},
pages = {297--304},
booktitle = {Proceedings of the 15th International Conference of the European Association for Machine Translation (EAMT)},
location = {Leuven, Belgium},
editor = {Mikel L. Forcada and Heidi Depraetere and Vincent Vandeghinste},
year = 2011
}
(Pecina et al., 2011).
It is not clear, if it matters in which translation direction the parallel corpus was constructed, of if both sides were translated from a third language.
van Halteren, Hans (2008):
Source Language Markers in EUROPARL Translations, Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)
@InProceedings{vanhalteren:2008:PAPERS,
author = {van Halteren, Hans},
title = {Source Language Markers in {EUROPARL} Translations},
booktitle = {Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)},
month = {August},
address = {Manchester, UK},
publisher = {Coling 2008 Organizing Committee},
pages = {937--944},
url = {
http://www.aclweb.org/anthology/C08-1118},
year = 2008
}
Halteren (2008) shows that it is possible to reliably detect the source language in English texts from the European Parliament proceedings, so the original source language does have some effect.
Benchmarks
Discussion
Related Topics
New Publications
- UNKNOWN CITATION 'simion-collins-stein:2015:EMNLP'
Hai Long Trieu and Le Minh Nguyen (2017):
A Multilingual Parallel Corpus for Improving Machine Translation on Southeast Asian Languages, Machine Translation Summit XVI
@inproceedings{mtsummit2017:Trieu,
author = {Hai Long Trieu and Le Minh Nguyen},
title = {A Multilingual Parallel Corpus for Improving Machine Translation on Southeast Asian Languages},
booktitle = {Machine Translation Summit XVI},
location = {Nagoya, Japan},
year = 2017
}
Trieu and Nguyen (2017)
Abate, Solomon Teferra and Melese, Michael and Tachbelie, Martha Yifiru and Meshesha, Million and Atinafu, Solomon and Mulugeta, Wondwossen and Assabie, Yaregal and Abera, Hafte and Ephrem, Binyam and Abebe, Tewodros and Tsegaye, Wondimagegnhue and Lemma, Amanuel and Andargie, Tsegaye and Shifaw, Seifedin (2018):
Parallel Corpora for bi-lingual English-Ethiopian Languages Statistical Machine Translation, Proceedings of the 27th International Conference on Computational Linguistics
@inproceedings{C18-1262,
author = {Abate, Solomon Teferra and Melese, Michael and Tachbelie, Martha Yifiru and Meshesha, Million and Atinafu, Solomon and Mulugeta, Wondwossen and Assabie, Yaregal and Abera, Hafte and Ephrem, Binyam and Abebe, Tewodros and Tsegaye, Wondimagegnhue and Lemma, Amanuel and Andargie, Tsegaye and Shifaw, Seifedin},
title = {Parallel Corpora for bi-lingual English-Ethiopian Languages Statistical Machine Translation},
booktitle = {Proceedings of the 27th International Conference on Computational Linguistics},
month = {aug},
address = {Santa Fe, New Mexico, USA},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/C18-1262},
pages = {3102--3111},
year = 2018
}
Abate et al. (2018)
Teferra Abate, Solomon and Melese, Michael and Yifiru Tachbelie, Martha and Meshesha, Million and Atinafu, Solomon and Mulugeta, Wondwossen and Assabie, Yaregal and Abera, Hafte and Ephrem, Binyam and Abebe, Tewodros and Tsegaye, Wondimagegnhue and Lemma, Amanuel and Andargie, Tsegaye and Shifaw, Seifedin (2018):
Parallel Corpora for bi-Directional Statistical Machine Translation for Seven Ethiopian Language Pairs, Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing
@inproceedings{W18-3812,
author = {Teferra Abate, Solomon and Melese, Michael and Yifiru Tachbelie, Martha and Meshesha, Million and Atinafu, Solomon and Mulugeta, Wondwossen and Assabie, Yaregal and Abera, Hafte and Ephrem, Binyam and Abebe, Tewodros and Tsegaye, Wondimagegnhue and Lemma, Amanuel and Andargie, Tsegaye and Shifaw, Seifedin},
title = {Parallel Corpora for bi-Directional Statistical Machine Translation for Seven Ethiopian Language Pairs},
booktitle = {Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing},
month = {aug},
address = {Santa Fe, New Mexico, USA},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/W18-3812},
pages = {83--90},
year = 2018
}
Abate et al. (2018)
Deng, Dun and Xue, Nianwen (2014):
Building a Hierarchically Aligned Chinese-English Parallel Treebank, Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
@InProceedings{deng-xue:2014:Coling,
author = {Deng, Dun and Xue, Nianwen},
title = {Building a Hierarchically Aligned Chinese-English Parallel Treebank},
booktitle = {Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers},
month = {August},
address = {Dublin, Ireland},
publisher = {Dublin City University and Association for Computational Linguistics},
pages = {1511--1520},
url = {
http://www.aclweb.org/anthology/C14-1143},
year = 2014
}
Deng and Xue (2014)
Hieber, Felix and Jehl, Laura and Riezler, Stefan (2013):
Task Alternation in Parallel Sentence Retrieval for Twitter Translation, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
@InProceedings{hieber-jehl-riezler:2013:Short,
author = {Hieber, Felix and Jehl, Laura and Riezler, Stefan},
title = {Task Alternation in Parallel Sentence Retrieval for Twitter Translation},
booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {323--327},
url = {
http://www.aclweb.org/anthology/P13-2058},
year = 2013
}
Hieber et al. (2013)
Wenjun Du and Wuying Liu and Junting Yu and Mianzhu Yi (2015):
Russian-Chinese Sentence-level Aligned News Corpus, Proceedings of the 18th Annual Conference of the European Association for Machine Translation
@InProceedings{W15-4931,
author = {Wenjun Du and Wuying Liu and Junting Yu and Mianzhu Yi},
title = {Russian-Chinese Sentence-level Aligned News Corpus},
booktitle = {Proceedings of the 18th Annual Conference of the European Association for Machine Translation},
month = {May},
address = {Antalya, Turkey},
url = {
http://aclweb.org/anthology/W15-4931},
editor = {\^IIknur Durgar Elâ"‚¬"Kahlout and Mehmed \"Ozkan and Felipe S\'anchezâ"‚¬"Mart\'inez and Gema Ram\'irezâ"‚¬"S\'anchez and Fred Hollowood and Andy Way},
pages = {213},
year = 2015
}
Du et al. (2015)
Resnik, Philip and Smith, Noah A (2003):
The web as a parallel corpus, Computational Linguistics
@article{resnik2003web,
author = {Resnik, Philip and Smith, Noah A},
title = {The web as a parallel corpus},
journal = {Computational Linguistics},
volume = {29},
number = {3},
pages = {349--380},
publisher = {MIT Press},
year = 2003
}
Resnik and Smith (2003)
Shi, Lei and Niu, Cheng and Zhou, Ming and Gao, Jianfeng (2006):
A dom tree alignment model for mining parallel data from the web, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
@inproceedings{shi2006dom,
author = {Shi, Lei and Niu, Cheng and Zhou, Ming and Gao, Jianfeng},
title = {A dom tree alignment model for mining parallel data from the web},
booktitle = {Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics},
pages = {489--496},
organization = {Association for Computational Linguistics},
url = {
http://anthology.aclweb.org/P/P06/P06-1062.pdf},
year = 2006
}
Shi et al. (2006)
Font Llitjós, Ariadna (2006):
Can the Internet help improve Machine Translation?, Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Doctoral Consortium
@InProceedings{fontllitjos:2006:HLT-NAACL06-DocConsortium,
author = {Font Llitj\'{o}s, Ariadna},
title = {Can the Internet help improve Machine Translation?},
booktitle = {Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Doctoral Consortium},
month = {June},
address = {New York City, USA},
publisher = {Association for Computational Linguistics},
pages = {219--222},
url = {
http://www.aclweb.org/anthology/N/N06/N06-3003},
year = 2006
}
Llitjós (2006)
Germann, Ulrich (2016):
Bilingual Document Alignment with Latent Semantic Indexing, Proceedings of the First Conference on Machine Translation
@InProceedings{germann:2016:WMT,
author = {Germann, Ulrich},
title = {Bilingual Document Alignment with Latent Semantic Indexing},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {692--696},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2368},
year = 2016
}
Germann (2016)
Gomes, Luís and Pereira Lopes, Gabriel (2016):
First Steps Towards Coverage-Based Document Alignment, Proceedings of the First Conference on Machine Translation
@InProceedings{gomes-pereiralopes:2016:WMT,
author = {Gomes, Lu\'{i}s and Pereira Lopes, Gabriel},
title = {First Steps Towards Coverage-Based Document Alignment},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {697--702},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2369},
year = 2016
}
Gomes and Lopes (2016)
Germann, Ulrich (2016):
Bilingual Document Alignment with Latent Semantic Indexing, Proceedings of the First Conference on Machine Translation
@InProceedings{germann:2016:WMT,
author = {Germann, Ulrich},
title = {Bilingual Document Alignment with Latent Semantic Indexing},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {692--696},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2368},
year = 2016
}
Germann (2016)
Gomes, Luís and Pereira Lopes, Gabriel (2016):
First Steps Towards Coverage-Based Document Alignment, Proceedings of the First Conference on Machine Translation
@InProceedings{gomes-pereiralopes:2016:WMT,
author = {Gomes, Lu\'{i}s and Pereira Lopes, Gabriel},
title = {First Steps Towards Coverage-Based Document Alignment},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {697--702},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2369},
year = 2016
}
Gomes and Lopes (2016)
Jakubina, Laurent and Langlais, Phillippe (2016):
BAD LUC@WMT 2016: a Bilingual Document Alignment Platform Based on Lucene, Proceedings of the First Conference on Machine Translation
@InProceedings{jakubina-langlais:2016:WMT,
author = {Jakubina, Laurent and Langlais, Phillippe},
title = {BAD LUC$@$WMT 2016: a Bilingual Document Alignment Platform Based on Lucene},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {703--709},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2370},
year = 2016
}
Jakubina and Langlais (2016)
Dara, Aswarth Abhilash and Lin, Yiu-Chang (2016):
YODA System for WMT16 Shared Task: Bilingual Document Alignment, Proceedings of the First Conference on Machine Translation
@InProceedings{dara-lin:2016:WMT,
author = {Dara, Aswarth Abhilash and Lin, Yiu-Chang},
title = {YODA System for WMT16 Shared Task: Bilingual Document Alignment},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {679--684},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2366},
year = 2016
}
Dara and Lin (2016)
Esplà-Gomis, Miquel and Forcada, Mikel and Ortiz Rojas, Sergio and Ferrández-Tordera, Jorge (2016):
Bitextor's participation in WMT'16: shared task on document alignment, Proceedings of the First Conference on Machine Translation
@InProceedings{esplagomis-EtAl:2016:WMT,
author = {Espl\`{a}-Gomis, Miquel and Forcada, Mikel and Ortiz Rojas, Sergio and Ferr\'{a}ndez-Tordera, Jorge},
title = {Bitextor's participation in WMT'16: shared task on document alignment},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {685--691},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2367},
year = 2016
}
Esplà-Gomis et al. (2016)
Le, Thanh and Vu, Hoa Trong and Oberländer, Jonathan and Bojar, Ondřej (2016):
Using Term Position Similarity and Language Modeling for Bilingual Document Alignment, Proceedings of the First Conference on Machine Translation
@InProceedings{le-EtAl:2016:WMT,
author = {Le, Thanh and Vu, Hoa Trong and Oberl\"{a}nder, Jonathan and Bojar, Ond\v{r}ej},
title = {Using Term Position Similarity and Language Modeling for Bilingual Document Alignment},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {710--716},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2371},
year = 2016
}
Le et al. (2016)
MedveÄ", Marek and Jakubícek, Miloš and Kovár, Vojtech (2016):
English-French Document Alignment Based on Keywords and Statistical Translation, Proceedings of the First Conference on Machine Translation
@InProceedings{medve-jakubicek-kovar:2016:WMT,
author = {MedveÄ", Marek and Jakub\'{i}cek, Milo\v{s} and Kov\'{a}r, Vojtech},
title = {English-French Document Alignment Based on Keywords and Statistical Translation},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {728--732},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2374},
year = 2016
}
MedveÄ" et al. (2016)
Azpeitia, Andoni and Etchegoyhen, Thierry (2016):
DOCAL - Vicomtech's Participation in the WMT16 Shared Task on Bilingual Document Alignment, Proceedings of the First Conference on Machine Translation
@InProceedings{azpeitia-etchegoyhen:2016:WMT,
author = {Azpeitia, Andoni and Etchegoyhen, Thierry},
title = {DOCAL - Vicomtech's Participation in the WMT16 Shared Task on Bilingual Document Alignment},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {666--671},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2364},
year = 2016
}
Azpeitia and Etchegoyhen (2016)
Papavassiliou, Vassilis and Prokopidis, Prokopis and Piperidis, Stelios (2016):
The ILSP/ARC submission to the WMT 2016 Bilingual Document Alignment Shared Task, Proceedings of the First Conference on Machine Translation
@InProceedings{papavassiliou-prokopidis-piperidis:2016:WMT,
author = {Papavassiliou, Vassilis and Prokopidis, Prokopis and Piperidis, Stelios},
title = {The ILSP/ARC submission to the WMT 2016 Bilingual Document Alignment Shared Task},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {733--739},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2375},
year = 2016
}
Papavassiliou et al. (2016)
Lohar, Pintu and Afli, Haithem and Liu, Chao-Hong and Way, Andy (2016):
The ADAPT Bilingual Document Alignment system at WMT16, Proceedings of the First Conference on Machine Translation
@InProceedings{lohar-EtAl:2016:WMT,
author = {Lohar, Pintu and Afli, Haithem and Liu, Chao-Hong and Way, Andy},
title = {The ADAPT Bilingual Document Alignment system at WMT16},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {717--723},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2372},
year = 2016
}
Lohar et al. (2016)
Mahata, Sainik and Das, Dipankar and Pal, Santanu (2016):
WMT2016: A Hybrid Approach to Bilingual Document Alignment, Proceedings of the First Conference on Machine Translation
@InProceedings{mahata-das-pal:2016:WMT,
author = {Mahata, Sainik and Das, Dipankar and Pal, Santanu},
title = {WMT2016: A Hybrid Approach to Bilingual Document Alignment},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {724--727},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2373},
year = 2016
}
Mahata et al. (2016)
Shchukin, Vadim and Khristich, Dmitry and Galinskaya, Irina (2016):
Word Clustering Approach to Bilingual Document Alignment (WMT 2016 Shared Task), Proceedings of the First Conference on Machine Translation
@InProceedings{shchukin-khristich-galinskaya:2016:WMT,
author = {Shchukin, Vadim and Khristich, Dmitry and Galinskaya, Irina},
title = {Word Clustering Approach to Bilingual Document Alignment (WMT 2016 Shared Task)},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {740--744},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2376},
year = 2016
}
Shchukin et al. (2016)
Buck, Christian and Koehn, Philipp (2016):
Findings of the WMT 2016 Bilingual Document Alignment Shared Task, Proceedings of the First Conference on Machine Translation
@InProceedings{buck-koehn:2016:WMT1,
author = {Buck, Christian and Koehn, Philipp},
title = {Findings of the WMT 2016 Bilingual Document Alignment Shared Task},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {554--563},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2347},
year = 2016
}
Buck and Koehn (2016)
Buck, Christian and Koehn, Philipp (2016):
Quick and Reliable Document Alignment via TF/IDF-weighted Cosine Distance, Proceedings of the First Conference on Machine Translation
@InProceedings{buck-koehn:2016:WMT2,
author = {Buck, Christian and Koehn, Philipp},
title = {Quick and Reliable Document Alignment via TF/IDF-weighted Cosine Distance},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {672--678},
url = {
http://www.aclweb.org/anthology/W/W16/W16-2365},
year = 2016
}
Buck and Koehn (2016)
Wang Ling and LuÃs Marujo and Chris Dyer and Alan W. Black and Isabel Trancoso (2016):
Mining Parallel Corpora from Sina Weibo and Twitter, Computational Linguistics
@Article{J16-2005,
author = {Wang Ling and LuÃs Marujo and Chris Dyer and Alan W. Black and Isabel Trancoso},
title = {Mining Parallel Corpora from Sina Weibo and Twitter},
journal = {Computational Linguistics},
volume = {41},
number = {2},
month = {June},
year = 2016
}
Ling et al. (2016)
Barrón-Cedeño, Alberto and España-Bonet, Cristina and Boldoba, Josu and Màrquez, Lluís (2015):
A Factory of Comparable Corpora from Wikipedia, Proceedings of the Eighth Workshop on Building and Using Comparable Corpora
mentioned in Parallel Corpora and Comparable Corpora@InProceedings{Barronetal:2015,
author = {{Barr\'on-Cede{\~n}o}, Alberto and {Espa{\~n}a-Bonet}, Cristina and {Boldoba}, Josu and {M\`arquez}, Llu\'{i}s},
title = {A Factory of Comparable Corpora from Wikipedia},
booktitle = {Proceedings of the Eighth Workshop on Building and Using Comparable Corpora},
pages = {3--13},
month = {July},
date = {30},
address = {Beijing, China},
language = {english},
url = {
http://www.aclweb.org/anthology/W15-3402},
year = 2015
}
Barrón-Cedeño et al. (2015)
Jalili Sabet, Masoud and Negri, Matteo and Turchi, Marco and Barbu, Eduard (2016):
An Unsupervised Method for Automatic Translation Memory Cleaning, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
@InProceedings{jalilisabet-EtAl:2016:P16-2,
author = {Jalili Sabet, Masoud and Negri, Matteo and Turchi, Marco and Barbu, Eduard},
title = {An Unsupervised Method for Automatic Translation Memory Cleaning},
booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {287--292},
url = {
http://anthology.aclweb.org/P16-2047},
year = 2016
}
Sabet et al. (2016)
Xiaoyi Ma and Mark Y. Liberman (1999):
BITS: A method for bilingual text search over the web, In Proceedings of the Machine Translation Summit VII
@INPROCEEDINGS{Ma99bits:a,
author = {Xiaoyi Ma and Mark Y. Liberman},
title = {BITS: A method for bilingual text search over the web},
booktitle = {In Proceedings of the Machine Translation Summit VII},
url = {
http://www.mt-archive.info/MTS-1999-Ma-2.pdf},
year = 1999
}
Ma and Liberman (1999)
Ieva Zariņa and Pēteris \cNikiforovs and Raivis Skadiņš (2015):
Word Alignment Based Parallel Corpora Evaluation and Cleaning Using Machine Learning Techniques, Proceedings of the 18th Annual Conference of the European Association for Machine Translation
@InProceedings{W15-4924,
author = {Ieva Zari\c{n}a and P\={e}teris \c{N}ikiforovs and Raivis Skadi\c{n}\v{s}},
title = {Word Alignment Based Parallel Corpora Evaluation and Cleaning Using Machine Learning Techniques},
booktitle = {Proceedings of the 18th Annual Conference of the European Association for Machine Translation},
month = {May},
address = {Antalya, Turkey},
url = {
http://aclweb.org/anthology/W15-4924},
editor = {\^IIknur Durgar Elâ"‚¬"Kahlout and Mehmed \"Ozkan and Felipe S\'anchezâ"‚¬"Mart\'inez and Gema Ram\'irezâ"‚¬"S\'anchez and Fred Hollowood and Andy Way},
pages = {185--192},
year = 2015
}
Zariņa et al. (2015)
Francisco Guzman and Hassan Sajjad and Stephan Vogel and Ahmed Abdelali (2013):
The AMARA corpus: building resources for translating the web's educational content, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
@inproceedings{Guzman:iwslt:2013,
author = {Francisco Guzman and Hassan Sajjad and Stephan Vogel and Ahmed Abdelali},
title = {The {AMARA} corpus: building resources for translating the web's educational content},
url = {
http://www.mt-archive.info/10/IWSLT-2013-Guzman.pdf},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
year = 2013
}
Guzman et al. (2013)
Antonio Toral and Raphael Rubino and Miquel Esplà-Gomis and Tommi Pirinen and Andy Way and Gema RamÃrez-Sánchez (2014):
Extrinsic evaluation of web-crawlers in machine translation: a study on Croatian-English for the tourism domain, Proceedings of 17th Annual conference of the European Association for Machine Translation
@inproceedings{eamt-2014-Toral,
author = {Antonio Toral and Raphael Rubino and Miquel Espl\`{a}-Gomis and Tommi Pirinen and Andy Way and Gema RamÃrez-S\'{a}nchez},
title = {Extrinsic evaluation of web-crawlers in machine translation: a study on Croatian-English for the tourism domain},
booktitle = {Proceedings of 17th Annual conference of the European Association for Machine Translation},
pages = {221-224},
url = {
http://www.mt-archive.info/10/EAMT-2014-Toral.pdf},
location = {Dubrovnik, Croatia},
year = 2014
}
Toral et al. (2014)
Haddow, Barry and Hernandez, Adolfo and Neubarth, Friedrich and Trost, Harald (2013):
Corpus development for machine translation between standard and dialectal varieties, Proceedings of the Workshop on Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants
@InProceedings{haddow-EtAl:2013:RANLPLingVar2013,
author = {Haddow, Barry and Hernandez, Adolfo and Neubarth, Friedrich and Trost, Harald},
title = {Corpus development for machine translation between standard and dialectal varieties},
booktitle = {Proceedings of the Workshop on Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants},
month = {September},
address = {Hissar, Bulgaria},
publisher = {INCOMA Ltd. Shoumen, BULGARIA},
pages = {7--14},
url = {
http://www.aclweb.org/anthology/W13-5303},
year = 2013
}
Haddow et al. (2013)
Ling, Wang and Marujo, Luis and Dyer, Chris and Black, Alan W and Trancoso, Isabel (2014):
Crowdsourcing High-Quality Parallel Data Extraction from Twitter, Proceedings of the Ninth Workshop on Statistical Machine Translation
@InProceedings{ling-EtAl:2014:W14-33,
author = {Ling, Wang and Marujo, Luis and Dyer, Chris and Black, Alan W and Trancoso, Isabel},
title = {Crowdsourcing High-Quality Parallel Data Extraction from Twitter},
booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
month = {June},
address = {Baltimore, Maryland, USA},
publisher = {Association for Computational Linguistics},
pages = {426--436},
url = {
http://www.aclweb.org/anthology/W14-3356},
year = 2014
}
Ling et al. (2014)
Matthias Eck and Yury Zemlyanskiy and Joy Zhang and Alex Waibel (2014):
Extracting Translation Pairs from Social Network Content, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
@inproceedings{Eck:iwslt:2014,
author = {Matthias Eck and Yury Zemlyanskiy and Joy Zhang and Alex Waibel},
title = {Extracting Translation Pairs from Social Network Content},
pages = {200--205},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
year = 2014
}
Eck et al. (2014)
Ling, Wang and Xiang, Guang and Dyer, Chris and Black, Alan and Trancoso, Isabel (2013):
Microblogs as Parallel Corpora, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
@InProceedings{ling-EtAl:2013:ACL2013,
author = {Ling, Wang and Xiang, Guang and Dyer, Chris and Black, Alan and Trancoso, Isabel},
title = {Microblogs as Parallel Corpora},
booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {176--186},
url = {
http://www.aclweb.org/anthology/P13-1018},
year = 2013
}
Ling et al. (2013)
Smith, Jason R. and Saint-Amand, Herve and Plamada, Magdalena and Koehn, Philipp and Callison-Burch, Chris and Lopez, Adam (2013):
Dirt Cheap Web-Scale Parallel Text from the Common Crawl, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
@InProceedings{smith-EtAl:2013:ACL2013,
author = {Smith, Jason R. and Saint-Amand, Herve and Plamada, Magdalena and Koehn, Philipp and Callison-Burch, Chris and Lopez, Adam},
title = {Dirt Cheap Web-Scale Parallel Text from the Common Crawl},
booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {1374--1383},
url = {
http://www.aclweb.org/anthology/P13-1135},
year = 2013
}
Smith et al. (2013)
Francis Bond and Shan Wang (2014):
Issues in building English-Chinese parallel corpora with WordNets., Proceedings of the Seventh Global Wordnet Conference
@inproceedings{W14-0154,
author = {Francis Bond and Shan Wang},
title = {Issues in building English-Chinese parallel corpora with WordNets.},
booktitle = {Proceedings of the Seventh Global Wordnet Conference},
url = {
http://www.aclweb.org/anthology/W14-0154},
pages = {391-399},
editor = {Heili Orav and Christiane Fellbaum and Piek Vossen},
address = {Tartu, Estonia},
year = 2014
}
Bond and Wang (2014)
Papavassiliou, Vassilis and Prokopidis, Prokopis and Thurmair, Gregor (2013):
A modular open-source focused crawler for mining monolingual and bilingual corpora from the web, Proceedings of the Sixth Workshop on Building and Using Comparable Corpora
@InProceedings{papavassiliou-prokopidis-thurmair:2013:BUCC,
author = {Papavassiliou, Vassilis and Prokopidis, Prokopis and Thurmair, Gregor},
title = {A modular open-source focused crawler for mining monolingual and bilingual corpora from the web},
booktitle = {Proceedings of the Sixth Workshop on Building and Using Comparable Corpora},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {43--51},
url = {
http://www.aclweb.org/anthology/W13-2506},
year = 2013
}
Papavassiliou et al. (2013)
Eisele, Andreas (2005):
First Steps towards Multi-Engine Machine Translation, Proceedings of the ACL Workshop on Building and Using Parallel Texts
mentioned in Parallel Corpora and System Combination@InProceedings{eisele:2005:WPT,
author = {Eisele, Andreas},
title = {First Steps towards Multi-Engine Machine Translation},
booktitle = {Proceedings of the ACL Workshop on Building and Using Parallel Texts},
month = {June},
address = {Ann Arbor, Michigan},
publisher = {Association for Computational Linguistics},
pages = {155--158},
url = {
http://www.aclweb.org/anthology/W/W05/W05-0828},
year = 2005
}
Eisele (2005)
Victoria Arranz and Olivier Hamon and Karim Boudahmane and Martine Garnier-Rizet (2011):
Protocol and lessons learnt from the production of parallel corpora for the evaluation of speech translation systems, Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT)
@inproceedings{iwslt11:Arranz,
author = {Victoria Arranz and Olivier Hamon and Karim Boudahmane and Martine Garnier-Rizet},
title = {Protocol and lessons learnt from the production of parallel corpora for the evaluation of speech translation systems},
url = {
http://www.mt-archive.info/IWSLT-2011-Arranz.pdf},
pages = {129-135},
editor = {Marcello Federico and Mei-Yuh Hwang and Margit R{\"o}dder and Sebastian St{\"u}ker},
booktitle = {Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT)},
location = {San Francisco, USA},
year = 2011
}
Arranz et al. (2011)
Bin Lu and Ka Po Chow and Benjamin K. Tsou (2011):
The Cultivation of a Chinese-English-Japanese Trilingual Parallel Corpus from Comparable Patents, Proceedings of the 13th Machine Translation Summit (MT Summit XIII)
@inproceedings{MTS-2011-Lu,
author = {Bin Lu and Ka Po Chow and Benjamin K. Tsou},
title = {The Cultivation of a {Chinese-English-Japanese} Trilingual Parallel Corpus from Comparable Patents},
url = {
http://www.mt-archive.info/MTS-2011-Lu.pdf},
pages = {472-479},
booktitle = {Proceedings of the 13th Machine Translation Summit (MT Summit XIII)},
publisher = {International Association for Machine Translation},
location = {Xiamen, China},
year = 2011
}
Lu et al. (2011)
Gascó, Guillem and Rocha, Martha-Alicia and Sanchis-Trilles, Germán and Andrés-Ferrer, Jesús and Casacuberta, Francisco (2012):
Does more data always yield better translations?, Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
@InProceedings{gasco-EtAl:2012:EACL2012,
author = {Gasc\'{o}, Guillem and Rocha, Martha-Alicia and Sanchis-Trilles, Germ\'{a}n and Andr\'{e}s-Ferrer, Jes\'{u}s and Casacuberta, Francisco},
title = {Does more data always yield better translations?},
booktitle = {Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics},
month = {April},
address = {Avignon, France},
publisher = {Association for Computational Linguistics},
pages = {152--161},
url = {
http://www.aclweb.org/anthology/E12-1016},
year = 2012
}
Gascó et al. (2012)
Tatsuya Ishisaka and Masao Utiyama and Eiichiro Sumita and Kazuhide Yamamoto (2009):
Development of a Japanese-English Software Manual Parallel Corpus, Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)
@inproceedings{MTS09:Ishisaka,
author = {Tatsuya Ishisaka and Masao Utiyama and Eiichiro Sumita and Kazuhide Yamamoto},
title = {Development of a {J}apanese-{E}nglish Software Manual Parallel Corpus},
url = {
http://www.researchgate.net/publication/237841324\_Development\_of\_a\_Japanese-English\_Software\_Manual\_Paralell\_Corpus/file/e0b4951bdc04b06036.pdf},
googlescholar = {16239386671948484399},
booktitle = {Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)},
publisher = {International Association for Machine Translation},
location = {Ottawa, Ontario, Canada},
year = 2009
}
Ishisaka et al. (2009)
Alexandre Rafalovitch and Robert Dale (2009):
United Nations General Assembly Resolutions: A Six-Language Parallel Corpus, Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)
@inproceedings{MTS09:Rafalovitch,
author = {Alexandre Rafalovitch and Robert Dale},
title = {United {N}ations {G}eneral {A}ssembly Resolutions: A Six-Language Parallel Corpus},
url = {
http://www.uncorpora.org/Rafalovitch\_Dale\_MT\_Summit\_2009.pdf},
googlescholar = {10722155333156234579},
booktitle = {Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)},
publisher = {International Association for Machine Translation},
location = {Ottawa, Ontario, Canada},
year = 2009
}
Rafalovitch and Dale (2009)
Masao Utiyama and Daisuke Kawahara and Keiji Yasuda and Eiichiro Sumita (2009):
Mining Parallel Texts from Mixed-Language Web Pages, Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)
@inproceedings{MTS09:Utiyama1,
author = {Masao Utiyama and Daisuke Kawahara and Keiji Yasuda and Eiichiro Sumita},
title = {Mining Parallel Texts from Mixed-Language Web Pages},
url = {
http://www.mt-archive.info/MTS-2009-Utiyama-1.pdf},
googlescholar = {162748147626911829},
booktitle = {Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)},
publisher = {International Association for Machine Translation},
location = {Ottawa, Ontario, Canada},
year = 2009
}
Utiyama et al. (2009)
Qibo Zhu and Diana Inkpen and Ash Asudeh (2009):
Inducing translations from officially published materials in Canadian government websites, Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)
@inproceedings{MTS09:Zhu,
author = {Qibo Zhu and Diana Inkpen and Ash Asudeh},
title = {Inducing translations from officially published materials in {C}anadian government websites},
booktitle = {Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)},
publisher = {International Association for Machine Translation},
location = {Ottawa, Ontario, Canada},
year = 2009
}
Zhu et al. (2009)
Hong, Gumwon and Li, Chi-Ho and Zhou, Ming and Rim, Hae-Chang (2010):
An Empirical Study on Web Mining of Parallel Data, Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)
@InProceedings{hong-EtAl:2010:PAPERS,
author = {Hong, Gumwon and Li, Chi-Ho and Zhou, Ming and Rim, Hae-Chang},
title = {An Empirical Study on Web Mining of Parallel Data},
booktitle = {Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)},
month = {August},
address = {Beijing, China},
publisher = {Coling 2010 Organizing Committee},
pages = {474--482},
url = {
http://www.aclweb.org/anthology/C10-1054},
year = 2010
}
Hong et al. (2010)
Han, Xiwu and Li, Hanzhang and Zhao, Tiejun (2009):
Train the Machine with What It Can Learn---Corpus Selection for SMT, Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
@InProceedings{han-li-zhao:2009:BUCC,
author = {Han, Xiwu and Li, Hanzhang and Zhao, Tiejun},
title = {Train the Machine with What It Can Learn---Corpus Selection for SMT},
booktitle = {Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora},
month = {August},
address = {Singapore},
publisher = {Association for Computational Linguistics},
pages = {27--33},
url = {
http://www.aclweb.org/anthology/W/W09/W09-3106},
year = 2009
}
Han et al. (2009)
Donghua Xu and Chew Lim Tan (1999):
Alignment and Matching of Bilingual English-Chinese News Texts, Machine Translation
@article{MTJ:1999:Xu,
author = {Donghua Xu and Chew Lim Tan},
title = {Alignment and Matching of Bilingual {E}nglish-{C}hinese News Texts},
pages = {1--33},
journal = {Machine Translation},
volume = {14},
number = {1},
month = {March},
year = 1999
}
Xu and Tan (1999)
Miquel Esplà-Gomis (2009):
Bitextor: a Free/Open-source Software to Harvest Translation Memories from Multilingual Websites, MT Summit Workshop on New Tools for Translators
@inproceedings{MTS09:Espla-Gomis,
author = {Miquel Espl\`{a}-Gomis},
title = {Bitextor: a Free/Open-source Software to Harvest Translation Memories from Multilingual Websites},
booktitle = {MT Summit Workshop on New Tools for Translators},
publisher = {International Association for Machine Translation},
location = {Ottawa, Ontario, Canada},
year = 2009
}
Esplà-Gomis (2009)
Ambati, Vamshi and Vogel, Stephan (2010):
Can Crowds Build Parallel Corpora for Machine Translation Systems?, Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
@InProceedings{ambati-vogel:2010:MTURK,
author = {Ambati, Vamshi and Vogel, Stephan},
title = {Can Crowds Build Parallel Corpora for Machine Translation Systems?},
booktitle = {Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk},
month = {June},
address = {Los Angeles},
publisher = {Association for Computational Linguistics},
pages = {62--65},
url = {
http://www.aclweb.org/anthology/W10-0710},
year = 2010
}
Ambati and Vogel (2010)
Hu, Chang and Resnik, Philip and Kronrod, Yakov and Eidelman, Vladimir and Buzek, Olivia and Bederson, Benjamin B. (2011):
The Value of Monolingual Crowdsourcing in a Real-World Translation Scenario: Simulation using Haitian Creole Emergency SMS Messages, Proceedings of the Sixth Workshop on Statistical Machine Translation
mentioned in Parallel Corpora and Sparse Data@InProceedings{hu-EtAl:2011:WMT,
author = {Hu, Chang and Resnik, Philip and Kronrod, Yakov and Eidelman, Vladimir and Buzek, Olivia and Bederson, Benjamin B.},
title = {The Value of Monolingual Crowdsourcing in a Real-World Translation Scenario: Simulation using Haitian Creole Emergency SMS Messages},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {399--404},
url = {
http://www.aclweb.org/anthology/W11-2148},
year = 2011
}
Hu et al. (2011)
Krstovski, Kriste and Smith, David A. (2011):
A Minimally Supervised Approach for Detecting and Ranking Document Translation Pairs, Proceedings of the Sixth Workshop on Statistical Machine Translation
@InProceedings{krstovski-smith:2011:WMT,
author = {Krstovski, Kriste and Smith, David A.},
title = {A Minimally Supervised Approach for Detecting and Ranking Document Translation Pairs},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {207--216},
url = {
http://www.aclweb.org/anthology/W11-2125},
year = 2011
}
Krstovski and Smith (2011)
Cartoni, Bruno and Zufferey, Sandrine and Meyer, Thomas and Popescu-Belis, Andrei (2011):
How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives, Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
@InProceedings{cartoni-EtAl:2011:BUCC,
author = {Cartoni, Bruno and Zufferey, Sandrine and Meyer, Thomas and Popescu-Belis, Andrei},
title = {How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives},
booktitle = {Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web},
month = {June},
address = {Portland, Oregon},
publisher = {Association for Computational Linguistics},
pages = {78--86},
url = {
http://www.aclweb.org/anthology/W11-1211},
year = 2011
}
Cartoni et al. (2011)
Gahbiche-Braham, Souhir and Bonneau-Maynard, Hélène and Yvon, François (2011):
Two Ways to Use a Noisy Parallel News Corpus for Improving Statistical Machine Translation, Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
@InProceedings{gahbichebraham-bonneaumaynard-yvon:2011:BUCC,
author = {Gahbiche-Braham, Souhir and Bonneau-Maynard, H\'{e}l\`{e}ne and Yvon, Fran\c{c}ois},
title = {Two Ways to Use a Noisy Parallel News Corpus for Improving Statistical Machine Translation},
booktitle = {Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web},
month = {June},
address = {Portland, Oregon},
publisher = {Association for Computational Linguistics},
pages = {44--51},
url = {
http://www.aclweb.org/anthology/W11-1207},
year = 2011
}
Gahbiche-Braham et al. (2011)
Patry, Alexandre and Langlais, Philippe (2011):
Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia., Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
@InProceedings{patry-langlais:2011:BUCC,
author = {Patry, Alexandre and Langlais, Philippe},
title = {Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia.},
booktitle = {Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web},
month = {June},
address = {Portland, Oregon},
publisher = {Association for Computational Linguistics},
pages = {87--95},
url = {
http://www.aclweb.org/anthology/W11-1212},
year = 2011
}
Patry and Langlais (2011)
John Fry (2005):
Assembling a Parallel Corpus from RSS News Feeds, Proceedings of the Workshop on Example-based Machine Translation at MT Summit X
@InProceedings{Fry:2005:MTS,
author = {John Fry},
title = {Assembling a Parallel Corpus from {RSS} News Feeds},
url = {
http://mt-archive.info/MTS-2005-Fry.pdf},
googlescholar = {12617274806825000713},
booktitle = {Proceedings of the Workshop on Example-based Machine Translation at {MT} Summit X},
month = {September},
address = {Phuket, Thailand},
year = 2005
}
Fry (2005)