Corpus Cleaning
Parallel corpora may contain misaligned or otherwise noisy sentence pairs whose removal may help.
Corpus Cleaning is the main subject of 37 publications. 21 are discussed here.
Publications
Statistical machine translation models are generally assumed to be fairly robust to noisy data, such as data that includes misalignments. This is less true for neural machine translation models
Khayrallah, Huda and Koehn, Philipp (2018):
On the Impact of Various Types of Noise on Neural Machine Translation, Proceedings of the 2nd Workshop on Neural Machine Translation and Generation
mentioned in Corpus Cleaning and Analysis And Visualization@InProceedings{W18-2709,
author = {Khayrallah, Huda and Koehn, Philipp},
title = {On the Impact of Various Types of Noise on Neural Machine Translation},
booktitle = {Proceedings of the 2nd Workshop on Neural Machine Translation and Generation},
publisher = {Association for Computational Linguistics},
pages = {74--83},
location = {Melbourne, Australia},
url = {
http://aclweb.org/anthology/W18-2709},
year = 2018
}
(Khayrallah and Koehn, 2018).
Data cleaning has been shown to help
Stephan Vogel (2003):
Using Noisy Biligual Data for Statistical Machine Translation, Proceedings of Meeting of the European Chapter of the Association of Computational Linguistics (EACL)
@InProceedings{Vogel:2003b,
author = {Stephan Vogel},
title = {Using Noisy Biligual Data for Statistical Machine Translation},
url = {
http://acl.ldc.upenn.edu/E/E03/E03-1050.pdf},
googlescholar = {8288165789971790306},
booktitle = {Proceedings of Meeting of the European Chapter of the Association of Computational Linguistics (EACL)},
year = 2003
}
(Vogel, 2003). Often, for instance in the case of news reports that are rewritten for a different audience during translation, documents are not very parallel, so the task of sentence alignment becomes more of a task of sentence extraction
Fung, Pascale and Cheung, Percy (2004):
Multi-level Bootstrapping For Extracting Parallel Sentences From a Quasi-Comparable Corpus , Proceedings of Coling 2004
@inproceedings{Fung:2004,
author = {Fung, Pascale and Cheung, Percy},
title = {Multi-level Bootstrapping For Extracting Parallel Sentences From a Quasi-Comparable Corpus },
url = {
http://acl.ldc.upenn.edu/coling2004/MAIN/pdf/151-882.pdf},
booktitle = {Proceedings of Coling 2004 },
editor = {{}},
month = {Aug 23--Aug 27},
address = {Geneva, Switzerland},
publisher = {COLING},
pages = {1051--1057},
year = 2004
}
(Fung and Cheung, 2004;
Fung, Pascale and Cheung, Percy (2004):
Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM, Proceedings of EMNLP 2004
@inproceedings{Fung:2004b,
author = {Fung, Pascale and Cheung, Percy},
title = {Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM},
url = {
http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Fung.pdf},
booktitle = {Proceedings of EMNLP 2004},
editor = {Dekang Lin and Dekai Wu},
month = {July},
address = {Barcelona, Spain},
publisher = {Association for Computational Linguistics},
pages = {57--63},
year = 2004
}
Fung and Cheung, 2004b). For good performance it has proven crucial, especially when only small amounts of training data are available, to exploit all of the data, may it be by augmenting phrase translation tables to include all words or breaking up sentences that are too long
Coskun Mermer and Hamza Kaya and Mehmet Ugur Dogan (2007):
The T\"UBITAK-UEKAE Statistical Machine Translation System for IWSLT 2007 , Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
mentioned in Research Groups and Corpus Cleaning@inproceedings{Mermer:2007:IWSLT,
author = {Coskun Mermer and Hamza Kaya and Mehmet Ugur Dogan},
title = {The {T{\"U}BITAK-UEKAE} Statistical Machine Translation System for {IWSLT} 2007 },
url = {
http://20.210-193-52.unknown.qala.com.sg/archive/iwslt\_07/papers/slt7\_176.pdf},
googlescholar = {308880400410463324},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
year = 2007
}
(Mermer et al., 2007).
There is a robust body of work on filtering out noise in parallel data. For example:
Kaveh Taghipour and Shahram Khadivi and Jia Xu (2011):
Parallel Corpus Refinement as an Outlier Detection Algorithm, Proceedings of the 13th Machine Translation Summit (MT Summit XIII)
@inproceedings{MTS-2011-Taghipour,
author = {Kaveh Taghipour and Shahram Khadivi and Jia Xu},
title = {Parallel Corpus Refinement as an Outlier Detection Algorithm},
url = {
http://www.mt-archive.info/MTS-2011-Taghipour.pdf},
pages = {414-421},
booktitle = {Proceedings of the 13th Machine Translation Summit (MT Summit XIII)},
publisher = {International Association for Machine Translation},
location = {Xiamen, China},
year = 2011
}
Taghipour et al. (2011) use an outlier detection algorithm to filter a parallel corpus;
Xu, Hainan and Koehn, Philipp (2017):
Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
@InProceedings{D17-1318,
author = {Xu, Hainan and Koehn, Philipp},
title = {Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora},
booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
publisher = {Association for Computational Linguistics},
pages = {2935--2940},
location = {Copenhagen, Denmark},
url = {
http://aclweb.org/anthology/D17-1318},
year = 2017
}
Xu and Koehn (2017) generate synthetic noisy data (inadequate and non-fluent translations) and use this data to train a classifier to identify good sentence pairs from a noisy corpus; and
Cui, Lei and Zhang, Dongdong and Liu, Shujie and Li, Mu and Zhou, Ming (2013):
Bilingual Data Cleaning for SMT using Graph-based Random Walk, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
@InProceedings{cui-EtAl:2013:Short,
author = {Cui, Lei and Zhang, Dongdong and Liu, Shujie and Li, Mu and Zhou, Ming},
title = {Bilingual Data Cleaning for {SMT} using Graph-based Random Walk},
booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {340--345},
url = {
http://www.aclweb.org/anthology/P13-2061},
year = 2013
}
Cui et al. (2013) use a graph-based random walk algorithm and extract phrase pair scores to weight the phrase translation probabilities to bias towards more trustworthy ones.
Most of this work was done in the context of statistical machine translation, but more recent work
Carpuat, Marine and Vyas, Yogarshi and Niu, Xing (2017):
Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation, Proceedings of the First Workshop on Neural Machine Translation
mentioned in Neural Network Models and Corpus Cleaning@InProceedings{carpuat-vyas-niu:2017:NMT,
author = {Carpuat, Marine and Vyas, Yogarshi and Niu, Xing},
title = {Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation},
booktitle = {Proceedings of the First Workshop on Neural Machine Translation},
month = {August},
address = {Vancouver},
publisher = {Association for Computational Linguistics},
pages = {69--79},
url = {
http://www.aclweb.org/anthology/W17-3209},
year = 2017
}
(Carpuat et al., 2017) targets neural models. That work focuses on identifying semantic differences in translation pairs using cross-lingual textual entailment and additional length-based features, and demonstrates that removing such sentences improves neural machine translation performance.
As
Spencer Rarrick and Chris Quirk and Will Lewis (2011):
MT Detection in Web-Scraped Parallel Corpora, Proceedings of the 13th Machine Translation Summit (MT Summit XIII)
mentioned in Parallel Corpora and Corpus Cleaning@inproceedings{MTS-2011-Rarrick,
author = {Spencer Rarrick and Chris Quirk and Will Lewis},
title = {MT Detection in Web-Scraped Parallel Corpora},
url = {
http://www.mt-archive.info/MTS-2011-Rarrick.pdf},
pages = {422-430},
booktitle = {Proceedings of the 13th Machine Translation Summit (MT Summit XIII)},
publisher = {International Association for Machine Translation},
location = {Xiamen, China},
year = 2011
}
Rarrick et al. (2011) point out, one problem of parallel corpora extracted from the web is translations that have been created by machine translation.
Venugopal, Ashish and Uszkoreit, Jakob and Talbot, David and Och, Franz and Ganitkevitch, Juri (2011):
Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation., Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing
mentioned in Parallel Corpora and Corpus Cleaning@InProceedings{venugopal-EtAl:2011:EMNLP,
author = {Venugopal, Ashish and Uszkoreit, Jakob and Talbot, David and Och, Franz and Ganitkevitch, Juri},
title = {Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.},
booktitle = {Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing},
month = {July},
address = {Edinburgh, Scotland, UK.},
publisher = {Association for Computational Linguistics},
pages = {1363--1372},
url = {
http://www.aclweb.org/anthology/D11-1126},
year = 2011
}
Venugopal et al. (2011) propose a method to watermark the output of machine translation systems to aid this distinction.
Antonova, Alexandra and Misyurev, Alexey (2011):
Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text, Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
mentioned in Parallel Corpora and Corpus Cleaning@InProceedings{antonova-misyurev:2011:BUCC,
author = {Antonova, Alexandra and Misyurev, Alexey},
title = {Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text},
booktitle = {Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web},
month = {June},
address = {Portland, Oregon},
publisher = {Association for Computational Linguistics},
pages = {136--144},
url = {
http://www.aclweb.org/anthology/W11-1218},
year = 2011
}
Antonova and Misyurev (2011) report that rule-based machine translation output can be detected due to certain word choices, and statistical machine translation output due to lack of reordering.
In 2016, a shared task on sentence pair filtering was organized
Barbu, Eduard and Parra Escartín, Carla and Bentivogli, Luisa and Negri, Matteo and Turchi, Marco and Orasan, Constantin and Federico, Marcello (2016):
The first Automatic Translation Memory Cleaning Shared Task, Machine Translation
@Article{Barbu2016,
author = {Barbu, Eduard and Parra Escart{\'i}n, Carla and Bentivogli, Luisa and Negri, Matteo and Turchi, Marco and Orasan, Constantin and Federico, Marcello},
title = {The first Automatic Translation Memory Cleaning Shared Task},
journal = {Machine Translation},
month = {Dec},
day = {01},
volume = {30},
number = {3},
pages = {145--166},
issn = {1573-0573},
doi = {10.1007/s10590-016-9183-x},
url = {
https://doi.org/10.1007/s10590-016-9183-x},
year = 2016
}
(Barbu et al., 2016), albeit in the context of cleaning translation memories which tend to be cleaner than web crawled data. In 2018, a shared task explored filtering techniques for neural machine translation UNKNOWN CITATION 'koehn-EtAl:2018:WMT'.
Yonatan Belinkov and Yonatan Bisk (2018):
Synthetic and Natural Noise Both Break Neural Machine Translation, International Conference on Learning Representations
mentioned in Corpus Cleaning and Analysis And Visualization@inproceedings{belinkov2018synthetic,
author = {Yonatan Belinkov and Yonatan Bisk},
title = {Synthetic and Natural Noise Both Break Neural Machine Translation},
booktitle = {International Conference on Learning Representations},
url = {
https://openreview.net/forum?id=BJ8vJebC-},
year = 2018
}
Belinkov and Bisk (2018) investigate noise in neural machine translation, but they focus on creating systems that can translate the kinds of orthographic errors (typos, misspellings, etc.) that humans can comprehend. In contrast, we address noisy training data and focus on types of noise occurring in web-crawled corpora.
There is a rich literature on data selection which aims at sub-sampling parallel data relevant for a task-specific machine translation system
Axelrod, Amittai and He, Xiaodong and Gao, Jianfeng (2011):
Domain Adaptation via Pseudo In-Domain Data Selection, Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing
mentioned in Corpus Cleaning and Domain Adaptation@InProceedings{axelrod-he-gao:2011:EMNLP,
author = {Axelrod, Amittai and He, Xiaodong and Gao, Jianfeng},
title = {Domain Adaptation via Pseudo In-Domain Data Selection},
booktitle = {Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing},
month = {July},
address = {Edinburgh, Scotland, UK.},
publisher = {Association for Computational Linguistics},
pages = {355--362},
url = {
http://www.aclweb.org/anthology/D11-1033},
year = 2011
}
(Axelrod et al., 2011).
van der Wees, Marlies and Bisazza, Arianna and Monz, Christof (2017):
Dynamic Data Selection for Neural Machine Translation, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
mentioned in Corpus Cleaning and Adaptation@InProceedings{D17-1148,
author = {van der Wees, Marlies and Bisazza, Arianna and Monz, Christof},
title = {Dynamic Data Selection for Neural Machine Translation},
booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
publisher = {Association for Computational Linguistics},
pages = {1411--1421},
location = {Copenhagen, Denmark},
url = {
http://aclweb.org/anthology/D17-1147},
year = 2017
}
Wees et al. (2017) find that the existing data selection methods developed for statistical machine translation are less effective for neural machine translation. This is different from our goals of handling noise since those methods tend to discard perfectly fine sentence pairs (say, about cooking recipes) that are just not relevant for the targeted domain (say, software manuals). Our work is focused on noise that is harmful for all domains.
Since we begin with a clean parallel corpus and potentially noisy data to it, this work can be seen as a type of data augmentation.
Sennrich, Rico and Haddow, Barry and Birch, Alexandra (2016):
Improving Neural Machine Translation Models with Monolingual Data, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
mentioned in Corpus Cleaning and Monolingual Data@InProceedings{sennrich-haddow-birch:2016:P16-11,
author = {Sennrich, Rico and Haddow, Barry and Birch, Alexandra},
title = {Improving Neural Machine Translation Models with Monolingual Data},
booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {86--96},
url = {
http://www.aclweb.org/anthology/P16-1009},
year = 2016
}
Sennrich et al. (2016) incorporate monolingual corpora into NMT by first translating it using an NMT system trained in the opposite direction. While such a corpus has the potential to be noisy, the method is very effective.
Currey, Anna and Miceli Barone, Antonio Valerio and Heafield, Kenneth (2017):
Copied Monolingual Data Improves Low-Resource Neural Machine Translation, Proceedings of the Second Conference on Machine Translation, Volume 1: Research Paper
mentioned in Corpus Cleaning and Monolingual Data@InProceedings{currey-micelibarone-heafield:2017:WMT,
author = {Currey, Anna and Miceli Barone, Antonio Valerio and Heafield, Kenneth},
title = {Copied Monolingual Data Improves Low-Resource Neural Machine Translation},
booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 1: Research Paper},
month = {September},
address = {Copenhagen, Denmark},
publisher = {Association for Computational Linguistics},
pages = {148--156},
url = {
http://www.aclweb.org/anthology/W17-4715},
year = 2017
}
Currey et al. (2017) create additional parallel corpora by copying monolingual corpora in the target language into the source, and find it improves over back-translation for some language pairs.
Fadaee, Marzieh and Bisazza, Arianna and Monz, Christof (2017):
Data Augmentation for Low-Resource Neural Machine Translation, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
mentioned in Neural Network Models and Corpus Cleaning@InProceedings{fadaee-bisazza-monz:2017:Short2,
author = {Fadaee, Marzieh and Bisazza, Arianna and Monz, Christof},
title = {Data Augmentation for Low-Resource Neural Machine Translation},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
month = {July},
address = {Vancouver, Canada},
publisher = {Association for Computational Linguistics},
pages = {567--573},
url = {
http://aclweb.org/anthology/P17-2090},
year = 2017
}
Fadaee et al. (2017) improve NMT performance in low-resource settings by altering existing sentences to create training data that includes rare words in different contexts.
Copy Noise:
Other work has also considered copying in NMT.
Currey, Anna and Miceli Barone, Antonio Valerio and Heafield, Kenneth (2017):
Copied Monolingual Data Improves Low-Resource Neural Machine Translation, Proceedings of the Second Conference on Machine Translation, Volume 1: Research Paper
mentioned in Corpus Cleaning and Monolingual Data@InProceedings{currey-micelibarone-heafield:2017:WMT,
author = {Currey, Anna and Miceli Barone, Antonio Valerio and Heafield, Kenneth},
title = {Copied Monolingual Data Improves Low-Resource Neural Machine Translation},
booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 1: Research Paper},
month = {September},
address = {Copenhagen, Denmark},
publisher = {Association for Computational Linguistics},
pages = {148--156},
url = {
http://www.aclweb.org/anthology/W17-4715},
year = 2017
}
Currey et al. (2017) add copied data and back-translated data to a clean parallel corpus. They report improvements on English-Romanian when adding as much back-translated and copied data as they have parallel (1:1:1 ratio). For English-Turkish and English-German, they add twice as much back translated and copied data as parallel data (1:2:2 ratio), and report improvements on English-Turkish but not on English-German. However, their English-German
systems trained with the copied corpus did not perform worse than baseline systems.
Ott, Myle and Auli, Michael and Grangier, David and Ranzato, Marc'Aurelio (2018):
Analyzing Uncertainty in Neural Machine Translation, Proceedings of the 35th International Conference on Machine Learning
mentioned in Corpus Cleaning, Inference and Analysis And Visualization@InProceedings{pmlr-v80-ott18a,
author = {Ott, Myle and Auli, Michael and Grangier, David and Ranzato, Marc'Aurelio},
title = {Analyzing Uncertainty in Neural Machine Translation},
booktitle = {Proceedings of the 35th International Conference on Machine Learning},
pages = {3956--3965},
editor = {Dy, Jennifer and Krause, Andreas},
volume = {80},
series = {Proceedings of Machine Learning Research},
address = {Stockholmsmässan, Stockholm Sweden},
month = {10--15 Jul},
publisher = {PMLR},
url = {
http://proceedings.mlr.press/v80/ott18a/ott18a.pdf},
year = 2018
}
Ott et al. (2018) found that while copied training sentences represent less than 2.0% of their training data (WMT 14 English-German and English-French), copies are over-represented in the output of beam search. Using a subset of training data from WMT 17, they replace a subset of the true translations with a copy of the input. They analyze varying amounts of copied noise, and a variety of beam sizes. Larger beams are more effected by this kind of noise; however, for all beam sizes performance degrades completely with 50% copied sentences.
Benchmarks
Discussion
Related Topics
New Publications
Alberto Poncelas and Gideon Maillette de Buy Wenniger and Andy Way (2018):
Data Selection with Feature Decay Algorithms Using an Approximated Target Side, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
@inproceedings{iwslt18-Selection-Poncelas,
author = {Alberto Poncelas and Gideon Maillette de Buy Wenniger and Andy Way},
title = {Data Selection with Feature Decay Algorithms Using an Approximated Target Side},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
year = 2018
}
Poncelas et al. (2018)
Pinnis, Marcis (2018):
Tilde's Parallel Corpus Filtering Methods for WMT 2018, Proceedings of the Third Conference on Machine Translation: Shared Task Papers
@inproceedings{W18-6486,
author = {Pinnis, Marcis},
title = {Tilde{'}s Parallel Corpus Filtering Methods for WMT 2018},
booktitle = {Proceedings of the Third Conference on Machine Translation: Shared Task Papers},
month = {oct},
address = {Belgium, Brussels},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/W18-6486},
pages = {939--945},
year = 2018
}
Pinnis (2018)
Barbu, Eduard (2017):
Ensembles of Classifiers for Cleaning Web Parallel Corpora and Translation Memories, Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
@inproceedings{barbu-2017-ensembles,
author = {Barbu, Eduard},
title = {Ensembles of Classifiers for Cleaning Web Parallel Corpora and Translation Memories},
booktitle = {Proceedings of the International Conference Recent Advances in Natural Language Processing, {RANLP} 2017},
month = {sep},
address = {Varna, Bulgaria},
publisher = {INCOMA Ltd.},
url = {
https://doi.org/10.26615/978-954-452-049-6_011},
doi = {10.26615/978-954-452-049-6_011},
pages = {71--77},
year = 2017
}
Barbu (2017)
Guo, Mandy and Shen, Qinlan and Yang, Yinfei and Ge, Heming and Cer, Daniel and Hernand ez Abrego, Gustavo and Stevens, Keith and Constant, Noah and Sung, Yun-hsuan and Strope, Brian and Kurzweil, Ray (2018):
Effective Parallel Corpus Mining using Bilingual Sentence Embeddings, Proceedings of the Third Conference on Machine Translation: Research Papers
@inproceedings{W18-6317,
author = {Guo, Mandy and Shen, Qinlan and Yang, Yinfei and Ge, Heming and Cer, Daniel and Hernand ez Abrego, Gustavo and Stevens, Keith and Constant, Noah and Sung, Yun-hsuan and Strope, Brian and Kurzweil, Ray},
title = {Effective Parallel Corpus Mining using Bilingual Sentence Embeddings},
booktitle = {Proceedings of the Third Conference on Machine Translation: Research Papers},
month = {oct},
address = {Belgium, Brussels},
publisher = {Association for Computational Linguistics},
url = {
https://www.aclweb.org/anthology/W18-6317},
pages = {165--176},
year = 2018
}
Guo et al. (2018)
Schwenk, Holger (2018):
Filtering and Mining Parallel Data in a Joint Multilingual Space, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
mentioned in Corpus Cleaning and Multilingual Word Embeddings@InProceedings{P18-2037,
author = {Schwenk, Holger},
title = {Filtering and Mining Parallel Data in a Joint Multilingual Space},
booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
publisher = {Association for Computational Linguistics},
pages = {228--234},
location = {Melbourne, Australia},
url = {
http://aclweb.org/anthology/P18-2037},
year = 2018
}
Schwenk (2018)
Xu, Hainan and Koehn, Philipp (2017):
Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
@InProceedings{D17-1318,
author = {Xu, Hainan and Koehn, Philipp},
title = {Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora},
booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
publisher = {Association for Computational Linguistics},
pages = {2935--2940},
location = {Copenhagen, Denmark},
url = {
http://aclweb.org/anthology/D17-1318},
year = 2017
}
Xu and Koehn (2017)
Seppo Enarvi and Mikko Kurimo (2013):
Studies on training text selection for conversational Finnish language modeling, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
@inproceedings{Enarvi:iwslt:2013,
author = {Seppo Enarvi and Mikko Kurimo},
title = {Studies on training text selection for conversational {Finnish} language modeling},
url = {
http://www.mt-archive.info/10/IWSLT-2013-Enarvi.pdf},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
year = 2013
}
Enarvi and Kurimo (2013)
Barbu, Eduard (2015):
Spotting false translation segments in translation memories, Proceedings of the Workshop Natural Language Processing for Translation Memories
@InProceedings{barbu:2015:NLP4TM,
author = {Barbu, Eduard},
title = {Spotting false translation segments in translation memories},
booktitle = {Proceedings of the Workshop Natural Language Processing for Translation Memories},
month = {September},
address = {Hissar, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {9--16},
url = {
http://www.aclweb.org/anthology/W15-5202},
year = 2015
}
Barbu (2015)
Jalili Sabet, Masoud and Negri, Matteo and Turchi, Marco and C. de Souza, José G. and Federico, Marcello (2016):
TMop: a Tool for Unsupervised Translation Memory Cleaning, Proceedings of ACL-2016 System Demonstrations
@InProceedings{jalilisabet-EtAl:2016:P16-4,
author = {Jalili Sabet, Masoud and Negri, Matteo and Turchi, Marco and C. de Souza, Jos\'{e} G. and Federico, Marcello},
title = {TMop: a Tool for Unsupervised Translation Memory Cleaning},
booktitle = {Proceedings of ACL-2016 System Demonstrations},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {49--54},
url = {
http://anthology.aclweb.org/P/P16/P16-4009},
year = 2016
}
Sabet et al. (2016)
Axelrod, Amittai and Resnik, Philip and He, Xiaodong and Ostendorf, Mari (2015):
Data Selection With Fewer Words, Proceedings of the Tenth Workshop on Statistical Machine Translation
@InProceedings{axelrod-EtAl:2015:WMT,
author = {Axelrod, Amittai and Resnik, Philip and He, Xiaodong and Ostendorf, Mari},
title = {Data Selection With Fewer Words},
booktitle = {Proceedings of the Tenth Workshop on Statistical Machine Translation},
month = {September},
address = {Lisbon, Portugal},
publisher = {Association for Computational Linguistics},
pages = {58--65},
url = {
http://aclweb.org/anthology/W15-3003},
year = 2015
}
Axelrod et al. (2015)
Cui, Lei and Zhang, Dongdong and Liu, Shujie and Li, Mu and Zhou, Ming (2013):
Bilingual Data Cleaning for SMT using Graph-based Random Walk, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
@InProceedings{cui-EtAl:2013:Short,
author = {Cui, Lei and Zhang, Dongdong and Liu, Shujie and Li, Mu and Zhou, Ming},
title = {Bilingual Data Cleaning for {SMT} using Graph-based Random Walk},
booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {340--345},
url = {
http://www.aclweb.org/anthology/P13-2061},
year = 2013
}
Cui et al. (2013)
Kahif Shah and Lucia Specia (2014):
Quality estimation for translation selection, Proceedings of 17th Annual conference of the European Association for Machine Translation
@inproceedings{eamt-2014-Shah,
author = {Kahif Shah and Lucia Specia},
title = {Quality estimation for translation selection},
booktitle = {Proceedings of 17th Annual conference of the European Association for Machine Translation},
pages = {109-116},
url = {
http://www.mt-archive.info/10/EAMT-2014-Shah.pdf},
location = {Dubrovnik, Croatia},
year = 2014
}
Shah and Specia (2014)
Anthony Rousseau (2013):
XenC: An Open-Source Tool for Data Selection in Natural Language Processing, The Prague Bulletin of Mathematical Linguistics
@article{pbml-100-rousseau,
author = {Anthony Rousseau},
title = {XenC: An Open-Source Tool for Data Selection in Natural Language Processing},
url = {
http://ufal.mff.cuni.cz/pbml/100/art-rousseau.pdf},
pages = {73--82},
journal = {The Prague Bulletin of Mathematical Linguistics},
volume = {100},
year = 2013
}
Rousseau (2013)
Arase, Yuki and Zhou, Ming (2013):
Machine Translation Detection from Monolingual Web-Text, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
@InProceedings{arase-zhou:2013:ACL2013,
author = {Arase, Yuki and Zhou, Ming},
title = {Machine Translation Detection from Monolingual Web-Text},
booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {1597--1607},
url = {
http://www.aclweb.org/anthology/P13-1157},
year = 2013
}
Arase and Zhou (2013)
Aharoni, Roee and Koppel, Moshe and Goldberg, Yoav (2014):
Automatic Detection of Machine Translated Text and Translation Quality Estimation, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
@InProceedings{aharoni-koppel-goldberg:2014:P14-2,
author = {Aharoni, Roee and Koppel, Moshe and Goldberg, Yoav},
title = {Automatic Detection of Machine Translated Text and Translation Quality Estimation},
booktitle = {Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
month = {June},
address = {Baltimore, Maryland},
publisher = {Association for Computational Linguistics},
pages = {289--295},
url = {
http://www.aclweb.org/anthology/P14-2048},
year = 2014
}
Aharoni et al. (2014)
Michel Simard (2014):
Clean data for training statistical MT: the case of MT contamination, Proceedings of the Eleventh Conference of the Association for Machine Translation in the Americas (AMTA)
@inproceedings{AMTA-2014-Simard,
author = {Michel Simard},
title = {Clean data for training statistical MT: the case of {MT} contamination},
pages = {69-82},
url = {
http://www.mt-archive.info/10/AMTA-2014-Simard.pdf},
volume = {1},
booktitle = {Proceedings of the Eleventh Conference of the Association for Machine Translation in the Americas (AMTA)},
location = {Vancouver, BC, Canada},
year = 2014
}
Simard (2014)
Kaveh Taghipour and Shahram Khadivi and Jia Xu (2011):
Parallel Corpus Refinement as an Outlier Detection Algorithm, Proceedings of the 13th Machine Translation Summit (MT Summit XIII)
@inproceedings{MTS-2011-Taghipour,
author = {Kaveh Taghipour and Shahram Khadivi and Jia Xu},
title = {Parallel Corpus Refinement as an Outlier Detection Algorithm},
url = {
http://www.mt-archive.info/MTS-2011-Taghipour.pdf},
pages = {414-421},
booktitle = {Proceedings of the 13th Machine Translation Summit (MT Summit XIII)},
publisher = {International Association for Machine Translation},
location = {Xiamen, China},
year = 2011
}
Taghipour et al. (2011)
Formiga, Lluís and Fonollosa, José A. R. (2012):
Dealing with Input Noise in Statistical Machine Translation, Proceedings of COLING 2012: Posters
@InProceedings{formiga-fonollosa:2012:POSTERS,
author = {Formiga, Llu{\'i}s and Fonollosa, Jos{\'e} A. R.},
title = {Dealing with Input Noise in Statistical Machine Translation},
booktitle = {Proceedings of COLING 2012: Posters},
month = {December},
address = {Mumbai, India},
publisher = {The COLING 2012 Organizing Committee},
pages = {319--328},
url = {
http://www.aclweb.org/anthology/C12-2032},
year = 2012
}
Formiga and Fonollosa (2012)
Lui, Marco and Baldwin, Timothy (2012):
langid.py: An Off-the-shelf Language Identification Tool, Proceedings of the ACL 2012 System Demonstrations
@InProceedings{lui-baldwin:2012:Demo,
author = {Lui, Marco and Baldwin, Timothy},
title = {langid.py: An Off-the-shelf Language Identification Tool},
booktitle = {Proceedings of the ACL 2012 System Demonstrations},
month = {July},
address = {Jeju Island, Korea},
publisher = {Association for Computational Linguistics},
pages = {25--30},
url = {
http://www.aclweb.org/anthology/P12-3005},
year = 2012
}
Lui and Baldwin (2012)
Jehl, Laura and Hieber, Felix and Riezler, Stefan (2012):
Twitter Translation using Translation-Based Cross-Lingual Retrieval, Proceedings of the Seventh Workshop on Statistical Machine Translation
@InProceedings{jehl-hieber-riezler:2012:WMT,
author = {Jehl, Laura and Hieber, Felix and Riezler, Stefan},
title = {Twitter Translation using Translation-Based Cross-Lingual Retrieval},
booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation},
month = {June},
address = {Montreal, Canada},
publisher = {Association for Computational Linguistics},
pages = {163--174},
url = {
http://www.aclweb.org/anthology/W12-3121},
year = 2012
}
Jehl et al. (2012)