Word Alignment Based on Co-Occurence
While most current work on word alignment is model-based, more heuristic approaches are based on co-occurence statistics.
Word Alignment Based On Coocurrence is the main subject of 20 publications. 16 are discussed here.
Publications
Early work in word alignment focused on co-occurence statistics to find evidence for word associations
Hiroyuki Kaji and Toshiko Aizono (1996):
Extracting Word Correspondences from Bilingual Corpora Based on Word Co-occurrence Information, Proceedings of the 16th International Conference on Computational Linguistics (COLING)
@InProceedings{Kaji:1996,
author = {Hiroyuki Kaji and Toshiko Aizono},
title = {Extracting Word Correspondences from Bilingual Corpora Based on Word Co-occurrence Information},
url = {
http://acl.ldc.upenn.edu/C/C96/C96-1006.pdf},
googlescholar = {14726816923602580517},
booktitle = {Proceedings of the 16th International Conference on Computational Linguistics (COLING)},
year = 1996
}
(Kaji and Aizono, 1996). These methods may find evidence for the alignment of a word to multiple translations, a problem called indirect association, which may be overcome with enforcing one-to-one alignments
I. Dan Melamed (1996):
Automatic construction of clean broad-coverage translation lexicons, Proceedings of the Conference of the Association for Machine Translation in the Americas
@Inproceedings{Melamed:1996c,
author = {I. Dan Melamed},
title = {Automatic construction of clean broad-coverage translation lexicons},
url = {
http://www.mt-archive.info/AMTA-1996-Melamed.pdf},
googlescholar = {11463806347371034787},
booktitle = {Proceedings of the Conference of the Association for Machine Translation in the Americas},
year = 1996
}
(Melamed, 1996).
Akira Kumano and Hideki Hirakawa (1994):
BUILDING AN MT DICTIONARY FROM PARALLEL TEXTS BASED ON LINGUISTIC AND STATISTICAL INFORMATION, Proceedings of the 15th International Conference on Computational Linguistics (COLING)
@InProceedings{Kumano:1994,
author = {Akira Kumano and Hideki Hirakawa},
title = {BUILDING AN {MT} DICTIONARY FROM PARALLEL TEXTS BASED ON LINGUISTIC AND STATISTICAL INFORMATION},
url = {
http://acl.ldc.upenn.edu/C/C94/C94-1009.pdf},
googlescholar = {3640366317674066250},
booktitle = {Proceedings of the 15th International Conference on Computational Linguistics (COLING)},
year = 1994
}
Kumano and Hirakawa (1994) augment this method with an existing bilingual dictionary.
Kengo Sato and Masakazu Nakanishi (1998):
Maximum Entropy Model Learning of the Translation Rules, Proceedings of the 36th Annual Meeting of the Association of Computational Linguistics (ACL)
@Inproceedings{Sato:1998,
author = {Kengo Sato and Masakazu Nakanishi},
title = {Maximum Entropy Model Learning of the Translation Rules},
url = {
http://acl.ldc.upenn.edu/C/C98/C98-2186.pdf},
googlescholar = {3944009878662260031},
booktitle = {Proceedings of the 36th Annual Meeting of the Association of Computational Linguistics (ACL)},
year = 1998
}
Sato and Nakanishi (1998) use a maximum entropy model for word associations.
Sur-Jin Ker and Jason J. S. Chang (1996):
Aligning More Words with High Precision for Small Bilingual Corpora, Proceedings of the 16th International Conference on Computational Linguistics (COLING)
@InProceedings{Ker:1996,
author = {Sur-Jin Ker and Jason J. S. Chang},
title = {Aligning More Words with High Precision for Small Bilingual Corpora},
url = {
http://acl.ldc.upenn.edu/C/C96/C96-1037.pdf},
googlescholar = {5759244603010696985},
booktitle = {Proceedings of the 16th International Conference on Computational Linguistics (COLING)},
year = 1996
}
Ker and Chang (1996) groups words together into sense classes from a thesaurus to improve word alignment accuracy.
Co-occurence counts may also be used for phrase alignment, although this typically requires more efficient data structures for storing all phrases
Cromieres, Fabien (2006):
Sub-Sentential Alignment Using Substring Co-Occurrence Counts, Proceedings of the COLING/ACL 2006 Student Research Workshop
@InProceedings{cromieres:2006:SRW,
author = {Cromieres, Fabien},
title = {Sub-Sentential Alignment Using Substring Co-Occurrence Counts},
booktitle = {Proceedings of the COLING/ACL 2006 Student Research Workshop},
month = {July},
address = {Sydney, Australia},
publisher = {Association for Computational Linguistics},
pages = {13--18},
url = {
http://www.aclweb.org/anthology/P/P06/P06-3003},
year = 2006
}
(Cromieres, 2006).
Chatterjee, Niladri and Agrawal, Saumya (2006):
Word Alignment in English-Hindi Parallel Corpus Using Recency-Vector Approach: Some Studies, Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics
@InProceedings{chatterjee-agrawal:2006:COLACL,
author = {Chatterjee, Niladri and Agrawal, Saumya},
title = {Word Alignment in {English-Hindi} Parallel Corpus Using Recency-Vector Approach: Some Studies},
booktitle = {Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics},
month = {July},
address = {Sydney, Australia},
publisher = {Association for Computational Linguistics},
pages = {649--656},
url = {
http://www.aclweb.org/anthology/P/P06/P06-1082},
year = 2006
}
Chatterjee and Agrawal (2006) extends a recency vector approach
Pascale Fung and Kathleen R. McKeown (1994):
Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping, 1st Conference of the Association for Machine Translation in the Americas (AMTA)
@Inproceedings{pascale94aligning,
author = {Pascale Fung and Kathleen R. McKeown},
title = {Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping},
booktitle = {1st Conference of the Association for Machine Translation in the Americas (AMTA)},
year = 1994
}
(Fung and McKeown, 1994) with additional constraints.
Lardilleux, Adrien and Lepage, Yves (2008):
Multilingual Alignments by Monolingual String Differences, Coling 2008: Companion volume: Posters and Demonstrations
@InProceedings{lardilleux-lepage:2008:POSTERS,
author = {Lardilleux, Adrien and Lepage, Yves},
title = {Multilingual Alignments by Monolingual String Differences},
booktitle = {Coling 2008: Companion volume: Posters and Demonstrations},
month = {August},
address = {Manchester, UK},
publisher = {Coling 2008 Organizing Committee},
pages = {53--56},
url = {
http://www.aclweb.org/anthology/C08-3014},
year = 2008
}
Lardilleux and Lepage (2008) iteratively match the longest common subsequences from sentence pairs and align the remainder.
Heuristic word alignment methods have may be extended into iterative algorithms, for instance the competitive linking algorithm by
I. Dan Melamed (1995):
Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons, Proceedings of the Third Workshop on Very Large Corpora (VLC)
@Inproceedings{Melamed:1995b,
author = {I. Dan Melamed},
title = {Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons},
url = {
http://acl.ldc.upenn.edu/W/W95/W95-0115.pdf},
googlescholar = {3405308963388389121},
booktitle = {Proceedings of the Third Workshop on Very Large Corpora (VLC)},
year = 1995
}
Melamed (1995);
I. Dan Melamed (1996):
A Geometric Approach to Mapping Bitext Correspondence, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)
mentioned in Sentence Alignment and Word Alignment Based On Coocurrence@Inproceedings{Melamed:1996,
author = {I. Dan Melamed},
title = {A Geometric Approach to Mapping Bitext Correspondence},
url = {
http://acl.ldc.upenn.edu/W/W96/W96-0201.pdf},
googlescholar = {598522478706255108},
booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = 1996
}
Melamed (1996);
I. Dan Melamed (1997):
A Word-to-Word Model of Translational Equivalence, Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL)
@Inproceedings{Melamed:1997b,
author = {I. Dan Melamed},
title = {A Word-to-Word Model of Translational Equivalence},
url = {
http://acl.ldc.upenn.edu/P/P97/P97-1063.pdf},
googlescholar = {12869019493018941353},
booktitle = {Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL)},
year = 1997
}
Melamed (1997);
I. Dan Melamed (2000):
Models of Translational Equivalence among Words, Computational Linguistics
@Article{Melamed:2000,
author = {I. Dan Melamed},
title = {Models of Translational Equivalence among Words},
url = {
http://acl.ldc.upenn.edu/J/J00/J00-2004.pdf},
journal = {Computational Linguistics},
volume = {26},
number = {2},
year = 2000
}
Melamed (2000) or bilingual bracketing
Dekai Wu (1997):
Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora, Computational Linguistics
mentioned in Word Alignment Based On Coocurrence and Inversion Transduction Grammars@Article{Wu:1997,
author = {Dekai Wu},
title = {Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora},
url = {
http://acl.ldc.upenn.edu/J/J97/J97-3002.pdf},
googlescholar = {7926725626202301933},
journal = {Computational Linguistics},
volume = {23},
number = {3},
year = 1997
}
(Wu, 1997).
Dan Tufiş (2002):
A cheap and fast way to build useful translation lexicons, Proceedings of the International Conference on Computational Linguistics (COLING)
@InProceedings{Tufis:2002,
author = {Dan Tufi{\,s}},
title = {A cheap and fast way to build useful translation lexicons},
booktitle = {Proceedings of the International Conference on Computational Linguistics (COLING)},
year = 2002
}
Tufiş (2002) extends a simple co-occurence method to align words.
Monolingual collocation may also be helpful for word alignment:
Liu, Zhanyi and Wang, Haifeng and Wu, Hua and Li, Sheng (2010):
Improving Statistical Machine Translation with Monolingual Collocation, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
mentioned in Word Alignment Based On Coocurrence and Context Features@InProceedings{liu-EtAl:2010:ACL,
author = {Liu, Zhanyi and Wang, Haifeng and Wu, Hua and Li, Sheng},
title = {Improving Statistical Machine Translation with Monolingual Collocation},
booktitle = {Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics},
month = {July},
address = {Uppsala, Sweden},
publisher = {Association for Computational Linguistics},
pages = {825--833},
url = {
http://www.aclweb.org/anthology/P10-1085},
year = 2010
}
Liu et al. (2010) use collocation statistics help group words into cepts.
Benchmarks
Discussion
Related Topics
New Publications
Bai, Ming-Hong and You, Jia-Ming and Chen, Keh-Jiann and Chang, Jason S. (2009):
Acquiring Translation Equivalences of Multiword Expressions by Normalized Correlation Frequencies, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
@InProceedings{bai-EtAl:2009:EMNLP,
author = {Bai, Ming-Hong and You, Jia-Ming and Chen, Keh-Jiann and Chang, Jason S.},
title = {Acquiring Translation Equivalences of Multiword Expressions by Normalized Correlation Frequencies},
booktitle = {Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing},
month = {August},
address = {Singapore},
publisher = {Association for Computational Linguistics},
pages = {478--486},
url = {
http://www.aclweb.org/anthology/D/D09/D09-1050},
year = 2009
}
Bai et al. (2009)
Moore, Robert C. (2005):
Association-Based Bilingual Word Alignment, Proceedings of the ACL Workshop on Building and Using Parallel Texts
@InProceedings{moore:2005:WPT,
author = {Moore, Robert C.},
title = {Association-Based Bilingual Word Alignment},
booktitle = {Proceedings of the ACL Workshop on Building and Using Parallel Texts},
month = {June},
address = {Ann Arbor, Michigan},
publisher = {Association for Computational Linguistics},
pages = {1--8},
url = {
http://www.aclweb.org/anthology/W/W05/W05-0801},
year = 2005
}
Moore (2005)
Tiedemann, Jörg (2009):
Evidence-Based Word Alignment, Proceedings of the Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning
@InProceedings{tiedemann:2009:NLPMCTLLL,
author = {Tiedemann, J\"{o}rg},
title = {Evidence-Based Word Alignment},
booktitle = {Proceedings of the Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning},
month = {September},
address = {Borovets, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {28--32},
url = {
http://www.aclweb.org/anthology/W09-4205},
year = 2009
}
Tiedemann (2009)
I. Dan Melamed (1995):
Automatic Evaluation and Uniform Filter Cascades for Inducing N-best Translation Lexicons, Third Workshop on Very Large Corpora
@Inproceedings{Melamed:1995,
author = {I. Dan Melamed},
title = {Automatic Evaluation and Uniform Filter Cascades for Inducing N-best Translation Lexicons},
url = {
http://acl.ldc.upenn.edu/W/W95/W95-0115.pdf},
booktitle = {Third Workshop on Very Large Corpora},
year = 1995
}
Melamed (1995)