Creating Source Labels To Improve The Translation Search Space
Project leader: Hieu Hoang
Desirable skills for participants: hierarchical MT, C++, machine learning
Syntax MT is usually thought of as the use of SCFG grammar where non-terminals in the translation rules use labels from a linguistic parser. This is usually done by parsing the target training data, then labelling the non-terminals in the translation rules with constituent labels from the parse trees. The shortcoming of this approach is that: 1. You need a parser for the target language. This is not always available. 2. The parse tree can be incorrect, non-existent, or non-suited for MT. An alternative to this approach is: 1. Label the source sentence, and the source language in the training data. 2. Create your own labelling tool which is suited to the MT-task and language pair you are working on. This can be hand-witten rules, or preferably, automatically learnt using ML-techniques. 3. Selectively use the labels to constrain rule application during decoding. Don't use the standard syntactic SCFG model. The aim is to create a grammar which better explains the translation and which will, hopefully, lead to better translations. This project would be of interest to students who have some knowledge of machine learning and want a long-term subject of research. The antecendent of this project is the Mixed-Syntax Model described in Hieu Hoang's phd thesis, and in (Hoang 2010). However, these works used a traditional parser, and chunk parser.