Projects
Note that the password for this page is: mtm12
Participants are welcome to propose a new project to be added to this list.
- Statistical Example Based MT Proposed by Chris Dyer Improve translation models with features that look at source sentence context. Desirable skills: Python/Cython, C++, machine learning, cdec
- Word posteriori probability from a wordgraph Proposed by Mercedes Garcia Calculate the word posteriori probability from a wordgraph (extracted from Moses). Useful to calculate confidence measures. Desirable skills: Python/C/C++, machine learning, confidence measures.
- Improved extraction heuristics for hierarchical phrase-based models Proposed by Hieu Hoang This project aims to improve the performance of hierarchical phrase-based models through improved extraction heuristics Desirable skills: hierarchical MT, C++ or Java
- Features to model word span of non-terminals Proposed by Hieu Hoang In this project we will extend hierarchical phrase-based MT by modelling the distribution of the number of words spanned by each non-terminal. Desirable skills: hierarchical MT, C++
- Creating source labels to improve the translation search space Proposed by Hieu Hoang This project will seek to impove syntactic MT by extending the Mixed-Syntax model in my PhD thesis. The idea is to define a customised labelling for the source sentence which helps with MT rule extraction, rather than relying on off-the-shelf parsers. Desirable skills: hierarchical MT, C++, machine learning
- Building Moses Training Pipelines with Arrows Proposed by Ian Johnson It is not about the choice of language (Perl/Python/Haskell?) to build training pipelines, but the choice of programming styles. Towards a plugable code structure, we propose to use Arrows to construct the whole training pipeline of Moses. Desirable skills: Python
- Bounded-memory LM and Phrase Table building Proposed by Kenneth Heafield Language model estimation takes an annoyingly large amount of RAM. Phrase table p(s|t) and p(t|s) evaluation takes a lot of time due to on-disk sorting in plaintext. The ultimate goal of this project is a much faster way to do both tasks using a user-specified amount of RAM and making efficient use of disk only when necessary. The marathon project will lay the groundwork: a framework that presents multiple input and multiple output streams to streaming algorithms. The output streams will be optionally sorted by the framework. Where possible, these streams will be kept in memory. However, once a user-specified memory bound is reached, they will be dumped to disk in binary format. Desirable skills: C++
- Target hypergraph serialization Proposed by Kenneth Heafield Often, we repeatedly decode the same sentences with different feature weights. Usually, this happens within MERT, MIRA, or PRO, our version of Godot. This project will save time by parsing only once instead of every iteration. To do so, we need to develop an efficient (preferably binary) serialization format, modify Moses to generate it without doing cube pruning, and modify Moses to read this format without redoing parsing. Desirable skills: C++, Moses internals
- Document-Level Translation in Moses Proposed by Liane Guillou Design and implement a cache-based document-level translation strategy using Moses. Incorporate the notion of confidence estimation to ensure that only those best hypotheses that are deemed to be of “high quality” will be retained. Desirable skills: Experience with Moses and/or Machine Learning
- Integrate CSLM into Moses phrase-based decoder Proposed by Lane Schwartz Integrate the Continuous Space Language Model (CSLM) into Moses phrase-based decoding, using the newly implemented batch LM requests. Info on CSLM is at http://www-lium.univ-lemans.fr/cslm/ Desirable skills: C++, CSLM
- Optimisation of Sparse Moses Proposed by Barry Haddow Moses has support for sparse features in the miramerge branch, but it is quite inefficient in time and space. This project aims at optimising this version of Moses so it can be merged back into trunk, and so that Moses can scale to millions of features. Desirable skills: C++
- Open Source Computer Aided Translation Proposed by Philipp Koehn Two recently started EU-funded projects, MateCat and CASMACAT begun developing open source workbenches for human translators that can take advantage of machine translation. At the MT Marathon we would like to extend this effort. Desirable skills: Javascript, PHP
- SMT Research Survey Wiki Proposed by Philipp Koehn The number of research papers in statistical machine translation is exploding, so a useful resource would be a up-to-date survey of all published papers, in form of a Wiki. This project looks at improving such a tool, currently at http://www.statmt.org/survey/ in beta stage. Desirable skills: PHP
- Diagnostic evaluation of MT with DELiC4MT Proposed by Antonio Toral This project aims to improve and extend DELiC4MT, an open-source tool for diagnostic evaluation of MT. Desirable skills: java, perl, php, experience/interest in MT evaluation (beyond just BLEU!)
- Zones as Features for Phrase-based decoding Proposed by Colin Cherry Moses currently implements zones in the input text as hard re-ordering constraints: once the decoder enters a zone, it cannot leave until the zone has been completely translated. This project would use the same idea to create features. Annotated zones of the input would incur learned penalties when interrupted. Adapters to create annotated zones (and therefore, features) from the output of popular NLP tools such as PTB constituency parsers, dependency parsers, or named entity recognizers would be added to Moses as part of this project. Desirable skills: C++
- Extend PRO and k-best MIRA to other error metrics Proposed by Colin Cherry PRO and k-best MIRA are currently available in Moses for tuning high-dimensional feature vectors in a batch setting. However, they are both hard-coded to use (different) sentence-level approximations to BLEU. This project would augment both systems to use arbitrary sentence-level error metrics. Desirable skills: C++, Perl
- Sparse feature implementation in Joshua Proposed by Matt Post Joshua currently has only a dense feature representation for each grammar (with optional sharing of dense features across input grammar files). We'd like to extend this by adding support for hundreds of thousands or millions of sparse features. Desirable skills: Java, familiarity with cdec and Moses sparse feature implementations
- Parallel corpus extraction from Common Crawl Proposed by Philipp Koehn Common Crawl offers a current crawl of the web, amassing to 60 TB of data. This poses the challenge to extract parallel text useful for training machine translation systems. Desirable skills: Hadoop, Amazon Web Services
- C++11 compliance for cdec Proposed by Chris Dyer Make cdec C++11 compliant Desirable skills: C++
- Integration of phrase table filtering and merging in Moses Proposed by Barry Haddow Moses includes contributed code for phrase table filtering and for phrase table merging. It would be nice to have this fully integrated into the training pipeline. Desirable skills: perl and/or C++
- Multiple reference translations for European languages Proposed by Eva Hasler This project will look at automatically creating multiple reference translations for the news/news commentary data sets. Desirable skills: experience with Moses pipeline, C++/perl
- New development funcionality for the Asiya suite: parameter optimization with MERT Proposed by Meritxell Gonzàlez, Cristina España-Bonet The purpose of the project is to integrate MERT and Asiya in order to allow MERT to use any user-defined set of metrics. Desirable skills: C++, perl
- Integrating (a few) rules in the MT pipeline Proposed by Christian Buck We want to grab some low hanging fruit to improve moses for industry applications by allowing some simple rules in the translation pipeline. Desirable skills: Perl/Python