MTM 2012 Projects/Projects

Projects

Note that the password for this page is: mtm12

Participants are welcome to propose a new project to be added to this list.

Statistical Example Based MT
Proposed by Chris Dyer

Improve translation models with features that look at source sentence context.

Desirable skills: Python/Cython, C++, machine learning, cdec
Word posteriori probability from a wordgraph
Proposed by Mercedes Garcia

Calculate the word posteriori probability from a wordgraph (extracted from Moses). Useful to calculate confidence measures.

Desirable skills: Python/C/C++, machine learning, confidence measures.
Improved extraction heuristics for hierarchical phrase-based models
Proposed by Hieu Hoang

This project aims to improve the performance of hierarchical phrase-based models through improved extraction heuristics

Desirable skills: hierarchical MT, C++ or Java
Features to model word span of non-terminals
Proposed by Hieu Hoang

In this project we will extend hierarchical phrase-based MT by modelling the distribution of the number of words spanned by each non-terminal.

Desirable skills: hierarchical MT, C++
Creating source labels to improve the translation search space
Proposed by Hieu Hoang

This project will seek to impove syntactic MT by extending the Mixed-Syntax model in my PhD thesis. The idea is to define a customised labelling for the source sentence which helps with MT rule extraction, rather than relying on off-the-shelf parsers.

Desirable skills: hierarchical MT, C++, machine learning
Building Moses Training Pipelines with Arrows
Proposed by Ian Johnson

It is not about the choice of language (Perl/Python/Haskell?) to build training pipelines, but the choice of programming styles. Towards a plugable code structure, we propose to use Arrows to construct the whole training pipeline of Moses.

Desirable skills: Python
Bounded-memory LM and Phrase Table building
Proposed by Kenneth Heafield

Language model estimation takes an annoyingly large amount of RAM. Phrase table p(s|t) and p(t|s) evaluation takes a lot of time due to on-disk sorting in plaintext. The ultimate goal of this project is a much faster way to do both tasks using a user-specified amount of RAM and making efficient use of disk only when necessary. The marathon project will lay the groundwork: a framework that presents multiple input and multiple output streams to streaming algorithms. The output streams will be optionally sorted by the framework. Where possible, these streams will be kept in memory. However, once a user-specified memory bound is reached, they will be dumped to disk in binary format.

Desirable skills: C++
Target hypergraph serialization
Proposed by Kenneth Heafield

Often, we repeatedly decode the same sentences with different feature weights. Usually, this happens within MERT, MIRA, or PRO, our version of Godot. This project will save time by parsing only once instead of every iteration. To do so, we need to develop an efficient (preferably binary) serialization format, modify Moses to generate it without doing cube pruning, and modify Moses to read this format without redoing parsing.

Desirable skills: C++, Moses internals
Document-Level Translation in Moses
Proposed by Liane Guillou

Design and implement a cache-based document-level translation strategy using Moses. Incorporate the notion of confidence estimation to ensure that only those best hypotheses that are deemed to be of “high quality” will be retained.

Desirable skills: Experience with Moses and/or Machine Learning
Integrate CSLM into Moses phrase-based decoder
Proposed by Lane Schwartz

Integrate the Continuous Space Language Model (CSLM) into Moses phrase-based decoding, using the newly implemented batch LM requests. Info on CSLM is at http://www-lium.univ-lemans.fr/cslm/

Desirable skills: C++, CSLM
Optimisation of Sparse Moses
Proposed by Barry Haddow

Moses has support for sparse features in the miramerge branch, but it is quite inefficient in time and space. This project aims at optimising this version of Moses so it can be merged back into trunk, and so that Moses can scale to millions of features.

Desirable skills: C++
Open Source Computer Aided Translation
Proposed by Philipp Koehn

Two recently started EU-funded projects, MateCat and CASMACAT begun developing open source workbenches for human translators that can take advantage of machine translation. At the MT Marathon we would like to extend this effort.

Desirable skills: Javascript, PHP
SMT Research Survey Wiki
Proposed by Philipp Koehn

The number of research papers in statistical machine translation is exploding, so a useful resource would be a up-to-date survey of all published papers, in form of a Wiki. This project looks at improving such a tool, currently at http://www.statmt.org/survey/ in beta stage.

Desirable skills: PHP
Diagnostic evaluation of MT with DELiC4MT
Proposed by Antonio Toral

This project aims to improve and extend DELiC4MT, an open-source tool for diagnostic evaluation of MT.

Desirable skills: java, perl, php, experience/interest in MT evaluation (beyond just BLEU!)
Zones as Features for Phrase-based decoding
Proposed by Colin Cherry

Moses currently implements zones in the input text as hard re-ordering constraints: once the decoder enters a zone, it cannot leave until the zone has been completely translated. This project would use the same idea to create features. Annotated zones of the input would incur learned penalties when interrupted. Adapters to create annotated zones (and therefore, features) from the output of popular NLP tools such as PTB constituency parsers, dependency parsers, or named entity recognizers would be added to Moses as part of this project.

Desirable skills: C++
Extend PRO and k-best MIRA to other error metrics
Proposed by Colin Cherry

PRO and k-best MIRA are currently available in Moses for tuning high-dimensional feature vectors in a batch setting. However, they are both hard-coded to use (different) sentence-level approximations to BLEU. This project would augment both systems to use arbitrary sentence-level error metrics.

Desirable skills: C++, Perl
Sparse feature implementation in Joshua
Proposed by Matt Post

Joshua currently has only a dense feature representation for each grammar (with optional sharing of dense features across input grammar files). We'd like to extend this by adding support for hundreds of thousands or millions of sparse features.

Desirable skills: Java, familiarity with cdec and Moses sparse feature implementations
Parallel corpus extraction from Common Crawl
Proposed by Philipp Koehn

Common Crawl offers a current crawl of the web, amassing to 60 TB of data. This poses the challenge to extract parallel text useful for training machine translation systems.

Desirable skills: Hadoop, Amazon Web Services
C++11 compliance for cdec
Proposed by Chris Dyer

Make cdec C++11 compliant

Desirable skills: C++
Integration of phrase table filtering and merging in Moses
Proposed by Barry Haddow

Moses includes contributed code for phrase table filtering and for phrase table merging. It would be nice to have this fully integrated into the training pipeline.

Desirable skills: perl and/or C++
Multiple reference translations for European languages
Proposed by Eva Hasler

This project will look at automatically creating multiple reference translations for the news/news commentary data sets.

Desirable skills: experience with Moses pipeline, C++/perl
New development funcionality for the Asiya suite: parameter optimization with MERT
Proposed by Meritxell Gonzàlez, Cristina España-Bonet

The purpose of the project is to integrate MERT and Asiya in order to allow MERT to use any user-defined set of metrics.

Desirable skills: C++, perl
Integrating (a few) rules in the MT pipeline
Proposed by Christian Buck

We want to grab some low hanging fruit to improve moses for industry applications by allowing some simple rules in the translation pipeline.

Desirable skills: Perl/Python

MTM 2012

Projects

Overview

Scientific Programme

Practical Information

Organisation