Projects

Projects

Note that the password for this page is: mtm12

Participants are welcome to propose a new project to be added to this list.

  • Statistical Example Based MT
    Proposed by Chris Dyer
     
    Improve translation models with features that look at source sentence context.
     
    Desirable skills: Python/Cython, C++, machine learning, cdec
  • Word posteriori probability from a wordgraph
    Proposed by Mercedes Garcia
     
    Calculate the word posteriori probability from a wordgraph (extracted from Moses). Useful to calculate confidence measures.
     
    Desirable skills: Python/C/C++, machine learning, confidence measures.
  • Improved extraction heuristics for hierarchical phrase-based models
    Proposed by Hieu Hoang
     
    This project aims to improve the performance of hierarchical phrase-based models through improved extraction heuristics
     
    Desirable skills: hierarchical MT, C++ or Java
  • Features to model word span of non-terminals
    Proposed by Hieu Hoang
     
    In this project we will extend hierarchical phrase-based MT by modelling the distribution of the number of words spanned by each non-terminal.
     
    Desirable skills: hierarchical MT, C++
  • Creating source labels to improve the translation search space
    Proposed by Hieu Hoang
     
    This project will seek to impove syntactic MT by extending the Mixed-Syntax model in my PhD thesis. The idea is to define a customised labelling for the source sentence which helps with MT rule extraction, rather than relying on off-the-shelf parsers.
     
    Desirable skills: hierarchical MT, C++, machine learning
  • Building Moses Training Pipelines with Arrows
    Proposed by Ian Johnson
     
    It is not about the choice of language (Perl/Python/Haskell?) to build training pipelines, but the choice of programming styles. Towards a plugable code structure, we propose to use Arrows to construct the whole training pipeline of Moses.
     
    Desirable skills: Python
  • Bounded-memory LM and Phrase Table building
    Proposed by Kenneth Heafield
     
    Language model estimation takes an annoyingly large amount of RAM. Phrase table p(s|t) and p(t|s) evaluation takes a lot of time due to on-disk sorting in plaintext. The ultimate goal of this project is a much faster way to do both tasks using a user-specified amount of RAM and making efficient use of disk only when necessary. The marathon project will lay the groundwork: a framework that presents multiple input and multiple output streams to streaming algorithms. The output streams will be optionally sorted by the framework. Where possible, these streams will be kept in memory. However, once a user-specified memory bound is reached, they will be dumped to disk in binary format.
     
    Desirable skills: C++
  • Target hypergraph serialization
    Proposed by Kenneth Heafield
     
    Often, we repeatedly decode the same sentences with different feature weights. Usually, this happens within MERT, MIRA, or PRO, our version of Godot. This project will save time by parsing only once instead of every iteration. To do so, we need to develop an efficient (preferably binary) serialization format, modify Moses to generate it without doing cube pruning, and modify Moses to read this format without redoing parsing.
     
    Desirable skills: C++, Moses internals
  • Document-Level Translation in Moses
    Proposed by Liane Guillou
     
    Design and implement a cache-based document-level translation strategy using Moses. Incorporate the notion of confidence estimation to ensure that only those best hypotheses that are deemed to be of “high quality” will be retained.
     
    Desirable skills: Experience with Moses and/or Machine Learning
  • Integrate CSLM into Moses phrase-based decoder
    Proposed by Lane Schwartz
     
    Integrate the Continuous Space Language Model (CSLM) into Moses phrase-based decoding, using the newly implemented batch LM requests. Info on CSLM is at http://www-lium.univ-lemans.fr/cslm/
     
    Desirable skills: C++, CSLM
  • Optimisation of Sparse Moses
    Proposed by Barry Haddow
     
    Moses has support for sparse features in the miramerge branch, but it is quite inefficient in time and space. This project aims at optimising this version of Moses so it can be merged back into trunk, and so that Moses can scale to millions of features.
     
    Desirable skills: C++
  • Open Source Computer Aided Translation
    Proposed by Philipp Koehn
     
    Two recently started EU-funded projects, MateCat and CASMACAT begun developing open source workbenches for human translators that can take advantage of machine translation. At the MT Marathon we would like to extend this effort.
     
    Desirable skills: Javascript, PHP
  • SMT Research Survey Wiki
    Proposed by Philipp Koehn
     
    The number of research papers in statistical machine translation is exploding, so a useful resource would be a up-to-date survey of all published papers, in form of a Wiki. This project looks at improving such a tool, currently at http://www.statmt.org/survey/ in beta stage.
     
    Desirable skills: PHP
  • Diagnostic evaluation of MT with DELiC4MT
    Proposed by Antonio Toral
     
    This project aims to improve and extend DELiC4MT, an open-source tool for diagnostic evaluation of MT.
     
    Desirable skills: java, perl, php, experience/interest in MT evaluation (beyond just BLEU!)
  • Zones as Features for Phrase-based decoding
    Proposed by Colin Cherry
     
    Moses currently implements zones in the input text as hard re-ordering constraints: once the decoder enters a zone, it cannot leave until the zone has been completely translated. This project would use the same idea to create features. Annotated zones of the input would incur learned penalties when interrupted. Adapters to create annotated zones (and therefore, features) from the output of popular NLP tools such as PTB constituency parsers, dependency parsers, or named entity recognizers would be added to Moses as part of this project.
     
    Desirable skills: C++
  • Extend PRO and k-best MIRA to other error metrics
    Proposed by Colin Cherry
     
    PRO and k-best MIRA are currently available in Moses for tuning high-dimensional feature vectors in a batch setting. However, they are both hard-coded to use (different) sentence-level approximations to BLEU. This project would augment both systems to use arbitrary sentence-level error metrics.
     
    Desirable skills: C++, Perl
  • Sparse feature implementation in Joshua
    Proposed by Matt Post
     
    Joshua currently has only a dense feature representation for each grammar (with optional sharing of dense features across input grammar files). We'd like to extend this by adding support for hundreds of thousands or millions of sparse features.
     
    Desirable skills: Java, familiarity with cdec and Moses sparse feature implementations
  • Parallel corpus extraction from Common Crawl
    Proposed by Philipp Koehn
     
    Common Crawl offers a current crawl of the web, amassing to 60 TB of data. This poses the challenge to extract parallel text useful for training machine translation systems.
     
    Desirable skills: Hadoop, Amazon Web Services
  • C++11 compliance for cdec
    Proposed by Chris Dyer
     
    Make cdec C++11 compliant
     
    Desirable skills: C++
  • Integration of phrase table filtering and merging in Moses
    Proposed by Barry Haddow
     
    Moses includes contributed code for phrase table filtering and for phrase table merging. It would be nice to have this fully integrated into the training pipeline.
     
    Desirable skills: perl and/or C++
  • Multiple reference translations for European languages
    Proposed by Eva Hasler
     
    This project will look at automatically creating multiple reference translations for the news/news commentary data sets.
     
    Desirable skills: experience with Moses pipeline, C++/perl
  • New development funcionality for the Asiya suite: parameter optimization with MERT
    Proposed by Meritxell Gonzàlez, Cristina España-Bonet
     
    The purpose of the project is to integrate MERT and Asiya in order to allow MERT to use any user-defined set of metrics.
     
    Desirable skills: C++, perl
  • Integrating (a few) rules in the MT pipeline
    Proposed by Christian Buck
     
    We want to grab some low hanging fruit to improve moses for industry applications by allowing some simple rules in the translation pipeline.
     
    Desirable skills: Perl/Python