Summer school notes

Day 1 - Monday 12 May

Today's lab focused on the evaluation of machine translation. We evaluated the output of the machine translation systems submitted to the Third Workshop on Statistical Machine Translation held at ACL 08. If you're interested in looking at the interface again, it's here: http://statmt.org/wmt08/judge/

If you'd like to read our overview paper about the ACL workshop, it's available in the workshop proceedings. It is called Further Meta-Evaluation of Machine Translation.

Day 2 - Tuesday 13 May

Slides

We discussed in the lecture today IBM Model 1. Implement the EM algorithm for IBM Model 1 in your favorite programming language. Here are some data sets to train on:

Toy
French--English: 1000 sentences, 292 short sentences
German--English: 1000 sentences, 346 short sentences

Your program should output two different things:

A table containing the word translation probabilities that were learned (note: think of an efficient data structure for such a sparse matrix)
The most likely alignment (the Viterbi alignment) for each sentence pair in the training data

Pseudo-code of IBM Model 1 as presented in the lecture:

 initialize t(e|f) uniformly
 do until convergence
   set count(e|f) to 0 for all e,f
   set total(f) to 0 for all f
   for all sentence pairs (e_s,f_s)
     set total_s(e) = 0 for all e
     for all words e in e_s
       for all words f in f_s
         total_s(e) += t(e|f)
     for all words e in e_s
       for all words f in f_s
         count(e|f) += t(e|f) / total_s(e)
         total(f)   += t(e|f) / total_s(e)
   for all f
     for all e
       t(e|f) = count(e|f) / total(f)

Day 3 - Wednesday 14 May

Slides

Today we will play with the Moses decoder. First you need an account at the University of Saarbruecken that you will get from Andreas Eisele. On these machines, we stored a number of resources in /home/EXT/mtm200/.

ssh login.coli.uni-saarland.de

and then ssh again to one of the server machines such as cluster-01 to cluster-16 or forbin.

Step 1: Using the decoder

Log into the account. We compiled the Moses decoder and you can access it (and other tools) moses/moses in /home/EXT/mtm200/. You can find a tutorial how to use the decoder on a small toy example here:

http://www.statmt.org/moses/?n=Moses.Tutorial

The toy model is stored at data/sample-models.

Step 2: Train a model

We prepared a small training corpus and put all the scripts in place to train a translation model. First, you should get some idea about the stages by looking at the description of the training steps and the data requirements at http://www.statmt.org/moses/?n=FactoredTraining.HomePage.

If you need any disk space go to /local on the cluster machine you are logged into.

You will need the following resources:

scripts in moses-scripts and scripts
training corpus in data/wmt08

A step-by-step guide on how to train the model is here:

http://www.statmt.org/wmt08/baseline.html

Train a system following the guide. You do not need to do the installation part at the beginning of the step-by-step guide, since we already installed the software for you. Also, it is better to train a model only on part of the corpus, such as the first 10,000 sentence pairs.

Advanced steps:

Build a binary translation table that is stored on disk and queried at run-time.
Build a binary language model (irstLM)

Step 3: Download and install Moses

If you come this far, you learned how to train and use machine translation models using Moses. You may want to try this at home. The installation instructions for installing Moses are at

http://www.statmt.org/moses/?n=Development.GetStarted

You will first install SRILM, irstLM, and (for training) GIZA++. The page points you to the sources for these.

Installation on Unix-based systems should be straight-forward. Windows requires some additional tools (cygwin), and it generally a bit tricky. Search the support database for any hints if you get stuck.

Day 4 - Thursday 15 May

Slides

No lab session today. Go take a hike!

Day 5 - Friday 16 May

Slides

The mentioned paper on "Enriching Morphologically Poor Languages" is available here: (Avramidis and Koehn, ACL 2008)

Today's lab is about using factored translation models. We follow the factored decoding tutorial on the Moses web site. We have already installed the relevant data on the Saarbruecken machines you can find it at /home/EXT/mtm200/data/factored-corpus.

It helps, if you add the executables into your path and run the experiments in a local directory (/local/yourname). Be sure to clean up, once you are done. I'd suggest to copying over the small training corpus to avoid any trouble with permissions and accidental overwrites.

 export PATH=$PATH::/home/EXT/mtm200/moses
 export PATH=$PATH::/home/EXT/mtm200/moses-scripts/training
 cd /local
 mkdir myname
 cd myname
 cp -r /home/EXT/mtm200/data/factored-corpus .

Page last modified on May 16, 2008, at 12:21 PM