# Summer school notes

## Day 1 - Monday 12 May

Slides: part a, part b

Today's lab focused on the evaluation of machine translation. We evaluated the output of the machine translation systems submitted to the Third Workshop on Statistical Machine Translation held at ACL 08. If you're interested in looking at the interface again, it's here: http://statmt.org/wmt08/judge/

If you'd like to read our overview paper about the ACL workshop, it's available in the workshop proceedings. It is called Further Meta-Evaluation of Machine Translation.

## Day 2 - Tuesday 13 May

We discussed in the lecture today IBM Model 1. Implement the EM algorithm for IBM Model 1 in your favorite programming language. Here are some data sets to train on:

Your program should output two different things:

• A table containing the word translation probabilities that were learned (note: think of an efficient data structure for such a sparse matrix)
• The most likely alignment (the Viterbi alignment) for each sentence pair in the training data

Pseudo-code of IBM Model 1 as presented in the lecture:

``` initialize t(e|f) uniformly
do until convergence
set count(e|f) to 0 for all e,f
set total(f) to 0 for all f
for all sentence pairs (e_s,f_s)
set total_s(e) = 0 for all e
for all words e in e_s
for all words f in f_s
total_s(e) += t(e|f)
for all words e in e_s
for all words f in f_s
count(e|f) += t(e|f) / total_s(e)
total(f)   += t(e|f) / total_s(e)
for all f
for all e
t(e|f) = count(e|f) / total(f)
```

## Day 3 - Wednesday 14 May

Today we will play with the Moses decoder. First you need an account at the University of Saarbruecken that you will get from Andreas Eisele. On these machines, we stored a number of resources in `/home/EXT/mtm200/`.

Login has to happen in two steps:

and then ssh again to one of the server machines such as `cluster-01` to `cluster-16` or `forbin`.

#### Step 1: Using the decoder

Log into the account. We compiled the Moses decoder and you can access it (and other tools) `moses/moses` in `/home/EXT/mtm200/`. You can find a tutorial how to use the decoder on a small toy example here:

The toy model is stored at `data/sample-models`.

#### Step 2: Train a model

We prepared a small training corpus and put all the scripts in place to train a translation model. First, you should get some idea about the stages by looking at the description of the training steps and the data requirements at `http://www.statmt.org/moses/?n=FactoredTraining.HomePage`.

If you need any disk space go to `/local` on the cluster machine you are logged into.

You will need the following resources:

• scripts in `moses-scripts` and `scripts`
• training corpus in `data/wmt08`

A step-by-step guide on how to train the model is here:

Train a system following the guide. You do not need to do the installation part at the beginning of the step-by-step guide, since we already installed the software for you. Also, it is better to train a model only on part of the corpus, such as the first 10,000 sentence pairs.

If you come this far, you learned how to train and use machine translation models using Moses. You may want to try this at home. The installation instructions for installing Moses are at

You will first install SRILM, irstLM, and (for training) GIZA++. The page points you to the sources for these.

Installation on Unix-based systems should be straight-forward. Windows requires some additional tools (cygwin), and it generally a bit tricky. Search the support database for any hints if you get stuck.

## Day 4 - Thursday 15 May

No lab session today. Go take a hike!

## Day 5 - Friday 16 May

The mentioned paper on "Enriching Morphologically Poor Languages" is available here: (Avramidis and Koehn, ACL 2008)

Today's lab is about using factored translation models. We follow the factored decoding tutorial on the Moses web site. We have already installed the relevant data on the Saarbruecken machines you can find it at `/home/EXT/mtm200/data/factored-corpus`.

It helps, if you add the executables into your path and run the experiments in a local directory (`/local/yourname`). Be sure to clean up, once you are done. I'd suggest to copying over the small training corpus to avoid any trouble with permissions and accidental overwrites.

``` export PATH=\$PATH::/home/EXT/mtm200/moses
export PATH=\$PATH::/home/EXT/mtm200/moses-scripts/training
cd /local
mkdir myname
cd myname
cp -r /home/EXT/mtm200/data/factored-corpus .
```