Today's lab focused on the evaluation of machine translation. We evaluated the output of the machine translation systems submitted to the Third Workshop on Statistical Machine Translation held at ACL 08. If you're interested in looking at the interface again, it's here: http://statmt.org/wmt08/judge/
If you'd like to read our overview paper about the ACL workshop, it's available in the workshop proceedings. It is called Further Meta-Evaluation of Machine Translation.
We discussed in the lecture today IBM Model 1. Implement the EM algorithm for IBM Model 1 in your favorite programming language. Here are some data sets to train on:
Your program should output two different things:
Pseudo-code of IBM Model 1 as presented in the lecture:
initialize t(e|f) uniformly do until convergence set count(e|f) to 0 for all e,f set total(f) to 0 for all f for all sentence pairs (e_s,f_s) set total_s(e) = 0 for all e for all words e in e_s for all words f in f_s total_s(e) += t(e|f) for all words e in e_s for all words f in f_s count(e|f) += t(e|f) / total_s(e) total(f) += t(e|f) / total_s(e) for all f for all e t(e|f) = count(e|f) / total(f)
Today we will play with the Moses decoder. First you need an account at the University of Saarbruecken that you will get from Andreas Eisele. On these machines, we stored a number of resources in /home/EXT/mtm200/
.
Login has to happen in two steps:
ssh login.coli.uni-saarland.de
and then ssh again to one of the server machines such as cluster-01
to cluster-16
or forbin
.
Log into the account. We compiled the Moses decoder and you can access it (and other tools) moses/moses
in /home/EXT/mtm200/
. You can find a tutorial how to use the decoder on a small toy example here:
The toy model is stored at data/sample-models
.
We prepared a small training corpus and put all the scripts in place to train a translation model. First, you should get some idea about the stages by looking at the description of the training steps and the data requirements at http://www.statmt.org/moses/?n=FactoredTraining.HomePage
.
If you need any disk space go to /local
on the cluster machine you are logged into.
You will need the following resources:
moses-scripts
and scripts
data/wmt08
A step-by-step guide on how to train the model is here:
Train a system following the guide. You do not need to do the installation part at the beginning of the step-by-step guide, since we already installed the software for you. Also, it is better to train a model only on part of the corpus, such as the first 10,000 sentence pairs.
Advanced steps:
If you come this far, you learned how to train and use machine translation models using Moses. You may want to try this at home. The installation instructions for installing Moses are at
You will first install SRILM, irstLM, and (for training) GIZA++. The page points you to the sources for these.
Installation on Unix-based systems should be straight-forward. Windows requires some additional tools (cygwin), and it generally a bit tricky. Search the support database for any hints if you get stuck.
No lab session today. Go take a hike!
The mentioned paper on "Enriching Morphologically Poor Languages" is available here: (Avramidis and Koehn, ACL 2008)
Today's lab is about using factored translation models. We follow the factored decoding tutorial on the Moses web site. We have already installed the relevant data on the Saarbruecken machines you can find it at /home/EXT/mtm200/data/factored-corpus
.
It helps, if you add the executables into your path and run the experiments in a local directory (/local/yourname
). Be sure to clean up, once you are done. I'd suggest to copying over the small training corpus to avoid any trouble with permissions and accidental overwrites.
export PATH=$PATH::/home/EXT/mtm200/moses export PATH=$PATH::/home/EXT/mtm200/moses-scripts/training cd /local mkdir myname cd myname cp -r /home/EXT/mtm200/data/factored-corpus .