Experiment Management System (EMS) (a.k.a. experiment.perl)

experiment.perl, or Experiment Management System (EMS), for lack of a better name, is our experimental management and report system for machine translation experiments with Moses.

Experimental Notes ?

• Requirements
• How To
• Getting a copy
• SVN Server Information
• Checking for changes
• E-mail Notification
• A Short Manual
• Steps
• experiment.meta
• Data & Dev File
• Configuration File
• Executing a Step
• Web Interface
• FAQ
• My experiment crashed. How can I continue it?
• Tuning crashed in the middle. How can I continue it without starting over?
• To Do: Short Term
• To Do: Long term

Requirements

In order to run properly, EMS will require:

A version of Moses along with SRILM or IRSTLM (obviously!).
The GraphViz toolkit.
The ImageMagick toolkit.
The GhostView tool.

How To

Experiment.perl is extremely simple to use:

Get a copy of experiment.perl (see SVN information below).
Get a sample configuration file from someplace (see SVN information below).
Set up a working directory for your experiments for this task (mkdir does it).
Edit working-dir in the config file.
Run experiment.perl -config CONFIG from your experiment working directory.
Marvel at the graphical plan of action.
Run experiment.perl -config CONFIG -exec.
Check the results of your experiment (in evaluation/report.1)

If you survived this process you may want to familiarize yourself with the parameters in the configuration file. If some of them are unclear, try to investigate and write a much more informative description of the paramter into the config.

Other options:

--no-graph supresses the display of the graph
--continue RUN continues the experiment RUN, which crashed earlier. Make sure that crashed step and its output is deleted (this should be done automatically at some point, se TODO below). Also, make sure to specify the right config file (i.e. in the current run directory) when using --continue.

Getting a copy

SVN Server Information

The latest experiment.perl is available from the moses subversion repository.

mosesdecoder/
- trunk/
  - scripts/
    - ems/ a expriment management system
      - experiment.perl your entry point in the management system
      - experiment.meta template file defining how an experiment is ran
      - config/ sample self-documented config files go here
      - support/ experiment.perl sub-scripts
      - web/ the web interface for EMS

Checking for changes

A useful command for seeing what has changed in experiment.perl since you last downloaded it is svn diff.

     > cd ems
     > svn diff --revision HEAD experiment.perl

E-mail Notification

Right now, Hieu, Philipp and Josh are set to receive e-mail notification whenever a commit is made to the repository. If anyone you want to be added, let us know. If there is enough interest we will set up a mailing list for notifications.

A Short Manual

Experiment.perl is a experiment management tool. You have to define an experiment in a configuration file and experiment.perl figures out which steps need to be run and schedules them either as jobs on a cluster or runs them serially on a single machine.

Steps

An experimental run is broken up into several steps. Here a typical example:

In this graph, each step is a small box. Experiment.perl builds for each step a script file that gets either submitted to the cluster or run on the same machine. Note that some steps are quite involved, for instance tuning: On a cluster, the tuning script runs on the head node a submits jobs to the queue itself.

The main stages of running an experiment are:

CORPUS: preparing a parallel corpus
TRAINING: training a translation model
LM: training a language model
RECASING: training a recaser
TUNING: running minumum error rate training to set component weights
TESTING: translating a test set
EVALUATION: scoring the output
REPORTING: compile all scores in one file

experiment.meta

The actual steps, their dependencies and other salient information is to be found in the file experiment.meta. Think of experiment.meta as a "template" file.

Here the parts of the step description for CORPUS:get-corpus and CORPUS:tokenize:

 get-corpus
        in: get-corpus-script
        out: raw-stem
        [...]

 tokenize
        in: raw-stem
        out: tokenized-stem
        [...]

Each step takes some input (in) and provides some output (out). This also establishes the dependencies between the steps. The step tokenize requires the input raw-stem. This is provided by the step get-corpus.

experiment.meta provides a generic template for steps and their interaction. For an actual experiment, a configuration file determines which steps need to be run. This configuration file is the one that is specified when invocing experiment.perl. It may contain for instance the following:

 [CORPUS:europarl]

 ### raw corpus files (untokenized, but sentence aligned)
 #
 raw-stem = $europarl-v3/training/europarl-v3.fr-en

Here, the parallel corpus to be used is named europarl and it is provided in raw text format in the location $europarl-v3/training/europarl-v3.fr-en (the variable $europarl-v3 is defined elsewhere in the config file). The effect of this specification in the config file is that the step get-corpus does not need to be run, since its output is given as a file. More on the configuration file below in the next section.

Several types of information are specified in experiment.meta:

in and out: Established dependencies between steps; input may also be provided by files specified in the configuration.
default-name: Name of the file in which the output of the step will be stored.
template: Template for the command that is placed in the execution script for the step.
template-if: Potential command for the execution script. Only used, if the first parameter exists.
error: experiment.perl detects if a step failed by scanning STDERR for key words such as killed, error, died, not found, and so on. Additional key words and phrase are provided with this parameter.
not-error: Declares default error key words as not indicating failures.
pass-unless: Only if the given parameter is defined, this step is executed, otherwise the step is passed (illustrated by a yellow box in the graph).
ignore-unless: If the given parameter is defined, this step is not executed. This overrides requirements of downstream steps.
rerun-on-change: If similar experiment are runs, the output of steps may be used, if input and parameter settings are the same. This specifies a number of parameters whose change disallows a re-use in different run.
parallelizable: When running on the cluster, this step may be parallelized (only if generic-parallelizer is set in the config file, typically to $edinburgh-script-dir/generic-parallelizer.perl.
qsub-script: If running on a cluster, this step is run on the head node, and not submitted to the queue (because it submits jobs itself).

Here now the full definition of the step CONFIG:tokenize

 tokenize
        in: raw-stem
        out: tokenized-stem
        default-name: corpus/tok
        pass-unless: input-tokenizer output-tokenizer
        template-if: input-tokenizer IN.$input-extension OUT.$input-extension
        template-if: output-tokenizer IN.$output-extension OUT.$output-extension
        parallelizable: yes

The step takes raw-stem and produces tokenized-stem. It is parallizable with the generic parallelizer.

That output is stored in the file corpus/tok. Note that the actual file name also contains the corpus name, and the run number. Also, in this case, the parallel corpus is stored in two files, so file name may be something like corpus/europarl.tok.1.fr and corpus/europarl.tok.1.en.

The step is only executed, if either input-tokenizer or output-tokenizer are specified. The templates indicate how the command lines in the execution script for the steps look like.

Data & Dev File

You need Data file and Dev. You can simply download at workshop shared task

Configuration File

Typically, when setting up an experiment, you will take an existing configuration file and modify it. The config files are self-documenting and describe each possible parameter. Obviously, explaining each parameter would require explaining the entire training, tuning, testing, and evaluation process of the Moses statistical machine translation system, which goes beyond this short manual.

Executing a Step

CORPUS_europarl_tokenize.1
CORPUS_europarl_tokenize.1.DONE
CORPUS_europarl_tokenize.1.INFO
CORPUS_europarl_tokenize.1.STDERR
CORPUS_europarl_tokenize.1.STDOUT

Web Interface

Along with EMS comes a web interface allowing you to follow running and achieved experiments. Put or link the web directory provided with EMS on a web server (LAMPP on Linux or MAMP on Mac does the trick). Make sure the web server user has the right write permissions on the web interface directory.

To add your experiments to this interface, add a line to the file

/path/to/your/web/interface/directory/setup

To add a description to each run, edit the file

/path/to/your/web/interface/directory/comment

FAQ

My experiment crashed. How can I continue it?

Get to know what was the crashed step. This is shown either by the red node in the displayed graph or reported on the commandline in the last lines before crashing; though this may not be pretty obvious, if parallel steps kept running after that. You also need to know the number of the experiment.
If tuning has crashed, look at the next FAQ entry.
Every step keeps log of all errors in a file with the .STDERR extension. (The exact format is NAMEOFPROCESS_step.numberofexperiment.STDERR, e.g. CORPUS_factorize.13.STDERR) . Display this file, find the error and try to correct it. If there is no error, look at the previous steps; it may have occured before but not been detected.
Delete that .STDERR file and everything that has been produced by the crashed step. To find what has been produced by the crashed step, you may need to consult where the output of this step is placed, by looking at experiment.meta.
Make sure that you have corrected the reason of the crash, otherwise it will crash again; if you are using the cluster it may take a lot of time for it to crash again, as your jobs queue up. Optionally, if it is a small/not-demanding step, try to run it locally on your DICE machine by typing nice sh NAMEOFPROCESS_step.numberofexperiment (e.g. nice sh CORPUS_factorize.13 ). If it crashes again go to step 2
Check if experiment.perl will continue the crashed experiment. E.g., if your experiment's number is 13, that would be nice experiment.perl -config steps/config.13 -continue 13 . Let the graph to be displayed and verify that it indeed it goes on from there. (This is not always shown by the graph)
Run experiment.perl by adding the -exec parameter. E.g. nice experiment.perl -config steps/config.13 -continue 13 -exec.

Tuning crashed in the middle. How can I continue it without starting over?

Tuning is treated by experiment.perl as one step. Though, this refers to the execution of mert-moses script, which performs multiple runs; each of these runs is split into pieces and parallelized (for more details read here).

Try to find the error and correct that
1. Display the steps/TUNING_tune.num.STDERR and try to find at what point the error occurred. There may be an obvious error that you may need to correct.
2. If there is not an obvious error in the .STDERR file and you have been running in a cluster, the error may exist in one of the split parallelized runs used for multiple decoding. try to find the last bunch of lines that look like:
  Executing: qsub -l mem_free=0.5G -hard -b no -j yes -o folder/tuning/tmp.28/out.job15397-aa -e folder/tuning/tmp.28/err.job15397-aa -N mert15-aa folder/tuning/tmp.28/job15397-aa.bash >& folder/tuning/tmp.28/job15397-aa.log
3. Each of these lines refers to the the execution of one out of let's say 25 parallel decoding efforts, named with a prefix jobnumber and an alphabetic suffix .aa , .ab , .ac... . If one of these has failed to be run, or has faced an error, mert-moses will detect that, kill all the jobs and crash. The file that holds Moses output is specified by the -o parameter, and the file that holds the submission errors (e.g. by the GRID queuing system) is specified by the -e parameter.
If no mert run has occured (e.g. it has crashed during binarization, or during 1st run, you may treat tuning as described in the previous FAQ entry: delete .STDERR and resume the whole experiment.perl
If some mert runs have happened (that may meen a couple of days!) you have to let tuning resume the broken process. Open the file steps/TUNING_tune.num by using a text editor. Find the line which starts with executing mert-moses.pl . There are two parameters you can add to resume the broken tuning:
1. If the error has occurred during decoding, then just add the parameter --continue.
2. If a decoding set has successfully happened, and mert crashed during optimization that happens after that, then add the parameter --continue --skip decoder which will save you some time. That's optional, though.
Save the file and execute the tuning step, (preferably in cluster), separately: nice sh steps/TUNING_tune.num
Once tuning has completed succesfully, delete steps/TUNING_tune.num.STDERR and let experiment.perl go on experiment.perl -continue -config steps/config.num

To Do: Short Term

Option -clean to delete intermediate files and crashed steps
Option -delete to delete all files of one experiment, unless other non-deleted steps depend on them
Web interface should have editable comments for each experiment, including a link to a Wiki page that describes this experiment in more detail
Web interface should have a feature to compare the test set output of two experiment on a per-word level, including BLEU impact
experiment.perl should complain about not being launched from the experiment working directory
Adapt the sample experiment.meta to take into account the new EMS tree structure
Prepare an "out-of-the-box" package with toy corpora, etc.
Clean and generalize the different files (config, meta, ...)
Add more information to the web interface

Bug on experiments with factors: The experiment will re-use factorized (corpus, tuning and test) sets of an older experiment, even if the input-factors specification has changed.

To Do: Long term

Here some random requests, or other futuristic features:

Version tracking with SVN.
Re-use of other people's experimental intermediate files
A feature to check if two experiment results are significantly comparable, by performing a bootstrap comparison. Faster if comparison is made directly over the stored BLEU n-grams
Develop a GUI for managing the config files
Extend EMS to speech recognition (ASR) and speech-to-text translation (STT)

Page last modified on January 29, 2010, at 08:37 PM