The WMT 2015 evaluation campaign is a good excuse to build baseline systems that use all best known methods.
The somewhat original contribution is the use of word classes for all components, i.e.,
This can be seen as a last desperate attempt to avoid the inevitable onslaught of neural networks by emulating one of their benefits: the pooling of evidence in more generalized representations.
It is also a useful exercise how to run such large scale experiments on the cluster
I build machine translation systems for 8 language pairs (not doing Finnish). I started out storing them all on /export/b10
which was fine at first, but then ran into serious trouble, when running many (10-20) processes that all access this disk heavily.
Especially decoding requires the reading of typically 50GB of model files (mostly LM) from disk. While at the same time, other processes write on disk (building translation tables), the disk becomes very very slow, which is even quite noticeable on the command line (starting vi taking 1 minute...).
Use of a Grid Engine cluster allows for distributing decoder runs onto multiple machines. This seemed to work at first, but then ran into problems with starting up the Moses processes - at crunch time it took up to 5 hours to load the models.
There is not final verdict on this, since it is related to lesson 1. Maybe it is possible to have 5 processes using 20 cores each to run the decoder, but there will be load time / decoding time tradeoff: more processes, longer load time for each process, but faster decoding.
Some steps require a lot of memory or multiple CPUs. This needs to be properly communicated to GridEngine.
Run experiment.perl
with the -cluster
switch and have the following settings in your config
[GENERAL] qsub-settings = "-l 'arch=*64'"
[LM] train:qsub-settings = "-l 'arch=*64,mem_free=30G,ram_free=30G'"
[INTERPOLATED-LM] interpolate:qsub-settings = "-l 'arch=*64,mem_free=100G,ram_free=100G'"
[TRAINING] run-giza:qsub-settings = "-l 'arch=*64,mem_free=10G,ram_free=10G' -pe smp 9"
[TRAINING] run-giza-inverse:qsub-settings = "-l 'arch=*64,mem_free=10G,ram_free=10G' -pe smp 9"
[TUNING]
set jobs
to appropriate number (maybe just 1)
[TUNING] tune:qsub-settings = "-l 'arch=*64,mem_free=50G,ram_free=50G' -pe smp 20"
[EVALUATION]
set jobs
to appropriate number (maybe just 1)
[EVALUATION] decode:qsub-settings = "-l 'arch=*64,mem_free=50G,ram_free=50G' -pe smp 20"
Start with hill-climbing to the desired setup. Note: mkcls
for many classes gets very very slow. Worst case 2000 classes for 1 billion word French-English parallel corpus takes about a month (GIZA++ takes even longer, so what gives).
Language Pair | Baseline | brown60/200/600 OSM+LM | +Sparse+Reorder | +brown2000 |
[4] de-en | [4-1] 27.16 (1.011) | [4-4] 27.46 (1.011) | ||
[5] en-de | [5-1] 20.41 (1.003) | [5-2] 20.82 (1.004) +.41 | [5-3] 20.87 (1.004) +.05 | [5-4] 20.89 (1.006) +.02 |
[6] cs-en | [6-1] 26.44 (1.026) | [6-2] 26.69 (1.026) +.25 | [6-4] 26.92 (1.028) | |
[7] en-cs | [7-1] 18.96 (0.996) | [7-2] 19.51 (0.995) +.55 | [7-3] 19.86 (0.990) +.35 | [7-4] 19.76 (0.991) -.10 |
[8] fr-en | [8-1] 31.67 (1.030) | |||
[9] en-fr | [9-1] 31.22 (0.995) | |||
[10] ru-en | [10-1] 24.39 (1.026) | [10-2] 24.69 (1.023) +.30 | [10-3] 24.65 (1.024) -.04 | [10-4] 24.83 (1.020) +.18 |
[11] en-ru | [11-1] 19.37 (0.996) | [11-2] 20.11 (0.995) +.74 | [11-3] 20.25 (0.996) +.14 | [11-4] 20.28 (0.997) +.03 |