edit · history · print

WMT 2016

WMT is now a conference, but the shared task remains.

New languages are Turkish (hard!) and Romanian (easy). We also did not build Finnish systems last year, and threw out systems together with Edinburgh's in a joint submission.

Currently, some baseline systems [25]-[36] are being built. The web interface to experiments is accessible on CLSP machines under

 http://ndc06/gems/

If that does not work, you have to build a tunnel.

 ssh -fNL 11111:ndc06:80 [username]@login.clsp.jhu.edu

and then access the web site with

 http://localhost:11111/gems/

Cluster Advice

I spent a lot of time last year to run the training pipeline properly on the CLSP cluster.

The recommended usage is to call experiment.perl with the switch -cluster, so the scheduler submits jobs to the cluster.

You will see a number of specification in the config files like:

 [INTERPOLATED-LM]
 interpolate:qsub-settings = "-l 'arch=*64,mem_free=100G,ram_free=100G'"

These ensure that for specific jobs enough memory (and cores) are reserved.

The most challenging aspect is running the decoder on the massive models that are built for some of the language pairs. The recommended way to deal with that is to first copy the models to local disk and then start the decoder. This is done automatically with the cache-model specification in the config files.

To avoid that too many instances of the model files are copied on too many machines, the use of machines for decoding and tuning is restricted. For instance, run tuning only on the b0* machines:

 tune:qsub-settings = "-l 'hostname=b0*,arch=*64,mem_free=50G,ram_free=50G' -pe smp 16"

To Do

There are a bunch of things we could try:

  1. morphological handling of Finnish and Turkish (baseline would be Morphessor) (Huda)
  2. low resource things for Turkish
  3. part of the official data are massive amounts of monolingual CommonCrawl - we may be able to use them all, or do subsampling (Amittai)
  4. use of the neural LM / joint model that is already in Moses (Shuoyang, Gaurav)
  5. re-ranking with an attention model (Shuoyang, Gaurav, Kevin)
  6. some submission used additional data for Finnish, we may want to look into that as well - also for Romanian
  7. syntax-based systems (Shuoyang)
  8. anything else

Baselines

The first baseline uses all official data. The second baseline includes Brown clustering (which helped on average 0.5 BLEU last time around). This matches mostly the WMT 2015 system. A write-up of that system is here.

Language PairJHU in 2015Best in 2015Baselinew/ Brown clustersw/ CC LMw/ bothttl 100w/ nnjmw/ allSyntax
English-Turkish--[36-1] 7.84 (1.049)[36-3] 8.18 (1.040) +.34[36-2] 9.40 (1.044) +1.56[36-4] 8.85 (1.040) +1.01[36-5] 8.86 (1.041) +1.02---
Turkish-English--[35-1] 14.03 (0.994)[35-3] 14.30 (0.988) +.27[35-2] 13.91 (1.011) -.12[35-4] 14.12 (1.010) +.09[35-5] 14.19 (1.015) +.16--[51-1] 15.47 (0.921) +1.44
English-Finnish-15.5 (Abumatran)[34-1] 11.88 (1.053)[24-3] 12.59 (1.055) +.71[34-2] 12.15 (1.074) +.27[34-4] 12.85 (1.059) +.97[35-5] 12.82 (1.061) +.94---
Finnish-English-19.7 (UEDIN)[33-1] 16.55 (0.985)[34-3] 16.90 (0.981) +.35[33-2] 16.41 (0.990) -.14[33-4] 16.93 (0.998) +.38[33-5] 16.82 (1.004) +.27-- 
English-Romanian--[32-6] 23.36 (1.007)[32-4] 24.60 (1.006) +1.24[32-3] 23.29 (1.039) -.07[32-5] 23.49 (0.967) +.13[32-7] 23.55 (0.970) +.19[46-6] 23.73 (1.010) +.37[32-8] 23.49 (0.962) +.13 
Romanian-English--[31-2] 31.95 (1.014)[31-4] 32.53 (1.020) +.58[31-3] 32.47 (1.018) +.52[31-5] 32.80 (1.015) +.85[31-5] 32.80 (1.016) +.85[50-2] 32.03 (1.015) +.08[31-7] 32.80 (1.019) +.85[50-1] 27.04 (0.934) -4.91
English-Russian[11-6] 24.53 (1.034)24.3 (UEDIN)[30-1] 23.89 (1.037)[30-3] 24.96 (1.033) +1.07[30-2] 23.89 (1.055) +.00[30-4] 24.87 (1.050) +.98[30-6] 25.12 (1.055) +1.23[53-1] 24.37 (1.038) +.48[30-7] 25.16 (1.048) +1.27-
Russian-English[10-6] 27.96 (0.973)27.9 (JHU)[29-1] 27.54 (0.978)[29-3] 28.25 (0.974) +.71[29-2] 28.08 (0.981) +.54[29-4] 28.22 (0.979) +.68[29-5] 28.28 (0.981) +.74[54-1] 27.81 (0.979) +.27[29-6] 28.65 (0.989) +1.11
English-Czech[7-5] 18.11 (1.044)18.8 (CharlesU)[28-1] 18.24 (1.044)[28-3] 19.19 (1.044) +.95[28-2] 18.77 (1.048) +.53[28-4] 19.55 (1.046) +1.31[28-5] 19.53 (1.048) +1.29---
Czech-English[6-5] 26.38 (0.985)26.2 (JHU)[27-1] 27.04 (0.985)[27-3] 27.68 (0.987) +.64[27-2] 27.68 (0.994) +.64[27-5] 28.08 (0.993) +1.04[27-6] 28.18 (0.994) +1.14---
English-German[5-5] 22.70 (1.039)24.9 (Montreal)[26-2] 22.67 (1.035)[26-4] 22.99 (1.035) +.32[26-3] 22.51 (1.056) -.16[26-5] 22.73 (1.055) +.06[26-6] 22.70 (1.057) +.03[47-3] 22.62 (1.036) -.05[26-7] 22.88 (1.057) +.21 
German-English[4-5] 29.15 (0.985)29.3 (UEDIN)[25-1] 29.03 (0.983)[25-3] 29.64 (0.986) +.61[25-2] 29.63 (0.996) +.60[25-5] 29.90 (0.993) +.87[25-6] 29.96 (0.994) +.93[48-3][25-8] 30.01 (0.998) +.98 
  • JHU in 2015: final systems, most finished after submission deadline
  • Best in 2015 according to the matrix.
  • found a bug affecting reordering tables for factors other than surface, fixed in [26-2] for English-German

Results on official test

Language PairSubmittedUEDIN PhraseBest
English-Turkish
Turkish-English
English-Finnish
Finnish-English
English-Romanian
Romanian-English
English-Russian
Russian-English
English-Czech
Czech-English
English-German
German-English
edit · history · print
Page last modified on May 11, 2016, at 12:37 AM