machine translation



Training process

We will start with an overview of the training process. This should give a feel for what is going on and what files are produced. In the following, we will go into more details of the options of the training process and additional tools.

The training process takes place in nine steps, all of them executed by the script


The nine steps are

  1. Prepare data (45 minutes)
  2. Run GIZA++ (16 hours)
  3. Align words (2:30 hours)
  4. Get lexical translation table (30 minutes)
  5. Extract phrases (10 minutes)
  6. Score phrases (1:15 hours)
  7. Build lexicalized reordering model (1 hour)
  8. Build generation models
  9. Create configuration file (1 second)

If you are running on a machine with multiple processors, some of these steps can be considerably sped up with the following option:


The run times mentioned in the steps refer to a recent training run on the 751'000 sentence, 16 million word German-English Europarl corpus, on a 3GHz Linux machine.

If you wish to experiment with translation in both directions, step 1 and 2 can be reused, starting from step 3 the contents of the model directory get direction-dependent. In other words run steps 1 and 2, then make a copy of the whole experiment directory and continue two trainings from step 3.

Running the training script

For an standard phrase model, you will typically run the training script as follows.

Run the training script:

 train-model.perl -root-dir . --corpus corpus/euro --f de --e en

There should be two files in the corpus/ directory called and euro.en. These files should be sentence-aligned halfs of the parallel corpus. should contain the German sentences, and euro.en should contain the corresponding English sentences.

More on the training parameters at the end of this manual. For corpus preparation, see the section on how to prepare training data.

Edit - History - Print
Page last modified on May 04, 2010, at 10:05 PM