Moses
statistical
machine translation
system

Pipeline Creation Language (PCL)

Building pipelines can be tedious and error-prone. Using Moses scripts to build pipelines can be hampered by the fact that scripts need to be able to parse the output of the previous script. Moving scripts to different positions in the pipeline is tricky and may require a code change! It would be better if the scripts were re-usable without change and users can start to build up a library of computational pieces that can be used in any pipeline in any position.

Since pipelines are widely used in machine translation, and given the problem outlined above, a more convienent and less error prone way of building pipelines quickly, with re-usable components, would aid construction.

A domain specific language called Pipeline Creation Language (PCL) has been developed part of the MosesCore project (European Commission Grant Number 288487 under the 7th Framework Programme). PCL enables users to gather components into libraries, or packages, and re-use them in pipelines. Each component defines inputs and outputs which are checked by the PCL compiler to verify components are compatible with each other.

PCL is a general purpose language that can be used to construct non-recurrent software pipelines. In order to adapt your existing programs and script for use with PCL a Python wrapper must be defined for each program. This builds up a library of components with are combined with others in PCL files. The Python wrapper scripts must implement the following function interface:

  • get_name() - Returns an object representing the name of the component. The __str__() function should be implemented to return a meaningful name.
  • get_inputs() - Returns the inputs of the component. Components should only be defined with one input port. A list of input names must be returned.
  • get_outputs() - Returns the outputs of the component. Components should only be defined with one output port. A list of output names must be returned.
  • get_configuration() - Returns a list of names that represent the static data that shall be used to construct the component.
  • configure(args) - This function is the component designer's chance to preprocess configuration injected at runtime. The args parameter is a dictionary that contains all the configuration provided to the pipeline. This function is to filter out, and optionally preprocess, the configuration used by this component. This function shall return an object containing the configuration necessary to construct this component.
  • initialise(config) - This function is where the component designer defines the component's computation. The function receives the output object from the configure() function and must return a function that takes two parameters, an input object, and a state object. The input object is a dictionary that is received from the previous component in the pipeline, and the state object is the configuration for the component. The returned function should be used to define the component's computation.

Once your library of components have been written they can be combined using the PCL language. A PCL file defines one component which uses other defined components. For example, the following file defines a component that performs tokenisation for source and target files.

 #
 # Component definition: 2 input ports, 2 output ports
 #
 #                 +---------+
 # src_filename -->+         +--> tokenised_src_filename
 #                 |         |
 # trg_filename -->+         +--> tokenised_trg_filename
 #                 +---------+
 #
 import wrappers.tokenizer.tokenizer as tokeniser

 component src_trg_tokeniser
  inputs (src_filename), (trg_filename)
  outputs (tokenised_src_filename), (tokenised_trg_filename)
  configuration tokeniser.src.language,
                tokeniser.src.tokenisation_dir,
                tokeniser.trg.language,
                tokeniser.trg.tokenisation_dir,
                tokeniser.moses.installation
  declare
    src_tokeniser := new tokeniser with
      tokeniser.src.language -> language,
      tokeniser.src.tokenisation_dir -> tokenisation_dir,
      tokeniser.moses.installation -> moses_installation_dir
    trg_tokeniser := new tokeniser with
      tokeniser.trg.language -> language,
      tokeniser.trg.tokenisation_dir -> tokenisation_dir,
      tokeniser.moses.installation -> moses_installation_dir
  as
    wire (src_filename -> filename),
         (trg_filename -> filename) >>>
    (src_tokeniser *** trg_tokeniser) >>>
    wire (tokenised_filename -> tokenised_src_filename),
         (tokenised_filename -> tokenised_trg_filename)

A PCL file is composed of the following bits:

  • Imports: Optional imports can be specified. Notice that all components must be given an alias, in this case the component wrappers.tokenizer.tokenizer shall be referenced in this file by the name tokeniser.
  • Component: This starts the component definition and provides the name. The component's name must be the same as the filename. E.g., a component in fred.pcl must be called fred.
  • Inputs: Defines the inputs of the component. The example above defines a component with a two port input. Specifing a comma-separated list of names defines a one port input.
  • Outputs: Defines the outputs of the component. The example above defines a component with a two port output. Specifing a comma-separated list of names defines a one port output.
  • Configuration: Optional configuration for the component. This is static data that shall be used to construct components used in this component.
  • Declarations: Optional declarations of components used in this component. Configuration is used to construct imported components
  • Definition: The as portion of the component definition is an expression which defines how the construct components are to be combined to create the computation required for the component.

The definition of a component can use the following pre-defined components:

  • first - This component takes one expression with a one port input and creates a two port input and output component. The provided component is applied only to the first port of the input.
  • second - This component takes one expression with a one port input and creates a two port input and output component. The provided component is applied only to the second port of the input.
  • split - Split is a component with one input port and two output ports. The value of the outputs is the input, i.e., spliting the input.
  • merge - Merge values from the two port input to a one port output. A comma-separated list of top and bottom keywords subscripted with input names are used to map these values to a new name. E.g., merge top[a] -> top_a, bottom[b] -> bottom_b takes the a value of the top input and maps that value to a new name top_a, and the b value of the bottom input and maps that value to a new name bottom_b.
  • wire - Wires are used to adapt one component's output to another's input. For wires with one input and output port then the wire mapping is a comma-separated mapping, e.g., wire a -> next_a, b -> next_b adapts a one port output component whose outputs are a and b to a one port component whose inputs are next_a and next_b. For wires with two input and output ports mappings are in comma-separated parenthese, e.g., wire (a -> next_a, b -> next_b), (c -> next_c, d -> next_d). This wire adapts the top input from a to next_a, and b to next_b, and the bottom input from c to next_c and d to next_d.
  • if - Conditional execution of a component can be achieved with the if component. This component takes three arguments: a conditional expression, a then component and an else component. If the condition is evaluated to a truthy value the then component is executed, otherwise the else component is executed. See the conditional example in the PCL Git repository for an example of usage.

Combinator operators used to compose the pipeline, they are:

  • >>> - Composition. This operator composes two components. E.g., a >>> b creates a component in which a is executed before b.
  • *** - Parallel execution. This operator creates a component in which the two components provided are executed in parallel. E.g., a *** b creates a component with two input and output ports.
  • &&& - Parallel execution. The operator creates a component in which two components are executed in parallel from a single input port. E.g., a &&& b creates a component with one input port and two output ports.

Examples in the PCL Git repository show the usage of these operators and pre-defined components. Plus an example Moses training pipeline is available in contrib/arrow-pipelines directory of the mosesdecoder Git repository. Please see contrib/arrow-pipelines/README for details of how to compile and run this pipeline.

For more details of how to use PCL please see the latest manual at

 contrib/arrow-pipelines/python/pcl/documentation/pcl-manual.latest.pdf
Edit - History - Print
Page last modified on February 13, 2015, at 04:52 PM