Building pipelines can be tedious and error-prone. Using Moses scripts to build pipelines can be hampered by the fact that scripts need to be able to parse the output of the previous script. Moving scripts to different positions in the pipeline is tricky and may require a code change! It would be better if the scripts were re-usable without change and users can start to build up a library of computational pieces that can be used in any pipeline in any position.
Since pipelines are widely used in machine translation, and given the problem outlined above, a more convienent and less error prone way of building pipelines quickly, with re-usable components, would aid construction.
A domain specific language called Pipeline Creation Language (PCL) has been developed part of the MosesCore project (European Commission Grant Number 288487 under the 7th Framework Programme). PCL enables users to gather components into libraries, or packages, and re-use them in pipelines. Each component defines inputs and outputs which are checked by the PCL compiler to verify components are compatible with each other.
PCL is a general purpose language that can be used to construct non-recurrent software pipelines. In order to adapt your existing programs and script for use with PCL a Python wrapper must be defined for each program. This builds up a library of components with are combined with others in PCL files. The Python wrapper scripts must implement the following function interface:
get_name()
- Returns an object representing the name of the component. The __str__()
function should be implemented to return a meaningful name.
get_inputs()
- Returns the inputs of the component. Components should only be defined with one input port. A list of input names must be returned.
get_outputs()
- Returns the outputs of the component. Components should only be defined with one output port. A list of output names must be returned.
get_configuration()
- Returns a list of names that represent the static data that shall be used to construct the component.
configure(args)
- This function is the component designer's chance to preprocess configuration injected at runtime. The args
parameter is a dictionary that contains all the configuration provided to the pipeline. This function is to filter out, and optionally preprocess, the configuration used by this component. This function shall return an object containing the configuration necessary to construct this component.
initialise(config)
- This function is where the component designer defines the component's computation. The function receives the output object from the configure()
function and must return a function that takes two parameters, an input object, and a state object. The input object is a dictionary that is received from the previous component in the pipeline, and the state object is the configuration for the component. The returned function should be used to define the component's computation.
Once your library of components have been written they can be combined using the PCL language. A PCL file defines one component which uses other defined components. For example, the following file defines a component that performs tokenisation for source and target files.
# # Component definition: 2 input ports, 2 output ports # # +---------+ # src_filename -->+ +--> tokenised_src_filename # | | # trg_filename -->+ +--> tokenised_trg_filename # +---------+ # import wrappers.tokenizer.tokenizer as tokeniser component src_trg_tokeniser inputs (src_filename), (trg_filename) outputs (tokenised_src_filename), (tokenised_trg_filename) configuration tokeniser.src.language, tokeniser.src.tokenisation_dir, tokeniser.trg.language, tokeniser.trg.tokenisation_dir, tokeniser.moses.installation declare src_tokeniser := new tokeniser with tokeniser.src.language -> language, tokeniser.src.tokenisation_dir -> tokenisation_dir, tokeniser.moses.installation -> moses_installation_dir trg_tokeniser := new tokeniser with tokeniser.trg.language -> language, tokeniser.trg.tokenisation_dir -> tokenisation_dir, tokeniser.moses.installation -> moses_installation_dir as wire (src_filename -> filename), (trg_filename -> filename) >>> (src_tokeniser *** trg_tokeniser) >>> wire (tokenised_filename -> tokenised_src_filename), (tokenised_filename -> tokenised_trg_filename)
A PCL file is composed of the following bits:
wrappers.tokenizer.tokenizer
shall be referenced in this file by the name tokeniser
.
fred.pcl
must be called fred
.
as
portion of the component definition is an expression which defines how the construct components are to be combined to create the computation required for the component.
The definition of a component can use the following pre-defined components:
first
- This component takes one expression with a one port input and creates a two port input and output component. The provided component is applied only to the first port of the input.
second
- This component takes one expression with a one port input and creates a two port input and output component. The provided component is applied only to the second port of the input.
split
- Split is a component with one input port and two output ports. The value of the outputs is the input, i.e., spliting the input.
merge
- Merge values from the two port input to a one port output. A comma-separated list of top
and bottom
keywords subscripted with input names are used to map these values to a new name. E.g., merge top[a] -> top_a, bottom[b] -> bottom_b
takes the a
value of the top input and maps that value to a new name top_a
, and the b
value of the bottom input and maps that value to a new name bottom_b
.
wire
- Wires are used to adapt one component's output to another's input. For wires with one input and output port then the wire mapping is a comma-separated mapping, e.g., wire a -> next_a, b -> next_b
adapts a one port output component whose outputs are a
and b
to a one port component whose inputs are next_a
and next_b
. For wires with two input and output ports mappings are in comma-separated parenthese, e.g., wire (a -> next_a, b -> next_b), (c -> next_c, d -> next_d)
. This wire adapts the top input from a
to next_a
, and b
to next_b
, and the bottom input from c
to next_c
and d
to next_d
.
if
- Conditional execution of a component can be achieved with the if
component. This component takes three arguments: a conditional expression, a then component and an else component. If the condition is evaluated to a truthy value the then component is executed, otherwise the else component is executed. See the conditional example in the PCL Git repository for an example of usage.
Combinator operators used to compose the pipeline, they are:
>>>
- Composition. This operator composes two components. E.g., a >>> b
creates a component in which a
is executed before b
.
***
- Parallel execution. This operator creates a component in which the two components provided are executed in parallel. E.g., a *** b
creates a component with two input and output ports.
&&&
- Parallel execution. The operator creates a component in which two components are executed in parallel from a single input port. E.g., a &&& b
creates a component with one input port and two output ports.
Examples in the PCL Git repository show the usage of these operators and pre-defined components. Plus an example Moses training pipeline is available in contrib/arrow-pipelines
directory of the mosesdecoder
Git repository. Please see contrib/arrow-pipelines/README
for details of how to compile and run this pipeline.
For more details of how to use PCL please see the latest manual at
contrib/arrow-pipelines/python/pcl/documentation/pcl-manual.latest.pdf