Fourth Machine Translation Marathon | Main / FailFinder browse

We propose FailFinder, a debugger for MT decoders and syntactic models of translation. It includes a graphical analysis tool for the Joshua (or perhaps cdec) decoder that provides the user with information about why specific hypotheses were not chosen, given a reference. A reference can be 1) a token sequence 2) a sequence of target phrases and their source spans or 3) a target parse tree and its source spans. Further, we can allow the user to create custom desired hypotheses on-the-fly by allowing the user to select a partial hypothesis in a given cell and substitute it into the reference for the given span. This increases the chance that the new reference is still reachable since it is composed of pieces already in the hypergraph. A failure occurs if the reference is not the topbest hypotheses according to vanilla decoding. We would like to classify these failures into one of the following categories:

Reachability
1. Not present in lexicon
2. Cannot be combined (i.e. error in search space/grammar)
3. Wrong derivation (a derivation exists, but not the right one)
. Search
1. Pruned in hypergraph construction
2. Pruned in k-best extraction
. Scoring
1. Feature definition: Model score of force-decoded desired hypothesis

is lower than topbest hypothesis from model decoding

Once classified, FailFinder can then provide more information in each context. In the case of a reachability error, FailFinder could suggest what derivation rule(s) could be used to reach the reference (giving preference to simpler rules). In the case of a search error, FailFinder could specify at which cell(s) the a desirable partial hypothesis was pruned in the vanilla decode even though it was present in the forced decode and show the difference between the feature vectors of the force-decoded partial hypothesis and the vanilla-decoded partial hypothesis. In the case of a scoring error, we can display the difference in the feature vectors to highlight which features vary most.

Notes:

Scoring errors and search errors are not mutually exclusive -- this

tool can help you know if it's even worth fixing the search errors or if fixing them will eventually just result in another type of error

We could use training data if reachability is an issue
Forced decoding != oracle decoding -- Pruning still happens during

oracle decoding, making it problematic here

GUI should distinguish hypotheses in hypergraph from those in k-best
GUI should distinguish forced hypotheses, perturbed hypotheses, and

actual hypotheses

As time permits, we could also compare 2 charts from 2 different models to do a deeper analysis of not only which sentences got better/worse according to some metric, but also why those sentence got better/worse. Since it's not always clear whether this is due to improved modeling or improved search.

Features already in FailFinder:

GUI tool for displaying search space, partial hypotheses, feature scores, and derivations
Dumping the details of the search from Joshua (including the hypergraph, partial k-best hypotheses, their feature values, and their future costs)

Proposed work:

Forced decoding in Joshua (or usage of cdec)
Client-server framework for Joshua (for interactive perturbations) so that decoding can run on large-memory server and GUI can run on laptop
Enhance GUI to support visual cues for differentiating the hypotheses produced by a) vanilla decoding, b) forced decoding of the reference, and c) forced decoding of the desired hypothesis

Subtasks for group members:

Work on the GUI to easily expose these features to the user (1 Java Swing coder)
Create the server-client architecture for Joshua (1 general Java coder)
Implement forced hypergraph decoding in Joshua (1 Java coder w/ decoding knowledge)
Implement forced k-best decoding in Joshua (1 Java coder w/ decoding knowledge)

Page last modified on January 25, 2010, at 12:13 PM