Graph Transformer Networks for Image Recognition

d width=10 nowrap>Yahoo! is not affiliated with the authors of this page or responsible for its content.
Graph Transformer Networks for Image Recognition Graph Transformer Networks for Image Recognition
L´eon Bottou and Yann Le Cun
NEC Labs of America
4 Independence Way, Princeton NJ08540, USA.
leon@bottou.org The Courant Institute, New York University
715 Broadway, New York NY10003, USA.
yann@cs.nyu.edu
This contribution takes the example of a check reading system to discuss the modeling
and estimation issues associated with large scale pattern recognition systems.
1. Problem Decomposition and Model Composition
Consider a system that takes the image of a check and returns the check amount. This
system locates the numerical amount, recognizes digits or other symbols, and parses the check
amount. Accuracy should remain high despite countless variations in check layout, writing style
or amount grammar.
From an engineering perspective, one must design components for locating the amount,
segmenting characters, recognizing digits, and parsing the amount text. Yet it is very dicult
to locate the amount without identifying that it is composed of characters that mostly resemble
digits and form a meaningful check amount (not a date or a routing number). Purely sequential
approaches do not work. Components must interact, form hypotheses and backtrack erroneous
decisions. The orchestration is dicult to design and costly to maintain.
From a statistical perspective, one seeks to estimate and compare the posterior probabilities
P
(Y |X) where variable X represents a check image and variable Y represents a check amount.
Let us dene a suitable parametric model p (y|x), gather data pairs (x
i
, y
i
), and maximize the
likelihood
i
log p (y
i
|x
i
). Such a direct approach leads to problems of unpractical sizes. It is
therefore common to manually annotate some pairs (x
i
, y
i
) with detailled information such as
isolated character images T , character codes C, or sequences S of character codes. One can
then model P (C|T ) and P (Y |S) and obtain components such as a character recognizer or an
amount parser.
The statistical perspective suggests a principled way to orchestrate the interaction of
these components: let the global model p (y|x) be expressed as a composition of submodels
such as p (c|t) and p (y|s). The submodels are rst t using the detailled data. The resulting
parameters are used as a bias when tting the global model p (y|x) using the initial data pairs
(y
i
|x
i
). This bias can be viewed as a capacity control tool for structural risk minimization
(Vapnik, 1982).
Model composition works nicely with generative models where one seeks to estimate the
conditional input density P (X|Y ) instead of the posterior P (Y |X). For instance, Hidden
Markov Models (HMM) for speech recognition (Rabiner, 1989) use the decomposition
(1)
p (x
(1)
. . . x
( )
|y) =
s
(1)...s( )
p (s
(1)
. . . s
( )
|y)
t
=1...
p (x
(t)
|s
(t)
) p (s
(t)
|s
(t1)
)
where x
(1)
. . . x
( )
represents the sound signal, and where the sum is taken over all the sequences
s
(1)
. . . s
( )
of states (e.g. phonemes) that can represent the target word y. Such decompositions
are derived by applying the Bayes rule and making suitable conditional independence assump-
tions. This idea can be extended to write the much more complicated models that our check
reading example demands.
Both theoretical arguments (Vapnik, 1982) and empirical evidence (LeCun et al., 1998a;
Laerty et al, 2001) indicate that higher performance can be achieved by discriminant models that directly estimate the posterior probability P (Y |X). For instance, the parser of a dis-
criminant check reader can recognize that a character sequence resembles a date and therefore
cannot be the check amount. In contrast, generative models cannot use such negative reason-
ning: the parser can only describe the syntax of potential check amounts. This restriction must
be compensated by a much more detailled model. In fact, generative models waste computing
resources and data to indirectly estimate the prior input density P (X) which is not useful for
our recognition task.
Early discriminant Markov models were built by rewriting (1) as:
(2)
p (y|x
(1)
. . . x
( )
) =
s
(1)...s( )
p (y|s
(1)
. . . s
( )
)
t
=1...
p (s
(t)
|x
(t)
, s
(t1)
)
Unfortunately, such discriminant models have a severe aw (Bottou, 1991) known as
the label-bias problem (Laerty, 2001). Consider again the check reader example: because
posterior likelihoods are normalized, the submodel p (c|t) cannot express that subimage t does
not represent any recognizable character c. Yet this negative information is very useful for
segmenting image elds into characters.
LeCun et al. (1998a) solves the label-bias problem by modeling measures
p
(·) instead of
probabilities p(·). Measures obey the same axioms as probabilities except for the normalization
condition. This change does not aect the estimation process because we still estimate the
parameters by maximizing likelihoods computed by normalizing the measures:
(3)
max i
log p (y
i
|x
i
)
y p (y|x
i
)
The change only aects how complex models are built by composing submodels. Although
we add or multiply submodel measures as we would add or multiply probabilities, we do not
enforce normalization constraints when composing models. Normalization only occurs at the
ultimate level when estimating parameters.
Laerty et al. (2001) elegantly reach the same solution by casting discriminant Markov
models as Markov Random Fields: the Hammersley-Cliord theorem relates their joint prob-
ability with Gibbs distributions; each submodel score corresponds to a term of the potential
function; the nal normalization factor is the partition function. This interpretation provides
an additional justication for our framework.
2. Graph transformers networks
While Hidden Markov model deal with variable length sequences, our check reader must
deal with more complicated variabilities (check layout, character segmentation, fractions, etc.).
Writing the global discriminant model as a big formula like (2) is dicult and not very useful:
it does not suggest tractable algorithms for computing the quantities of interest.
Graph Transformer Networks (Bottou et al., 1997) provide a more convenient way to
express such models. Graph transformer networks use weighted acyclic directed graphs to
track multiple hypothesis. Figure 1 shows how a graph can be used to represent segmentation
hypotheses for an image representing a sequence of digits. Each hypotesis is represented by a
path linking the start node to the end node. The score
p
(·) of an hypothesis is the product of
the scores of each path component. The most likely hypothesis is easily found using the Viterbi
algorithm. The sum of all hypotheses scores is easily computed using similar factorization
techniques.
Figure 2 shows how a check reader model
p
(y|x) is expressed as a sequence of graph
transformations A rst transformer produces a graph representing the various elds of the
check. A second transformer renes this graph by describing segmentation hypotheses for
each eld. The next transformer emits character hypotheses C for each segment T . The next
transformer composes this graph with a grammar transducer. This composition produces a INPUT
32x32
Convolutions
Subsampling
Convolutions
Subsampling
Full connection
Full connection
Gaussian connections
feature maps
6@28x28
f. maps
6@14x14
f. maps 16@10x10
f. maps 16@5x5
layer
120
OUTPUT
95
layer
84
Segmentation Graph
Interpretation Graph
Grammar
Recognition Graph
Field Graph
Check Graph
Best Amount Graph
2nd Nat. Bank
$ *** 3.45
three dollars and 45/xx
not to exceed $10,000.00
$ *** 3.45
$10,000.00
45/xx
$
*
3
**
45
"$" 0.2
"*" 0.4
"3" 0.1
"B" 23.6
.......
"$" 0.2
"*" 0.4
"3" 0.1
.......
Recognition
Segmentation
Field Location
Viterbi
Answer
Parsing
Figures 1 (top left), 2 (right), and 3 (bottom left)
graph representing the amount hypotheses independently of the detailled syntax of the character
strings. The nal step produces the most likely amount using a Viterbi algorithm.
This structure was suggested by Pereira et al. (1994) in the case of generative models.
Instead of storing graphs in memory, they advocate representing graphs procedurally by dening
accessor functions that can be called to navigate the graph structure. Furthermore, they show
how graph transformations can be expressed as the composition of the input graph with a
third graph named a transducer. The composition operation denes the accessor functions of
the output graph on the basis of the accessor functions of the input graph and the accessor
functions of the transducer graph. The composition function is generic. All the specics of a
particular graph transformer are neatly encapsulated in the accessor functions of the transducer
graph.
The composition operation denes how output graph scores are computed by combining
input graph scores and transducer graph scores. The parameters of a particular graph trans-
former are expressed through the transducer graph scores. For instance, scores on the recog-
nition graph are computed by a convolutional neural network (gure 3) whose 60000 weights
represent the majority of the tunable parameters in the global model. This network is invoked
by the accessor functions of the transducer graph representing the recognition transformer.
All the critical graph transformer algorithms are implemented in terms of abstract graph
accessor functions. This level of ab