Training OCR Systems Using Variants of Ideal Images
dding=10 cellspacing=0 width=100%>Yahoo! is not affiliated with the authors of this page or responsible for its content.
Training OCR Systems Using Variants of Ideal Images
Training OCR Systems Using Variants of Ideal Images
Ibrahim S. I. Abuhaiba
Department of Electrical and Computer Engineering, Islamic University of Gaza,
P. O. Box 1276, Gaza, Palestine,
isabuhaiba@yahoo.com
Abstract
This paper shows that the learning stage of high-
performance character classifiers can be achieved using
only ideal images and/or simple variants of them. We
are interested in 94 character classes of the ASCII
character set. We use a software tool to generate binary
ideal images of these characters. Ideal images are
supported with simple variants derived from them to
represent character classes. Learning in a nearest
neighbor classifier (type-A) is performed using ideal
images of characters and their variants. Pixel intensity
values are used as features. To judge on the effectiveness
of this classifier, another novel nearest neighbor classifier
(type-B) is built and trained using real images.
These classifiers are tested using a real dataset. The
overall recognition, error, and rejection rates of a three-
variants type-A classifier are 98.5%, 1.5%, and 0.0%,
respectively. The recognition rate of the type-B classifier
exceeds that of type-A classifier that uses three variants
by only 1.2%. Using other kinds of classifiers and using
multiple classifier technology is expected to produce
much impressive results for the type-A classifier. The
type-A classifier that uses three variants requires 57 ms
to recognize a character. The type-B classifier requires
less than one half of the time required by type-A
classifier.
Keywords: Ideal Images, Synthetic Datasets, Real
Datasets, Document Image Defect Models, Nearest
Neighbour Classifier.
1.
Introduction
This paper aims at showing that high-performance
character classifiers can be trained using only ideal
images and/or simple variants of them, which has many
advantages:
Training datasets are no longer needed, which saves
in cost, time, and effort.
The size of the dataset on which the classifier is
built becomes very small since, in a specific font,
every character has only one ideal shape. Even if
variants of the ideal shape are used they are few.
Currently, software tools are available to easily and
quickly generate ideal character shapes and save
them in the desired resolution and image format.
Ideal character shapes and their simple variants
dont depend on printing, photocopying, scanning,
etc., technologies. Once they are generated and
saved, there is no need to reproduce them again,
which also saves in cost, time, and effort.
If the ideal shape of a character and few other
variants proved to be powerful in classification then
this reduces the size of testing datasets.
Training on mislabeled data is asking for trouble.
Training on correctly labeled data, but of low image
quality, is just as dangerous. Training on correctly la-
beled data of ideal image quality is expected to
reduce these troubles.
Currently, some authors mention that significant
improvement in accuracy on image pattern recognition
problems depends on the size and quality of training sets
[1]. Actually, as technologies of printing, photocopying,
scanning, faxing, etc. develop and change, the need for
larger and larger training datasets arises and the
definition of good training datasets changes to the extent
that a nowadays good training dataset may become
useless tomorrow. What remains constant is the ideal
shapes of characters. Therefore, we believe that OCR
researchers should concentrate on the choice of image
features and classification algorithms, instead.
Researchers in the last decade have intensified their
study of explicit, quantitative, parameterized models of
image defects that occur during printing and scanning.
Several models have been proposed, some motivated by
the physics of image formation and others by the surface
statistics of image distributions. A wide range of
techniques for estimating parameters of these models
has been explored [1-9]. Applications of these models
have taken many forms. Perhaps their widest impact is in
the generation of synthetic data sets used in training
classifiers for document image recognition systems [2,
10-13]. In an early experiment, a Tibetan OCR system
was constructed using training data that was initialized
with real images but augmented by synthetic variations
[2]. Perhaps the first large-scale use of these models to
help construct an industrial-strength classifier was [10], in
which a full-ASCII 100-typeface classifier was built
GVIP 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
using synthetic data only (and tested, with good results,
on real images). In a series of similar trials, synthetic data
has been used to construct pre-classification decision trees
with bounded error (which also worked well in practice)
[14]. The question now is: can we build such high-
performance large-scale classifiers using only ideal
images and simple variants and achieve the several
advantages mentioned earlier? We think so and the
results of this research greatly support this belief.
The increasing usage of synthetic data, along with or
instead of real data, in training and testing of recognition
systems has provoked a debate among engineers, which
was reflected in a panel discussion organized at the 5th
ICDAR [1, 15]. Six panel members spoke to the question
"under what circumstances is it advisable to use synthetic
data?" Points related to our work are summarized here
along with our comments:
The classifier that is trained on the most data wins,
but as the size of data increases the cost increases.
It doesn't matter how much real data you train on:
it's never enough, but we will show that it is enough
for us to learn only ideal shapes and variants to
recognize too many other real data.
Training only on real data feels safe after all,
almost everyone does it but it isn't.
Real data is corrupting: it is so expensive that we
reuse it, repeatedly, with unprincipled abandon. As
technology of printing, photocopying, scanning, etc.,
develop and change old real data may have to be
replaced with new large expenses.
Training on mislabeled data is asking for trouble.
Training on correctly labeled data, but of low image
quality, is just as dangerous. Training on correctly la-
beled data of ideal image quality is expected to
reduce these troubles.
We don't know how to separate helpful from unhelpful
training data, whether real or synthetic. However, we
intuitively believe that ideal data are always helpful
until somebody proves the opposite.
The document image degradation model research area
will live forever, since we will never agree on the
models. Even worse, if we agree on the models now
they may become obsolete in the future due to
technological advances and changes in image
generation and acquisition devices.
The rest of the paper is organized as follows. In Section 2,
the generation of ideal prototypes and variants is
described. Next , two classifiers (type-A and type B) are
presented in Section 3. Type-A classifier is a nearest
neighbor classifier trained on ideal images and variants of
them. Type-B is also a nearest neighbor classifier;
however, it is trained on real images. Type-B classifier is
built to help evaluate the type-A classifier. Experimental
results and performance are reported in Section 4. Finally,
the paper is concluded in Section 5.
2.
Generation of Ideal Prototypes and
Variants
There are many software tools that generate near-ideal
images of printed documents. The user can determine the
required quality in terms of resolution, color, and other
factors according to his needs. Documents can be
prepared using any word processor such as MS WORD.
Then, when printing the document, instead of directing
the document to a real printer it can be directed to a
special software tool that accepts the document as if it
was the printer. After accepting the document, it is
displayed as an image and can be manipulated using the
software tool and saved in the desired format.
In this research, we are interested in the 94 character
classes of the ASCII character set shown in Figure 1. We
used a software tool to generate binary ideal images of
these characters. Figure 2(a) shows a 20-times enlarged
ideal image of letter A generated using this software.
The original letter was set using Times New Roman font
with point size of 10 points. The resolution was set to 600
dpi in both directions. Ideal images of other characters
can be generated using the same software tool.
As mentioned earlier, it is difficult to precisely model
random noise that corrupt ideal images. However, natural
processes such as printing, photocopying, scanning, etc.
result in thickening the strokes constituting individual
characters. Typically, real strokes are few pixels wider
than the corresponding ideal strokes. This results in
discrepancies between real and ideal character images,
which vary from char