Feature combination strategies for saliency-based visual attention systems

ight>

Feature combination strategies for saliency-based visual attention systems
Feature
combination strategies for saliency-based
visual attention systems
Laurent Itti
Christof Koch
California Institute of Technology
Computation and Neural Systems Program
MSC 139-74, Pasadena, California 91125
E-mail: itti@klab.caltech.edu
Abstract.
Bottom-up or saliency-based visual attention allows pri-
mates to detect nonspecic conspicuous targets in cluttered scenes.
A classical metaphor, derived from electrophysiological and psycho-
physical studies, describes attention as a rapidly shiftable spot-
light. We use a model that reproduces the attentional scan paths of
this spotlight. Simple multi-scale feature maps detect local spatial
discontinuities in intensity, color, and orientation, and are combined
into a unique master or saliency map. The saliency map is se-
quentially scanned, in order of decreasing saliency, by the focus of
attention. We here study the problem of combining feature maps,
from different visual modalities (such as color and orientation), into a
unique saliency map. Four combination strategies are compared us-
ing three databases of natural color images: (1) Simple normalized
summation, (2) linear combination with learned weights, (3) global
nonlinear normalization followed by summation, and (4) local non-
linear competition between salient locations followed by summation.
Performance was measured as the number of false detections be-
fore the most salient target was found. Strategy (1) always yielded
poorest performance and (2) best performance, with a threefold to
eightfold improvement in time to nd a salient target. However, (2)
yielded specialized systems with poor generalization. Interestingly,
strategy (4) and its simplied, computationally efcient approxima-
tion (3) yielded signicantly better performance than (1), with up to
fourfold improvement, while preserving generality.
© 2001 SPIE and
IS&T. [DOI: 10.1117/1.1333677]
1
Introduction
Primates use saliency-based attention to detect, in real time,
conspicuous objects in cluttered visual environments. Re-
producing such nonspecic target detection capability in
articial systems has important applications, for example,
in embedded navigational aids, in robot navigation and in
battleeld management. Based on psychophysical studies
in humans and electrophysiological studies in monkeys, it
is believed that bottom-up visual attention acts in some way
akin to a spotlight.
13
The spotlight can rapidly shift
across the entire visual eld with latencies on the order of
50 ms , and selects a small area from the entire visual
scene. The neuronal representation of the visual world is
enhanced within the restricted area of the attentional spot-
light, and only this enhanced representation is allowed to
progress through the cortical hierarchy for high-level pro-
cessing, such as pattern recognition. Further, psychophysi-
cal studies suggest that only this spatially circumscribed
enhanced representation reaches visual awareness and
consciousness.
4
Where in a scene the focus of attention is to be deployed
is controlled by two tightly interacting inuences: First,
image-derived or bottom-up cues attract attention to-
wards conspicuous, or salient image locations in a
largely automatic and unconscious manner; second, atten-
tion can be shifted under top-down voluntary control
towards locations of cognitive interest, even though these
may not be particularly salient.
5
In the present study, we
largely make abstraction of the top-down component and
focus on the bottom-up, scene-driven component of visual
attention. Thus, our primary interest is in understanding, in
biologically plausible computational terms, how attention is
attracted towards salient image locations. Understanding
this mechanism is important because attention is likely to
be deployed, during the rst few hundred milliseconds after
a new scene is freely viewed, mainly based on bottom-up
cues. For a model which integrates a simplied bottom-up
mechanism to a task-oriented top-down mechanism, we re-
fer the reader to the article by Schill et al. in this issue and
to Refs. 6 and 7.
A common view of how attention is deployed onto a
given scene under bottom-up inuences is as follows. Low-
level feature extraction mechanisms act in a massively par-
allel manner over the entire visual scene to provide the
bottom-up biasing cues towards salient image locations. At-
tention then sequentially focuses on salient image locations
to be analyzed in more detail.
2,1
Visual attention hence al-
lows for seemingly real-time performance by breaking
down the complexity of scene understanding into a fast
temporal
sequence
of
localized
pattern
recognition
problems.
8
Several models have been proposed to functionally ac-
count
for
many
properties
of
visual
attention
in
primates.
6,813
These models typically share similar general
architecture. Multi-scale topographic feature maps de-
tect local spatial discontinuities in intensity, color, orienta-
tion and optical ow. In biologically plausible models, this
is usually achieved by using a center-surround mecha-
Paper HVEI-10 received Aug. 9, 1999; revised manuscript received Aug. 25, 2000;
accepted for publication Sep. 20, 2000.
1017-9909/2001/$15.00 © 2001 SPIE and IS&T.
Journal of Electronic Imaging 10(1), 161 169 (January 2001).
Journal of Electronic Imaging / January 2001 / Vol. 10(1) / 161 nism akin to biological visual receptive elds, a process
also known as a cortex transform in the image process-
ing literature. Receptive eld properties can be well ap-
proximated by difference-of-Gaussians lters for nonori-
ented features or Gabor lters for oriented features .
10,13
Feature maps from different visual modalities are then
combined into a unique master or saliency map.
1,3
In
the models like, presumably, in primates, the saliency map
is sequentially scanned, in order of decreasing saliency, by
the focus of attention Fig. 1 .
A central problem, both in biological and articial sys-
tems, is that of combining multi-scale feature maps, from
different visual modalities with unrelated dynamic ranges
such as color and motion , into a unique saliency map.
Models usually assume simple summation of all feature
maps, or linear combination using ad-hoc weights. The ob-
ject of the present study is to quantitatively compare four
combination strategies using three databases of natural
color images: 1 Simple summation after scaling to a xed
dynamic range;
2
linear combination with weights
learned, for each image database, by supervised additive
training; 3 nonlinear combination which enhances feature
maps with a few isolated peaks of activity, while suppress-
ing feature maps with uniform activity; and 4 local non-
linear iterative competition between salient locations within
each feature map, followed by summation. The four strate-
gies studied all involve a point-wise linear combination of
feature maps into the scalar saliency map; the main differ-
ence between the four variants relies on the weights given
to the various features. Indeed, there is mounting psycho-
physical evidence that different types of features do con-
tribute additively to salience, and not, for example, through
point-wise multiplication.
14
In the rst three strategies, the
different features are weighted in a nontopographic manner
one scalar weight for each entire map ; in the fourth strat-
egy, however, we will see that the weights are adjusted at
every image location depending on its contextual surround.
2
Model
The details of the model used in the present study have
been presented elsewhere
13
and are briey schematized in
Fig. 1. For the purpose of this study, it is only important to
remember that different types of features, such as intensity,
color or orientation are rst extracted in separate multi-
scale feature maps, and then need to be combined into a
unique saliency map, whose activity controls attention
Fig. 2 .
2.1
Fusion of Information
One difculty in combining different feature maps into a
single scalar saliency map is that these features represent
a priori not comparable modalities, with different dynamic
ranges and extraction mechanisms. Also, because many
feature maps are combined 6 for intensity computed at
different spatial scales, 12 for color and 24 for orientation
in our implementation , salient objects appearing strongly
in only a few maps risk being masked by