Network Anomography

s versions at the Internet Archive. Yahoo! is not affiliated with the authors of this page or responsible for its content.
Network Anomography
Network Anomography
Yin Zhang , Zihui Ge , Albert Greenberg , and Matthew Roughan
§ Department of Computer Sciences, University of Texas at Austin, Austin, TX 78712, USA AT&T Labs Research, Florham Park, NJ 07932, USA
§
School of Mathematical Science, University of Adelaide, SA 5005, Australia
Abstract
Anomaly detection is a rst and important step needed
to respond to unexpected problems and to assure high per-
formance and security in IP networks. We introduce a
framework and a powerful class of algorithms for net-
work anomography, the problem of inferring network-level
anomalies from widely available data aggregates.
The
framework contains novel algorithms, as well as a recently
published approach based on Principal Component Analy-
sis (PCA). Moreover, owing to its clear separation of infer-
ence and anomaly detection, the framework opens the door
to the creation of whole families of new algorithms. We
introduce several such algorithms here, based on ARIMA
modeling, the Fourier transform, Wavelets, and Princi-
pal Component Analysis. We introduce a new dynamic
anomography algorithm, which effectively tracks routing
and trafc change, so as to alert with high delity on intrin-
sic changes in network-level trafc, yet not on internal rout-
ing changes. An additional benet of dynamic anomogra-
phy is that it is robust to missing data, an important opera-
tional reality. To the best of our knowledge, this is the rst
anomography algorithm that can handle routing changes
and missing data. To evaluate these algorithms, we used
several months of trafc data collected from the Abilene
network and from a large Tier-1 ISP network. To compare
performance, we use the methodology put forward earlier
for the Abilene data set. The ndings are encouraging.
Among the new algorithms introduced here, we see: high
accuracy in detection (few false negatives and few false
positives), and high robustness (little performance degra-
dation in the presence of measurement noise, missing data
and routing changes).
1
Introduction
The rst step in xing a problem is knowing it exists. This
is no less true in networking than anywhere else we need
to know about a problem before we can repair it. Network-
ing vendors typically build alarms into network equipment
to facilitate fast, accurate detection and diagnosis of prob-
lems. However, in practice, there are many problems for
which explicit alarms are either absent (for new or uncom-
mon problems), or intrinsically hard to produce. In these
cases we must infer the problem from other data sources.
For instance, many types of network problems cause abnor-
mal patterns to appear in the network trafc. Such trafc
anomalies may be caused by problems ranging from secu-
rity threats such as Distributed Denial of Service (DDoS)
attacks and network worms, to unusual trafc events such
as ash crowds, to vendor implementation bugs, to network
miscongurations. We refer to the problem of inferring
anomalies from indirect measurement as network anomog-
raphy (combining anomalous with tomography, a gen-
eral approach to such inference problems).
Network tomography [31] bears some resemblance, in
that both involve the solution of a linear inverse prob-
lem. Examples include inference of individual link per-
formance characteristics from path performance character-
istics, and inference of trafc matrices from individual link
load measurements. For example, the trafc matrix estima-
tion problem arises because the obvious source of data for
direct measurement (ow-level data) can be hard to obtain
network-wide [5, 14, 22, 23, 28, 31, 34, 35]. On the other
hand, Simple Network Management Protocol (SNMP) data
on individual link loads is available almost ubiquitously.
Fortunately, the link loads and trafc matrices are simply
related by a linear equation
b
= Ax
(1)
The vector b contains the link measurements, and
A is the
routing matrix (dened formally below). We wish to in-
fer x, which contains the unknown trafc matrix elements
written as a vector. Tomographic inference techniques seek
to invert this relationship to nd x.
The anomography problem is different and somewhat
more complex. First, note that anomaly detection is per-
formed on a series of measurements over a period of time,
rather than from a single snapshot. In addition to changes in
the trafc, the solution must build in the ability to deal with
changes in routing. Second, note that the anomalies that
we wish to infer may have dramatically different properties
from a trafc matrix, and so different methods than those
used for network tomography may be called for. Indeed, we
nd that simple extensions to network tomography meth-
ods perform fair poorly here. Techniques that transform
the measurements prior to attempting to solve the inverse
problem are preferable.
As a simple example, imagine trying to detect an anoma-
lous trafc pattern caused by a ash crowd or DDoS attack
on a web site. This type of event will cause increases in
trafc ows headed towards a particular set of destinations.
It may be hard to rapidly identify which of the tens of thou-
sands of ingress links on a large network might be primar-
ily responsible, as large surges at a network egress link may
arise from small surges on several ingress links. We must infer the change in the pattern of trafc to the particular
site from the complete set of link data, considered together,
rather than as individual time series. This illustrates an im-
portant feature of anomography it extends anomaly de-
tection to network-level problems (automatically building
in correlation across the network) where link-level anomaly
detection might be inadequate or unreliable.
Many approaches to anomography are possible. In pio-
neering work, Lakhina et al. introduced a novel approach
based on Principal Component Analysis (PCA) [19]. Our
paper makes three major contributions to understanding
and solving anomography problems:
1. We present a simple and powerful framework that
encompasses a wide class of methods for network
anomography. We will see that the method of [19] is
a member of this class. The framework clearly decou-
ples the inference and anomaly detection steps, and so
immediately opens the door to the development of new
algorithms where one makes different choices for each
step. Accordingly, we introduce several such new al-
gorithms here, based on ARIMA modeling, the Fourier
transform, Wavelets, and Principal Component Analy-
sis. Moreover, the framework is not restricted to the
analysis of link trafc data, and in particular also applies
to the dual problem of inferring performance anomalies
from end-to-end performance measurements.
2. We introduce a new algorithm for dynamic anomogra-
phy, which identies network level trafc anomalies and
works in the presence of routing changes. That is, dy-
namic anomography tracks routing and trafc change
signaling trafc anomalies, but not internal network
routing changes (which may dramatically change inter-
nal trafc patterns but may leave the trafc matrix, de-
scribing how trafc enters and exits the network, stable).
In IP networks, routing changes occur as part of the nor-
mal self-healing behavior of the network, and so iso-
lating these from trafc anomalies is advantageous. An
additional benet of dynamic anomography is that it is
robust to missing link load measurements, an important
operational reality (see Section 4 for why missing data
may result in changes in the routing matrix). To the best
of our knowledge, this is the rst anomography algo-
rithm that can handle routing changes and missing data.
3. Using data sets collected from a large Tier-1 ISP and
from Internet2s Abilene network, we report on the re-
sults of an extensive and thorough evaluation of a set
of anomography methods. To understand the delity
of the methods and to compare different methods, we
apply the methodology introduced in [19]. Under this
methodology, we nd that in general the new temporal
anomography methods introduced here exhibit consis-
tently high delity. In particular, we nd that the most
successful method (of those examined) is a variation of
dynamic anomography, combining Box-Jenkins model-
ing (ARIMA) with 1
norm minimization. Further eval-
uation suggests that this algorithm can cope well with
measurement noise, and degrade gracefully in the pres-
ence of missing or corrupted data.
The paper is organized as follows. Section 2 summarizes
background and related work. In Section 3 we describe our
framework, and the anomography algorithms examined in
this paper, in the context of xed routing. In Section 4
we extend the Box-Jenkins anomography to the case where
routing may change over time. In Section 5 we describe our
evaluation methodology, and Section 6 presents the results.
Section 7 provides nal remarks.
2
Background
2.1
Network Tomography
Network tomography describes several problems: infer-
ring link performance of a network from end-to-end mea-
surements, or inferring Origin-Destin