current page
table border=0 cellpadding=10 cellspacing=0 width=100%>Yahoo! is not affiliated with the authors of this page or responsible for its content.
High Availability
Computer Systems
Jim Gray Daniel
P. Siewiorek
Digital Equipment Corporation Department of Electrical Engineering
455 Market St., 7'th Floor Carnegie Mellon University
San Francisco, CA. 94105 Pittsburgh, PA. 15213
Abstract: The key concepts
and techniques used to build high availability computer systems are
(1) modularity, (2) fail-fast modules, (3) independent failure modes,
(4) redundancy, and (5) repair. These ideas apply to hardware,
to design, and to software. They also apply to tolerating operations
faults and environmental faults. This article explains these ideas
and assesses high-availability system trends.
Overview
It is paradoxical
that the larger a system is, the more critical is its availability,
and the more difficult it is to make it highly-available. It is
possible to build small ultra-available modules, but building large systems involving thousands of modules
and millions of lines of code is still an art. These large systems
are a core technology of modern society, yet their availability are
still poorly understood.
This article sketches the techniques used to build
highly available computer systems. It points out that three decades
ago, hardware components were the major source of faults and outages.
Today, hardware faults are a minor source of system outages when compared
to operations, environment, and software faults. Techniques and
designs that tolerate this broader class of faults are in their infancy.
A Historical Perspective
Computers built in the late 1950's offered twelve-hour mean time to
failure. A maintenance staff of a dozen full-time customer engineers
could repair the machine in about eight hours. This failure-repair
cycle provided 60% availability. The vacuum tube and relay components
of these computers were the major source of failures; they had lifetimes
of a few months. Therefore, the machines rarely operated for more than
a day without interruption</span><span class="Normal--Char" style=" font-size: 9pt;
font-weight: normal; font-style: normal;">1.
Many fault detection and fault masking techniques
used today were first used on these early computers. Diagnostics tested the machine. Self-checking computational techniques detected faults while
the computation progressed. The program occasionally saved (checkpointed)
its state on stable media. After a failure, the program
read the most recent checkpoint, and continued the computation from
that point. This checkpoint/restart technique allowed long-running computations to
be performed by machines that failed every few hours.
Device improvements have improved computer system
availability. By 1980, typical well-run computer systems offered
99% availability</span><span class="Normal--Char" style=" font-size: 10pt;
font-weight: normal; font-style: normal;">2. This sounds good, but 99% availability
is 100 minutes of downtime per week. Such outages may be acceptable
for commercial back-office computer systems that process work in asynchronous
batches for later reporting. Mission critical and online applications
cannot tolerate 100 minutes of downtime per week. They require high-availability systems - ones that deliver 99.999% availability.
This allows at most five minutes of service interruption per year.
Process control, production control, and transaction
processing applications are the principal consumers of the new class
of high-availability systems. Telephone networks, airports, hospitals,
factories, and stock exchanges cannot afford to stop because of a computer
outage. In these applications, outages translate directly to reduced
productivity, damaged equipment, and sometimes lost lives.
Degrees of availability can be characterized by
orders of magnitude. Unmanaged computer systems on the Internet
typically fail every two weeks and average ten hours to recover.
These unmanaged computers give about 90% availability. Managed
conventional systems fail several times a year. Each failure
takes about two hours to repair. This translates to 99% availability</span><span
class="Normal--Char" style=" font-size: 10pt; font-weight: normal; font-style: normal;
">2. Current fault-tolerant systems fail once every few years and
are repaired within a few hours</span><span class="Normal--Char" style=" font-size: 10pt;
font-weight: normal; font-style: normal;">3. This is 99.99% availability. High-availability
systems require fewer failures and faster repair. Their requirements
are one to three orders-of-magnitude more demanding than current fault-tolerant
technologies (see Table 1).
Table 1. Availability of typical systems classes. Today's best
systems are in the high-availability range. The best of the general-purpose
systems are in the fault-tolerant range as of 1990.
Unavailability Availability
System Type (min/year) Availability Class
unmanaged 50,000 90.% 1
managed 5,000 99.% 2
well-managed 500 99.9% 3
fault-tolerant 50 99.99% 4
high-availability 5 99.999% 5
very-high-availability .5 99.9999% 6
ultra-availability .05 99.99999% 7
As the nines begin to pile up in the availability measure, it is better
to think of availability in terms of denial-of-service measured in minutes
per year. So for example, 99.999% availability is about 5 minutes of
service denial per year. Even this metric is a little cumbersome,
so the concept of availability class; or simply class is defined, by analogy to the hardness of diamonds or the class of
a cleanroom. Availability class is the number of leading nines
in the availability figure for a system or module. More formally,
if the system availability is A,
the system's availability class is e<sup>log</sup></span><span
class="Normal--Char" style=" font-family: 'Times New Roman', 'Arial';
font-size: 10pt; font-weight: normal; font-style: italic; text-decoration: none;
"><sup>10(). The rightmost
column of Table 1 tabulates the availability classes of various system
types.
The
telephone network is a good example of a high-availability system -
a class 5 system. Its design goal is at most two outage hours
in forty years. Unfortunately, over the last two years there have
been several major outages of the Uni