Byzantine Fault Isolation in the Farsite Distributed File System

ystem of interacting Byzantine-fault-tolerant
replicated-state-machine groups, as system scale increases, so
does the probability that a group will manifest a fault. If no
steps are taken to prevent faults from spreading among groups,
a single fault can result in total system failure. To address this
problem, we introduce Byzantine Fault Isolation (BFI), a
technique that enables a distributed system to operate with
application-defined partial correctness when some of its
constituent groups are faulty. We quantify BFIs benefit and
describe its use in Farsite, a peer-to-peer file system designed to
scale to 100,000 machines.
1

INTRODUCTION
Farsite [2] is a distributed peer-to-peer file system
that runs on a network of desktop workstations and
provides centralized file-system service. File storage in
Farsite is secure, even though the system runs on
unsecured machines. This security is established, in part,
through the use of Byzantine Fault Tolerance (BFT), a
well-known mechanism for building trusted services
from untrusted components [14, 6]. BFT enables a
service to continue functioning correctly as long as fewer
than a third of the machines it is running on are faulty.
Farsite is designed to be scalable. As more and more
workstations make use of the file-system service, the
resources of these workstations become available for use
by the system. However, the BFT technique cannot
exploit these additional resources to achieve the greater
throughput demanded by an increasing user base: Adding
machines to a BFT group decreases throughput, rather
than increasing it.
To achieve an increase in throughput with scale,
Farsite partitions its workload among multiple interacting
BFT groups. Unfortunately, as the count of BFT groups
increases, so too does the probability that some group
will contain enough faulty machines that the group will
be unable to suppress the fault. If the system design does
not account for the failure of one or more BFT groups, a
single group failure can cause the entire system to fail.
The alternative to total failure is degraded operation,
wherein individual group failures cause the system to
operate in a way that is still partially correct, rather than
completely unspecified. However, partial correctness is
not something that can be defined in an application-
independent fashion [11]. It is thus not possible to build a
generic substrate that enables an arbitrary service to
degrade gracefully in a meaningful and useful manner.
We have therefore developed a methodology for
designing a distributed system of BFT groups, wherein a
faulty group is prevented from corrupting the entire
system. We call our method Byzantine Fault Isolation
(BFI). BFI makes use of formal specification [15] to
constrain the semantic behavior of a faulty system. The
technique involves selectively weakening the system
semantics and concomitantly strengthening the system
design. Formal specification helps determine when these
two processes have satisfactorily converged.
The next section surveys previous approaches to
resisting Byzantine faults. Section 3 describes the Farsite
file system, the context for our work. Section 4 quantifies
the value of isolating Byzantine faults, and Section 5
describes our technique. Section 6 shows an example of
BFI in Farsite, and Section 7 summarizes.
2

PREVIOUS WORK
In 1980, Pease et al. [23] introduced the problem of
reaching agreement among a group of correctly
functioning processors in the presence of arbitrarily
faulty processors; they proved that a minimum of 3<i>t + 1
processors are needed to tolerate t faults. Two years later,
they christened this the Byzantine Generals Problem
[14], as it has been known ever since. A large body of
research has addressed the problem of Byzantine
agreement, the first decade of which is well surveyed by
Barborak and Malek [4].
In the mid-to-late 1990s, several researchers
combined Byzantine-fault-tolerant protocols with state-
machine replication [26] to produce toolkits such as
Rampart [24], SecureRing [12], and BFT [6]. These
toolkits provide a replication substrate for services
written as deterministic state machines. The substrate
guarantees that the service will operate correctly as long
as fewer than a third of its replicas are faulty. An
unfortunate property of these toolkits is that their
throughput scales negatively: As the group size grows,
the system throughput shrinks, which is the exact
opposite of the behavior desired for scalable systems.
Throughput scaling is non-positive because a non-
decreasing fraction of replicas redundantly perform each
computation. Throughput scaling is made negative by the
message load of each processor, which is linear in group
size. Some recent research has addressed the latter issue:
Lewis and Saia [16] have developed a protocol that
probabilistically reaches agreement if fewer than an
eighth of the replicas are faulty. The message workload of each processor is logarithmic in group size, so
throughput scaling is less dramatically negative than for
BFT. Abd-El-Malek et al. [1] have built a replicated state
machine based on the Query/Update (Q/U) protocol,
which requires 5<i>t + 1 processors to tolerate t Byzantine
faults. In theory, this protocol has zero throughput scaling
with system size; however, their implementation exhibits
negative throughput scaling, albeit at a lower rate than
BFT.
The above systems all exhibit two properties: (1) non-
positive throughput scaling and (2) all-or-nothing failure
semantics, meaning that failures beyond the tolerated
threshold can cause the entire system to fail.
In the absence of a Byzantine-fault-tolerant substrate
that provides positive throughput scaling, researchers
have built systems that partition their workload among
multiple machines. However, as the system size grows,
so does the expected number of faulty machines, which
in turn given all-or-nothing failure semantics leads to
an increasing likelihood of total system failure. We
observe that this problem could be assuaged if there were
some means to limit the spread of Byzantine faults.
Three avenues of research are related to this problem:
First, several researchers have isolated Byzantine faults
in distributed problems of academic interest, such as
dining philosophers [21], vertex coloring [21], and edge
coloring [25, 18]. Their solutions employ self-stabilizing
protocols to guarantee that correct results are eventually
obtained by all nodes that are beyond a specified distance
(the containment radius) from faulty nodes. The formal
notion of fault containment for self-stabilizing systems
was introduced by Ghosh et al. [10], who applied it only
to transient faults. Such transient-fault containment was
achieved by Demirbas et al. [8] for the problem of
tracking in sensor networks. None of this research offers
a broad approach to containing Byzantine faults.
Second, a number of researchers have investigated
ways to limit Byzantine corruption when performing
broadcast [3], multicast [22], or gossip [17, 20]. These
closely related problems have no computational aspect;
they merely propagate data. Furthermore, they have the
property that correct operation implicitly replicates all of
the data to all machines. The resulting redundancy
enables machines to vote on the datas correctness, as in
the original Byzantine agreement problem.
Third, some researchers have tackled specialized
subclasses of the general problem. Merideth [19]
proposes a proactive fault-containment system that relies
on fault detection, a well-known specialization of fault-
tolerance problems [7]. Krings and McQueen [13]
employ standard Byzantine-fault-tolerant protocols only
for carefully defined critical functionalities. The TTP/C
protocol [5] isolates only a constrained subset of
Byzantine faults, namely reception failures and consistent
transmission failures.
Thus, every known technique for building systems
that resist Byzantine faults has at least one of the
following weaknesses:
Its throughput does not increase with scale.
It addresses only a narrow academic problem.
It does not support computation.
It does not address general Byzantine faults.
3

CONTEXT FARSITE FILE SYSTEM
We developed BFI in the context of a scalable, peer-
to-peer file system called Farsite [2]. Farsite logically
functions as a centralized file server, but it is physically
distributed among the desktop workstations of a
university or corporation, which may have over 10
5

machines. In this environment, independent Byzantine
faults are significantly more likely than they would be in
a physically secure server cluster.
Farsite employs different techniques for managing
file content and metadata. File content is encrypted,
replicated, and distributed among the machines in the
system, and a secure hash of each files content is
maintained with its metadata. File metadata is managed
by BFT groups of workstations; we call each group a
server. Every machine in a Farsite system fills three
roles: a client for its local user, a file host storing
encrypted content of data files, and a member of a BFT
group that acts as a server for metadata.
File metadata is dynamically partitioned among
servers, as follows: A Farsite system is initialized with a
single server, called the root, which initially manages
metadata for all files. When the metadata load on the root