Experiences with building an intrusion-tolerant group communication system

tr> Yahoo! is not affiliated with the authors of this page or responsible for its content.
Experiences with building an intrusion-tolerant group communication system
Experiences with building an
intrusion-tolerant group
communication system
HariGovind V. Ramasamy
1,
, Prashant Pandey
2
, Michel Cukier
3
, and William
H. Sanders
4
1
IBM Zurich Research Laboratory, Switzerland. hvr@zurich.ibm.com
2
IBM Almaden Research Laboratory, USA. ppandey@us.ibm.com
3
University of Maryland, College Park, USA. mcukier@eng.umd.edu
4
University of Illinois, Urbana-Champaign, USA. whs@uiuc.edu
SUMMARY
There are many group communication systems (GCSs) that provide consistent group
membership and reliable, ordered multicast properties in the presence of crash faults.
However, relatively few GCS implementations are able to provide those properties in
the presence of malicious faults resulting from intrusions. We describe the systematic
transformation of a crash-tolerant GCS, namely C-Ensemble, into an intrusion-tolerant
GCS, the ITUA GCS. To do the transformation, we devised intrusion-tolerant versions of
key group communication protocols. We then inserted implementations of the protocols
into C-Ensemble and made signicant changes to the rest of the C-Ensemble protocol
stack to make the stack intrusion-tolerant. We quantify the cost of providing intrusion-
tolerant group communication in two ways. First, we quantify the implementation
eort by presenting a detailed analysis of the amount of change required to the
original C-Ensemble system. In doing so, we provide insight into the choice of building
an intrusion-tolerant GCS from scratch versus building one by leveraging a crash-
tolerant implementation. Second, we quantify the run-time performance cost of tolerating
intrusions by presenting results from an experimental evaluation of the main intrusion-
tolerant microprotocols. The results are analyzed to identify the parts that contribute
the most overhead while providing intrusion tolerance during both normal operation and
recovery from intrusions.
key words: Intrusion Tolerance, Fault Tolerance, Group Communication, Distributed Protocols,
Experimental Evaluation Correspondence to: IBM Zurich Research Laboratory, Rueschlikon, CH 8803, Switzerland. A preliminary version of this paper [
35
] appears in the Proceedings of the 2002 International Conference on
Dependable Systems and Networks (DSN-2002).
Contract/grant sponsor: DARPA; contract/grant number: F3060200C0172 2
H. V. RAMASAMY ET AL.
Introduction
The ever-growing economic and strategic importance of distributed computer systems has
made it important to ensure that these systems are intrusion-tolerant. An intrusion-tolerant
system is equipped with mechanisms that allow it to continue to operate correctly even when
signicant portions of it have been compromised and may be in the control of an intelligent
adversary.
Intrusion tolerance can be provided at various levels in a system: at the application level,
the middleware level, the OS level, or the hardware level. One promising approach is to
provide intrusion tolerance at the middleware level, providing intrusion-tolerant services (such
as remote method invocations) to distributed applications. Examples of intrusion-tolerant
middleware include MAFTIA [
51
], ITDOS [
46
], Enclaves [
20
], and ITUA [
26
,
33
], some of
which use replication to increase availability. A key concern in replicated systems is the need
to maintain the consistency of information.
Group communication systems (GCSs) are a well-known paradigm for providing state
consistency among the processes that constitute a group in the presence of faults. An early
crash-tolerant GCS was the ISIS system [
5
,
6
] from Cornell University. It led to numerous
other academic eorts for developing crash-tolerant GCSs, e.g., Horus [
8
] and Ensemble [
25
]
at Cornell University, Totem [
31
] at UCSB, Transis [
2
]
at the Hebrew University, Spread [
1
]
at Johns Hopkins University, and Newtop [
21
] at Newcastle University. Chockler et al. give an
excellent survey of crash-tolerant GCSs in [
14
].
The growing need to strengthen systems against malicious attacks motivates the building
of GCSs that can tolerate not just benign crash faults but also the malicious corruption of
some group members. However, compared to work on crash-tolerant GCSs, work in the area
of intrusion-tolerant GCSs has been gaining momentum only more recently, and relatively few
implementations exist. As a system builder entrusted with the task of building an intrusion-
tolerant GCS, one is naturally inclined to build it de novo, i.e., designing and implementing all
the group communication protocols from scratch. Most, if not all, existing implementations of
intrusion-tolerant GCSs were built using this approach. Examples of de novo implementations
include the Rampart toolkit [
41
], SecureRing [
28
], Correias GCS [
15
,
16
]), and the CoBFIT
GCS [
39
]. However, given the large number of existing crash-tolerant GCS implementations,
an alternative approach would be to reuse a crash-tolerant GCS and transform it into an
intrusion-tolerant GCS. In this paper, we attempt to answer questions relating to the feasibility,
incentives, limitations, and challenges of this alternative and less-used approach.
We describe the systematic transformation of a crash-tolerant GCS, namely C-Ensemble, the
C implementation of the well-known Ensemble GCS [
25
], into an intrusion-tolerant GCS, called
the ITUA GCS. We highlight the main issues involved in engineering such a transformation
and provide insights on how one can address them. The transformation was done at two
levels. At the design level, we devised key group communication protocols in the fault model
normally used for describing malicious attacks, namely the arbitrary or Byzantine [
29
] fault
model. The protocols provide reliable, ordered multicast and consistent group membership
properties. At the implementation level, we inserted implementations of the protocols into C-
Ensemble and made signicant changes to the rest of the C-Ensemble protocol stack to make
its implementation intrusion-tolerant. EXPERIENCES BUILDING INTRUSION-TOLERANT GROUP COMMUNICATION
3
Our work demonstrates the feasibility of building an intrusion-tolerant GCS by leveraging a
crash-tolerant implementation. In particular, our experience shows that a signicant portion of
the code base for a crash-tolerant GCS implementation can be reused for building an equivalent
intrusion-tolerant one. We present a detailed analysis of the amount of change required to the
original C-Ensemble system.
The transformation approach is not without limitations. The price to pay is the time and
eort spent in doing an exhaustive nd-and-patch campaign. In the presence of benign crash
faults, a group member can blindly trust what other members say. However, the group member
cannot aord to do so when some of the other members may be acting maliciously. The
nd-and-patch campaign involves identifying all parts of the crash-tolerant implementation
reecting mutual trust among group members and patching those parts to reect a fault
model in which group members may trust, but must always verify communication received
from others. The patching we did included adding sanity checks for all messages received
at a group member and adding cryptographic support to guarantee message integrity and
non-repudiation properties. The limitation of such a nd-and-patch campaign is that it can
only be best-eort; no guarantees can be given that all places have been patched. However,
that limitation is not very dierent from saying that even if an intrusion-tolerant GCS were
developed de novo (as the authors have done in their CoBFIT GCS work [
38
,
39
]), there would
be no way to guarantee that the implementation was free of vulnerabilities.
The performance of such a transformed GCS must be carefully analyzed if its applicability
to the building of intrusion-tolerant systems is to be understood. We describe the results of
a detailed experimental analysis of the cost, in terms of the reduced performance incurred
because of providing intrusion tolerance, both during normal operation and during recovery
from (single and multiple correlated) intrusions. To do the analysis, we instrumented the key
intrusion-tolerant protocol implementations to provide detailed information about the cost
incurred during fault-free operation and when tolerating both single and multiple correlated
intrusions. We also instrumented the equivalent crash-tolerant protocol implementations
to compare the performance overhead of tolerating intrusions with that of tolerating just
crashes. The results provide new insights into the cost of providing intrusion-tolerant group
communication, and suggest ways that this cost could be reduced in the future.
Related Work
We now describe related work, which includes de novo implementations of intrusion-tolerant
GCSs (e.g., Rampart [
41
], SecureRing [
28
], and WIT-GCS [
16
]), implementations of Byzantine-
fault-tolerant atomic multicast and state machine replication algorithms (e.g., the BFT
library [
11
] and SINTRA [
9
]), and the development of intrusion-tolerant systems from crash-
tolerant ones. The last category includes theoretical work such as [
4
,
27
,
30
] and pract