Pre-Orders for Fault Tolerance
"text/css">
Pre-Orders for Fault Tolerance
Padmanabhan Krishnan
1
Technical Report COSC 06 92
A reformatted version of the report is to appear as a paper at the Sixteenth Australian
Computer Science Conference
Citations should refer to the proceedings.
Abstract
We describe a process algebraic approach to the semantics of robust systems. We extend
a subset of CCS 14 with multi-set pre xes to model systems with replicated synchronous
majority voting. Based on an operational semantics, we de ne pre-orders which introduces a
hierarchy of faulty processes and fault-tolerant processes. We then show how a similar ordering
on modal-
formulae 18 can characterise the fault pre-orders.
1
Department of Computer Science
University of Canterbury, Private Bag 4800
Christchurch, New Zealand
Email: paddy@cosc.canterbury.ac.nz
1 Introduction
The main characteristic of fault-tolerant or robust systems is the ability to cope with errors
in software and hardware. Robust systems are usually able to operate correctly in non-ideal
environments. For example, the failure of a single processor will not cripple a robust system,
the task assigned to the failed processor will be completed. Robust systems are usually reactive
systems 17 i.e., respond to changes in the environment with the distinguishing feature that
the environment can deviate from its `expected' behaviour without a ecting the correctness of the
system.
The aim of this paper is to describe a framework in which faulty and fault-tolerant systems can
be studied. Cristian 5 describes the various issues in fault-tolerant computation. The rst issue is
the de nition of fault or failure classi cation and in communicating systems it includes: omission
fault or failure to send a message, addition fault or generation of an spurious message, value fault
or sending the wrong value, state-transition fault or responding incorrectly to the environment and
crash
failure or the inability to interact with its environment.
Associated with a system is a failure model, which is a speci cation indicating the corrective
action on the occurrence of a fault. The failure model chosen for a particular system depends on
its functionality. For example, in a student lab environment, shutting down the lab due to an
erroneous le-server would be acceptable while a heart-lung machine should not be shut down if a
sensor is faulty. Also associated with a fault model is containment, i.e., how to limit the e ect of
a fault. For example, if backups are available one may shut down a server and activate a backup.
If this is done transparently the system as a whole continues to work. Furthermore, these models
make assumptions about the ability to identify a fault also called fault detection.
As there are a large number of techniques to detect faults and to recover from them, it is
di cult to address all issues in one paper. In this paper we consider the failure classi cations of
omission, value and addition in replicated systems with synchronous majority voting 6 . As these
systems operate in parallel, we develop a theory for replicated systems in the context of theories
of concurrent systems.
Process calculi such as ACP 3 , CCS 14 and CSP 8 are important formalisms in the descrip-
tion of concurrent systems. Koutny et. al. 9 have developed a trace semantics with extractor
functions for replicated CSP processes. However they do not consider explicit fault modelling.
Krishnan and McKenzie 10 present a calculus for replicated systems and its bisimulation seman-
tics. They also introduced modelling fault occurrences using action re nements as de ned in 2 .
However the de nitions in 10 were too restrictive. The main reason was the presence of an explicit
replication operator, due to which the bisimulation relation was not a congruence.
In this paper we de ne a calculus based on CCS for replicated systems with a notion of fault
injection. Semantic characterisations of the failure classi cation using pre-orders is de ned. The
pre-order is relativised with respect to the correct behaviour and if P is less than Q in the pre-
order, Q is no more faulty with respect to the correctness criteria than P. We also develop a
logical characterisation of the pre-orders using the modal-
calculus 18 . The main achievement of
this paper is the development of pre-orders which are relevant to robust systems and their modal
characterisation.
2 Replicated Systems
As in CCS 14 we assume a set of atomic actions with typical elements represented by
1
,
2
etc. As we are considering replicated systems with synchronous majority voting, we use multi-sets
to indicate the votes for each action. This is justi ed as in synchronous voting all votes arrive
at the `same' logical time. That is, given a system which is composed of replicated sub-systems,
we can assume that the system cannot exhibit an action until each sub-system has voted for the
action of its choice.
De
nition:
1
Let
N
be the set of natural numbers. A multiset
over
is a function from to
N
.
We represent the multiplicity of elements by superscripts but when the multiplicity is equal
to 1 the superscript is not written and actions receiving no vote are not explicitly written. For
example, m =
f
3
1
;
2
2
;
3
g
represents a multiset m where m
1
= 3, m
2
= 2, m
3
= 1 and
for all other actions
m
= 0.
1
As we are interested in majority voting, we de ne an extraction function which identi es the
actions that have received maximum number of votes.
De
nition:
2
De ne a function
V
oted
Action
=
f
8
1
2
,
1
g
We are interested in good" environments, i.e., environments where all votes are identical. In
such an environment, all the voting sub-systems reach a consensus on the action to be exhibited.
De
nition:
3
A multiset
is said to be perfect i
9
1
2
such that
1
0 and
8
2
2
-
f
1
g
2
= 0.
The syntax for the set of processes is de ned as follows.
P ::=
nil
P P + P P
j
P
nil
represents the terminated process and can exhibit no action. If
is `non-trivial', i.e., at least
one action has a non-zero vote,
P represents the process which has accumulated votes indicated
by
and will exhibit a voted action and then behave as P. For completeness sake one needs to
consider the behaviour of empty multisets. While we do not expect a correct speci cation to involve
empty pre xes, an omission fault could erase the pre x. For example, the process
P with an
-omission fault will behave as P. As shall be seen later
P under an
-omission fault will be
transformed into
;
P. The behaviour of
;
P is identical to the behaviour of P. The process P
+ Q represents non-deterministic choice between P and Q, while the process P
j
Q represents
parallel combination of P and Q. In this paper we do not consider synchronisation, and as in CCS
we permit only one process to evolve in one step.
This syntax is simpler than the one described in the Krishnan and McKenzie 10 model where
an explicit replication combinator with the usual action pre x was used. Instead of having an
explicit replication combinator we permit a multiset pre x instead of the usual action pre x. This
simpli es the operational semantics considerably. However, the expressive power of the simpli ed
language is not reduced. It is possible to recode a process involving the replication combinator
into one using the multi-set pre x. For example,
1
P
q
2
Q
q
3
R can be transformed into
f
1
;
2
;
3
g
S, where S is the translation of P
q
Q
q
R.
In the translation, we do not `discard' a process if its vote was not part of the majority chosen
and thus is a relaxation of the model with explicit replication 10 . In 10 for example, the process
1
P
q
2
Q
q
1
R
would exhibit
1
and then evolve to
P
q
R
, i.e., the process
Q
is discarded. The above process translated to the current setting would exhibit
1
but Q will
continue to contribute to subsequent behaviour. This implies that in the new model, processes
can `withstand' more faults as the replication factor is not reduced. This is justi ed as we do not
consider Byzantine faults 11 and hence there is no need to eliminate a process from the operating
environment.
However, given a process in the original language, it is possible to construct a process with
multi-set pre xes whose behaviour is identical to the given process. The process
1
P
q
2
Q
q
1
R
would be coded as
f
2
1
;
2
g
S, where S is the appropriate coding of P
q
R.
An operational semantics based on labelled transition systems 16 for the above language is
de ned in gure 1.
Just as we de ned a perfect multiset, we de ne a perfect process as a process which receives a
perfect vote for all its actions.
De
nition:
4
A process P said to be perfect if in all sub-terms of the form
P
1
,
is perfect.
Based on the operational or one step derivation relation
,
!
, we use the following abbrevia-
tions.
De
nition:
5
a P
6
,
!
i
:9
P
0
, such that P
,
!
P
0
b P
,
!
i
9
P
0
such that P
,
!
P
0
P
6
,
!
indicates that P cannot make a
move, while P
,
!
indicates that P can exhibit
.
Given a replicated process we now describe our model of fault introduction. Fault introduction
is de ned as follows
De
nition:
6
De ne P
y
as follows:
nil
y
=
nil
,
P
y
=
P, P + Q
y
= P
y
+ Q
y
2
Pre x
1
2
V
oted
Action
P
,
!
P
Pre x
2
8
:
= 0
P
,
!
P
0
P
,
!
P
0
Non-Determinism
P
,
!
P
0
P + Q
,
!
P
0
Q + P
,
!
P
0
Parallel
P
,
!
P
0
P
j
Q
,
!
P
0
j
Q
Q
j
P
,
!
Q
j
P
0
Figure 1: Operational Semantics
Intuitively, if a process has terminated, no fault can a ect it, while if a process can perform an
action, an occurrence of a faul