Recoverability Preservation: A Measure of Last Resort

able border=0 cellpadding=7 cellspacing=0 width=100% bgcolor=ccccff> « back to results for ""
Below is a cache of http://www.csm.ornl.gov/~sheldon/public/prs-final.pdf. It's a snapshot of the page taken as our search engine crawled the Web.
The web site itself may have changed. You can check the current page or check for previous versions at the Internet Archive. Yahoo! is not affiliated with the authors of this page or responsible for its content.
Recoverability Preservation: A Measure of Last Resort
Recoverability Preservation: A Measure of Last Resort
Ali Mili, Frederick Sheldon, Fatma Mili, Jules Desharnais

College of Computing Science
New Jersey Inst. of Technology
Newark NJ 07102-1982, USA
mili@cis.njit.edu

U.S. DOE Oak Ridge National Lab
PO Box 2008, MS 6085, 1 Bethel Valley Rd
Oak Ridge TN 37831-6085
sheldonft@ornl.gov

School of Engineering and Computer Science
Oakland University
Rochester MI 48309-4401
mili@oakland.edu

D磂partement dInformatique et GL
Universit磂 Laval
Qu磂bec PQ G1K 7P4 Canada
Jules.Desharnais@ift.ulaval.ca
Abstract.
Traditionally, it is common to distinguish between three broad families of methods for dealing with the
presence and manifestation of faults in digital (hardware or software) systems: Fault Avoidance, Fault Removal and
Fault Tolerance. We focus on fault tolerance and submit that current techniques of fault tolerance would benet from
a better understanding of recoverability preservation, i.e. a systems ability to preserve recoverability even when/
if it does not preserve correctness. In this extended abstract, we briey introduce the concept of recoverability
preservation, discuss some preliminary characterizations of it, then explore possible applications thereof.
Keywords
Programming Calculi, Relational Mathematics, System Fault Tolerance, Fault, Error, Failure, Recoverability Preser-
vation, Recovery Routine.
1
Introducing Recoverability Preservation
For the purposes of this extended abstract, we will introduce recoverability preservation by means of a simple/ sim-
plistic example. We wish to draw the readers attention on the extent to which a function may deviate from its correct
behavior but still preserve recoverability. We consider the space S dened by an integer variable x, and we consider
the following simple program
P; L: F
where

is a label, and

(past) and

(future) are dened by their (expected) functions, as follows:

ォ! #"

Ι$%&!')(103254
If the computation starts with initial state
76
, then at label

we must have state (
768&9
). This is the only correct
state
at label

.
If the past function is incorrect, and instead of computing (

ォˊ$A&!
) it computes

BDˊ1A&E )(10GF
Proceedings Int'l Conf. Principles of Software Engineering, Buenos Aires, Argentina, Nov. 22-27, 2004. then the states that

B
produces are not correct, but they are still maskable, in the sense that application of the future
function (which takes the mod by 9) after

B
will cancel out the error produced by

B
(which mistakenly adds 18 to
the correct result).
If the past function is incorrect, and instead of computing (

ォˊ$A&!
) it computes



ˊ$A&
0G2
then the states that


computes are not even maskable, but they are recoverable, in the following sense: If we know
what is

%&

032ぃ
, we can derive


&9 ィ
. We say that


preserves recoverability with respect to the expected
past function

. It is possible to recover from errors caused by


by simply applying

&


Γ
to the current
(potentially erroneous) state.
If the past function is incorrect, and instead of computing (

ォˊ$A&!
) it computes


ě
&


then the states that


computes are not even recoverable, but they are partially recoverable, in the following sense:
If we know what is


━
, we may not know exactly what is





, but we know something about it.
For example, if


&│

0
, we know that


&


is either 1 or 4. We then say that


preserves partial
recoverability with respect to the expected past function


. We can envision a probabilistic recovery routine which
preserves the current state or adds 3 to it, and has a 0.5 probability of retrieving the correct state.
If the past function is incorrect, and instead of computing (

ォˊ$A&!
) it computes



&


then the states that


produces are not recoverable, in the following sense: knowing

ぃ
gives us no informa-
tion whatsoever on the value of

! Γ
.
A supercial, intuitive look at this example seems to indicate that a function


preserves recoverability with
respect to an ideal function

if the level sets of

!
(i.e. the equivalence classes of the domain of

!
modulo the
relation

"
$#



"
$#
%
) rene (in the sense of: dene a ner partition) the level sets of

; we will revisit this
characterization subsequently, in proposition 4. Note, interestingly, that this relation does not involve how

&
maps
inputs to outputs, as that is a correctness preservation consideration, not a recoverability preservation consideration.
What is important, from the standpoint of recoverability preservation, is not what values

&
assigns to each level set
(if that were wrong, the recovery routine can always correct it), but rather how


partitions its domain into level
sets (as that reects whether


maintains sufcient information to compute the correct nal result, which is the
essence of recoverability preservation). For all these cases except the last, it is possible to recover from errors, using
exclusively the current state, with perhaps less than 1.0 probability of successful recovery. Notice that in this sample
illustrative case, we have, for the sake of simplicity, neglected to factor in an important source of redundancy: the non-
determinacy of the specication that the program at hand must satisfy. In effect we have assumed that the specication
that the program must satisfy is the expected function of the program; had we chosen a less deterministic specication,
say
'
(such that


&(
Ι
renes
'
), we would have highlighted further possibilities for deviating from the expected past
function without jeopardizing recoverability preservation. This highlights our premise that recoverability preservation
is a weak property, in the sense that it is possible for a program to deviate signicantly from its intended function while
still preserving recoverability, hence without jeopardizing the possibility of recovery.
Figure 1 illustrates the hierarchy between the various properties that we have discussed: Figure (a) represents the partition of the domain of

by the equivalence relation

0)

. Figure (b) represents the partition of the domain of

by the equivalence relation

21


. This function preserves
recoverability, because if we know what partition we are in by

21


, we know what partition we are in by

)

. Figure (c) represents the partition of the domain of

by relation


1

3
. This function preserves partial recover-
ability, because if we know what partition we are in by

4
1

3
, we can infer a limited set of possible partitions by

5)

. Figure (d) represents the partition of the domain of

by relation


1


. This function does not preserve recover-
ability, because knowing what partition we are in by


1


gives us no indication on what partition by

we are
in; the two partitions are perfectly orthogonal. (d):


1


, where


does
not preserve recoverability
(c):


1


, where


preserves partial recoverability
(a):

)

, for original

(b):


1


, where


preserves recoverability








































Fig. 1.
Degrees of Recoverability 2
Motivating Recoverability Preservation
For the purposes of this summary discussion, we adopt the terminology and denitions of Laprie [16].
A system failure occurs when the delivered service deviates from fullling the system function, the latter
being what the system is intended for. An error is that part of the system state which is liable to lead to
subsequent failure; an error affecting the service is an indication that a failure occurs or has occurred. The
adjudged or hypothesized cause of an error is a fault.
Fault Tolerance refers to the set of measures and provisions that a system makes to avoid failure after faults have
caused errors. Steps to provide fault tolerance include:
Error Detection
, which consists in checking some correctness conditions along the execution of the system.
Damage Assessment
, which consists in assessing the extent of the damage sustained by the system state, so as to
determine an optimal / adequate recovery action.
Error Recovery
, which consists in retrieving a correct state either from the current state (forward error recovery)
or from a previously saved state (backward error recovery) and resuming the computation.
Fault Diagnosis and Removal
, which (unlike all three other steps) is carried out off-line,