Exception handling in workflow management systems - Software ...
s and process centered software engineering environments. A PSS
controls the flow of work between programs and users in networked environments based on a ªmetaprogramº (the process). The
resulting applications are characterized by a high degree of distribution and a high degree of heterogeneity (properties that make fault
tolerance both highly desirable and difficult to achieve.) In this paper, we present a solution for implementing more reliable processes
by using exception handling, as it is used in programming languages, and atomicity, as it is known from the transaction concept in
database management systems. The paper describes the mechanism incorporating both transactions and exceptions and presents a
validation technique allowing to assess the correctness of process specifications.
Index TermsÐDependability, exception handling, workflow management, process support systems.
æ
1 I
NTRODUCTION
F
OR
the purposes of this paper, distributed processes can be
characterized as sequences of program invocations and
data exchanges between distributed and heterogeneous
stand-alone systems. Business processes are perhaps the
best known example of such processes. The basic tool for
developing and executing business processes is a Workflow
Management System (WFMS) [7], [22], [31], [32], [33], [44],
[50]. A process support system (PSS) generalizes this idea to
any type of process [3], [4]. A PSS generally consists of
buildtime and runtime environment, where the buildtime
environment provides a modeling language and appropriate
design tools allowing us to specify processes. The runtime
environment offers the necessary services for process
automation and monitoring. In this sense, a process support
system can be seen as a tool for ªprogramming in the largeº
over heterogeneous and distributed environments [6].
Due to the characteristics of the environment where they
execute and their long duration (days, maybe weeks),
distributed processes are susceptible to a wide variety of
failures. For instance, communication problems, computer
outages, or program failures are some of the many technical
sources for errors during process execution. Erroneous
process specifications, unexpected changes in the system
configuration, or absent employees are examples of the
many possible user-originated. Thus, in order to build
realistic systems, it is crucial to deploy mechanisms that
allow the system to continue processing even if failures
occur.
In their general classification of system dependability
aspects, Laprie et al. [36] distinguish two ways of coping
with failures in dependable systems, fault prevention and
fault tolerance. While the former is concerned with ªhow to
prevent fault occurrence or introduction,º the latter deals
with ªhow to provide a service complying with the
specification in spite of faultsº [36]. Fault prevention is, to
a large degree, a design issueÐit requires the existence of
suitable design methodologies and construction rules which
help to avoid introducing failures in a system. In process
support systems, validation facilities such as simulation
tools are used to support fault prevention [1], [39]. A
complete avoidance of failures, however, is not possible.
Hence, there is a clear need for inherently fault-tolerant
processes.
Supporting the design of fault-tolerant processes requires:
1. Enhancing the modeling language by adding con-
structs for error detection and error handling and
2. Modifying the runtime system in order to implement
the new semantics.
In this sense, a process support system is not different from
a programming environment and, consequently, concepts
for application fault tolerance, as they have been developed
in many fields, could be applied. Two techniques are
especially relevant for process support environments:
atomicity and exception handling.
The concept of atomicity, as it is used in databases,
provides a well-known abstraction for failure handling. It is
based on backward recovery: In the case of a failure, an
application or parts of it are ªrolled backº to a previous
consistent state. From this state, the computation can
continue by retrying the previously failed instructions or
by following alternative execution paths. The advantage of
the atomicity abstraction is that the programmer (or process
designer) does not need to specify all necessary steps for
undoing work. Instead, this is left to the runtime system,
which performs recovery based on logged information.
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 26, NO. 10, OCTOBER 2000
943
. C. Hagen is with Credit Suisse, CIXT, CH-8070 ZuÈrich, Switzerland.
E-mail: claus.hagen@credit-suisse.ch.
. G. Alonso is with the Department of Computer Science, Swiss Federal
Institute of Technology (ETHZ) ETH Zentrum, CH-8092 ZuÈrich,
Switzerland. E-mail: alonso@inf.ethz.ch.
Manuscript received 2 Mar. 1999; revised 10 Dec. 1999; accepted 28 Jan.
2000.
Recommended for acceptance by A. Romanovsky.
For information on obtaining reprints of this article, please send e-mail to:
tse@computer.org, and reference IEEECS Log Number 111487.
0098-5589/00/$10.00 ß 2000 IEEE
Paradoxically, adding fault tolerance mechanisms to a
system can reduce its fault tolerance due to the resulting
system complexity [8]. To avoid this, programming lan-
guage designers have developed language elements that
allow us to separate the failure handling aspects from the
ªnormalº flow of control [25], [40]. The same ideas could be
used with processes since, when properly used, they can
provide forward recovery (execution of alternative and/or
compensating activities) and, thus, complement the back-
ward recovery provided by the atomicity concept. The
problem is, however, how to combine both concepts and
adapt them to the context of distributed processes.
This paper describes how this problem has been
addressed in the O
PERA
process support system [3]. In
particular, we deal with exception handling of the type
encountered in normal programming languages like C,
C++, or Java and how exception handling can be combined
with transactional atomicity. In this regard, the contribution
is twofold. In the first place, we put forward the idea that
process support systems are programming environments
and should be treated as such. This view is not common
among the process support community and, in fact, many
process modeling languages lack even essential abstractions
(such as modularization and nesting) necessary for com-
prehensively modeling large processes. In the second place,
we show how to incorporate exception handling and
transactional atomicity into process support systems. The
result is a flexible and powerful solution that preserves
consistency even if processes access multiple databases.
Furthermore, to help process designers to identify potential
consistency problems, an integral part of our solution is a
validation service which checks the well-formedness of
process specifications and gives hints to designers on how
to improve their design.
The paper is organized as follows: In Section 2, we
present an example and discuss the problem of integrating
fault tolerance into process support systems. Section 3 gives
a short overview of O
PERA
, the system in which these ideas
have been implemented. Section 4 presents the concept of
spheres of atomicity and Section 5 describes our approach for
incorporating exception handling functionality. Section 6
illustrates the solutions using an example. Section 8
discusses the problem of well-formed processes, develops
a correctness criterion, and presents an algorithm for
process validation. Finally, Section 9 concludes the paper.
2 M
OTIVATION AND
E
XAMPLE
As a running example for the rest of the paper, consider a
process incorporating the reservation of various flights,
rental cars, and accommodations, as well as the final
sending of documents and invoices to the customer and
storage of the result in the travel agency's internal database
(Fig. 1). The programs and services incorporated in the
process are executed by different autonomous systems. In
here, we will assume the flight reservation is done through
a CORBA gateway to a booking system. We will also
assume that sending the documents and invoice, as well as
reserving a hotel, are manual tasks to be handled by the
travel agency's personnel. The record keeping in the local
database will take place via a transaction processing
monitor [10] and the reservation of a rental car will be done
through a legacy system. During execution, this process can
encounter a wide range of problems: Applications might be
down, machines may have crashed, programs may return
the wrong results, the process logic might be incorrect,
messages may disappear, the server where the process runs
may fail, and so forth. One way to deal with some of those
failures is to replicate the process using back-up techniques
so as to be able to resume execution even if the server where
the process runs fails [30]. In some other cases, researchers
have proposed dynamic modification of the process
structure so as to be able to circumvent failures [35]. In this
paper, we are interested in those failures that are normally
considered exceptions in traditional programming languages
like C or C++.
Exception handling, however is not enough in the
context of distributed processes. Processes interact with
external applications and have side effects. Dealing with
failures during process execution implies being able to
account for those side effects. More concretely, aborting a
process in the case of a failure leaves it in an undefined state
with only parts of its goals met and external resources (such
as accessed databases) possibly inconsistent and with large
amounts of work already performed. This is particularly
undesirable for processes which are very long, consist of
expensive tasks, or access multiple external data reposi-
tories. To allow the continuation of a process in sp