self-adaptive middleware: supporting business process priorities and service level agreements
TRANSCRIPT
Self-adaptive middleware: Supporting business process
priorities and service level agreements
Yves Caseau
Bouygues Telecom, 92640 Boulogne-Billancourt, France
Received 8 December 2004; revised 19 April 2005; accepted 3 May 2005
Abstract
This paper deals with quality of service, defined at the application level, with respect to business constraints expressed in terms of business
processes. We present a set of adaptive methods and rules for routing messages in an integration infrastructure which yields a form of
autonomic behavior, namely the ability to dynamically optimize the flow of messages in order to comply with SLAs according to business
priorities. EAI (Enterprise Application Integration) infrastructures may be seen as component systems that exchange asynchronous messages
over an application bus, under the supervision of a processflow engine that orchestrates the messages. The QoS (Quality of Service) of the
global IT system is defined and monitored with SLAs (Service Level Agreements) that apply to each business process. The goal of this paper
is to propose routing strategies for message handling that maximize the ability of the EAI system to meet these requirements in a self-
adaptive and self-healing manner, i.e., its ability to cope with sudden variations of the event flow or temporary failures of a component
system. These results are a first contribution towards deployment of autonomic computing concepts into BPM (Business Process
Management) architectures. This approach marks a departure from previous approaches in which QoS constraints are pushed to the lower
level (e.g., the network). Although the techniques, such as adaptive queuing, are similar, managing QoS at the business process level yields
more flexibility and robustness.
q 2005 Elsevier Ltd. All rights reserved.
Keywords: Service level agreement; Self-optimizing; Autonomic; Middleware; Message passing; EAI; Business processes; BPM; Adaptive queuing
1. Introduction
The topic of this paper is the management and
optimization of business processes within an integration
middleware. These issues arise when IT is organized and
measured according to the key business processes, using
components that participate in the processes through their
connection to a global infrastructure (EAI) and the
orchestration of a processflow engine. The optimization of
IT is thus measured through SLAs which define the
expected performance of each process: the intention is to
meet these requirements and, when this becomes impossible
for some external reason (bust, failure), to ensure an ordered
degradation in performance so that higher-priority processes
are less impacted than lower-priority ones.
1474-0346/$ - see front matter q 2005 Elsevier Ltd. All rights reserved.
doi:10.1016/j.aei.2005.05.013
E-mail address: [email protected].
Our goal is to study the impact of routing and control
strategies, and to show that it is possible to demonstrate self-
adaptive and self-healing behavior. The topic of self-
adaptive and self-healing software and middleware is
studied by a very active research community [1], but little
work has been done on the management of business
processes and their quality of service. On the other hand,
adaptive queuing is the classical approach to implement
QoS management in a network [2], but there is a large gap
between QoS at the network level and at the business level.
The approach that we propose is to monitor and control
QoS, from a business perspective, at the integration
middleware level, and specifically at the processflow engine
level. We use a finite-event simulator to represent an EAI
integration infrastructure. This simulator is used to test
various message handling strategies, both at the component
level (how to sort the message queues) and at the EAI level
(how to configure the processflow engine and how to
manage business processes’ QoS with control rules). We
have found that sorting strategies produce large differences
Advanced Engineering Informatics 19 (2005) 199–211
www.elsevier.com/locate/aei
Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211200
both the standpoint of both quality and robustness. We also
show that simple control rules are effective in increasing the
ability to cope with system failures or exceptional
congestions. The paper is organized as follows. Section 2
presents the field of application integration and business
process management Section 3 describes the motivations for
this work, namely the ability to optimize quality of service
according to business priorities. The acronym OAI
(Optimization of Application Integration) is introduced as
a dual issue from EAI: EAI is about building integration
infrastructure, OAI is about operating them. Section 4
explains how processes and SLA are defined, and illustrates
the concept of business priorities using a real-life example.
Section 5 presents the finite-event model and simulator that
we used to study various message routing strategies. Section
6 shows how 8 different routing strategies perform when
applied to regular and exceptional scenarios. Section 7
introduces the concept of control rules, which are triggered
by certain QoS conditions defined for systems and
processes. We show that these rules may be used to increase
robustness against exceptional events. The last section
compares this OAI approach with related works, and
stresses the difference with network QoS concepts.
2. Application integration and business processes
This paper describes a piece of applied research that
stems directly from an operational crisis which occurred at
Bouygues Telecom a few years ago. This section describes
the issues that face IT managers when they try to support
their customers’ business processes. Since the term ‘Quality
of Service’ will be used throughout the paper in a sense that
is different from the ‘network QoS’ literature, from which
most of the relevant techniques will be extracted in the next
sections, it is important to state what those issues are.
Most companies have been going through IT re-
engineering programs in the last ten years. For instance,
Bouygues Telecom built its earlier IT system around an
integrated off-the-shelf package that delivered billing,
CRM, provisioning and customer management services.
Five years ago, this approach ran into performance,
scalability and cost-efficiency problems, mostly because
the packaged software had been heavily customized. The
motivations for the re-engineering approach which resulted
from this difficulty constitute a textbook case for the
industry:
†Take ownership of IT through a corporate business
object and process model. Instead of letting multiple
software vendors tell you what a customer or a service is,
build a central model that is the cornerstone of inter-system
exchanges.
†Deploy a component-based architecture to solve
performance and scalability issues.
†Increase flexibility through the displacement of
business logic from components towards a processflow
engine. The explicit implementation of business processes is
a step forward both from a ‘strategic alignment’ perspective
(making sure that IT processing follows closely the business
strategy) and from an ‘agility’ perspective (changing the
processes becomes easier).
†Ensuring better Quality of Service through redundant
secure infrastructure and through SLA monitoring (cf.
Section 4).
The resulting target architecture is illustrated by the
following figure. Each major IT function is implemented by
separate components, which are assembled using an EAI
infrastructure. The components exchange information
through message passing, under the supervision of a
processflow engine. Collaboration between components is
precisely described using a business process language, as a
sequence of activities separated by transitions [3]. Business
objects are distributed among these components, and the
common model is used to provide the exchange semantics.
The distribution of business objects means that synchroni-
zation flows must be orchestrated, with some form of
interleaving with the control flow from the business
processes. Thus, the role of the EAI infrastructure is
threefold: transporting messages, managing the execution of
business processes and ensuring the synchronization of
distributed copies of business objects.
The figure above is obviously a simplified abstraction of
a true IT system. For instance, it is not desirable to rely on a
single EAI integration infrastructure. There are multiple
technology, deployment and performance constraints which
lead to the concept of ‘fractal architecture’ [4]. A fractal
architecture is a recursive construction of the pattern
represented in Fig. 1 (a software bus with components) at
multiple scales. Such patterns may collaborate through a
gateway (between two EAI buses) or one may be a sub-
pattern of another if it describes one of its components. In
this paper we shall focus on a single EAI instantiation (one
bus and one processflow), but this does not mean that the
entire IT system of one company could fit on a single bus.
The concepts of Enterprise Architecture and Business
Process Management have become very popular, and hype
may be replacing common sense in many cases [4]. Our
practical experience with enterprise architecture and
integration architecture suggests three general warnings:
†Agility is a matter not of technology but of design.
Agility, which is not simply the ability to develop new
features easily but also to integrate, test and deploy them,
depends on the modularity of the functional architecture.
†Synchronization of distributed heterogeneous data
sources is a difficult, longstanding problem, even with a
BPM or EAI infrastructure. As business processes interact
through the shared business objects, coherence of execution
requires classical mechanisms such as signalization,
exclusion or serialization.
†Business Process Operations is the harder part when
deploying the target infrastructure shown in Fig. 1. The next
sections will show that monitoring is more difficult precisely
Process ManagementBusiness Objects
Management
EAI Infrastructure
IT Systems
CRM DWBilling ProvisioningC. Management
Business Processes
P1
P2
Tasks
Task
Task
Task
Business Processes
TechnicalProcesses
Transport
Directories
Each transition is defined through business object updates
distribution
Fig. 1. A Target EAI/BPM Architecture.
Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211 201
because the whole system is more robust. Instead of
producing clear-cut failures, asynchronous systems have
the ability to continue working when they are overloaded,
resulting in later shutdowns (when message queues are full)
of larger proportions. Incidents must be resolved on ‘active
systems’, which requires a new culture among the
‘operations’ staff.
The work reported in this paper is a contribution to this
field of ‘operational issues of EAI and BPM integration
infrastructure’.
3. Optimization of application integration
The purpose of ‘strategic alignment’ is to manage and
optimize IT according to key business processes. Defining
the QoS of a process is not a difficult task: it is a matter of
throughput, latency and availability. Throughput measures
the number of individual processes that can be performed in
a fixed amount of time such as a day. Latency is the time
required for end-to-end execution of a single process, which
is most often what the customer sees and reflects the
perceived quality of service. Availability is the percentage
of the time when the system is able to start a new process. A
service contract states desired goals in terms of QoS,
yielding a service level agreement (SLA). Complying with
an SLA, and even monitoring it, is a difficult task because
the engineering (sizing and capacity analysis) is done on the
individual systems and the infrastructure. We do indeed
design and build individual systems with a fairly precise
characterization of their availability, their throughput and
their latency. However, reverse-engineering from the SLA
at the process level is difficult for a number of reasons:
(1) The availability of systems coupled in an asynchronous
manner is difficult to evaluate (it is definitely not the
product of availabilities, as with a synchronous
approach). It depends on many parameters, and it is
possible to build a reliable asynchronous system with
components that are far less reliable (using buffers and
replication).
(2) The sizing issue is not a stationary flow (which is itself
not so trivial because of the multi-commodity aspect)
but a stochastic problem concerning the resolution of
bursts. Real-life situations generate bursts of messages
that are difficult to process.
(3) The management of the flow is complex from a
functional point of view: it is a matter of protocol
rather than ‘simply’ a matter of the sizes of pipes and
engines. Implicit retries or acknowledge protocols have
an impact on performance.
(4) Processes are interdependent because they operate on
shared business objects. This dependency (a classical
problem with databases) may be handled through
serialization, which induces serious constraints on
QoS, or through interleaving of processes and
additional ‘re-synchronization’ steps to correct the
possible discrepancies.
The pure stochastic asynchronous problem is made
harder in real life owing to a combination of synchronous
and asynchronous communications, and a combination of
planned flows and stochastic flows which creates interesting
planning issues.
Processes
CRM PFSCust.Base
Prov.
NetworkHelpAccounts
Fraud OrderMgt
InfrastructureSystems
Goalsareset at theprocesslevel
Fig. 2. OAI - The Operational Challenge of EAI.
Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211202
The bottom line is that the operation of a ‘process-
oriented’ IT global system is a highly challenging problem.
We have coined a new name, OAI (Optimization of
Application Integration), for this challenge, which is
illustrated by Fig. 2. This figure, which is an abstraction
from a real IT infrastructure example, represents 9
component systems that are interconnected with a message
bus, and on which 5 processes are defined. These processes
have different business priorities and very different target
latencies. The challenge of OAI is to run the 5 processes so
that each SLA is met, even when the common resources are
stressed by unforeseen events.
This OAI challenge is not merely of scientific interest.
An instance of such a problem occurred at Bouygues
Telecom during the launch of a new commercial service.
The provisioning of this new service was a simple enough
operation, although distributed onto multiple systems. The
service guarantee given to the end-customer sounded very
conservative, since the actual processing time that was
needed was a small fraction of the guaranteed provisioning
delay. However, this service level agreement proved
difficult to meet with regularity and was shown to be very
sensitive to external perturbation. One of the most
worrisome problems was that low-priority processes could
upset the delivery of this new service provisioning, which
was considered a strategic initiative.
The OAI problem statement is different from a network
QoS issue. The constraints expressed in terms of business
processes could be rewritten as elementary constraints about
transport and service execution. Assuming that the
application servers meet their expected levels of service
and that the network and the associated middleware provide
the necessary QoS for transport, the business process would
be delivered accordingly. This approach is too restrictive
and fails for two reasons (loss of intention and harmful
strengthening of the QoS constraints):
† Pushing the constraints to the lowest levels makes the
problem harder; it may become impossible to meet these
QoS constraints, whereas the original business require-
ments were feasible.
† The explicit representation of business QoS objectives at
a global level makes it easier to capture priorities and
hierarchies of goals.
Although the rest of the paper will borrow adaptive
queuing techniques from computer networking approaches
[2], handling quality of service at the integration middle-
ware level represents a significant improvement. It is more
flexible: business QoS constraints may be used as
declarative parameters (or policies). It is more robust: a
higher-level perspective supports larger-scale reaction. The
relationship between OAI and adaptive network manage-
ment will be discussed in the last section.
4. Business process management and SLAs
The approach to application integration which we follow
is closely related to BPM (Business Process Management):
business processes are the central control mechanism for the
complete set of applications [3,5]. Since business processes
are ‘at the heart’ of the business, it is not surprising that
business objectives in terms of quality of service are
expressed through IT processes. At first glance, a process is
a directed graph of tasks that must be executed on a subset
of components. Tasks that belong to a chain must be
executed sequentially, while sub-chains may be executed in
parallel. A task that has many successors corresponds to a
‘fork’, while a task with multiple predecessors corresponds
to a ‘join’. More precisely, a process may be labeled with
additional constraints that specify concurrency or sequence
constraints, as well as error recovery strategies. It is also
common to introduce the recursive notion of a sub-process,
where a task itself is defined as a process. XML description
of processes is becoming a standard through the BPEL
initiative [6]. The BPEL4WS description of the five
processes presented in Section 5.2 is a straightforward
exercise of encoding a graph into XML. However, the
notion of processes’ QoS is not mature enough in the
middleware community to be standardized. The closest step
is the WSLA (Web Service Level Agreement) proposal [7],
which focuses on throughput and latency at the local
‘service call’ level. We argue for a standard that takes the
global view at the process level and that also takes failures
into account.
QoS is defined and measured with respect to a Service
Level Agreement. SLAs set the expected completion time,
with associated probabilities (of completion and completion
within the limit) for a given incoming distribution. SLAs are
Latency (s)
Value ($)
0
SLAapproximation
Target completion time
Fig. 3. Service Level Agreements for Processes.
Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211 203
common and similar to what is found in network routing or
order processing. The intuition behind an SLA is shown in
Fig. 3. The value produced for the company by the
completion of an order (the execution of a process) is a
decreasing function of the execution time. An SLA contract
may be seen as a simple ‘rectangle’ approximation of this
function. This value is compared to the cost of processing an
order, which is a function of the target completion time and
the required probability of reaching this target. Due to the
stochastic nature of the incoming load, the cost rises sharply
when the probability approaches 1. Setting up the SLA
requires solving the stochastic maximization of profit. Such
SLA are commonly used because they are easy to define and
thus easy to monitor Fig. 3.
Thus, an SLA sets the expected completion time, with
associated probabilities (of completion and completion
within the limit) for a given incoming distribution. Here is a
common example of an SLA, using a nested structure:
† Activation process is available 99.9 of the time.
† 80 of all activations are processed successfully within
2 hs.
† 98 of all activations are processed successfully within
24 h.
In the context of order processing or call centers, SLAs of
the kind ‘X of calls are handled within Y seconds’ are
commonly used because this approximation is acceptable
and because there is a wealth of knowledge on how to size a
queue network that is hit by a Poisson or an Erlang flow so
that this type of SLA is met (for instance, see [8]).
An OAI problem may be described with a queue network
([9,10]) and is very closely related to the field of System
Performance Analysis. As noted above, however, there are
many aspects that make this problem difficult and beyond
the reach of an analytic approach. The design of the proper
message handling algorithms is closer to the field of online
algorithms [11,12]. The proper routing of a set of different
process flows with different SLAs is very similar to the
problem of call routing according to SLAs in a call center
that we studied in a previous work [13]. It is also similar to
the network routing problems involving different flows with
different QoS levels [14]. The intractability of these
problems (for instance, see [15] on NP-completeness of
QoS routing) led us naturally to explore the use of
simulation, which we shall now present.
5. Finite-event simulation
5.1. Infrastructure
The infrastructure that we simulate corresponds to the
problem shown in Section 2. It consists of 9 technical
systems, which are plugged onto a message bus. The bus’
role is to transport asynchronous messages, which are
controlled by a processflow engine.
A technical system is able to perform tasks on business
objects, which are parts of the business process. Each
system consists of a number of parallel threads that may
perform the same tasks. It is plugged onto the bus using an
‘adapter’ that plays different roles in real life: XML parsing,
translation from one object model to another, selection of
relevant messages (filtering), and so on. In our model, the
adapter is part of the ‘technical system’ and is made of a
queue of incoming messages and a sorting algorithm that
will feed the system when one of its threads is available. The
processflow engine is responsible for the execution of
business processes.
Another technical system plugged onto the bus is the
monitoring system, which gathers process completion
statistics and issues QoS updates, at both the process and
the system levels (see Section 6). The statistics include the
average completion time for a process or a task, the fraction
of process executions that meet the SLA target, the number
of processes that failed, etc. This monitoring system also
runs ‘control rules’, which are triggered by these monitoring
events and may cause flow controlling actions that will be
explained in Section 6, by changing the ‘status’ of one or
many systems.
5.2. Events
We use a simple finite-event simulation engine. For a
general introduction to simulation and its use in investi-
gating distributed systems, see [16], which also contains an
insightful discussion about the order in which messages are
processed. A finite-event simulation is defined by the type of
events created for each object by the simulator and how they
are interpreted. Fig. 4 shows the diagram of the events
produced in our experiment, where an arrow designates an
event that originates from a source and is handled by its
destination.
There are two kinds of events that have no source and are
produced as an input to the simulation engine: the
StartProcess and the SystemFailure events. StartProcess
events are generated for each process, according to
parameters that are given for the process or for the
ProcessflowEngine
Infrastructure
Monitor
StartProcess
StartTask
System
TimeOutAlert
EndProcess
EndTaskStartTask
EndTask
SetStatus
Failure
ReceivedTask
ReceivedTask
Fig. 4. Event Schema.
Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211204
simulation scenario (see next section), such as the average
throughput, type of distribution (uniform or Poisson), etc.
We may now briefly describe the semantics associated
with each other type of event:
– A StartTask event is produced by a processflow
engine each time a new step of a process is taken.
– A EndProcess event is produced when the
EndTask event received by the processflow
engine corresponds to the last step of a process.
– A EndTask event is produced by a system that has
completed a task.
– A ReceivedTask event is an ‘acknowledge’ event
produced by a system that has received a
StartTask event. It acknowledges the reception
of the message, so that the processflow engine
knows that the delivery was successful.
The engine uses a time-out mechanism to re-send the
StartTask message a given number of times until this
acknowledgment is received; otherwise, the process fails.
The time-out is implemented with a TimeOutAlert event that
the processflow engine sends to itself. The monitoring
system collects the EndProcess events to produce QoS
indicators. These may trigger rules that modify the behavior
of the technical system through a SetStatus event.
The OAI test scenario presented in Section 2 corresponds
to the following five processes (which are loosely inspired
by real business processes):
† P1 is a ‘subscription’ process, with high priority (1) and a
target latency which is not tight but strict.
† P2 is an automated barring process triggered by fraud
detection. Its path may be seen in Fig. 2 (blue processZpriority 2), from Fraud to OrderMgt to Prov(isioning) to
Network.
† P3 is a lower-priority (3) barring that is triggered by the
Account Management system.
† P4 is a high-priority de-barring process once a customer
has paid its delinquent bill. P4 goes through the order
manager, then provisioning, network and customer
management.
† P5 is a medium-priority query process that is originated
by the helpdesk, with a short and tight completion time
(a customer representative is waiting).
We simulate three types of ‘errors’: when a system
randomly cannot complete a task, when it suffers from a
failure and when the unstable state produced by a previous
failure on a business object prevents the running of a new
process on this object. Error handling is managed through
the send/acknowledge protocol and through a resynchroni-
zation process which is run at regular intervals to repair all
business objects that have been involved in a previous
failure.
5.3. Monitoring and control
The goal of the simulation is to gather a number of
informative indicators, which are used both as a result of the
experiment (see the tables presented in the following
sections) and as an input of the control strategy used to
implement self-adaptive approaches. Monitoring is based
on a sampling approach (every 10 min) in which we produce
a status report containing:
– Throughput
– SLA performance (process and system levels)
– Utilization of systems
– Run-time (process and system levels)
– Failure rate (process and system levels)
The sampling interval is a trade-off between different
objectives: small enough to be reactive with respect to the
durations involved in the SLA (here from a few minutes to a
few hours), but large enough to yield statistically stable
results and to avoid overloading the systems. Our choice of
10 min is conservative and pragmatic; a more thorough
study remains to be done.
These status reports are logged, so that we can produce
average, standard deviation and min/max intervals, for one
or for multiple runs. The definition of ‘system SLA
performance’ is obtained as follows. For each process, we
build a theoretical model of ‘perfect execution’ that meets
exactly the SLA completion time, using expected task
execution and message delivery times. This is used both to
define an SLA indicator for each system, which is the ratio
of the theoretical task completion time (deliveryCwaiting
Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211 205
timeCexecution time) to the actual measured time, and as a
key for sorting messages. In the remainder of the paper, we
use the term SLA-routing for a simple adaptation of EDD
(Earliest Due-Date [17,18]) to the context of business
processes (see Section 8 for a comparison with network
routing). SLA-routing, therefore, means sorting messages
according to the ‘expected delivery time’, which is
interpolated using the SLA characteristic of the business
process to which the message contributes.
The monitoring system implements a simple rule
interpreter, which runs rules of the type ‘conditionZOaction’, where:
† The condition is a conjunction of tests on the status and
the QoS indicators of the processes and systems (SLA
performance, size of queue, etc.).
† The conclusion sets the status of a system to a new value.
This framework is flexible and should enable us to
experiment with various flow control strategies. In this
paper, we have implemented only four possible states for
each system:
– CUT: the system is shut down.
– SLOW: the system is slowed, which means that
the EndTask events are produced with a delay
equal to the task duration. This is a crude way of
lowering the throughput of a system (we may also
shut down a few threads).
– REGULAR: this is the default mode.
– FAST: the message sorting strategy (see next
section) is changed to a PFCLS approach that is
more efficient for handling congestions.
6. Routing Strategies
6.1. Routing algorithms
The term ‘routing’ is used here to designate the message
passing strategy. Control of the ‘routing’ is handled by the
processflow engine and is not meant to be modified. On the
other hand, the order in which the messages are processed
by each system is a design parameter, which we designate as
a ‘routing strategy’ in this section, and which could named
more precisely ‘mailbox sorting strategy’. The goal of this
section is to study the effect of various sorting algorithms,
that is, the algorithm used to select the next task that will be
run by a system whenever one of its computational threads
becomes available. We have implemented and compared
the following algorithms:
1. FCFS (First Come, First Served) is the default strategy
used by most EAI platforms. Standard queuing theory
tells us that it is indeed the best overall strategy to
minimize waiting time for a simple queue [8]. It is also
fair, by definition, but does not take priorities into
account. One of the early motives for this work was to
see how priorities could be introduced.
2. LCFS (Last Come, First Served), a standard algorithm
that is not much used in EAI environments because it is
not fair. We tested it as a reference point since it is quite
robust, and produces interesting results in case of
congestions.
3. RSS (Reactive Selection for Service) is another classical
approach from queuing theory [8] which we thought
could be interesting for studying the opportunity for
randomized algorithms [19].
4. SLA (Service Level Agreement) uses an estimated start-
time based on a linear interpolation of the SLA, as
discussed in the previous section, using the process start
time as a reference. The task message selected is the one
with the earliest expected start time.
5. PRF (Priority and FCFS) is a combination of priority
sorting (main) and FCFS sorting. We tried various
combinations, but the one that works best is the simplest
(lexicographical ordering: priority first and then FCFS
for ties).
6. PRL (Priority and LCFS) uses a similar combination of
priority and LCFS.
7. PRSS (Priority and SLA at System level) uses a
combination of priority and SLA time that is applied at
the system level. That means that the linear interpolation
of the SLA time is used only to estimate the expected
waiting times in the queue, using the issuing of the
StartTask event as a reference.
8. PRSP (Priority and SLA at Process level) is similar, but
the linear extrapolation is used to compute, for each
StartTask event, the time that is expected between the
process start and the beginning of the given task (same as
the #4 algorithm).
The first step is to simulate a typical event flow, with
various throughput densities. We used 5 scenarios that may
be described as follows:
† S1: Our reference scenario. The throughput has been
adjusted so that all systems are reasonably busy, with
utilization ratios from 30 to 80. The load is equally
balanced between the 5 process flows (P1 to P5). The
event distribution is uniform in time, with a 5-hour time
horizon.
† S11: The event rate is increased by 10 during the two
first hours.
† S12: The event rate is increased by 30 during the two
first hours. Many systems reach the 100 utilization rate.
† S13: The event rate is increased by 50 during the two
first hours. This is a ‘crisis scenario’ since this flow is too
large for the systems to handle. Since we suppose that
the sizing of the EAI system has been done properly,
such crisis situations are unlikely, except for bursts,
which will be studied in the next section.
Table 1
Algorithms!scenarios-regular flow
S1 (%) S11 (%) S12 (%) S13 (%) S2 (%)
FCFS 98 98 83 28 98
98 98 59 15 98
98 98 76 22 98
88 88 47 12 93
84 84 46 13 90
LCFS 98 98 92 75 98
98 96 85 66 98
98 98 91 73 98
93 90 72 52 94
96 94 86 79 96
RSS 98 98 90 67 98
98 97 82 56 98
98 98 88 64 98
94 90 67 41 94
95 93 83 72 95
SLA 98 98 82 26 98
98 98 75 18 98
98 98 78 22 98
98 98 69 15 98
99 99 74 20 99
PRF 98 98 98 98 98
97 96 81 48 98
97 94 44 2 98
98 98 95 86 98
93 89 69 45 93
PRL 98 98 98 97 98
97 96 83 73 97
97 92 54 59 96
98 98 96 97 98
95 93 82 70 95
PRSS 98 98 98 98 98
97 96 83 50 98
97 94 44 3 98
98 98 98 97 98
96 94 79 52 96
PRSP 98 98 98 98 98
97 96 82 50 98
97 92 39 1 98
98 98 98 97 98
96 94 76 50 97
Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211206
† S2: This scenario is similar to S1 except that the event
arrival times follow a Poisson distribution.
Table 1 contains the preliminary results when we run
these 5 scenarios against the 8 algorithms presented earlier.
Each experiment is summarized with the average SLA ( of
processes that were completed within the SLA maximum
time). These numbers are given for the five processes; the
P1 and P4 lines are in boldface type to indicate the higher
priority. A real-life situation could be more complex, with
different SLA percentage objectives (we could say that P1
completion time must be met 95 of the time, and P2 only 90
of the time). Since this is somehow redundant with process
priorities, we assume here that the goal of the routing
strategy is to ensure that, for each process, SLA satisfaction
is to be as close to 100 as possible, starting with higher-
priority processes.
The first lesson that can be drawn from these experiments
is that priority routing works. The four algorithms that use
process priority as part of the sorting strategy are able to
maintain the SLA of high-priority processes much better
than the first four algorithms. If one looks carefully, these
are very positive and very encouraging results, since it is a
very strong business objective to be able to deliver the key
processes when congestion occurs.
The second lesson is that FCFS is not a good default
algorithm. Both RSS and LCFS do better as soon as the
event flow becomes tight (we did not report results for
smaller event flows since all algorithms perform well,
although it may be noted that the average duration is indeed
lower for FCFS). On the other hand, SLA time is a valuable
technique since, without taking priority into account, it the
best overall approach when the event flow can be handled by
the EAI system.
When the event load becomes too great, PRSS and PRSP
do a good job of handling the high-priority processes, but
overall satisfaction is not as good as that obtained with PRL.
This is even more marked if throughput increases to many
times the usual rate, in which case the SLA satisfaction rate
drops to 0 with all algorithms but LFCS or PRL. This is
because Last Come, First Served has the implicit behavior
of serving a few customers well and the rest very poorly, as
opposed to other (fairer) approaches that treat everyone
poorly. This aspect will be discussed in the next two sub-
sections.
It is interesting to note that these results are not very
sensitive to the event distribution. They are also fairly
stable, as the results obtained by running a scenario 10 or
100 times are very similar. For this reason, we do not report
standard deviation numbers in these tables.
6.2. Burst distribution
We shall now measure the ability to deal with a burst,
which corresponds to the ‘self-adaptive’ characteristic that
we expect from the EAI infrastructure. We use the following
scenarios:
– S3: This scenario is a combination of the event
flow used in S1 (reference) and a 20-min burst of
P1 processes, doubling the input rate for P1.
– S31: This scenario is a variation of S3, where the
burst lasts for 40 min.
– S32: Same, but the burst lasts for one hour.
– S4: This scenario is similar to S3, but the burst
occurs with a lower-priority process, P3.
– S42: Similar to S32 (P3 burst).
We compare only the last four algorithms, since the
previous section made it clear that taking priority into
account for the sorting algorithm is indeed a good idea.
Table 2 confirms our earlier results: the combination of
priority and SLA sorting is the best approach. The S4*
Table 3
Algorithms!scenarios-component failure
S5 (%) S51 (%) S6 (%) S61 (%)
PRF 91 76 98 98
78 62 87 79
65 37 94 86
84 70 98 98
82 72 92 92
PRL 90 78 98 98
79 63 87 79
71 50 94 88
86 75 98 98
85 76 95 95
PRSS 90 77 98 98
79 62 88 79
66 38 95 87
88 73 98 98
85 76 96 95
PRSP 91 75 98 98
78 61 88 79
64 33 94 85
86 70 98 98
85 75 96 95
Table 2
Algorithms!scenarios-bursts
S3 (%) S31 (%) S32 (%) S4 (%) S42 (%)
PRF 98 96 90 98 98
80 66 54 97 97
70 47 30 84 47
88 79 70 98 98
81 70 60 92 90
PRL 97 95 92 98 98
83 70 58 97 96
74 53 36 86 68
92 86 80 98 97
84 73 64 94 92
PRSS 98 98 95 98 98
82 67 54 97 97
72 49 31 85 49
96 93 86 98 98
84 73 62 96 95
PRSP 98 97 91 98 98
82 68 54 97 97
70 45 29 82 45
97 90 83 98 98
83 72 61 95 94
Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211 207
experiments show that these four algorithms can handle a
burst of low-priority messages very well. We may see that
for small bursts there is a tiny advantage for PRSP against
PRSS, but the overall winner is PRSS.
Here also, we can push the system to the limit with a
bigger burst (3 times the regular P1 flow, the S33 scenario),
and we find that PRL does a better job at getting at least a
third of the P1 process calls completed within the target
times, while SLA satisfaction drops to 0 with the other
approaches.
The conclusion is that PRSS is the best self-adaptive
algorithm, but that PRL may be seen as a more robust
algorithm, depending on whether further delay of an order
that has not met its SLA completion time is considered a
good practice.
6.3. Component failure
Each system is characterized by its availability, which
is the probability that the system is available to perform
a task. Non-availability may have multiple causes, from
technical and hardware failure to software and functional
errors. We include in our model, and in all experiments
that have been reported so far, the random generation of
service failure which captures these small ‘glitches’, most
of which are caused by human error (typing error when
customer parameters are input). In this section, we shall
study the impact of a complete system failure, which is
the shutdown of the system for a given duration (a few
minutes to a few hours). When a system is unavailable,
there are a number of effects: first, the messages in the
system queue wait to be executed; second, the
infrastructure does not receive the acknowledge message,
so it stops sending new messages and stores its own
process states; and lastly, after a time-out delay,
processes are considered to have failed.
Our goal here is to measure first the ability to deal with
the growth of the queues, and then the ability to handle the
backlog when the system restarts. We use the following
scenarios:
– S5: We simulate the failure of a key component
(Order Management) for 15 min, which means
that all processes are impacted.
– S51: We simulate a similar failure for a duration
of 30 min.
– S6: We simulate the failure of the Fraud system,
which impacts only the second process, for
15 min.
– S61: Same as S6, but for a duration of 30 min.
Table 3 presents the results obtained with these scenarios
and the four priority-based algorithms.
We see that a small failure causes behavior similar to
that observed for a burst: all priority-based algorithms do
a good job of preserving high-priority processes, and the
SLA-based algorithm does particularly well. However, if
the failure is longer, the LCFS behavior ensures that a
higher proportion of the processes are run within the
target completion time. It is interesting to see that this
behavior is similar to what happens in a crisis congestion
situation today: at some point of combined delay, the
operations staff decides to extract a backlog of ‘old
orders’ to be processed off-line during the night/week-
ends, so that the on-line system recovers a typical
throughput.
Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211208
6.4. Summary
We may summarize our main findings as follows:
† The first lesson to be drawn from these experiments is
that priority routing works. The four algorithms that use
process priority as part of the sorting strategy are able to
maintain the SLA of high-priority processes much better
than the first four algorithms.
† The second lesson is that FCFS is not a good default
algorithm. Both RSS and LCFS do better as soon as the
event flow becomes tight
† On the other hand, SLA time is a valuable technique
since, without taking priority into account, it the best
overall approach when the event flow can be handled by
the EAI system.
† The combination of priority and SLA sorting is the best
approach. The experiments show that these algorithms
can handle a burst of low-priority messages very well.
We may see that for small bursts there is a tiny advantage
for PRSP against PRSS, but the overall winner is PRSS.
7. Control strategies
7.1. Flow rules
Once the ‘routing strategy’ is set up, the other approach
to controlling the behavior of the whole EAI system is to
control the flow. Figuratively speaking, we may view each
adapter as a faucet that can be turned on and off, or even to a
‘reduced’ setting, according to dynamic rules. The idea is to
implement rules that would say ‘there is no point in sending
more water to a portion of the pipes that is already
congested’. To experiment with this approach, we have
implemented, as explained in Section 4, a rule engine within
Table 4
Flow rule sets!congestion scenarios
S33 S51
FCFS (%) PRSP (%) FCFS (%)
No Rules 38 70 56
31 44 48
35 22 50
29 66 44
44 67 57
RS3 46 70 60
23 44 25
42 23 55
33 65 47
31 39 35
RS4 52 70 62
25 43 29
46 23 55
25 65 46
33 66 35
the monitoring system. The first step was to implement the
following sets of rules:
– RS1: When the QoS of a given system X falls
below 90 of its SLA level (cf. Section 3), we
reduce the flow of systems that are providers of X
(i.e., that produce EndTask messages that will
trigger new tasks to be executed by X) and whose
priority is lower than X. The priority of a system is
defined simply as the maximum priority of all
processes to which it contributes (this is actually a
‘min’, since high priority is 1 and low priority is
3). A dual rule restores the default setting once the
QoS of X reaches 90.
– RS2: This is a similar rule, but the triggering
condition is based on processes. When the QoS of
a given process P falls below 90, we reduce the
flow of all systems that have lower priority than P
and that are providers of a system that supports P.
Table 4 reports the results of the associated experiments
(the of process runs that meet the SLA). We focused on the
S33 and S51 scenarios, because they were the two that
placed the EAI system under stress, and included the S2
scenario to make sure that the rules would have no adverse
effect on a more regular situation. We compare the effect of
the rule sets on two routing strategies: FCFS (the default
solution for most EAI systems, including ours) and PRSP (to
see whether the rules add value to a smarter routing
strategy). The lines for P1 and P4 are in boldface type to
indicate their higher priority.
We see that the control flow rules bring an improvement
when the routing is straightforward, but are of no value with
priority-based routing. This is not really surprising, since the
goal of the improved routing is precisely to deliver the
important messages without being bothered by other flows
of lower-priority messages. Thus, there is no value in
S2
PRSP (%) FCFS (%) PRSP (%)
75 98 98
61 98 98
33 98 98
70 93 98
75 90 97
75 98 98
61 98 98
33 98 98
70 93 98
52 81 92
75 98 98
61 98 98
33 98 98
70 93 98
66 90 97
Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211 209
reducing or cutting these lower-priority flows in case of
congestion.
It is difficult to summarize a long sequence of
experiments with one table: we tried various ways of
controlling the flow ‘at the faucet’, from a plain cut-off
(on/off) to more subtle schemes to reduce the flow. It turns
out that we were not able to produce any stable
improvements when the rules were added to the PRSP
algorithms. It must be repeated that it is important to run
these stochastic experiments many times, since for one
given run, there is usually a set of rules that shows a small
improvement.
7.2. Routing rules
Since the reduction of flow did not seem to provide much
improvement, we took a different approach and decided to
switch the routing strategy when severe congestion occurs.
We added a new status (FAST) to each ST that directs
the system to ‘run all incoming messages using PRL sorting
strategy’. We then implemented the following sets of rules:
– RS3: When the QoS of a given system X drops
below 95, the system is switched to FAST status.
The system resumes normal status once the QoS
returns above 95.
– RS4: When the QoS of a given process P drops
below 95, all systems that support this process are
switched to FAST status.
– RS5: A system is switched to FAST status
whenever its mailbox size exceeds 100.
Table 5
Routing rule sets!congestion scenarios
S33 S51
PRF (%) PRSP (%) PRF (%)
No Rules 69 70 76
42 44 62
23 22 37
63 66 70
65 67 72
RS3 74 75 76
69 69 69
58 59 65
75 77 73
72 72 79
RS4 71 76 76
64 68 66
52 57 59
69 74 72
67 70 78
RS5 77 78 77
74 73 74
65 63 65
77 80 77
72 74 72
Obviously, the triggering size is a constant that
depends on the volume processed by the EAI and
the number of connected systems.
This idea is very similar to the ‘active scheduling’
principle presented, for instance, in [20]. In this instance, the
active router is a programmable router that can execute a
mix of FCFS, WFQ (Weighted Fair Queuing) and JEDD
(Jitter Earliest Due-Date) routing strategies. The router is
controlled by an ‘intelligent agent’ that runs on top of the
CORBA middleware.
In our case, we found that the most efficient adaptive
scheduling method for dealing with congestion was the PRL
method. This new approach proved to be more successful
than using flow rules and provided an improvement for all
routing algorithms (but PRL by construction).
Table 5 reports the results of these experiments, using the
same three scenarios as in the previous table. We compared
the three rule sets using PRF and PRSP, to see whether the
combination of ‘improved routing’ and ‘routing rule’ was
necessary or whether the simple PRF (priority & FCFS)
strategy was sufficient as the default strategy.
The first observation drawn from these results is that this
approach works: rules do not degrade quality of service for
regular scenarios and bring an improvement when dealing
with congestions. Second, the simpler RS5 rules actually
provide the greatest improvement. This is interesting since
RS5 rules are extremely simple to implement (purely local
conditions and conclusions). Lastly, ‘smart routing’ pays, in
the sense that the advantage provided by SLA time
prediction over FCFS is not lost by adding routing rules
S2
PRSP (%) PRF (%) PRSP (%)
75 98 98
61 98 98
33 98 98
70 98 98
75 93 97
74 98 98
68 97 98
64 98 98
72 98 98
80 92 96
74 98 98
64 98 98
59 98 98
69 98 98
78 93 97
75 98 98
66 98 98
57 98 98
72 98 98
80 93 97
Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211210
8. Comparison with related works
There is an active but small research community which
focuses on autonomic, self-adaptive and fault-tolerant
middleware (e.g., the Chameleon project [21]). The main
difference with our work is that such approaches aim to
build new, specialized middleware, whereas our goal is to
use standard commercial products that have already been
deployed.
Similarly, many of the main objectives stated in the first
sections of this paper are shared with the BPM community.
As mentioned earlier, the need for a QoS description
standard has produced many proposals (e.g., [22] and [7]).
However, the introduction of QoS constraint into XML
specifications in the BP*L family of languages [3] is still at
a very preliminary stage, and no implementation of a QoS
adaptive mechanism has yet been proposed in the process-
flow community.
On the contrary, the approach presented in this paper
draws heavily from previous work on computer networking
[2]. The various scheduling methods with which we have
experimented are directly inspired by techniques such as
Weighted Round Robin, Weighted Fair Queuing [23,24]
and Delay-Earliest Due-Date. WFQ is a tractable approxi-
mation of Generalized Processor Sharing, an idealized fluid
model, which relies on the computation of a virtual time for
each type of packet traffic. DEDD is similarly based on the
computation of expected due-dates based on service levels.
There are many other relevant techniques that could be
borrowed and used in the context of integration middleware.
For instance, reservation methods (such as RSVP [2,14])
could be implemented to guarantee a part of the bandwidth
for high-priority processes. We tried this approach when
studying SLA-routing for call centers [13], but found that
the additional robustness was gained at the expense of
overall efficiency, which is also known for computer
networks (we also experimented with non-work-conserving
approaches, but we lack stochastic information about the
message distribution required to make ‘idle waiting’
worthwhile). Similarly, the RED technique (Random
Early Detection/Discard) could apply, although message
loss is not an option. The equivalent of dropping a packet in
our context would be a long-term postponement, which is
strangely close to the idea of using LCFS during congestion.
More generally, the DiffServ architecture for IP networks is
relevant to our work since it deals with the delivery of SLAs
for complex applications (using, for instance, RSVP
routing).
One must not conclude from the above arguments that
adaptive middleware and adaptive networks are competing
technologies. Rather, they complement each other and
should be used in combination. The point that we made
earlier is that explicit ‘business’ QoS (i.e., associated with
business processes) modeling yields benefits in addition to
those that can be obtained with an ‘smart’ network. Another
benefit of the work described in this paper is that it is
intended for deployment on an existing IT network, and, as
such, is easier to deploy than a radical change in network
and middleware technology.
The limits of a classical layered approach towards the
need for new kinds of applications are well known: lack of
flexibility and lack of efficiency. The active network
approach [20,25] has some interesting similarities with
our own work, since the authors study a CORBA
middleware on top of an adaptive network. The implemen-
tation of high-level QoS constraints is done through a layer
of CORBA agents that use adaptive routers to control the
flow of messages. Similar approaches are found in the
context of wireless networks [26]. In both cases, the major
driver for QoS-awareness is the presence of multimedia
applications (hence the importance of jitter). What sets our
work apart from most recent adaptive network proposals is
the nature of the ‘QoS constraints’ that we want to handle.
Business processes have a richer and more interesting
structure than most multimedia services, and delivery of the
expected QoS for business processes may take advantage of
this structure.
9. Conclusion
This work is a first step towards our goal of self-
adaptive, self-healing integration infrastructures. This
goal makes our approach a modest contribution to the
field of autonomic computing [27]. From a business
perspective, the leading characteristic of a self-adaptive
infrastructure is the ability to balance resources according
to business priorities. This objective is what motivated
our research in the first place. As far as ‘self-healing’ is
concerned, there is still a long way to go, even when we
find that some routing strategies are more tolerant of the
congestions caused by a system failure than others. On
the other hand, simulation is indeed a powerful approach
for studying and understanding complex systems, such as
application infrastructures [28]. The contribution of this
paper may be stated as follows: we demonstrate how
‘business QoS’ SLAs could be used as declarative
policies in an adaptive infrastructure. We close the gap
between a very high-level vision, namely the business
processes, and the detailed implementation of a message-
passing integration infrastructure. The main findings may
be summarized as follows:
(1) Priority handling works: it is possible and fairly
simple to take process priority into account for
routing messages, and the results show a real
improvement in the ability to focus on higher-priority
processes when congestion occurs.
(2) The routing (mailbox sorting) algorithm matters: the
more sophisticated SLA projection technique showed
a real improvement over an FCFS policy, even
without taking priorities into account.
Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211 211
(3) Control rules are interesting, but they are secondary
to the routing policy: it is more efficient to deal with
congestion problems with a distributed routing
strategy than with a comprehensive rule schema.
Rules should complement the routing scheme, and
since there is no value in doing the same work twice,
the priority handling is best done at the node level.
The next step is to pursue our research in the following
directions:
– A more thorough study with simulation scenarios
which are closer to ‘real life’: the test OAI
problem used in this paper is representative of true
EAI infrastructure, but the principles identified
here need to be validated with improved OAI
scenarios that are accurate reproductions of real-
life situations and span longer time periods.
– A better understanding of system failures: the
failure scenarios would become much more
interesting if we included some of the behavior
of real systems, which means that we need to
enrich our model. This starts with gathering data
and analyzing real-life ‘failures’ of our EAI
infrastructure.
– A realistic model for representing synchronization
of distributed objects. This is necessary to
evaluate the impact of message ordering on data
consistency. The ability to use improved routing
algorithms depends on the synchronization strat-
egy used to manage distributed copies of business
objects.
References
[1] Garlan D. Self-Healing Systems: Some Resources. http://www.cs.
cmu.edu/wgarlan/17811/resources.html.
[2] Skeshav S. An engineering approach to computer networking.:
Addison Wesley Professional; 1997.
[3] Smith H, Fingar P. Business process management: the third wave.:
Meghan-Kiffer Press; 2002.
[4] Caseau Y. Urbanisation et BPM (Enterprise Architecture and
Business Process Management). Paris: Dunod; 2005.
[5] Wietrzyk VI, Takizawa M. Distributed workflows: a framework for
electronic commerce. J Inform Sci Eng 2003;19.
[6] Andrews T, et al. Specification: Business Process Execution
Language for Web Services Version 1.1. http://www.ibm.com/
developerworks/library/ws-bpel/; 2003.
[7] Ludwig H, Keller A, Dan A, King R, Franck R. Web Service Level
Agreement (WSLA) Language Specification. Version 1.0, http://
www.research.ibm.com/wsla/.
[8] Gross D, Harris C. Fundamentals of queuing theory. New York:
Wiley-Interscience; 1998.
[9] Lazowska E, Zahorjan J, Scott Graham G, Sevcik K. Quantitative
system performance—computer system analysis using queuing
network models.: Prentice Hall; 1984.
[10] Bolch G, Greiner S, de Meer H, Trivedi K. Queueing networks and
Markov chains. New York: Wiley-Interscience; 1998.
[11] Fiat A, Woeginger G, editors. Online algorithms—the state of the art.
Lecture notes on computer science, vol. 1442.
[12] Benoist T, Bourreau E, Caseau Y, Rottembourg B. Towards stochastic
constraint programming: a study of on-line multi-choice knapsack
with deadlines Lecture notes on computer science. Proceedings of
CP’2001. vol. 2239 2001.
[13] Caseau Y. Declarative ACD routing with service level optimization
Technical Memorandum, White Pajama July 2001.
[14] Halabi S. Internet routing architecture.: Cisco Press; 2001.
[15] Wang Z. Quality of service routing for supporting multimedia
applications. IEEE JSAC 1996;14(7).
[16] Fujimoto R. Parallel and distributed simulation systems.: Wiley-
Interscience; 2000.
[17] Golestani SJ. A stop-and-go queuing framework for congestion
management Proceedings of ACM SIGCOMM 90, Philadelphia
1990.
[18] Verma D, Zhang H, Ferrari D. Delay Jitter control for real-time
communication in a packet switching network IEEE Tricom.: Chapel
Hill, NC; 1991.
[19] Motwani R, Raghavan P. Randomized algorithms.: Cambridge
University Press; 1995.
[20] Hussain SA, Marshall A. An Active Scheduling Policy for
Programmable Routers 16th UK Teletraffic symposium: management
of QoS—The New Challenge. Harlow 2000.
[21] Bagchi S, et al. The Chameleon Infrastructure for Adaptive, Software
Implemented Fault Tolerance. In 17th Symposium on Reliable
Distributed Systems; 1998.
[22] Frolund S, Koistinen J. QML: A Language for Quality of Service
Specification. Hewlett-Packard Software Technology Laboratory,
report HPL-98-10; 1998.
[23] Demers A, Keshav S, Shenker S. Analysis and simulation of a fair
queuing algorithm Journal of internetworking research and experience
1990 pp. 3–26.
[24] Bennet JCR, Zhang H. WF2Q: Worst-Case Fair Weighted Fair
Queuing IEEE INFOCOM’96, San Francisco 1996.
[25] Chieng D, Marshall A, Parr G. SLA-driven flexible bandwidth
reservation negotiation schemes for QoS aware IP networks Manage-
ment of multimedia networks and services: seventh IFIP/IEEE
international conference, MMNS, San Diego. Proceedings. Lecture
notes in computer science. vol. 3271. Berlin: Springer; 2004.
[26] Raatikainen K, Wireless Internet: Challenges and Solutions. Univer-
sity of Helsinki, Series of Publication B, Report B-2004-3; 2004.
[27] Ganek A, Corbi T. The dawning of the autonomic computing era. IBM
Syst J 2003;42(1).
[28] P. Singleton. Performance Modelling - What,Why,When and How.
BT Technology Journal, Vol. 20, No. 3, Kluwer, July 2003.