self-adaptive middleware: supporting business process priorities and service level agreements

Self-adaptive middleware: Supporting business process

priorities and service level agreements

Yves Caseau

Bouygues Telecom, 92640 Boulogne-Billancourt, France

Received 8 December 2004; revised 19 April 2005; accepted 3 May 2005

Abstract

This paper deals with quality of service, defined at the application level, with respect to business constraints expressed in terms of business

processes. We present a set of adaptive methods and rules for routing messages in an integration infrastructure which yields a form of

autonomic behavior, namely the ability to dynamically optimize the flow of messages in order to comply with SLAs according to business

priorities. EAI (Enterprise Application Integration) infrastructures may be seen as component systems that exchange asynchronous messages

over an application bus, under the supervision of a processflow engine that orchestrates the messages. The QoS (Quality of Service) of the

global IT system is defined and monitored with SLAs (Service Level Agreements) that apply to each business process. The goal of this paper

is to propose routing strategies for message handling that maximize the ability of the EAI system to meet these requirements in a self-

adaptive and self-healing manner, i.e., its ability to cope with sudden variations of the event flow or temporary failures of a component

system. These results are a first contribution towards deployment of autonomic computing concepts into BPM (Business Process

Management) architectures. This approach marks a departure from previous approaches in which QoS constraints are pushed to the lower

level (e.g., the network). Although the techniques, such as adaptive queuing, are similar, managing QoS at the business process level yields

more flexibility and robustness.

q 2005 Elsevier Ltd. All rights reserved.

Keywords: Service level agreement; Self-optimizing; Autonomic; Middleware; Message passing; EAI; Business processes; BPM; Adaptive queuing

1. Introduction

The topic of this paper is the management and

optimization of business processes within an integration

middleware. These issues arise when IT is organized and

measured according to the key business processes, using

components that participate in the processes through their

connection to a global infrastructure (EAI) and the

orchestration of a processflow engine. The optimization of

IT is thus measured through SLAs which define the

expected performance of each process: the intention is to

meet these requirements and, when this becomes impossible

for some external reason (bust, failure), to ensure an ordered

degradation in performance so that higher-priority processes

are less impacted than lower-priority ones.

1474-0346/$ - see front matter q 2005 Elsevier Ltd. All rights reserved.

doi:10.1016/j.aei.2005.05.013

E-mail address: [email protected].

Our goal is to study the impact of routing and control

strategies, and to show that it is possible to demonstrate self-

adaptive and self-healing behavior. The topic of self-

adaptive and self-healing software and middleware is

studied by a very active research community [1], but little

work has been done on the management of business

processes and their quality of service. On the other hand,

adaptive queuing is the classical approach to implement

QoS management in a network [2], but there is a large gap

between QoS at the network level and at the business level.

The approach that we propose is to monitor and control

QoS, from a business perspective, at the integration

middleware level, and specifically at the processflow engine

level. We use a finite-event simulator to represent an EAI

integration infrastructure. This simulator is used to test

various message handling strategies, both at the component

level (how to sort the message queues) and at the EAI level

(how to configure the processflow engine and how to

manage business processes’ QoS with control rules). We

have found that sorting strategies produce large differences

Advanced Engineering Informatics 19 (2005) 199–211

www.elsevier.com/locate/aei

http://www.elsevier.com/locate/aei

Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211200

both the standpoint of both quality and robustness. We also

show that simple control rules are effective in increasing the

ability to cope with system failures or exceptional

congestions. The paper is organized as follows. Section 2

presents the field of application integration and business

process management Section 3 describes the motivations for

this work, namely the ability to optimize quality of service

according to business priorities. The acronym OAI

(Optimization of Application Integration) is introduced as

a dual issue from EAI: EAI is about building integration

infrastructure, OAI is about operating them. Section 4

explains how processes and SLA are defined, and illustrates

the concept of business priorities using a real-life example.

Section 5 presents the finite-event model and simulator that

we used to study various message routing strategies. Section

6 shows how 8 different routing strategies perform when

applied to regular and exceptional scenarios. Section 7

introduces the concept of control rules, which are triggered

by certain QoS conditions defined for systems and

processes. We show that these rules may be used to increase

robustness against exceptional events. The last section

compares this OAI approach with related works, and

stresses the difference with network QoS concepts.

2. Application integration and business processes

This paper describes a piece of applied research that

stems directly from an operational crisis which occurred at

Bouygues Telecom a few years ago. This section describes

the issues that face IT managers when they try to support

their customers’ business processes. Since the term ‘Quality

of Service’ will be used throughout the paper in a sense that

is different from the ‘network QoS’ literature, from which

most of the relevant techniques will be extracted in the next

sections, it is important to state what those issues are.

Most companies have been going through IT re-

engineering programs in the last ten years. For instance,

Bouygues Telecom built its earlier IT system around an

integrated off-the-shelf package that delivered billing,

CRM, provisioning and customer management services.

Five years ago, this approach ran into performance,

scalability and cost-efficiency problems, mostly because

the packaged software had been heavily customized. The

motivations for the re-engineering approach which resulted

from this difficulty constitute a textbook case for the

industry:

†Take ownership of IT through a corporate business

object and process model. Instead of letting multiple

software vendors tell you what a customer or a service is,

build a central model that is the cornerstone of inter-system

exchanges.

†Deploy a component-based architecture to solve

performance and scalability issues.

†Increase flexibility through the displacement of

business logic from components towards a processflow

engine. The explicit implementation of business processes is

a step forward both from a ‘strategic alignment’ perspective

(making sure that IT processing follows closely the business

strategy) and from an ‘agility’ perspective (changing the

processes becomes easier).

†Ensuring better Quality of Service through redundant

secure infrastructure and through SLA monitoring (cf.

Section 4).

The resulting target architecture is illustrated by the

following figure. Each major IT function is implemented by

separate components, which are assembled using an EAI

infrastructure. The components exchange information

through message passing, under the supervision of a

processflow engine. Collaboration between components is

precisely described using a business process language, as a

sequence of activities separated by transitions [3]. Business

objects are distributed among these components, and the

common model is used to provide the exchange semantics.

The distribution of business objects means that synchroni-

zation flows must be orchestrated, with some form of

interleaving with the control flow from the business

processes. Thus, the role of the EAI infrastructure is

threefold: transporting messages, managing the execution of

business processes and ensuring the synchronization of

distributed copies of business objects.

The figure above is obviously a simplified abstraction of

a true IT system. For instance, it is not desirable to rely on a

single EAI integration infrastructure. There are multiple

technology, deployment and performance constraints which

lead to the concept of ‘fractal architecture’ [4]. A fractal

architecture is a recursive construction of the pattern

represented in Fig. 1 (a software bus with components) at

multiple scales. Such patterns may collaborate through a

gateway (between two EAI buses) or one may be a sub-

pattern of another if it describes one of its components. In

this paper we shall focus on a single EAI instantiation (one

bus and one processflow), but this does not mean that the

entire IT system of one company could fit on a single bus.

The concepts of Enterprise Architecture and Business

Process Management have become very popular, and hype

may be replacing common sense in many cases [4]. Our

practical experience with enterprise architecture and

integration architecture suggests three general warnings:

†Agility is a matter not of technology but of design.

Agility, which is not simply the ability to develop new

features easily but also to integrate, test and deploy them,

depends on the modularity of the functional architecture.

†Synchronization of distributed heterogeneous data

sources is a difficult, longstanding problem, even with a

BPM or EAI infrastructure. As business processes interact

through the shared business objects, coherence of execution

requires classical mechanisms such as signalization,

exclusion or serialization.

†Business Process Operations is the harder part when

deploying the target infrastructure shown in Fig. 1. The next

sections will show that monitoring is more difficult precisely

Process ManagementBusiness Objects

Management

EAI Infrastructure

IT Systems

CRM DWBilling ProvisioningC. Management

Business Processes

P1

P2

Tasks

Task

Task

Task

Business Processes

TechnicalProcesses

Transport

Directories

Each transition is defined through business object updates

distribution

Fig. 1. A Target EAI/BPM Architecture.

Y. Caseau / Advanced Engineering Informatics 19 (2005) 199–211 201

because the whole system is more robust. Instead of

producing clear-cut failures, asynchronous systems have

the ability to continue working when they are overloaded,

resulting in later shutdowns (when message queues are full)

of larger proportions. Incidents must be resolved on ‘active

systems’, which requires a new culture among the

‘operations’ staff.

The work reported in this paper is a contribution to this

field of ‘operational issues of EAI and BPM integration

infrastructure’.

3. Optimization of application integration

The purpose of ‘strategic alignment’ is to manage and

optimize IT according to key business processes. Defining

the QoS of a process is not a difficult task: it is a matter of

throughput, latency and availability. Throughput measures

the number of individual processes that can be performed in

a fixed amount of time such as a day. Latency is the time

required for end-to-end execution of a single process, which

is most often what the customer sees and reflects the

perceived quality of service. Availability is the percentage

of the time when the system is able to start a new process. A

service contract states desired goals in terms of QoS,

yielding a service level agreement (SLA). Complying with

an SLA, and even monitoring it, is a difficult task because

the engineering (sizing and capacity analysis) is done on the

individual systems and the infrastructure. We do indeed

design and build individual systems with a fairly precise

characterization of their availability, their throughput and

their latency. However, reverse-engineering from the SLA

at the process level is difficult for a number of reasons:

(1) The availability of systems coupled in an asynchronous

manner is difficult to evaluate (it is definitely not the

product of availabilities, as with a synchronous

approach). It depends on many parameters, and it is

possible to build a reliable asynchronous system with

components that are far less reliable (using buffers and

replication).

(2) The sizing issue is not a stationary flow (which is itself

not so trivial because of the multi-commodity aspect)

but a stochastic problem concerning the resolution of

bursts. Real-life situations generate bursts of messages

that are difficult to process.

(3) The management of the flow is complex from a

functional point of view: it is a matter of protocol

rather than ‘simply’ a matter of the sizes of pipes and

engines. Implicit retries or acknowledge protocols have

an impact on performance.

(4) Processes are interdependent because they operate on

shared business objects. This dependency (a classical

problem with databases) may be handled through

serialization, which induces serious constraints on

QoS, or through interleaving of processes and

additional ‘re-synchronization’ steps to correct the

possible discrepancies.

The pure stochastic asynchronous problem is made

harder in real life owing to a combination of synchronous

and asynchronous communications, and a combination of

planned flows and stochastic flows which creates interesting

planning issues.

Processes

CRM PFSCust.Base

Prov.

NetworkHelpAccounts

Fraud OrderMgt

InfrastructureSystems

Goalsareset at theprocesslevel

Fig. 2. OAI - The Operational Challenge of EAI.


The bottom line is that the operation of a ‘process-

oriented’ IT global system is a highly challenging problem.

We have coined a new name, OAI (Optimization of

Application Integration), for this challenge, which is

illustrated by Fig. 2. This figure, which is an abstraction

from a real IT infrastructure example, represents 9

component systems that are interconnected with a message

bus, and on which 5 processes are defined. These processes

have different business priorities and very different target

latencies. The challenge of OAI is to run the 5 processes so

that each SLA is met, even when the common resources are

stressed by unforeseen events.

This OAI challenge is not merely of scientific interest.

An instance of such a problem occurred at Bouygues

Telecom during the launch of a new commercial service.

The provisioning of this new service was a simple enough

operation, although distributed onto multiple systems. The

service guarantee given to the end-customer sounded very

conservative, since the actual processing time that was

needed was a small fraction of the guaranteed provisioning

delay. However, this service level agreement proved

difficult to meet with regularity and was shown to be very

sensitive to external perturbation. One of the most

worrisome problems was that low-priority processes could

upset the delivery of this new service provisioning, which

was considered a strategic initiative.

The OAI problem statement is different from a network

QoS issue. The constraints expressed in terms of business

processes could be rewritten as elementary constraints about

transport and service execution. Assuming that the

application servers meet their expected levels of service

and that the network and the associated middleware provide

the necessary QoS for transport, the business process would

be delivered accordingly. This approach is too restrictive

and fails for two reasons (loss of intention and harmful

strengthening of the QoS constraints):

† Pushing the constraints to the lowest levels makes the

problem harder; it may become impossible to meet these

QoS constraints, whereas the original business require-

ments were feasible.

† The explicit representation of business QoS objectives at

a global level makes it easier to capture priorities and

hierarchies of goals.

Although the rest of the paper will borrow adaptive

queuing techniques from computer networking approaches

[2], handling quality of service at the integration middle-

ware level represents a significant improvement. It is more

flexible: business QoS constraints may be used as

declarative parameters (or policies). It is more robust: a

higher-level perspective supports larger-scale reaction. The

relationship between OAI and adaptive network manage-

ment will be discussed in the last section.

4. Business process management and SLAs

The approach to application integration which we follow

is closely related to BPM (Business Process Management):

business processes are the central control mechanism for the

complete set of applications [3,5]. Since business processes

are ‘at the heart’ of the business, it is not surprising that

business objectives in terms of quality of service are

expressed through IT processes. At first glance, a process is

a directed graph of tasks that must be executed on a subset

of components. Tasks that belong to a chain must be

executed sequentially, while sub-chains may be executed in

parallel. A task that has many successors corresponds to a

‘fork’, while a task with multiple predecessors corresponds

to a ‘join’. More precisely, a process may be labeled with

additional constraints that specify concurrency or sequence

constraints, as well as error recovery strategies. It is also

common to introduce the recursive notion of a sub-process,

where a task itself is defined as a process. XML description

of processes is becoming a standard through the BPEL

initiative [6]. The BPEL4WS description of the five

processes presented in Section 5.2 is a straightforward

exercise of encoding a graph into XML. However, the

notion of processes’ QoS is not mature enough in the

middleware community to be standardized. The closest step

is the WSLA (Web Service Level Agreement) proposal [7],

which focuses on throughput and latency at the local

‘service call’ level. We argue for a standard that takes the

global view at the process level and that also takes failures

into account.

QoS is defined and measured with respect to a Service

Level Agreement. SLAs set the expected completion time,

with associated probabilities (of completion and completion

within the limit) for a given incoming distribution. SLAs are

Latency (s)

Value ($)

0

SLAapproximation

Target completion time

Fig. 3. Service Level Agreements for Processes.


common and similar to what is found in network routing or

order processing. The intuition behind an SLA is shown in

Fig. 3. The value produced for the company by the

completion of an order (the execution of a process) is a

decreasing function of the execution time. An SLA contract

may be seen as a simple ‘rectangle’ approximation of this

function. This value is compared to the cost of processing an

order, which is a function of the target completion time and

the required probability of reaching this target. Due to the

stochastic nature of the incoming load, the cost rises sharply

when the probability approaches 1. Setting up the SLA

requires solving the stochastic maximization of profit. Such

SLA are commonly used because they are easy to define and

thus easy to monitor Fig. 3.

Thus, an SLA sets the expected completion time, with

associated probabilities (of completion and completion

within the limit) for a given incoming distribution. Here is a

common example of an SLA, using a nested structure:

† Activation process is available 99.9 of the time.

† 80 of all activations are processed successfully within

2 hs.

† 98 of all activations are processed successfully within

24 h.

In the context of order processing or call centers, SLAs of

the kind ‘X of calls are handled within Y seconds’ are

commonly used because this approximation is acceptable

and because there is a wealth of knowledge on how to size a

queue network that is hit by a Poisson or an Erlang flow so

that this type of SLA is met (for instance, see [8]).

An OAI problem may be described with a queue network

([9,10]) and is very closely related to the field of System

Performance Analysis. As noted above, however, there are

many aspects that make this problem difficult and beyond

the reach of an analytic approach. The design of the proper

message handling algorithms is closer to the field of online

algorithms [11,12]. The proper routing of a set of different

process flows with different SLAs is very similar to the

problem of call routing according to SLAs in a call center

that we studied in a previous work [13]. It is also similar to

the network routing problems involving different flows with

different QoS levels [14]. The intractability of these

problems (for instance, see [15] on NP-completeness of

QoS routing) led us naturally to explore the use of

simulation, which we shall now present.

5. Finite-event simulation

5.1. Infrastructure

The infrastructure that we simulate corresponds to the

problem shown in Section 2. It consists of 9 technical

systems, which are plugged onto a message bus. The bus’

role is to transport asynchronous messages, which are

controlled by a processflow engine.

A technical system is able to perform tasks on business

objects, which are parts of the business process. Each

system consists of a number of parallel threads that may

perform the same tasks. It is plugged onto the bus using an

‘adapter’ that plays different roles in real life: XML parsing,

translation from one object model to another, selection of

relevant messages (filtering), and so on. In our model, the

adapter is part of the ‘technical system’ and is made of a

queue of incoming messages and a sorting algorithm that

will feed the system when one of its threads is available. The

processflow engine is responsible for the execution of

business processes.

Another technical system plugged onto the bus is the

monitoring system, which gathers process completion

statistics and issues QoS updates, at both the process and

the system levels (see Section 6). The statistics include the

average completion time for a process or a task, the fraction

of process executions that meet the SLA target, the number

of processes that failed, etc. This monitoring system also

runs ‘control rules’, which are triggered by these monitoring

events and may cause flow controlling actions that will be

explained in Section 6, by changing the ‘status’ of one or

many systems.

5.2. Events

We use a simple finite-event simulation engine. For a

general introduction to simulation and its use in investi-

gating distributed systems, see [16], which also contains an

insightful discussion about the order in which messages are

processed. A finite-event simulation is defined by the type of

events created for each object by the simulator and how they

are interpreted. Fig. 4 shows the diagram of the events

produced in our experiment, where an arrow designates an

event that originates from a source and is handled by its

destination.

There are two kinds of events that have no source and are

produced as an input to the simulation engine: the

StartProcess and the SystemFailure events. StartProcess

events are generated for each process, according to

parameters that are given for the process or for the

ProcessflowEngine

Infrastructure

Monitor

StartProcess

StartTask

System

TimeOutAlert

EndProcess

EndTaskStartTask

EndTask

SetStatus

Failure

ReceivedTask

ReceivedTask

Fig. 4. Event Schema.


simulation scenario (see next section), such as the average

throughput, type of distribution (uniform or Poisson), etc.

We may now briefly describe the semantics associated

with each other type of event:

– A StartTask event is produced by a processflow

engine each time a new step of a process is taken.

– A EndProcess event is produced when the

EndTask event received by the processflow

engine corresponds to the last step of a process.

– A EndTask event is produced by a system that has

completed a task.

– A ReceivedTask event is an ‘acknowledge’ event

produced by a system that has received a

StartTask event. It acknowledges the reception

of the message, so that the processflow engine

knows that the delivery was successful.

The engine uses a time-out mechanism to re-send the

StartTask message a given number of times until this

acknowledgment is received; otherwise, the process fails.

The time-out is implemented with a TimeOutAlert event that

the processflow engine sends to itself. The monitoring

system collects the EndProcess events to produce QoS

indicators. These may trigger rules that modify the behavior

of the technical system through a SetStatus event.

The OAI test scenario presented in Section 2 corresponds

to the following five processes (which are loosely inspired

by real business processes):

† P1 is a ‘subscription’ process, with high priority (1) and a

target latency which is not tight but strict.

† P2 is an automated barring process triggered by fraud

detection. Its path may be seen in Fig. 2 (blue processZpriority 2), from Fraud to OrderMgt to Prov(isioning) to

Network.

† P3 is a lower-priority (3) barring that is triggered by the

Account Management system.

† P4 is a high-priority de-barring process once a customer

has paid its delinquent bill. P4 goes through the order

manager, then provisioning, network and customer

management.

† P5 is a medium-priority query process that is originated

by the helpdesk, with a short and tight completion time

(a customer representative is waiting).

We simulate three types of ‘errors’: when a system

randomly cannot complete a task, when it suffers from a

failure and when the unstable state produced by a previous

failure on a business object prevents the running of a new

process on this object. Error handling is managed through

the send/acknowledge protocol and through a resynchroni-

zation process which is run at regular intervals to repair all

business objects that have been involved in a previous

failure.

5.3. Monitoring and control

The goal of the simulation is to gather a number of

informative indicators, which are used both as a result of the

experiment (see the tables presented in the following

sections) and as an input of the control strategy used to

implement self-adaptive approaches. Monitoring is based

on a sampling approach (every 10 min) in which we produce

a status report containing:

– Throughput

– SLA performance (process and system levels)

– Utilization of systems

– Run-time (process and system levels)

– Failure rate (process and system levels)

The sampling interval is a trade-off between different

objectives: small enough to be reactive with respect to the

durations involved in the SLA (here from a few minutes to a

few hours), but large enough to yield statistically stable

results and to avoid overloading the systems. Our choice of

10 min is conservative and pragmatic; a more thorough

study remains to be done.

These status reports are logged, so that we can produce

average, standard deviation and min/max intervals, for one

or for multiple runs. The definition of ‘system SLA

performance’ is obtained as follows. For each process, we

build a theoretical model of ‘perfect execution’ that meets

exactly the SLA completion time, using expected task

execution and message delivery times. This is used both to

define an SLA indicator for each system, which is the ratio

of the theoretical task completion time (deliveryCwaiting


timeCexecution time) to the actual measured time, and as a

key for sorting messages. In the remainder of the paper, we

use the term SLA-routing for a simple adaptation of EDD

(Earliest Due-Date [17,18]) to the context of business

processes (see Section 8 for a comparison with network

routing). SLA-routing, therefore, means sorting messages

according to the ‘expected delivery time’, which is

interpolated using the SLA characteristic of the business

process to which the message contributes.

The monitoring system implements a simple rule

interpreter, which runs rules of the type ‘conditionZOaction’, where:

† The condition is a conjunction of tests on the status and

the QoS indicators of the processes and systems (SLA

performance, size of queue, etc.).

† The conclusion sets the status of a system to a new value.

This framework is flexible and should enable us to

experiment with various flow control strategies. In this

paper, we have implemented only four possible states for

each system:

– CUT: the system is shut down.

– SLOW: the system is slowed, which means that

the EndTask events are produced with a delay

equal to the task duration. This is a crude way of

lowering the throughput of a system (we may also

shut down a few threads).

– REGULAR: this is the default mode.

– FAST: the message sorting strategy (see next

section) is changed to a PFCLS approach that is

more efficient for handling congestions.

6. Routing Strategies

6.1. Routing algorithms

The term ‘routing’ is used here to designate the message

passing strategy. Control of the ‘routing’ is handled by the

processflow engine and is not meant to be modified. On the

other hand, the order in which the messages are processed

by each system is a design parameter, which we designate as

a ‘routing strategy’ in this section, and which could named

more precisely ‘mailbox sorting strategy’. The goal of this

section is to study the effect of various sorting algorithms,

that is, the algorithm used to select the next task that will be

run by a system whenever one of its computational threads

becomes available. We have implemented and compared

the following algorithms:

1. FCFS (First Come, First Served) is the default strategy

used by most EAI platforms. Standard queuing theory

tells us that it is indeed the best overall strategy to

minimize waiting time for a simple queue [8]. It is also

fair, by definition, but does not take priorities into

account. One of the early motives for this work was to

see how priorities could be introduced.

2. LCFS (Last Come, First Served), a standard algorithm

that is not much used in EAI environments because it is

not fair. We tested it as a reference point since it is quite

robust, and produces interesting results in case of

congestions.

3. RSS (Reactive Selection for Service) is another classical

approach from queuing theory [8] which we thought

could be interesting for studying the opportunity for

randomized algorithms [19].

4. SLA (Service Level Agreement) uses an estimated start-

time based on a linear interpolation of the SLA, as

discussed in the previous section, using the process start

time as a reference. The task message selected is the one

with the earliest expected start time.

5. PRF (Priority and FCFS) is a combination of priority

sorting (main) and FCFS sorting. We tried various

combinations, but the one that works best is the simplest

(lexicographical ordering: priority first and then FCFS

for ties).

6. PRL (Priority and LCFS) uses a similar combination of

priority and LCFS.

7. PRSS (Priority and SLA at System level) uses a

combination of priority and SLA time that is applied at

the system level. That means that the linear interpolation

of the SLA time is used only to estimate the expected

waiting times in the queue, using the issuing of the

StartTask event as a reference.

8. PRSP (Priority and SLA at Process level) is similar, but

the linear extrapolation is used to compute, for each

StartTask event, the time that is expected between the

process start and the beginning of the given task (same as

the #4 algorithm).

The first step is to simulate a typical event flow, with

various throughput densities. We used 5 scenarios that may

be described as follows:

† S1: Our reference scenario. The throughput has been

adjusted so that all systems are reasonably busy, with

utilization ratios from 30 to 80. The load is equally

balanced between the 5 process flows (P1 to P5). The

event distribution is uniform in time, with a 5-hour time

horizon.

† S11: The event rate is increased by 10 during the two

first hours.


first hours. Many systems reach the 100 utilization rate.


first hours. This is a ‘crisis scenario’ since this flow is too

large for the systems to handle. Since we suppose that

the sizing of the EAI system has been done properly,

such crisis situations are unlikely, except for bursts,

which will be studied in the next section.

Table 1

Algorithms!scenarios-regular flow

S1 (%) S11 (%) S12 (%) S13 (%) S2 (%)

FCFS 98 98 83 28 98

98 98 59 15 98

98 98 76 22 98

88 88 47 12 93

84 84 46 13 90

LCFS 98 98 92 75 98

98 96 85 66 98

98 98 91 73 98

93 90 72 52 94

96 94 86 79 96

RSS 98 98 90 67 98

98 97 82 56 98

98 98 88 64 98

94 90 67 41 94

95 93 83 72 95

SLA 98 98 82 26 98

98 98 75 18 98

98 98 78 22 98

98 98 69 15 98

99 99 74 20 99

PRF 98 98 98 98 98

97 96 81 48 98

97 94 44 2 98

98 98 95 86 98

93 89 69 45 93

PRL 98 98 98 97 98

97 96 83 73 97

97 92 54 59 96

98 98 96 97 98

95 93 82 70 95

PRSS 98 98 98 98 98

97 96 83 50 98

97 94 44 3 98

98 98 98 97 98

96 94 79 52 96

PRSP 98 98 98 98 98

97 96 82 50 98

97 92 39 1 98

98 98 98 97 98

96 94 76 50 97


† S2: This scenario is similar to S1 except that the event

arrival times follow a Poisson distribution.

Table 1 contains the preliminary results when we run

these 5 scenarios against the 8 algorithms presented earlier.

Each experiment is summarized with the average SLA ( of

processes that were completed within the SLA maximum

time). These numbers are given for the five processes; the

P1 and P4 lines are in boldface type to indicate the higher

priority. A real-life situation could be more complex, with

different SLA percentage objectives (we could say that P1

completion time must be met 95 of the time, and P2 only 90

of the time). Since this is somehow redundant with process

priorities, we assume here that the goal of the routing

strategy is to ensure that, for each process, SLA satisfaction

is to be as close to 100 as possible, starting with higher-

priority processes.

The first lesson that can be drawn from these experiments

is that priority routing works. The four algorithms that use

process priority as part of the sorting strategy are able to

maintain the SLA of high-priority processes much better

than the first four algorithms. If one looks carefully, these

are very positive and very encouraging results, since it is a

very strong business objective to be able to deliver the key

processes when congestion occurs.

The second lesson is that FCFS is not a good default

algorithm. Both RSS and LCFS do better as soon as the

event flow becomes tight (we did not report results for

smaller event flows since all algorithms perform well,

although it may be noted that the average duration is indeed

lower for FCFS). On the other hand, SLA time is a valuable

technique since, without taking priority into account, it the

best overall approach when the event flow can be handled by

the EAI system.

When the event load becomes too great, PRSS and PRSP

do a good job of handling the high-priority processes, but

overall satisfaction is not as good as that obtained with PRL.

This is even more marked if throughput increases to many

times the usual rate, in which case the SLA satisfaction rate

drops to 0 with all algorithms but LFCS or PRL. This is

because Last Come, First Served has the implicit behavior

of serving a few customers well and the rest very poorly, as

opposed to other (fairer) approaches that treat everyone

poorly. This aspect will be discussed in the next two sub-

sections.

It is interesting to note that these results are not very

sensitive to the event distribution. They are also fairly

stable, as the results obtained by running a scenario 10 or

100 times are very similar. For this reason, we do not report

standard deviation numbers in these tables.

6.2. Burst distribution

We shall now measure the ability to deal with a burst,

which corresponds to the ‘self-adaptive’ characteristic that

we expect from the EAI infrastructure. We use the following

scenarios:

– S3: This scenario is a combination of the event

flow used in S1 (reference) and a 20-min burst of

P1 processes, doubling the input rate for P1.

– S31: This scenario is a variation of S3, where the

burst lasts for 40 min.

– S32: Same, but the burst lasts for one hour.

– S4: This scenario is similar to S3, but the burst

occurs with a lower-priority process, P3.

– S42: Similar to S32 (P3 burst).

We compare only the last four algorithms, since the

previous section made it clear that taking priority into

account for the sorting algorithm is indeed a good idea.

Table 2 confirms our earlier results: the combination of

priority and SLA sorting is the best approach. The S4*

Table 3

Algorithms!scenarios-component failure

S5 (%) S51 (%) S6 (%) S61 (%)

PRF 91 76 98 98

78 62 87 79

65 37 94 86

84 70 98 98

82 72 92 92

PRL 90 78 98 98

79 63 87 79

71 50 94 88

86 75 98 98

85 76 95 95

PRSS 90 77 98 98

79 62 88 79

66 38 95 87

88 73 98 98

85 76 96 95

PRSP 91 75 98 98

78 61 88 79

64 33 94 85

86 70 98 98

85 75 96 95

Table 2

Algorithms!scenarios-bursts

S3 (%) S31 (%) S32 (%) S4 (%) S42 (%)

PRF 98 96 90 98 98

80 66 54 97 97

70 47 30 84 47

88 79 70 98 98

81 70 60 92 90

PRL 97 95 92 98 98

83 70 58 97 96

74 53 36 86 68

92 86 80 98 97

84 73 64 94 92

PRSS 98 98 95 98 98

82 67 54 97 97

72 49 31 85 49

96 93 86 98 98

84 73 62 96 95

PRSP 98 97 91 98 98

82 68 54 97 97

70 45 29 82 45

97 90 83 98 98

83 72 61 95 94


experiments show that these four algorithms can handle a

burst of low-priority messages very well. We may see that

for small bursts there is a tiny advantage for PRSP against

PRSS, but the overall winner is PRSS.

Here also, we can push the system to the limit with a

bigger burst (3 times the regular P1 flow, the S33 scenario),

and we find that PRL does a better job at getting at least a

third of the P1 process calls completed within the target

times, while SLA satisfaction drops to 0 with the other

approaches.

The conclusion is that PRSS is the best self-adaptive

algorithm, but that PRL may be seen as a more robust

algorithm, depending on whether further delay of an order

that has not met its SLA completion time is considered a

good practice.

6.3. Component failure

Each system is characterized by its availability, which

is the probability that the system is available to perform

a task. Non-availability may have multiple causes, from

technical and hardware failure to software and functional

errors. We include in our model, and in all experiments

that have been reported so far, the random generation of

service failure which captures these small ‘glitches’, most

of which are caused by human error (typing error when

customer parameters are input). In this section, we shall

study the impact of a complete system failure, which is

the shutdown of the system for a given duration (a few

minutes to a few hours). When a system is unavailable,

there are a number of effects: first, the messages in the

system queue wait to be executed; second, the

infrastructure does not receive the acknowledge message,

so it stops sending new messages and stores its own

process states; and lastly, after a time-out delay,

processes are considered to have failed.

Our goal here is to measure first the ability to deal with

the growth of the queues, and then the ability to handle the

backlog when the system restarts. We use the following

scenarios:

– S5: We simulate the failure of a key component

(Order Management) for 15 min, which means

that all processes are impacted.

– S51: We simulate a similar failure for a duration

of 30 min.

– S6: We simulate the failure of the Fraud system,

which impacts only the second process, for

15 min.

– S61: Same as S6, but for a duration of 30 min.

Table 3 presents the results obtained with these scenarios

and the four priority-based algorithms.

We see that a small failure causes behavior similar to

that observed for a burst: all priority-based algorithms do

a good job of preserving high-priority processes, and the

SLA-based algorithm does particularly well. However, if

the failure is longer, the LCFS behavior ensures that a

higher proportion of the processes are run within the

target completion time. It is interesting to see that this

behavior is similar to what happens in a crisis congestion

situation today: at some point of combined delay, the

operations staff decides to extract a backlog of ‘old

orders’ to be processed off-line during the night/week-

ends, so that the on-line system recovers a typical

throughput.


6.4. Summary

We may summarize our main findings as follows:

† The first lesson to be drawn from these experiments is

that priority routing works. The four algorithms that use

process priority as part of the sorting strategy are able to

maintain the SLA of high-priority processes much better

than the first four algorithms.

† The second lesson is that FCFS is not a good default

algorithm. Both RSS and LCFS do better as soon as the

event flow becomes tight

† On the other hand, SLA time is a valuable technique

since, without taking priority into account, it the best

overall approach when the event flow can be handled by

the EAI system.

† The combination of priority and SLA sorting is the best

approach. The experiments show that these algorithms

can handle a burst of low-priority messages very well.

We may see that for small bursts there is a tiny advantage

for PRSP against PRSS, but the overall winner is PRSS.

7. Control strategies

7.1. Flow rules

Once the ‘routing strategy’ is set up, the other approach

to controlling the behavior of the whole EAI system is to

control the flow. Figuratively speaking, we may view each

adapter as a faucet that can be turned on and off, or even to a

‘reduced’ setting, according to dynamic rules. The idea is to

implement rules that would say ‘there is no point in sending

more water to a portion of the pipes that is already

congested’. To experiment with this approach, we have

implemented, as explained in Section 4, a rule engine within

Table 4

Flow rule sets!congestion scenarios

S33 S51

FCFS (%) PRSP (%) FCFS (%)

No Rules 38 70 56

31 44 48

35 22 50

29 66 44

44 67 57

RS3 46 70 60

23 44 25

42 23 55

33 65 47

31 39 35

RS4 52 70 62

25 43 29

46 23 55

25 65 46

33 66 35

the monitoring system. The first step was to implement the

following sets of rules:

– RS1: When the QoS of a given system X falls

below 90 of its SLA level (cf. Section 3), we

reduce the flow of systems that are providers of X

(i.e., that produce EndTask messages that will

trigger new tasks to be executed by X) and whose

priority is lower than X. The priority of a system is

defined simply as the maximum priority of all

processes to which it contributes (this is actually a

‘min’, since high priority is 1 and low priority is

3). A dual rule restores the default setting once the

QoS of X reaches 90.

– RS2: This is a similar rule, but the triggering

condition is based on processes. When the QoS of

a given process P falls below 90, we reduce the

flow of all systems that have lower priority than P

and that are providers of a system that supports P.

Table 4 reports the results of the associated experiments

(the of process runs that meet the SLA). We focused on the

S33 and S51 scenarios, because they were the two that

placed the EAI system under stress, and included the S2

scenario to make sure that the rules would have no adverse

effect on a more regular situation. We compare the effect of

the rule sets on two routing strategies: FCFS (the default

solution for most EAI systems, including ours) and PRSP (to

see whether the rules add value to a smarter routing

strategy). The lines for P1 and P4 are in boldface type to

indicate their higher priority.

We see that the control flow rules bring an improvement

when the routing is straightforward, but are of no value with

priority-based routing. This is not really surprising, since the

goal of the improved routing is precisely to deliver the

important messages without being bothered by other flows

of lower-priority messages. Thus, there is no value in

S2

PRSP (%) FCFS (%) PRSP (%)

75 98 98

61 98 98

33 98 98

70 93 98

75 90 97

75 98 98

61 98 98

33 98 98

70 93 98

52 81 92

75 98 98

61 98 98

33 98 98

70 93 98

66 90 97


reducing or cutting these lower-priority flows in case of

congestion.

It is difficult to summarize a long sequence of

experiments with one table: we tried various ways of

controlling the flow ‘at the faucet’, from a plain cut-off

(on/off) to more subtle schemes to reduce the flow. It turns

out that we were not able to produce any stable

improvements when the rules were added to the PRSP

algorithms. It must be repeated that it is important to run

these stochastic experiments many times, since for one

given run, there is usually a set of rules that shows a small

improvement.

7.2. Routing rules

Since the reduction of flow did not seem to provide much

improvement, we took a different approach and decided to

switch the routing strategy when severe congestion occurs.

We added a new status (FAST) to each ST that directs

the system to ‘run all incoming messages using PRL sorting

strategy’. We then implemented the following sets of rules:

– RS3: When the QoS of a given system X drops

below 95, the system is switched to FAST status.

The system resumes normal status once the QoS

returns above 95.

– RS4: When the QoS of a given process P drops

below 95, all systems that support this process are

switched to FAST status.

– RS5: A system is switched to FAST status

whenever its mailbox size exceeds 100.

Table 5

Routing rule sets!congestion scenarios

S33 S51

PRF (%) PRSP (%) PRF (%)

No Rules 69 70 76

42 44 62

23 22 37

63 66 70

65 67 72

RS3 74 75 76

69 69 69

58 59 65

75 77 73

72 72 79

RS4 71 76 76

64 68 66

52 57 59

69 74 72

67 70 78

RS5 77 78 77

74 73 74

65 63 65

77 80 77

72 74 72

Obviously, the triggering size is a constant that

depends on the volume processed by the EAI and

the number of connected systems.

This idea is very similar to the ‘active scheduling’

principle presented, for instance, in [20]. In this instance, the

active router is a programmable router that can execute a

mix of FCFS, WFQ (Weighted Fair Queuing) and JEDD

(Jitter Earliest Due-Date) routing strategies. The router is

controlled by an ‘intelligent agent’ that runs on top of the

CORBA middleware.

In our case, we found that the most efficient adaptive

scheduling method for dealing with congestion was the PRL

method. This new approach proved to be more successful

than using flow rules and provided an improvement for all

routing algorithms (but PRL by construction).

Table 5 reports the results of these experiments, using the

same three scenarios as in the previous table. We compared

the three rule sets using PRF and PRSP, to see whether the

combination of ‘improved routing’ and ‘routing rule’ was

necessary or whether the simple PRF (priority & FCFS)

strategy was sufficient as the default strategy.

The first observation drawn from these results is that this

approach works: rules do not degrade quality of service for

regular scenarios and bring an improvement when dealing

with congestions. Second, the simpler RS5 rules actually

provide the greatest improvement. This is interesting since

RS5 rules are extremely simple to implement (purely local

conditions and conclusions). Lastly, ‘smart routing’ pays, in

the sense that the advantage provided by SLA time

prediction over FCFS is not lost by adding routing rules

S2

PRSP (%) PRF (%) PRSP (%)

75 98 98

61 98 98

33 98 98

70 98 98

75 93 97

74 98 98

68 97 98

64 98 98

72 98 98

80 92 96

74 98 98

64 98 98

59 98 98

69 98 98

78 93 97

75 98 98

66 98 98

57 98 98

72 98 98

80 93 97


8. Comparison with related works

There is an active but small research community which

focuses on autonomic, self-adaptive and fault-tolerant

middleware (e.g., the Chameleon project [21]). The main

difference with our work is that such approaches aim to

build new, specialized middleware, whereas our goal is to

use standard commercial products that have already been

deployed.

Similarly, many of the main objectives stated in the first

sections of this paper are shared with the BPM community.

As mentioned earlier, the need for a QoS description

standard has produced many proposals (e.g., [22] and [7]).

However, the introduction of QoS constraint into XML

specifications in the BP*L family of languages [3] is still at

a very preliminary stage, and no implementation of a QoS

adaptive mechanism has yet been proposed in the process-

flow community.

On the contrary, the approach presented in this paper

draws heavily from previous work on computer networking

[2]. The various scheduling methods with which we have

experimented are directly inspired by techniques such as

Weighted Round Robin, Weighted Fair Queuing [23,24]

and Delay-Earliest Due-Date. WFQ is a tractable approxi-

mation of Generalized Processor Sharing, an idealized fluid

model, which relies on the computation of a virtual time for

each type of packet traffic. DEDD is similarly based on the

computation of expected due-dates based on service levels.

There are many other relevant techniques that could be

borrowed and used in the context of integration middleware.

For instance, reservation methods (such as RSVP [2,14])

could be implemented to guarantee a part of the bandwidth

for high-priority processes. We tried this approach when

studying SLA-routing for call centers [13], but found that

the additional robustness was gained at the expense of

overall efficiency, which is also known for computer

networks (we also experimented with non-work-conserving

approaches, but we lack stochastic information about the

message distribution required to make ‘idle waiting’

worthwhile). Similarly, the RED technique (Random

Early Detection/Discard) could apply, although message

loss is not an option. The equivalent of dropping a packet in

our context would be a long-term postponement, which is

strangely close to the idea of using LCFS during congestion.

More generally, the DiffServ architecture for IP networks is

relevant to our work since it deals with the delivery of SLAs

for complex applications (using, for instance, RSVP

routing).

One must not conclude from the above arguments that

adaptive middleware and adaptive networks are competing

technologies. Rather, they complement each other and

should be used in combination. The point that we made

earlier is that explicit ‘business’ QoS (i.e., associated with

business processes) modeling yields benefits in addition to

those that can be obtained with an ‘smart’ network. Another

benefit of the work described in this paper is that it is

intended for deployment on an existing IT network, and, as

such, is easier to deploy than a radical change in network

and middleware technology.

The limits of a classical layered approach towards the

need for new kinds of applications are well known: lack of

flexibility and lack of efficiency. The active network

approach [20,25] has some interesting similarities with

our own work, since the authors study a CORBA

middleware on top of an adaptive network. The implemen-

tation of high-level QoS constraints is done through a layer

of CORBA agents that use adaptive routers to control the

flow of messages. Similar approaches are found in the

context of wireless networks [26]. In both cases, the major

driver for QoS-awareness is the presence of multimedia

applications (hence the importance of jitter). What sets our

work apart from most recent adaptive network proposals is

the nature of the ‘QoS constraints’ that we want to handle.

Business processes have a richer and more interesting

structure than most multimedia services, and delivery of the

expected QoS for business processes may take advantage of

this structure.

9. Conclusion

This work is a first step towards our goal of self-

adaptive, self-healing integration infrastructures. This

goal makes our approach a modest contribution to the

field of autonomic computing [27]. From a business

perspective, the leading characteristic of a self-adaptive

infrastructure is the ability to balance resources according

to business priorities. This objective is what motivated

our research in the first place. As far as ‘self-healing’ is

concerned, there is still a long way to go, even when we

find that some routing strategies are more tolerant of the

congestions caused by a system failure than others. On

the other hand, simulation is indeed a powerful approach

for studying and understanding complex systems, such as

application infrastructures [28]. The contribution of this

paper may be stated as follows: we demonstrate how

‘business QoS’ SLAs could be used as declarative

policies in an adaptive infrastructure. We close the gap

between a very high-level vision, namely the business

processes, and the detailed implementation of a message-

passing integration infrastructure. The main findings may

be summarized as follows:

(1) Priority handling works: it is possible and fairly

simple to take process priority into account for

routing messages, and the results show a real

improvement in the ability to focus on higher-priority

processes when congestion occurs.

(2) The routing (mailbox sorting) algorithm matters: the

more sophisticated SLA projection technique showed

a real improvement over an FCFS policy, even

without taking priorities into account.


(3) Control rules are interesting, but they are secondary

to the routing policy: it is more efficient to deal with

congestion problems with a distributed routing

strategy than with a comprehensive rule schema.

Rules should complement the routing scheme, and

since there is no value in doing the same work twice,

the priority handling is best done at the node level.

The next step is to pursue our research in the following

directions:

– A more thorough study with simulation scenarios

which are closer to ‘real life’: the test OAI

problem used in this paper is representative of true

EAI infrastructure, but the principles identified

here need to be validated with improved OAI

scenarios that are accurate reproductions of real-

life situations and span longer time periods.

– A better understanding of system failures: the

failure scenarios would become much more

interesting if we included some of the behavior

of real systems, which means that we need to

enrich our model. This starts with gathering data

and analyzing real-life ‘failures’ of our EAI

infrastructure.

– A realistic model for representing synchronization

of distributed objects. This is necessary to

evaluate the impact of message ordering on data

consistency. The ability to use improved routing

algorithms depends on the synchronization strat-

egy used to manage distributed copies of business

objects.

References

[1] Garlan D. Self-Healing Systems: Some Resources. http://www.cs.

cmu.edu/wgarlan/17811/resources.html.

[2] Skeshav S. An engineering approach to computer networking.:

Addison Wesley Professional; 1997.

[3] Smith H, Fingar P. Business process management: the third wave.:

Meghan-Kiffer Press; 2002.

[4] Caseau Y. Urbanisation et BPM (Enterprise Architecture and

Business Process Management). Paris: Dunod; 2005.

[5] Wietrzyk VI, Takizawa M. Distributed workflows: a framework for

electronic commerce. J Inform Sci Eng 2003;19.

[6] Andrews T, et al. Specification: Business Process Execution

Language for Web Services Version 1.1. http://www.ibm.com/

developerworks/library/ws-bpel/; 2003.

[7] Ludwig H, Keller A, Dan A, King R, Franck R. Web Service Level

Agreement (WSLA) Language Specification. Version 1.0, http://

www.research.ibm.com/wsla/.

[8] Gross D, Harris C. Fundamentals of queuing theory. New York:

Wiley-Interscience; 1998.

[9] Lazowska E, Zahorjan J, Scott Graham G, Sevcik K. Quantitative

system performance—computer system analysis using queuing

network models.: Prentice Hall; 1984.

[10] Bolch G, Greiner S, de Meer H, Trivedi K. Queueing networks and

Markov chains. New York: Wiley-Interscience; 1998.

[11] Fiat A, Woeginger G, editors. Online algorithms—the state of the art.

Lecture notes on computer science, vol. 1442.

[12] Benoist T, Bourreau E, Caseau Y, Rottembourg B. Towards stochastic

constraint programming: a study of on-line multi-choice knapsack

with deadlines Lecture notes on computer science. Proceedings of

CP’2001. vol. 2239 2001.

[13] Caseau Y. Declarative ACD routing with service level optimization

Technical Memorandum, White Pajama July 2001.

[14] Halabi S. Internet routing architecture.: Cisco Press; 2001.

[15] Wang Z. Quality of service routing for supporting multimedia

applications. IEEE JSAC 1996;14(7).

[16] Fujimoto R. Parallel and distributed simulation systems.: Wiley-

Interscience; 2000.

[17] Golestani SJ. A stop-and-go queuing framework for congestion

management Proceedings of ACM SIGCOMM 90, Philadelphia

1990.

[18] Verma D, Zhang H, Ferrari D. Delay Jitter control for real-time

communication in a packet switching network IEEE Tricom.: Chapel

Hill, NC; 1991.

[19] Motwani R, Raghavan P. Randomized algorithms.: Cambridge

University Press; 1995.

[20] Hussain SA, Marshall A. An Active Scheduling Policy for

Programmable Routers 16th UK Teletraffic symposium: management

of QoS—The New Challenge. Harlow 2000.

[21] Bagchi S, et al. The Chameleon Infrastructure for Adaptive, Software

Implemented Fault Tolerance. In 17th Symposium on Reliable

Distributed Systems; 1998.

[22] Frolund S, Koistinen J. QML: A Language for Quality of Service

Specification. Hewlett-Packard Software Technology Laboratory,

report HPL-98-10; 1998.

[23] Demers A, Keshav S, Shenker S. Analysis and simulation of a fair

queuing algorithm Journal of internetworking research and experience

1990 pp. 3–26.

[24] Bennet JCR, Zhang H. WF2Q: Worst-Case Fair Weighted Fair

Queuing IEEE INFOCOM’96, San Francisco 1996.

[25] Chieng D, Marshall A, Parr G. SLA-driven flexible bandwidth

reservation negotiation schemes for QoS aware IP networks Manage-

ment of multimedia networks and services: seventh IFIP/IEEE

international conference, MMNS, San Diego. Proceedings. Lecture

notes in computer science. vol. 3271. Berlin: Springer; 2004.

[26] Raatikainen K, Wireless Internet: Challenges and Solutions. Univer-

sity of Helsinki, Series of Publication B, Report B-2004-3; 2004.

[27] Ganek A, Corbi T. The dawning of the autonomic computing era. IBM

Syst J 2003;42(1).

[28] P. Singleton. Performance Modelling - What,Why,When and How.

BT Technology Journal, Vol. 20, No. 3, Kluwer, July 2003.



http://www.ibm.com/developerworks/library/ws-bpel/

http://www.ibm.com/developerworks/library/ws-bpel/

http://www.research.ibm.com/wsla/

http://www.research.ibm.com/wsla/

self-adaptive middleware: supporting business process priorities and service level agreements

Documents