Design and EvaluationDesign and Evaluationof an Autonomic Workflow Engineof an Autonomic Workflow Engine
Thomas Heinis, Cesare Pautasso, Gustavo AlsonsoDept. of Computer Science
Swiss Federal Institute of Technology (ETHZ)
The 2nd IEEE International Conference on Autonomic Computing (UCAC-05)
March 15th, 2008Seo, Dongmahn
2/47
Contents Introduction System Background System Architecture Autonomic Capabilities System evaluation Conclusion
3/47
ContentsIntroductionIntroduction System Background System Architecture Autonomic Capabilities System evaluation Conclusion
4/47
Introduction Motivation Related Work Contribution
5/47
Motivation Workflow management systems
e-commercevirtual laboratoriesDNA sequencingscientific computingGrid computing idea of process-based Web service composition
6/47
Motivation (cont.)
Workflow enginesopen environmentunknown workloaddifficult to choose
a centralized solution a distributed implementation of the engine
problem of configuring the system in an optimal way NOT feasible solution
considering the number of parameters involved the variability of the workload having a system administrator in charge of manually monitoring reconfiguring the system
7/47
Related Work Decentralization of workflow process execution
important area of research support business processes lead to higher scalability introduces several problems
lack of a global view over the process scalability and reliability problems per se
To address the problem GOLIAT ,autonomic computing techniques, self-optimizing
computer systems autonomic computing principles in the context of distributed
workflow engines
8/47
Contribution Goal
self-tuningself-configuration capabilitiesself-healing capabilities
9/47
Contribution (cont.)
System extension to the JOpera engine
Java based service composition tool combines a workflow engine with an open architecture to provide support for Web service composition, Grid computing and
specialized workflow engines
flexible architecture, components Key system modules can be replicated to handle large
workloads. Other modules can be paired with a backup to achieve fault
tolerance. The autonomic controller can be configured by selecting
different reconfiguration strategies.
10/47
Contribution (cont.)
the key contributions of the paper the novel system architecture
genericcan be adopted by many engines operating under different
models and languages the resulting scalability and fault tolerance
flexible enough to support the very large loads present in computational applications and large scale Web service composition
the independence of the underlying workflow modeleasily extensible to support many different kinds of services
11/47
Contents Introduction
System BackgroundSystem Background System Architecture Autonomic Capabilities System evaluation Conclusion
12/47
System Background Requirements Workload Assumptions Deployment Environment
13/47
Requirements the workflow execution engine
to support autonomic behaviormust feature
self-configuration, self-tuning and self healing capabilities
Self-configurationswitching the system’s configuration on the flywithout manual intervention and disrupting the system requires the workflow execution engine
to support dynamically and efficiently change the configuration
14/47
Requirements (cont.)
self-tuningsystem reconfiguration to optimal given the current
workload the workflow engine must give access to its internal
statecontrol algorithms can analyze current and past performance
information to plan configuration changes in respose to the current workload
assumptionthe characteristics of the workload affect the system’s
performancethe self-tuning algorithm can optimally adapt the system to
the workload by monitoring key performance indicators
15/47
Requirements (cont.)
self-healingable to detect configuration changes due to external
eventsfailures of nodes
recovery action requires
mechanisms for detecting failures and configuration changes of the cluster
to query the workflow execution state
16/47
Workload Assumptions the workload is assumed
to be a collection of concurrent workflow processes a worst case scenario not deal with workload prediction issues
future work
17/47
Deployment Environment [Assumption] JOpera
runs on a dedicated cluster of computers can use these resources exclusively
main goal of the autonomic features to ensure the optimal configuration of the cluster
efficient resource utilization good allocation of the available nodes to the different system components
cluster configuration is NOT static the system could be extended to use shared nodes
that are also used for other purposes.
18/47
Contents Introduction System Background
System ArchitectureSystem Architecture Autonomic Capabilities System evaluation Conclusion
19/47
System Architecture Workflow Execution Distributed Workflow Execution Scalable Workflow Execution
20/47
Workflow Execution Workflow processes model
interactions btw different tasks by defining the data flow and control flow btw them
21/47
Distributed Workflow Execution
22/47
Scalable Workflow Execution scalability bottleneck
use several layers of cachingbtw tuple space and threads producing and consuming tuples
23/47
Contents Introduction System Background System Architecture
Autonomic CapabilitiesAutonomic Capabilities System evaluation Conclusion
24/47
Autonomic Capabilities Self-Tuning
Information StrategyOptimization StrategySelection Strategy
Self-ConfigurationReconfiguration
Actions Self-Healing
25/47
Self-tuning Information Strategy
detect imbalances in the system’s configuration to sample the current space size
Optimization Strategy to establish a configuration
such that the number of navigator and dispatcher threads is balanced
Selection Strategyprioritizing nodes according to how well suited they are
for a configuration change
26/47
Self-Configuration a closed feedback-loop controller Reconfiguration Actions
Starting Threadsthe JOpera API
Stopping Navigator Threadsmigrating the state of the processes
the navigator thread is working on and redirecting associated events by flushing the locally cached state into the global tuple space
27/47
Self-Configuration (cont.)
Stooping Dispatcher Threadsmore difficulttask may involve the invocation of a local application or the
interaction with a remote service provider on the Webmetadatakill method
immediately stops all active task executions ensures all task invocations will be repeated on a differend dispatcher
thread
stop method immediately ceases to take tuples from the task space
28/47
Self-Healing periodically monitors the nodes of the cluster Handling Dispatcher Thread Failures
the task that were managed by it are lost and have to be restarted
very similar to self-configuration component kills a dispatcher
Handling Navigator Thread Failures the state of the execution of the process is still the
available in the global process execution state spacesimply removing their entries in the tuple routing table
which point to the failed navigator
29/47
Contents Introduction System Background System Architecture Autonomic Capabilities
System evaluationSystem evaluation Conclusion
30/47
System evaluation Experimental Setup Base line Autonomic Behavior
Self-ConfigurationReconfiguration Overhead
Self-Healing Discussion
31/47
Experimental Setup a cluster of up to 20 nodes
1.0GHz dual P-III, 1GB of RAM, Linux (Kernel version 2.4.22) and Sun’s Java Development Kit version 1.4.2
one additional node the global tuple space server IBM’s T-Spaces v2.1.3
32/47
Base Line two different workloads
1000 concurrent processes containing 10 parallel tasks of duration of 0 seconds (workload 0)
1000 processes containing 10 parallel tasks of duration of 20 seconds (workload 20)
total 15 nodes14 navigators and 1 dispatcher up to 14 dispatchers and
1 navigator
33/47
Base Line (cont.)
34/47
Base Line (cont.)
35/47
Autonomic Behavior Self-Configuration
36/47
Autonomic Behavior (cont.)
37/47
Autonomic Behavior (cont.)
38/47
Autonomic Behavior (cont.)
Reconfiguration Overhead
39/47
Self-Healing initially to use 15 nodes to replace 5 of the nodes assigned workload
consists of four peaks of 500 processes occurring every 100 seconds
each of the processes consist of 10 parallel tasks of 10 seconds duration
change nodesgrow to 20 nodes at t=90 reduced by 5 nodes at t = 140again by 5 nodes at t=230
40/47
Self-Healing (cont.)
41/47
Self-Healing (cont.)
42/47
Self-Healing (cont.)
43/47
Self-Healing (cont.)
44/47
Discussion to find an optimal static configuration for a given
workloadvery difficultdifferent characteristics lead to different optimal
configurations autonomic controller was able to
adapt the configuration of the workflow engineaccording to the variable characteristics of the workload
self-healing experimentcommon situation in the lifetime of a cluster-based
system
45/47
Contents Introduction System Background System Architecture Autonomic Capabilities System evaluation
ConclusionConclusion
46/47
Conclusion the design of an autonomic workflow engine demonstrated its self-managing behavior and
evaluated its performance show how to apply the autonomic computing
paradigm to greatly simplify the deployment and the maintenance of such systems
homogeneous workload more complex characteristics as part of future
work
47/47