managing and optimizing bioinformatics workflows for data...

22
J Grid Computing DOI 10.1007/s10723-013-9260-9 Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds Vincent C. Emeakaroha · Michael Maurer · Patrick Stern · Pawel P. Labaj · Ivona Brandic · David P. Kreil Received: 14 May 2012 / Accepted: 24 May 2013 © Springer Science+Business Media Dordrecht 2013 Abstract The rapid advancements in recent years of high-throughput technologies in the life sci- ences are facilitating the generation and storage of huge amount of data in different databases. Despite significant developments in computing capacity and performance, an analysis of these large-scale data in a search for biomedical rele- vant patterns remains a challenging task. Scien- tific workflow applications are deemed to support data-mining in more complex scenarios that in- clude many data sources and computational tools, V. C. Emeakaroha (B ) · M. Maurer · P. Stern · I. Brandic Information Systems Institute, Vienna University of Technology, Vienna, Austria e-mail: [email protected] M. Maurer e-mail: [email protected] P. Stern e-mail: [email protected] I. Brandic e-mail: [email protected] P. P. Labaj · D. P. Kreil Chair of Bioinformatics, Boku University Vienna, Vienna, Austria P. P. Labaj e-mail: [email protected] D. P. Kreil e-mail: [email protected] as commonly found in bioinformatics. A scien- tific workflow application is a holistic unit that defines, executes, and manages scientific appli- cations using different software tools. Existing workflow applications are process- or data- rather than resource-oriented. Thus, they lack efficient computational resource management capabilities, such as those provided by Cloud computing environ- ments. Insufficient computational resources disrupt the execution of workflow applications, wasting time and money. To address this issue, advanced resource monitoring and management strategies are required to determine the resource consump- tion behaviours of workflow applications to en- able a dynamical allocation and deallocation of resources. In this paper, we present a novel Cloud management infrastructure consisting of resource level-, application level monitoring techniques, and a knowledge management strategy to manage computational resources for supporting workflow application executions in order to guarantee their performance goals and their successful comple- tion. We present the design description of these techniques, demonstrate how they can be applied to scientific workflow applications, and present detailed evaluation results as a proof of concept. Keywords Workflow execution · Resource level monitoring · Application level monitoring · Workflow management · Knowledge database · Cloud computing

Upload: phungmien

Post on 12-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

J Grid ComputingDOI 10.1007/s10723-013-9260-9

Managing and Optimizing Bioinformatics Workflowsfor Data Analysis in Clouds

Vincent C. Emeakaroha · Michael Maurer ·Patrick Stern · Paweł P. Łabaj · Ivona Brandic ·David P. Kreil

Received: 14 May 2012 / Accepted: 24 May 2013© Springer Science+Business Media Dordrecht 2013

Abstract The rapid advancements in recent yearsof high-throughput technologies in the life sci-ences are facilitating the generation and storageof huge amount of data in different databases.Despite significant developments in computingcapacity and performance, an analysis of theselarge-scale data in a search for biomedical rele-vant patterns remains a challenging task. Scien-tific workflow applications are deemed to supportdata-mining in more complex scenarios that in-clude many data sources and computational tools,

V. C. Emeakaroha (B) · M. Maurer ·P. Stern · I. BrandicInformation Systems Institute,Vienna University of Technology, Vienna, Austriae-mail: [email protected]

M. Maurere-mail: [email protected]

P. Sterne-mail: [email protected]

I. Brandice-mail: [email protected]

P. P. Łabaj · D. P. KreilChair of Bioinformatics,Boku University Vienna, Vienna, Austria

P. P. Łabaje-mail: [email protected]

D. P. Kreile-mail: [email protected]

as commonly found in bioinformatics. A scien-tific workflow application is a holistic unit thatdefines, executes, and manages scientific appli-cations using different software tools. Existingworkflow applications are process- or data- ratherthan resource-oriented. Thus, they lack efficientcomputational resource management capabilities,such as those provided by Cloud computing environ-ments. Insufficient computational resources disruptthe execution of workflow applications, wastingtime and money. To address this issue, advancedresource monitoring and management strategiesare required to determine the resource consump-tion behaviours of workflow applications to en-able a dynamical allocation and deallocation ofresources. In this paper, we present a novel Cloudmanagement infrastructure consisting of resourcelevel-, application level monitoring techniques,and a knowledge management strategy to managecomputational resources for supporting workflowapplication executions in order to guarantee theirperformance goals and their successful comple-tion. We present the design description of thesetechniques, demonstrate how they can be appliedto scientific workflow applications, and presentdetailed evaluation results as a proof of concept.

Keywords Workflow execution · Resource levelmonitoring · Application level monitoring ·Workflow management · Knowledge database ·Cloud computing

V.C. Emeakaroha et al.

1 Introduction

Scientific workflow applications have become anempowering tool for large-scale data analysis inbioinformatics [37, 38], providing the necessaryabstractions that enable effective usage of compu-tational and data resources. The workflows striveto manage the operation complexities to freeresearchers to focus on guiding the data analy-sis, interpretation of results, and taking decisionswhenever human inputs are required [22].

Considering the fast development of high-throughput technologies, which generate hugeamounts of data, scientific workflow applicationscan be instrumental in achieving automation andbreaking down extended complexity in the lifesciences [42]. The execution of workflow appli-cations can be computationally intensive and re-quires reliable resource availability. Moreover,scientific workflow applications should be highlyflexible to accommodate dynamic change of inputdata and parameter modifications, even duringexecution. Because, in scientific workflows, pre-liminary results can indicate suboptimal parame-ter choice. In such situations, one may want toadapt these parameters immediately. Thus, in thefield of Bioinformatics, analyses are often semi-interactive. A typical example is the significancethreshold for continuing into the second step ofan analysis. This may have to be set liberally incase few results are expected. If during executionsurprisingly many results meet the threshold, onemay want to revise it for greater stringency. Typ-ical scenarios that require a response to changedinput data include the update of an external data-base or the upgrade of a tool used to preprocessinput data. In such situations, when the workflowis not already close to completion, it may be abetter use of resources to restart with the revisedinput data.

The efficient management of such flexible work-flow applications seeking to guarantee the avail-ability of resources and the achievement of theirperformance goals is a challenging task. Often,resource availability decides the successful execu-tion of a complex and expensive workflow appli-cation. Therefore, there is a need for advancedtechniques to monitor, at runtime, the resourceconsumption and application behaviours and to

inform about resource shortages, so that themanagement system can take adequate resourceallocation decisions to support the successful com-pletion of each workflow application process.

Traditionally, a workflow application can beexecuted using local and distributed computeresources. Such resources are basically limitedand, normally, cannot be dynamically reallocated.Considering that workflow applications are re-source intensive and can take hours if not daysto complete, provisioning them in an environmentwith fixed resources leads to poor performanceand possible execution failures due to the lackof a flexible allocation of extra resources. TheCloud is proving to be a valuable complementto the compute resources traditionally used inbioinformatics research laboratories [32]. Cloudcomputing technology promises on-demand andflexible resource provisioning that can be realizedautonomically [3, 20]. The execution of workflowapplications in a Cloud environment allows foreasier management and guaranteeing of their per-formance goals.

In this paper, we present a novel Cloudmanagement infrastructure consisting of resourcelevel-, application level monitoring techniques,and a knowledge management strategy, which al-locates and de-allocates resources based on themonitored information in order to support theexecution of workflow applications in Cloud envi-ronments. The monitoring techniques are respon-sible for monitoring the execution of workflowapplications in the Cloud to determine their per-formance and their resource consumption behav-iours including supervising the status of avail-able resources. The knowledge management usesthe monitored information to make decisions onhow to allocate resources to the computationalnodes in order to ensure the successful comple-tion of the workflow applications. The main con-tributions of this paper are: (i) the introductionof Cloud management techniques applicable toworkflow applications; (ii) the improvement ofscientific workflow application execution to sup-port achieving their performance goals and tofacilitate their successful completions; and (iii) theevaluation of our approach. The contributions aredemonstrated using TopHat [43], a typical scien-tific workflow application from the field of bioin-

Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

formatics that exemplifies the challenges in thecomplex analysis of large data sets from modernhigh-throughput technologies. The focus of thiswork is on supporting the execution of workflowapplications and not on designing a workflow system.

This paper is organized as follows: Section 2presents some general descriptions of workflowapplications, including the current state-of-the-art. There, we introduce in detail, the challengesto be addressed in this paper. In Section 3, wedescribe the Cloud management infrastructure,which is composed of resource-, application mon-itoring techniques, and a knowledge managementstrategy. Section 4 presents a concrete exam-ple of applying the Cloud techniques to supportworkflow application execution and Section 5 dis-cusses the evaluation of our approach to presenta proof of concept. We consider related work inSection 6 while Section 7 concludes the paper anddiscusses our future work directions.

2 Workflow Applications

In most working environments, there are pro-cedures that are repeated over and over again.These procedures usually consist of sub-tasks thatrepresent encapsulated units of work, which forma dependency network. Thus, a workflow can bedefined as a composite task that comprises coor-dinated human and machine executed sub-tasks[22]. In computer aided scientific work, a scientificworkflow application is a holistic unit that defines,executes, and manages sub-task processes with thehelp of software artifacts [16].

2.1 Background

Bioinformatics is the research discipline in whichscientists, with the use of computational methods,seek to gain insights from data gathered in the lifesciences [22]. An objective example is the discov-ery of interesting patterns in data obtained fromlaboratory experiments and/or from earlier resultsstored in databases that can be online or in storagesites spread around the world. Bioinformatics isboth applying established computational tools tonew data, as well as new tools to well charac-terized data sets, seeking to improve on earliermethods. Thus, in bioinformatics, like in many

modern research disciplines, scientific workflowapplications empower advanced and more com-plex analysis. Management systems have been de-veloped to facilitate the execution of workflowapplications that use data and services from dis-tributed sources [2, 6, 14, 17].

As a result of the proliferation of new high-throughput technologies in the life sciences thatgenerate massive amounts of data, the retrieval,storage, and analysis of data face great technicalchallenges [35, 39]. In particular, often, bioinfor-matics tools, many of which are available onlythrough web-based interfaces, are not suited forthe analysis of newly generated large-scale datasets due to their computational intensiveness [4,34]. In general, existing workflow applicationmanagement systems cannot handle the massiveamounts of data and execute workflow applica-tions on these efficiently either [26]. New analysissoftware, workflow applications, monitoring, andmanagement approaches are required that cantake advantage of more powerful infrastructuresuch as data centers or Cloud environments.

In this paper, we consider Next GenerationSequencing (NGS), a recently introduced high-throughput technology for the identification ofnucleotide molecules like RNA or DNA in bio-medical samples. The output of the sequencingprocess is a list of billions of character sequencescalled ‘reads’, each typically holds up to 35–200letters that represent the individual DNA basesdetermined. Lately, this technology has also beenused to identify and count the abundances ofRNA molecules that reflect new gene activity. Weuse the approach, called RNA-Seq, as a typicalexample of a scientific workflow application in thefield of bioinformatics.

At first, in the analysis of RNA-Seq data, theobtained sequences are aligned to the referencegenome. The aligner presented here, TopHat [43],consists of many sub-tasks, some of them have tobe executed sequentially, whereas others can runin parallel (Fig. 1). These sub-tasks can have dif-ferent resource-demand characteristics: needingextensive computational power, demanding highI/O access, or requiring extensive memory size.

In Fig. 1, the nodes marked with ∗ representsimplified sub-tasks of the workflow application,whereas the nodes marked with # represent the

V.C. Emeakaroha et al.

Fig. 1 Overview of theTopHat aligningapproach: color online

data transfered between the sub-tasks. The firstsub-task aligns input reads to the given genomeusing the Bowtie program [24]. Unaligned readsare then divided into shorter sub-sequences thatare further aligned to the reference genome inthe next sub-task. If sub-sequences coming fromthe same read were aligned successfully to thegenome, that may indicate that this read wasstraddling a ‘gap’ in the gene, falling on a so-called splice-junction. After verification of candi-date reads falling on splice junctions, these and thereads that were aligned in the first sub-task arecombined to create an output with a comprehen-sive list of localized alignments.

2.2 Problem Description

In the course of a comprehensive study of someof the largest available RNA-Seq data sets [23], ithas been noticed that even with modern dedicated

servers (having large memory size and high CPUcapacities) the analysis of such large data setscan be challenging and requires prior knowledgeabout the resource-demand characteristics of theworkflow application to execute.

In such studies, the analysis of one data set caneasily take days. The analysis time is related toboth the size of the input data set and the com-plexity of the studied organism. Like many suchtools, TopHat creates large temporary files (inGBs per data set), and these accumulate duringexecution. With fixed resources, it can thereforeeasily occur that in the middle of an analysisrun, there is unexpectedly no more space on thedisks, especially when several instances have beenexecuted in parallel. Thus, without appropriateprior knowledge about resource demands acrossexecution time, the instances of TopHat runningin parallel may trigger a crash and lead to the lossof several days of calculation time.

Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

For higher efficiency in such analysis, we en-visage the necessity of appropriate strategies tomonitor the resource consumption behaviours ofworkflow applications and to apply knowledgedatabases to manage and allocate resources ap-propriately. In the next section, we describe themanagement tools that can be used to this end.

3 Cloud Management Infrastructure

Cloud computing facilitates the implementationof scalable on-demand computing infrastructurescombining concepts from virtualization, Grid, anddistributed systems [3, 33, 36]. Cloud resources areprovisioned based on Service Level Agreements(SLAs), which are contracts specified and signedbetween Cloud providers and their customersdetailing the terms of the agreement, includingnon-functional requirements, such as Quality ofService (QoS) and penalties in case of violations[5, 11, 21]. Clouds open the door for a funda-mental change in how workflow applications aredesigned and deployed in bioinformatics researchlaboratories [15]. Its on-demand resource avail-ability and seemingly infinite scalability can aug-ment traditional in-house compute and storage

devices. In this section, we present monitoringtechniques and a knowledge management strat-egy that provide state-of-the-art techniques forresource and application management in Cloudenvironments.

3.1 Resource Level Monitoring

Monitoring techniques are the basis for efficientresource and SLA management in Clouds. Nor-mally, SLA parameters such as availability,throughput, and response time represent the per-formance goals of applications. However, inClouds, applications are provisioned with the in-frastructure resources, which are characterized bylow-level resource metrics such as, uptime, down-time, CPU, memory, and storage. Thus, there isa gap between the SLA parameters and the low-level resource metrics.

In order to bridge this gap and guarantee theSLAs of applications, a Low-level resource Met-rics to High-level SLA (LoM2HiS) monitoringand mapping framework was designed [7]. TheLoM2HiS framework monitors Cloud infrastruc-ture resource metrics to determine their statuswhile applications are being provisioned in aCloud environment. Figure 2 depicts an overview

Fig. 2 LoM2HiSframework overview:color online

V.C. Emeakaroha et al.

of the LoM2HiS framework. This framework aimsto address two issues:

– monitoring of low-level resource metrics– mapping of the monitored metrics into SLA

parameters so as to monitor the applicationSLA at runtime.

It consist of two core components namely, theHost monitor and the Run-time monitor, as shownin Fig. 2. The Run-time monitor is designed tomonitor services (applications) based on the ne-gotiated and agreed SLA objectives.

As depicted in Fig. 2, after agreeing on SLAterms, the service provider creates mapping rulesfor the LoM2HiS mappings (step 1 in Fig. 2) usingDomain Specific Languages (DSLs), which aresmall languages that can be tailored to a specificproblem domain. Once the customer requests theprovisioning of an agreed service (step 2), theRun-time monitor loads the service SLA from theagreed SLA repository (step 3). Service provi-sioning is based on infrastructure resources, whichrepresent hosts and network resources in a com-putational environment. The Gmond monitoringagents from the Ganglia project [27] is used tomeasure the resource metrics, and the Host moni-tor accesses the measured metrics for processing(step 4). The Host monitor extracts the metric-value pairs from the raw metrics and transmitsthem periodically to both the run-time monitor(step 5) and the knowledge component (step 6)using a novel communication mechanism.

Upon reception of the measured metrics, theRun-time monitor maps the low-level metricsbased on predefined mapping rules to form theequivalence of the agreed SLA objectives. Thesemapping rules are predefined depending on ap-plications types, raw metrics, and parameters in-volved. The definition of the mapping rules mayinclude different complexity levels, e.g., 1 : n or n :m. An example rule is presented in (1). It showshow to map the SLA parameter Availability fromthe low-level metrics uptime and downtime:

Availability =(

1 − downtimeuptime + downtime

)∗ 100.

(1)

The resulting mapping is stored in the mappedmetric repository (step 7), which also contains thepredefined mapping rules. The Run-time monitoruses the mapped values to monitor the status ofthe deployed service applications at runtime. Incase future SLA violation threats are detected,it notifies the knowledge component for preven-tive actions (step 8) and the knowledge compo-nent’s decision is executed on the infrastructureresources (step 9). The concept of detecting futureSLA violation threats is designed by defining amore restrictive threshold than the SLA violationthreshold that is known as threat threshold. Thus,calculated SLA values are compared against thepredefined threat thresholds in order to informthe knowledge management to react before realSLA violations can occur.

In designing LoM2HiS, the separation of theHost monitor and the Run-time monitor makesit possible to deploy these two components ondifferent hosts. This decision is focused towardsincreasing the scalability of the framework andon facilitating its usage in distributed and parallelenvironments. To this end, a scalable communica-tion mechanism is designed for exchanging mes-sages between the two components. The commu-nication mechanism exploits the capabilities of theJava Messaging Service (JMS) API [19], which isa Java message oriented middleware for sendingmessages between two or more clients. In order touse JMS, there is a need for a JMS provider that iscapable of managing a large number of sessionsand queues. We use the well established opensource Apache ActiveMQ [1] for this purpose.

We have successfully utilized the LoM2HiSframework for resource monitoring and detectingSLA violations in private Cloud environments andin related areas [8, 40]. We have also evaluatedit using different application types [10]. However,the LoM2HiS framework is focused on monitor-ing at the resource level in Clouds. The Run-timecomponent is only capable of monitoring the ex-ecution of single threaded applications on Cloudinfrastracture. Thus, it is not capable of moni-toring multi-threaded applications that are beingexecuted with different processes. To address thisissue, especially for workflow applications, we in-troduce application level monitoring in the nextsection.

Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

3.2 Application Level Monitoring

In this section, we complement the resourcemonitoring framework presented in the previ-ous section with an application level monitoringarchitecture, which is capable of monitoring multi-threaded applications and the processes execut-ing their sub-tasks. The goals of this architectureare to achieve more insight information of theworkflow execution and to realize fine-grainedmanagement of workflow applications. Further-more, we implement the ability of monitoringthe temporary storage created by the TopHatworkflow application while executing.

In order to realize this goal of monitoring theprocesses executing the TopHat application andits sub-tasks, we used the Linux inbuilt programssuch as top and du to determine the resourceusage of the processes associated with TopHatapplication. The du Linux program is capable ofchecking the disk usage of files and directories.Thus, we apply it to supervise the directory andsub-directories from where the TopHat workflowapplication is started in order to obtain informa-tion about the size of the temporal storage utilized

by the application at runtime. The memory usageof each process executing a task associated withTopHat is derived by analyzing the log fileslocated in the /proc directory especially the statmlog file with our implemented Java routine toachieve the usage information. Figure 3 presentsan overview of the application monitoringarchitecture.

The monitoring architecture design includesscripts to automatically start the TopHat work-flow application and call the inbuilt Linux pro-grams to monitor the execution of the application.The monitoring takes place at specified interval oftime, which is defined based on the applicationtype and its resource consumption behaviours.The application monitoring architecture stops itsoperation automatically once the workflow appli-cation completes its execution or there is a failureof the execution. We implemented Java routinesto analyze and extract the monitored informationvalues, which are sent to the knowledge manage-ment for processing. Furthermore, our architec-ture is capable of plotting the monitored infor-mation with the help of JFreeChart API [18] tovisualize the results.

Fig. 3 Application monitoring overview: color online

V.C. Emeakaroha et al.

3.3 Knowledge Management Strategy

In the previous sections, we discussed the resourceand application monitoring techniques in Cloudenvironments. In this section we present a knowl-edge management strategy that utilizes the moni-tored information to provide proactive actions forresource management in Clouds.

The term knowledge management (KM) in ourcontext, means intelligent usage of measured data,obtained by monitoring, for the decision makingprocess to satisfy application performance goaldefined in SLA agreements while optimizing thecomputational resource usage. Apart from themonitoring phase (as shown in Fig. 4), the knowl-edge management consist of three other phasesnamely analysis, plan, and execute. The core of theKM is a knowledge database that interacts withthese phases in the management process.

Figure 4 presents an overview of the knowl-edge management. The monitoring phase deliversthe monitored information, which includes detailsabout the actual resource allocation status, the uti-lization of the resources by the workflow applica-tions, the performance status of the applications,and the applications’ performance goals defined in

the SLA. The analysis phase processes the moni-tored data. It provides an interface for receivingthe monitored information from the monitoringphase. It analyzes the received information to de-termine the exact resource SLA threat thresholdsthat are violated, and then decides on the exactreactive action to carry out in order to preventfuture SLA violations. The plan phase plans theexecution of the recommended actions and sortto prevent oscillation effects (i.e., allocating anddeallocating the same resource interchangeably)in the operations. This phase is divided into twoparts: plan I and plan II. Plan I is responsiblefor mapping actions onto physical and virtualmachines in the Cloud environments and man-aging those machines [31]. Plan II is in chargeof planning the order and timing of the actions.The execute phase is the final one. It executesthe recommended actions on the computationaldevices with the help of software actuators.

We have successfully applied different tech-niques such as Case-Based Reasoning, rule-based,etc for knowledge management in Cloud environ-ments [28–30]. In the next section, we show howthe monitoring and rule-based techniques are ap-plied to support workflow application execution.

Fig. 4 Knowledge management overview: color online

Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

4 Optimizing Workflow Execution

In this section, we apply the Cloud managementtechniques described in Section 3 to optimize theRNA-Seq scientific data analysis process. The ob-jective is to address the resource availability prob-lems affecting workflow application executionswhile analyzing bioinformatic data as outlined inSection 2.2.

4.1 Workflow Execution Monitoring

Scientific workflow applications are resource in-tensive and can take considerable amount of timeto complete. The successful completion of dataanalysis using workflow applications in the lifesciences is paramount to the scientist. To facili-tate this objective, we apply the resource - and

application monitoring techniques to monitor theworkflow application executions in order to su-pervise the computational resource status and thestate of the processes executing the applicationsub-tasks.

Normally, workflow applications are composedof other applications (sub-tasks) linked togetherto achieve a common goal (as shown in Fig. 1). Aworkflow application can be executed in a distrib-uted system using multiple computational nodesin which case, some parts of the application mightexecute on a different node. Thus, the successfulcompletion of a workflow application depends onthe completion of its composed parts executed bydifferent processes.

To demonstrate our approach, we apply theCloud management infrastructure techniques tosupport the scientific data analysis processes as

Fig. 5 Applying cloudmanagementinfrastructure techniquesto workflow application:color online

V.C. Emeakaroha et al.

shown in Fig. 1. For simplicity and ease of under-standing, we use a reduced version of Fig. 1 in thisdemonstration.

As described in Section 3.1, the resource mon-itoring framework consists of three main com-ponents: (i) the Monitoring agents that monitorssingle computational nodes; (ii) the Host moni-tor that gathers and processes monitored infor-mation from the Monitoring agents; and (iii) theRun-time monitor that maps metrics and moni-tors application SLAs. We integrated the appli-cation monitoring architecture functions into theresource monitoring components to supplementtheir functions in monitoring at application level,and thereby achieve an advanced monitoring tech-nique. Figure 5 presents how we applied the ad-vanced monitoring technique to efficiently man-age the scientific data analysis executions.

The TopHat workflow application executionis composed of different applications (sub-tasks)running sequentially and in parallel (as shownin Fig. 1). Thus, it is necessary to monitor thecomputational node used to execute each of thesesub-tasks and the sub-tasks themselves in orderto dynamically allocate resources if needed andto determine at each point in time the perfor-mance of the sub-task execution. As depicted inFig. 5, we integrate a monitoring agent into eachof the computational node used for the execu-tion of the workflow application. The monitoringagents monitor the low-level resource metrics’status (e.g., CPU, memory, disk space, etc.) ofeach node including the resource consumption ofeach sub-task. The monitored information (arrowa in Fig. 5) is communicated back to the advancedmonitoring technique for processing.

The monitored information consist of uniqueIDs for each of the computational node, and theirresource metrics including the resource consump-tion of each of the processes executing the sub-task. Thus, the advanced monitoring techniqueis capable of determining the specific node andthe exact resource metric that might be lackingin the near future. It passes this information tothe knowledge management component (arrowb in Fig. 5) to take appropriate actions to en-sure the availability of this resource (arrow cin Fig. 5) in the Cloud environment for furthercomputations.

4.2 On-Demand Resource Allocation

This section describes how we apply the knowl-edge management strategy to allocate resourceson-demand based on the monitored information.With this strategy, we intend to efficiently managecomputational resources to support the executionof the workflow applications.

To achieve these aims, we introduce a specula-tive knowledge management technique utilizingrule-based approach [30]. This approach ensuresthat at every point in time the computationalnodes posses enough resources for the workflowapplication. Furthermore, it ensures that re-sources are not wasted by reducing the amount ofresources allocated to a node if necessary.

In the life sciences, it has been identified thatCPU, storage, and memory are the crucial com-putational resources for workflow execution. Todemonstrate our speculative approach, we defineSLAs to represent the workflow application per-formance goals. The SLAs specify objective val-ues for the required computational resources. En-forcing these objectives guarantees the applica-tion performance and its successful completion.An example of SLA specifications is presented inTable 2 at the evaluation section. The initial sizeof a computational node is determined based onthe SLA of the application to be executed.

In a further step, we introduce three notionsfor resource management: allocated, utilized, andspecif ied—allocated means the amount of re-sources allocated to a computational node, uti-lized means the amount of resources used by theapplication executing on the computational node,and specif ied means the assumed amount of re-sources required for successful completion of theworkflow application. An SLA violation occurs,if less resource is allocated than the applicationutilizes (or wants to utilize) with respect to thespecified objective in the SLA. Consequently, wetry to allocate less than specified, but more thanutilized in order to avoid SLA violations on theone hand and on the other hand to mitigate re-source wastage.

We define allocating more or less than uti-lized to be called over-provisioning or under-provisioning, respectively. In order to knowwhether a resource r is in danger of under-

Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

provisioning or already is under-provisioned, orwhether it is over-provisioned, we calculate thecurrent utilization utr = user

prr × 100, where user

and prr signify how much of a resource r wasused and allocated, respectively, and divide thepercentage range into three regions using ThreatThresholds (TT). In this case, we define twothreat thresholds TTr

low and TTrhigh representing

the higher and the lower boundaries as shown inFig. 6:

– Region −1: Danger of under-provisioning, orunder-provisioning (> TTr

high)– Region 0: Well provisioned (≤ TTr

high and ≥TTr

low)– Region +1: Over-Provisioning (< TTr

low)

The idea of this rule-based design is to maintainan ideal value that we call target value tv(r) forthe utilization of a resource r, in exactly the centreof region 0. So, if the utilization value after somemeasurement leaves this region by using more(Region −1) or less resources (Region +1), thenwe reset the utilization to the target value, i.e., weincrease or decrease allocated resources so thatthe utilization is again at region 0 using (2).

tv(r) = TTrlow + TTr

high

2%. (2)

As long as the utilization value stays inregion 0, no action will be executed. E.g.,for r = storage, TTr

low = 60 %, and TTrhigh =

80 %, the target value would be tv(r) = 70 %.Figure 6 presents the regions and measurements

Fig. 6 Example behaviour of actions at time intervals t1–t6: color online

(expressed as utilization of a certain resource) attime steps t1, t2, . . . , t6. At t1 the utilization of theresource is in Region −1, because it is in dangerof a violation. Thus, the knowledge database rec-ommends to increase the resource such that at thenext iteration t2 the utilization is at the center ofRegion 0, that is, the target value. At time stepst3 and t4 utilization stays in the center region 0and consequently, no action is required. At t5,the resource is under-utilized and so the knowl-edge database recommends the decrease of theresource to tv(r), which is attained at t6. Therefore,we reset the utilization value to the target valueby dynamically allocating extra resources to thecomputing node.

A large enough span between the thresholdsTTr

low and TTrhigh helps to prevent oscillations

of repeatedly increasing and decreasing the sameresource. However, to further reduce the risk ofoscillations, we suggest to calculate a predictionfor the next value based on the latest measure-ments. Thus, an action is only invoked when thecurrent AND the predicted measurement exceedthe respective thresholds. So, especially when onlyone value exceeds the threshold, no action isexecuted.

Our on-demand resource allocation model dy-namically increases or decreases the virtual ma-chines resources by allocating extra resourceswhen necessary and duly releasing them to avoidwastage and unnecessary energy consumption.

Based on this technique, we can provision sci-entific data analysis processes with enough re-sources to achieve high performance and to ensuretheir successful completion. In the next section,we present some evaluations of our approach.

5 Evaluation

In this section, we present the evaluation of ourapproach. The goals of the evaluations are to showthe applicability of our advanced monitoring tech-niques to monitor the workflow application execu-tion and the usage of the knowledge managementstrategy to efficiently manage resources. We firstdiscuss the environmental setups, the workflowapplication used, and later present the achievedresults.

V.C. Emeakaroha et al.

Table 1 Computationalnode capacity

OS CPU Cores Memory Storage

Linux/Ubuntu Intel Xeon(R) 3 GHz 2 9 GB 19 GB

5.1 Experimental Setup

The evaluations are carried out in a private virtu-alized Cloud environment. The workflow applica-tion is executed using virtual machines and eachof them can have its capacity increased up to thevalues presented in Table 1.

The virtual machines (VMs) represent compu-tational nodes for the execution of applications.The VMs in our evaluation environment are cre-ated using VMWare tools, which offer us theopportunity to manage the created VMs usingour developed monitoring and knowledge man-agement techniques. We avoided using existingCloud frameworks such as Eucalyptus in settingup our Cloud environment since they have theirown management tools. However, we intend tointegrate our management tools into such frame-work in the future.

Figure 7 presents an abstracted view of theevaluation testbed exemplifying a control entitynode and a computing node. The testbed repre-sents a setup for efficient management of appli-cation execution in a Cloud environment. Thistestbed is a private Cloud environment used forthe evaluation of our implemented research pro-totypes. In this case, the purpose of the testbed isto present a proof of our concept and to demon-strate how workflow application executions can beefficiently managed in a Cloud environment.

On the testbed, a group of nodes act as thecontrol entity. They host the advanced monitor-ing technique, the knowledge management com-ponent, and provides an interface for deployingworkflow applications requested by the users. Thetestbed is designed with scalability in mind. The

control nodes are de-centrally distributed in orderto be capable of monitoring large scale Cloudenvironment. The computing nodes are used forthe execution of the workflow applications. Inthis group of nodes, we only embed light-weightagents for monitoring and executing the resourceallocation decisions. Thus, the computing nodesare focused on executing the applications therebyrelieving them from the complexities of analyz-ing the monitored information. This increasesthe efficiency and performance of the comput-ing nodes. The control nodes and the computingnodes interacts through our designed communica-tion mechanism, which is based on the Java Mes-sage Service (JMS). The communication mecha-nism provides a means of exchanging informationsuch as resource status, monitored values, andsynchronization data, among the components inthe Cloud environment.

Generally in our private Cloud, each con-trol node is assigned a number of computingnodes, which it processes their monitored infor-mation. Furthermore, the control nodes synchro-nize among themselves based on the availableresources to reach the decision of where to deployfurther workflow application requests.

The evaluation of our approach in thispaper is based on the bioinformatic workflowapplication TopHat, which was described inSection 2.1. It aligns RNA-Seq reads tomammalian-sized genomes using the ultrahigh-throughput short read aligner Bowtie [24],and then analyzes the mapping results to identifysplice junctions between exons. Furthermore,it uses the Sequence Alignment/Map (SAM)tools in its execution. SAM tools provide various

Fig. 7 EvaluationtEstbed: color online

Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

utilities for manipulating alignments in the SAMformat, including sorting, merging, indexing, andgenerating alignments in a per-position format[25].

We analyzed a set of RNA-Seq data using theTopHat workflow application, and the achievedresults are presented in the next sections.

5.2 Resource Level Monitoring Results

As outlined in Section 4.1, we use the LoM2HiSframework to monitor the infrastructure re-sources while analyzing RNA-Seq data usingTopHat workflow application for the duration ofabout three hours. We monitored the status ofthe resource metrics CPU, memory and storageat runtime with a measurement interval of oneminute. The achieved results are presented inFigs. 8, 9, and 10. Note that the monitored re-sults presented are from one of the computationalnode and not the entire Cloud environment. Theresource consumption behaviours on the differentcomputational nodes are similar, thus, we presentthe results from one node for simplicity and easeof understanding our approach.

The aim of the monitoring processes is totimely detect the unavailability of computationalresources. To realize this, LoM2HiS utilizes mon-itoring agents to monitor the resource statusand compares them against the threat thresholds,which are defined values to signalize the short-age of computational resources at runtime. Thethreat thresholds can be dynamically or stati-cally defined. In this paper, the knowledge man-agement dynamically updates the initial definedthreat thresholds.

Figure 8 presents the monitored results for theCPU usage. From the results, it can be observedthat the TopHat workflow application in sometime intervals is very CPU intensive. For examplefrom the execution time 52 to 80. These time inter-vals where the CPU usage is 100 % are the criticalones that need to be managed. The LoM2HiSframework is configured in this case with a threatthreshold value of about 80 % CPU utilization.This means, once the CPU utilization exceeds thisthreshold, it sends a notification message to theknowledge management to provide preventive ac-tions to avoid reaching 100 % utilization, becauseat that point the performance of TopHat degradesand there might be risk of failures.

The monitored results of the memory usage aredepicted in Fig. 9. As shown in the figure, thememory consumption increases along the execu-tion line of the TopHat workflow application. It isdifficult to predict the total amount of memory theapplication might require in the next time inter-val or to successfully complete the data analysis.Thus, we define a threat threshold value that isabout 2 GB less than the current allocated mem-ory. That is, once the memory utilization exceedsthe threat threshold value, a notification messageis sent to the knowledge database for memoryresource allocation decisions.

Figure 10 shows the utilization of storage re-source by the TopHat workflow application. Ac-cording to the figure, the storage utilization in-creases along the execution line. In this case, onecan notice some jumps in the utilization lines.These jumps can be high depending on the size ofthe data set to be analysed. Therefore, the threatthreshold value for managing storage resource is

Fig. 8 Monitored CPUutilization: color online

20

40

60

80

100

120

0 20 40 60 80 100 120 140 160 180

CPU

Usa

ge (

%)

Execution Time (min)

CPU

V.C. Emeakaroha et al.

Fig. 9 Monitoredmemory utilization: coloronline

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

8e+06

9e+06

0 20 40 60 80 100 120 140 160 180

Mem

ory

Cap

acity

(K

B)

Execution Time (min)

Memory

set to be about 4 GB less than the current allo-cated storage, in order not to risk failure situationsbefore the knowledge management can react toallocate more storage resources.

Generally, the threat thresholds are defined toaccommodate reaction time for the knowledgemanagement so that the resource allocation pro-cedures are carried out early enough before thesystem runs out of resources.

5.3 Application Level Monitoring Results

In this section, we discuss the application levelmonitoring results achieved at runtime while ex-ecuting the TopHat workflow application. Theresults show the behaviour of each system processexecuting the sub-tasks of the workflow appli-cation in terms of memory resource utilization.Furthermore, we show the result of the amountof temporal storage used by the TopHat workflowapplication while executing.

Figure 11 presents the results of the memoryutilization from selected processes executing theTopHat application sub-tasks. To simplify the re-sults of the application level monitoring, we con-centrate on the processes executing the highestresource consumption sub-tasks. In this regards,the Bowtie and Fix-map-ordering processes areoutstanding.

The Bowtie processes are the most expensivein terms of memory resource consumption as de-picted in Fig. 11. According to the results shownon Fig. 11, the two Bowtie processes use the high-est amount of memory while actively executingand the memory consumption falls to zero in theirpassive state. The Fix-map-ordering process is sec-ond in memory consumption level as shown inFig. 11. Once the process comes into active state,its resource consumption jumps from zero to acertain level and gradually makes an increasingcurve along the execution line until the processbecomes inactive and the consumption level fallsback to zero.

Fig. 10 Monitoredstorage utilization: coloronline

15

15.5

16

16.5

17

17.5

18

18.5

19

0 20 40 60 80 100 120 140 160 180

Stor

age

Cap

acity

(G

B)

Execution Time (min)

Storage

Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

Fig. 11 Process memory utilization: color online

The TopHat process has lower memory re-source consumption level compared to the othertwo processes as depicted in Fig. 11. It is the mainprocess controlling the execution of the otherprocesses. It stays active and consumes resourceswhile the processes executing the sub-tasks arerunning. It releases the resources and goes intopassive state when all the sub-processes are inpassive state for example in timeline 25 and 75in Fig. 11. Furthermore, it is the process thatterminates last at the end of the execution.

Note, we also monitored the memory resourceconsumption of the Java virtual machine run-ning the whole TopHat workflow application asshown in Fig. 11. This process consumed the sameamount of memory resource as the Bowtie processfrom the beginning to the end of the executionlength. However, this process is not considered aspart of the TopHat processes.

Figure 12 presents the monitored results of thetemporary storage TopHat uses while executing.

Fig. 12 Monitored temporary storage utilization: coloronline

This temporary storage usage is due to the tem-porary files written by the workflow applicationalong the execution line, which are cleaned up atthe end of the execution. The behaviour of thetemporal storage usage increases as more data arebeing analyzed by the TopHat workflow applica-tion. However, the usage falls to zero at the endof the execution once the workflow applicationcleans up. The essence of monitoring the temporalstorage is to determine the whole amount of stor-age resources required by the TopHat workflowapplication for seamless execution.

Without the knowledge of the exact temporalstorage requirement, one might not efficiently al-locate enough storage resources to support theexecution of the workflow application. Because,using only the monitored information from thenormal storage utilization can cause poor resourceallocation decision in situations where the TopHatis writing lots of temporary files. As seen onFig. 10, there were jumps in the storage consump-tion along the execution line. These jumps arecaused due to the temporary storage usage, whichthe resource monitoring technique is unable to de-tect. This issue causes the definition of large threatthreshold value for the management of storageresource. However, the monitoring of the tempo-ral storage addresses this issue and provides theknowledge management with the missing infor-mation to efficiently manage the storage resource.

Generally, the application level monitoringcomplements the resource monitoring with fine-grained monitored information of the applica-tion resource consumption behaviour. Further-more, it supports the knowledge management indefining appropriate values to update the pre-defined threat thresholds. This is evident in themanagement of the storage resource as can beobserved from the monitored results.

In the next section, we discuss how the knowl-edge management deals with the notification mes-sages generated by the monitoring techniques.

5.4 Ensuring Resource Availability

This section shows via simulations how theknowledge management approach reacts to themonitored data and enables seamless workflow

V.C. Emeakaroha et al.

application execution, as well as an economicallyefficient usage of resources.

To demonstrate this approach, we first definethe SLAs shown in Table 2 for the workflow appli-cation to specify the amount of available resourceson the virtual machine required for the seamlessexecution of the workflow. For CPU power, weconvert CPU utilization into Million InstructionsPer Second (MIPS) based on the assumption thatan Intel Xeon(R) 3 GHz processor delivers 10,000MIPS for 100 % resource utilization of one core,and linearly degrades with CPU utilization.

Additionally, we analyze the Resource Alloca-tion Ef f iciency (RAE) of the various simulations,where we relate violations and utilization. Thebasic idea is that RAE should be equal to utiliza-tion (100 % −w, where w stands for wastage, seebelow) if no violations occur (p = 0 %, where pstands for penalty, see below) or equal to zero, ifthe violation rate is at 100 %, and follow a lineardecrease in between. Thus, we define

RAE = (100 − w)(100 − p)

100. (3)

Furthermore, we define a more general ap-proach also taking into account the cost of actionswith a generic cost function that maps SLA viola-tions, resource wastage, and the costs of executedactions into a monetary unit, which we describeas Cloud EUR. The essence of this monetary unitis to device an independent means of comparingthe various cost values realized in the differentevaluation scenarios. There is no pricing modelbehind it.

The cost values in our generic cost function aredefined based on the specified SLA penalties forviolating the agreed SLA terms, the consequenceof resource wastage, and the effects of the correc-tive actions to avoid SLA violations on the Cloud

Table 2 TopHat SLA

Service level objective (SLO) name SLO value

CPU power ≥ 20000 MIPSMemory ≥ 8192 MBStorage ≥ 19456 MB

system performance. Equation (4) presents thecost function

c(p, w, c) =∑

r

pr(pr) + wr(wr) + ar(ar), (4)

where, the resource r, pr(pr) : [0, 100] → R+

defines the costs due to the penalties that haveto be paid according to the relative number ofSLA violations (as compared to all possible SLAviolations) pr; wr(wr) : [0, 100] → R

+ defines thecosts due to unutilized resources wr; and ar(ar) :[0, 100] → R

+ the costs due to the executed num-ber of actions ar (as compared to the number of allpossible actions).

The evaluations of our knowledge managementtechniques are done in three simulation cate-gories, where we setup and configure our VMresources differently to manage the agreed SLAobjectives and prevent / correct SLA violationsituations. The categories are described as follows:

1. In the first category (Scenario 1) we assumea static configuration with a fixed initial re-source configuration of the VMs. Normally,when setting up such a testbed as describedin Section 5.1, an initial guess of possible re-source consumption is done based on earlymonitoring data. From this configuration, weassume quite generous resource limits. Thefirst ten measurements of CPU, memory, andstorage lie in the range of [140, 12,500] MIPS,[172, 1,154] MB, [15.6, 15.7] GB, respectively.So we initially configured our VM with 15,000MIPS, 4,096 MB, and 17.1 GB, respectively.

2. The second category subsumes several sce-narios, where we apply our knowledge man-agement approach to manage the initial VMconfigurations as described in the first cate-gory. The eight scenarios in this category de-pend on the chosen Threat Thresholds (TTs).According to Table 3, we define these scenar-ios as Scenario 2.1, 2.2, . . . , 2.8, respectively.In these scenarios, we investigate low, middle,and high values for TTr

low and TTrhigh, where

TTrlow ∈ {30 %, 50 %, 70 %} and TTr

high ∈{60 %, 75 %, 90 %} for all resources stated

Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

Table 3 8 SimulationsScenarios for TTlow andTThigh

Scenarios

1 2 3 4 5 6 7 8

TTlow 30 % 30 % 30 % 50 % 50 % 50 % 70 % 70 %TThigh 60 % 75 % 90 % 60 % 75 % 90 % 75 % 90 %

above. We combine the TTs to form eightdifferent scenarios as depicted in Table 3.

3. In the third category (Scenario 3), we considera best case scenario, where we assume to havean oracle that predicts the maximal resourceconsumption values, which can be used tostatically configure the virtual machines. Theessence of this third scenario is to demonstratethe better resource utilization achievable byour knowledge management approach com-pared to peak provisioning.

Figure 13a, b and c present the violations, uti-lization, as well as the number of reconfigurationactions, respectively, for every SLA parameter(together with an average value) in the differentscenarios. Generally, the bars in the figures arenaturally ordered beginning from Scenario 1, overScenarios 2.1, . . . , 2.8, ending with Scenario 3. InScenario 1, we have static resource configurationand the number of violations in this scenarioreaches 41.7 % for CPU and memory, and 49.4 %for storage, which leads to an average of 44.3 %.Thus, we experience violations in almost half ofthe cases. This is especially crucial for parametersmemory and storage, where program executioncould fail, if it runs out of memory or storage,whereas for a violation of the CPU parameter, wewould “only” delay the successful termination ofthe workflow application. We excluded the resultsof Scenarios 1 and 3 from Fig. 13a for bettervisibility in order to focus on the achieved resultsof scenario 2.*, which applies the knowledge man-agement approach to prevent or correct the SLAviolations.

With the Scenarios 2.* in the second category,we can reduce the SLA violations to a minimum.We completely avoid violations for storage in allsub-scenarios, as well as for memory in all butone sub-scenario. Also CPU violations can bereduced to 0.6 % for Sub-scenarios 2.1 and 2.4,

and still achieve a maximum SLA violation rateof 2.8 % with Scenario 2.8. The average SLAviolation rate can be lowered to 0.2 % in the best

(a) Violations

(b) Utilization

(c) Reconfiguration actions

Fig. 13 Violations, utilization and reconfiguration actionsfor ten knowledge management scenarios using bioinfor-matic workflow: color online

V.C. Emeakaroha et al.

case. Scenario 3, of course, shows no violations.However, it is unlikely to know the maximumresource consumption before workflow execution.

The resource utilization is clearly higher whena lot of violations occur due to out-usage ofavailable resources. Thus, Scenario 1 naturallyachieves high resource utilization because whena parameter is violated, it means the resourceis already fully used up, but more of the re-source would be needed to fulfill the require-ments. On the opposite, Scenario 3 naturallyachieves low utilization, as a lot of resources areover-provisioned. Scenarios 2.* achieve a goodutilization that is on average in between of the twoextremes and ranges from 70.6 % (Scenario 2.1) to86.2 % (Scenario 2.8). Furthermore, we observesome exceptions to this “rule” when consideringindividual parameters. So, e.g., for memory, weachieve a utilization of 85.0 % with Scenario 2.8 or80.0 % with Scenario 2.6, which is higher than theutilization in Scenario 1 (77.4 %). The same is truefor CPU utilization rates of 85.5 % as comparedto 84.3 % for the Scenario 1 and 2.8, respectively.Only in the case of storage, the utilization of allexcept in one of the Scenarios 2.*, achieved 85.9 %utilization, which is smaller than that of Scenario 3(90.1 %).

The clear advantage of Scenarios 2.* is that theydo not run into any crucial SLA violation (ex-cept for Scenario 2.3), but they achieve a higherresource utilization as compared to Scenario 3.Regarding the reallocation actions, of course, Sce-nario 1 and 3 do not execute any. This is onlydone for Scenarios 2.* as shown in Fig. 13c. More-over, with the knowledge management in Sce-narios 2.*, the amount of executed reallocationactions for most scenarios stays below 10 %. OnlyScenario 2.7 executes actions in 19.8 % of thecases on average of the time. Nevertheless, fiveout of eight scenarios stay below 5 % on average.These results show that our knowledge manage-ment approach is scalable and less intrusive on theoverall system performance. That is, it does notcause lots of overheads, which might degrade thesystem performance.

When it comes to the overall costs of the sce-narios (cf. Fig. 14a), all 2.* scenarios approach theresult achieved by the best case scenario 3. Thecost of SLA violations in Scenario 1 sums up to

(a) Cost

(b) Resource Allocation Efficiency

Fig. 14 Resource allocation efficiency and cost for tenknowledge management scenarios using bioinformaticworkflow: color online

4493.6 Cloud EUR, and is omitted on Fig. 14ain order to achieve better visibilities of the othercost results because including this result wouldchange the scaling on the figure due to its largevalue as compared to the other obtained costvalues. Furthermore, the lowest cost is achievedusing Scenario 2.6, which is even lower than thecost for Scenario 3. This is possible, because Sce-nario 2.6 achieves a very good resource utiliza-tion and low SLA violation rate with a very fewnumber of reallocation actions. Also resource al-location efficiency for Scenarios 2.* as shown inFig. 14b achieves remarkably better results thanfor Scenario 1 (RAE of 48.2 %). Nevertheless, allscenarios of the second category achieve a betterRAE as compared to that of Scenario 3 (69.3 %).

Thus, we conclude that by using the suggestedknowledge management strategy, we can avoidmost costly SLA violations, and therefore, facil-itate application execution while improving theresource usage efficiency as demonstrated with

Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

the workflow application. All this can be achievedwith a very low number of time- and energy con-suming VM reallocation actions as shown in manyof the knowledge management scenarios.

Based on these observations, we conclude thatby using the suggested Cloud management in-frastructure techniques, we can guarantee the per-formance and successful completion of workflowapplications. Furthermore, we can efficientlymanage resources to avoid considerable wastage,extra maintenance costs, and CO2 emissions dueto the unnecessary energy consumption of unusedresources.

6 Related Work

Currently, a number of workflow systems existthat are applicable in the area of life sciences.While these workflow systems are valuable in theexplorative analysis of data, they do not provideresource management capabilities to ensure theperformance and successful completion of manymodern, more demanding workflow applications.

Hull et al. [17] discuss the Taverna system, aworkbench designed to build workflows for bioin-formatics. It is a Java application that is capa-ble of building complex workflows, which canaccess both remote and local processors of dif-ferent kinds, launch the execution of workflowapplications, and display different types of re-sults. The strength of this system lies in its simpleinterface and its flexibility to support access toservices via different types of interfaces. However,it does not provide efficient resource managementcapabilities.

Altintas et al. [2] propose Kepler, an opensource system based on the Ptolemy II tool fromthe University of California Berkeley. Ptolemycan combine concurrent components that areindependent and autonomous through a well-defined model of computation that governs allinteractions and thus, guarantees the correct op-eration of the system. The independent compo-nents can be used to achieve different goals forworkflow applications. But it does not facilitateadequate resource management to support theexecution of the workflow applications.

Deelman et al. [6] present Pegasus, a workflowmanager that supports automatic conversion ofhigh-level workflow descriptions to executableworkflows and enacts them in a Condor-basedGrid infrastructure. It consist of a Metadata Cata-logue Service (MCS) and the implementations ofworkflow reduction, resource selection, and taskclustering to optimize execution performance. Pe-gasus implements both data and computationalabstractions and can therefore, perform somemanagement functions but suffers from a lack ofopenness and standardization.

Other advanced workflow systems includingGalaxy [14] and Wildfire [41] offer an easy wayto access local and remote resources and performautomatic analyses to test hypotheses or processdata. However, many well known challenges, inthe field of data integration and workflow systemsfor bioinformatics [12, 13, 26], are still not solvedsuch as transferring large amount of data, whichmight incur high cost of transfer and storage onremote locations.

Despite their usabilities, the above mentionedworkflow systems lack the capabilities of on-line resource monitoring of application executionas obtainable in Cloud environments. Moreover,they are not capable of dynamic resource manage-ment (i.e., dynamically allocating and deallocat-ing resources to applications) utilizing knowledgedatabases as basis for decision making.

7 Conclusion and Future Work

With the recent fast development of high-throughput technologies in the life sciences, hugeamounts of data are being generated and storedin different databases. The analysis of these large-scale data sets by scientists in a search for biomed-ically interesting patterns presents serious tech-nical challenges. Scientific workflow applicationscan be instrumental in addressing these chal-lenges, provide a certain degree of automation,and thus, empower advanced more complex stud-ies in the life sciences.

The analysis of scientific data using workflowapplications is computationally intensive andrequires adequate availability of resources toachieve satisfactorily high performance and suc-

V.C. Emeakaroha et al.

cessful execution completion. Lack of compu-tational resources disrupts the analysis process,causing a waste of time and money. Traditionalcomputational environments consist of static lim-ited resources and are, therefore, not well suitedfor the execution of large workflow applications.Cloud computing is proving to be a valuablecomplement to these traditional static computefacilities. Cloud technology promises scalable andon-demand resource provisioning. With this tech-nology, it is now possible to dynamically allocateresources to workflow application executions.

In this paper, we have presented a novelCloud management infrastructure consisting ofa resource level-, application level monitoringtechniques, and a knowledge management strat-egy. The monitoring techniques are responsiblefor monitoring Cloud environments where scien-tific workflow applications are being executed, inorder to determine their resource consumptionbehaviours, the status of the computational re-sources, and the performance of the applications.The knowledge management uses the monitoredinformation to dynamically allocate resources atruntime to the workflow applications. We dis-cussed in detail the design of these Cloud manage-ment infrastructure techniques and their applica-bility to support workflow application execution.

We have evaluated our approach using thewell-known TopHat workflow application to an-alyze RNA-Seq data for a duration of about threehours, through which we showed how resourcebottleneck are detected at runtime and how theknowledge management could dynamically allo-cate more resources in such cases. Based on ourevaluations, we can conclude that we successfullyutilized the Cloud management infrastructuretechniques to support the execution of a scientificworkflow application in order to guarantee itsperformance goals and successful completion.

In the future, we intend to integrate theseCloud management infrastructure techniques intoworkflow systems to achieve self-manageable au-tonomic systems that are capable of managingcomputational resources for data analysis, aimingto fully maximize the power of computationalmethods and mining of experimental data in thelife sciences by means of workflow applications.

Acknowledgements This work is supported by theVienna Science and Technology Fund (WWTF) under thegrant agreement ICT08-018 Foundation of Self-governingICT Infrastructures (FoSII). PPL and DPK gratefully ac-knowledge support by the Vienna Science and TechnologyFund (WWTF), Baxter AG, the Austrian Institute of Tech-nology, and the Austrian Centre of BiopharmaceuticalTechnology. This paper is a substantially extended versionof a WORKS11 paper [9].

References

1. ActiveMQ: Messaging and integration patternprovider. http://activemq.apache.org/. Accessed 4April 2013

2. Altintas, I., Berkley, C., Jones, E.M., Ludascher, B.,Mock, S.: Kepler: an extensible system for design andexecution of scientific workflows. In: 16th InternationalConference on Scientific and Statistical Database Man-agement, pp. 423–424 (2004)

3. Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J.,Brandic, I.: Cloud computing and emerging IT plat-forms: vision, hype, and reality for delivering comput-ing as the 5th utility. Futur. Gener. Comput. Syst. 25(6),599–616 (2009)

4. Cantacessi, C., Jex, A.R., Hall, R.S., Young, N.D.,Campbell, B.E., Joachim, A., Nolan, M.J., Abubucker,S., Sternberg, P.W., Ranganathan, S., Mitreva, M.,Gasser, R.B.: A practical, bioinformatic workflow sys-tem for large data sets generated by next-generationsequencing. Nucleic Acids Res. 38(17), e171 (2010)

5. Comuzzi, M., Kotsokalis, C., Spanoudkis, G.,Yahyapour, R.: Establishing and monitoring SLAsin complex service based systems. In: Proceedings ofthe 7th International Conference on Web Services(ICWS’09) (2009)

6. Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y.,Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B.,Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: aframework for mapping complex scientific workflowsonto distributed systems. Sci. Program. 13, 219–237(2005)

7. Emeakaroha, V.C., Brandic, I., Maurer, M., Dustdar,S.: Low level metrics to high level slas - lom2his frame-work: bridging the gap between monitored metrics andsla parameters in cloud environments. In: 2010 Inter-national Conference on High Performance Computingand Simulation (HPCS), pp. 48–54 (2010)

8. Emeakaroha, V.C., Calheiros, R.N., Netto, M.A.S.,Brandic, I., De Rose, C.A.F.: DeSVi: an architec-ture for detecting SLA violations in cloud comput-ing infrastructures. In: Proceedings of the 2nd Interna-tional ICST Conference on Cloud Computing (Cloud-Comp’10) (2010)

9. Emeakaroha, V.C., Labaj, P.P., Maurer, M., Brandic,I., Kreil, D.P.: Optimizing bioinformatics workflowsfor data analysis using cloud management techniques.

Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

In: Proceedings of the 6th Workshop on Workflowsin Support of Large-Scale Science, WORKS ’11,pp. 37–46. ACM, New York, NY, USA (2011)

10. Emeakaroha, V.C., Netto, M.A.S., Calheiros, R.N.,Brandic, I., Buyya, R., De Rose, C.A.F.: Towards au-tonomic detection of sla violations in cloud infrastruc-tures. Futur. Gener. Comput. Syst. 28(7), 1017–1029(2012)

11. Ferretti, S., Ghini, V., Panzieri, F., Pellegrini, M., Tur-rini, E.: Qos-aware clouds. In: 2010 IEEE 3rd Inter-national Conference on Cloud Computing (CLOUD),pp. 321–328 (2010)

12. Goderis, A., Li, P., Goble, C.: Workflow discovery: theproblem, a case study from e-science and a graph-basedsolution. In: IEEE International Conference on WebServices, pp. 312–319 (2006)

13. Goderis, A., Sattler, U., Lord, P., Goble, C.: Seven bot-tlenecks to workflow reuse and repurposing. In: Gil, Y.,Motta, E., Benjamins, V., Musen, M. (eds.) SemanticWeb - ISWC 2005. Lecture Notes in Computer Science,vol. 3729, pp. 323–337. Springer, Berlin/Heidelberg(2005)

14. Goecks, J., Nekrutenko, A., Taylor, J., and The GalaxyTeam: Galaxy: a comprehensive approach for support-ing accessible, reproducible, and transparent computa-tional research in the life sciences. Genome Biol. 11(8),R86 (2010)

15. Halligan, B.D., Geiger, J.F., Vallejos, A.K., Greene,A.S., Twigger, S.N.: Low cost, scalable proteomics dataanalysis using amazons cloud computing services andopen source search algorithms. J. Proteome Res. 8(6),3148–3153 (2009)

16. Hollingsworth, D.: The workflow reference model.In:: Technical Report (WFMC- TC00-1003) WorkflowManagement Coalition (1995)

17. Hull, D., Wolstencroft, K., Stevens, R., Goble, C.,Pocock, M.R., Li, P., Oinn, T.: Taverna: a tool for build-ing and running workflows of services. Nucleic AcidsRes. 34(Suppl 2), W729–W732 (2006)

18. JFree.org: Jfreechart. http://www.jfree.org/jfreechart/.Last Access: 4 Apr 2013

19. JMS: Java messaging service. http://java.sun.com/products/jms/. Last Access: 4 Apr 2013

20. Kephart, J.O., Chess, D.M.: The vision of autonomiccomputing. IEEE Comput. 36(1), 41–50 (2003)

21. Koller, B., Schubert, L.: Towards autonomous SLAmanagement using a proxy-like approach. MultiagentGrid Syst. 3(3), 313–325 (2007)

22. Kreil, D.P.: From general scientific workflows tospecific sequence analysis applications: the study ofcompositionally biased proteins. Ph.D. thesis (2001)

23. Łabaj, P.P., Leparc, G.G., Linggi, B.E., Markillie, L.M.,Wiley, H.S., Kreil, D.P.: Characterization and improve-ment of RNA-Seq precision in quantitative transcriptexpression profiling. Bioinformatics 27(13), i383–i391(2011)

24. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ul-trafast and memory-efficient alignment of short DNAsequences to the human genome. Genome Biol. 10(3),R25 (2009)

25. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan,J., Homer, N., Marth, G., Abecasis, G., Durbin, R.,1000 Genome Project Data Processing Subgroup: Thesequence alignment/map format and samtools. Bioin-formatics 25(16), 2078–2079 (2009)

26. Linke, B., Giegerich, R., Goesmann, A.. Conveyor: aworkflow engine for bioinformatic analyses. Bioinfor-matics 27(7), 903–911 (2011)

27. Massie, M.L., Chun, B.N., Culler, D.E.: The Gan-glia distributed monitoring system: design, implemen-tation and experience. Parallel Comput. 30(7), 817–840(2004)

28. Maurer, M., Brandic, I., Emeakaroha, V.C., Dustdar,S.: Towards knowledge management in self-adaptableclouds. In: IEEE 2010 Fourth International Work-shop of Software Engineering for Adaptive Service-Oriented Systems, Miami, USA (2010)

29. Maurer, M., Brandic, I., Sakellariou, R.: Simulat-ing autonomic sla enactment in clouds using casebased reasoning. In: ServiceWave 2010: Proceedingsof the 2010 ServiceWave Conference, Ghent, Belgium(2010)

30. Maurer, M., Brandic, I., Sakellariou, R.: Enacting slasin clouds using rules. In: Proceedings of Euro-Par 2011(2011)

31. Maurer, M., Brandic, I., Sakellariou, R.: Adaptiveresource configuration for cloud infrastructure man-agement. Futur. Gener. Comput. Syst. 29(2), 472–487(2013)

32. Merchant, N., Hartman, J., Lowry, S., Lenards, A.,Lowenthal, D., Skidmore, E.: Leveraging cloud in-frastructure for life science research laboratories: ageneralized view. In: International Workshop on CloudComputing at OOPSLA09, Orlando, USA (2009)

33. Nurmi, D., Wolski, R., Grzegorczyk, C., Obertelli, G.,Soman, S., Youseff, L., Zagorodnov, D.: The Eucalyp-tus open-source cloud-computing system. In: Proceed-ings of the 9th International Symposium on ClusterComputing and the Grid (CCGRID’09) (2009)

34. Pennisi, E.: Will computers crash genomics? Science331(6018), 666–668 (2011)

35. Robinson, G.E., Banks, J.A., Padilla, D.K., Burggren,W.W., Cohen, C.S., Delwiche, C.F., Funk, V.,Hoekstra, H.E., Jarvis, E.D., Johnson, L., Martindale,M.Q., Rio, C.M., Medina, M., Salt, D.E., Sinha, S.,Specht, C., Strange, K., Strassmann, J.E., Swalla,B.J., Tomanek, L.: Empowering 21st century biology.BioScience 60(11), 923–930 (2010)

36. Rochwerger, B., Breitgand, D., Levy, E., Galis, A.,Nagin, K., Llorente, L., Montero, R., Wolfsthal, Y.,Elmroth, E., Caceres, J., Ben-Yehuda, M., Emmerich,W., Galan, F.: The RESERVOIR model and architec-ture for open federated cloud computing. IBM J. Res.Dev. 53(4), 535–545 (2009)

37. Romano, P.: Automation of in-silico data analysisprocesses through workflow management systems.Brief. Bioinform. 9(1), 57–68 (2007)

38. Smedley, D., Swertz, M.A., Wolstencroft, K., Proctor, G.,Zouberakis, M., Bard, J., Hancock, J.M., Schofield, P.:

V.C. Emeakaroha et al.

Solutions for data integration in functional genomics:a critical assessment and case study. Brief. Bioinform.9(6), 532–544 (2008)

39. Stein, L.D.: Towards a cyberinfrastructure for the bio-logical sciences: progress, visions and challenges. Nat.Rev. Genet. 9(9), 678–688 (2008)

40. Stoegerer, C., Brandic, I., Emeakaroha, V.C., Kastner,W., Novak, T.: Applying availability slas to traffic man-agement systems. In: Proceedings of the IEEE Intelli-gent Transportation Systems Conference (ITSC 2011)(2011)

41. Tang, F., Chua, C.L., Ho, L.-Y., Lim, Y.P., Issac,P., Krishnan, A.: Wildfire: distributed, grid-enabledworkflow construction and execution. BMC Bioin-forma. 6(69) (2005). http://www.biomedcentral.com/1471-2105/6/69

42. Tiwari, A., Sekhar, A.K.: Workflow based frameworkfor life science informatics. Comput. Biol. Chem. 31(5–6), 305–319 (2007)

43. Trapnell, C., Pachter, L., Salzberg, S.L.: Tophat: dis-covering splice junctions with RNA-Seq. Bioinformat-ics 25(9), 1105–1111 (2009)