user inspired management of scientific jobs in grids and clouds

User Inspired Management of Scientific Jobs in Grids and Clouds

Eran Chinthaka WithanaSchool of Informatics and Computing

Indiana University, Bloomington, Indiana, USA

Doctoral CommitteeProfessor Beth Plale, PhDDr. Dennis Gannon, PhD

Professor Geoffrey Fox, PhDProfessor David Leake, PhD

2

Outline• Mid-Range Science

– Challenges and Opportunities– Current Landscape

• Research– Research Questions– Contributions

• Mining Historical Information to Find Patterns and Experiences– Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds

[Contribution 1]– Using Reliability Aspects of Computational Resources to Improve Scientific Job

Executions [Contribution 2]• Uniform Abstraction for Large-Scale Compute Resource Interactions

[Contribution 3, 4]• Applications• Related Work• Conclusion and Future Work

Thesis Defense - Eran Chinthaka Withana

3









4

Mid-Range Science

• Challenges– Resource requirements going beyond lab and university,

but not suited for large-scale resources– Difficulties finding sufficient compute resources

• E.g.: short term forecast in LEAD for energy and agriculture

– Lacking resources to have strong CS support person on team

– Need for less-expensive and more-available resources• Opportunities

– Wide variety of computational resources– Science gateways



5

Current Landscape• Grid Computing

– Batch orientation, long queues even under moderate loads, no access transparency– Drawbacks in quota system– Levels of computer science expertise required

• Cloud Computing– High availability, pay-as-you-go model, on-demand limitless1 resource allocation– Payment policy and research cost models

• Use of Workflow Systems– Hybrid workflows

• Enables utilization of heterogeneous compute resources– E.g.: Vortex2 Experiment

• Need for resource abstraction layers and optimal selection of resources

• Need for improvement of scientific job executions– Better scheduler decisions, selection of compute resources– Reliability issues in compute resources– Importance of learning user patterns and experiences

1M. Armbrust et al. Above the clouds: A Berkeley view of cloud computing Tech. Rep. UCB/EECS-2009-28, EECS Department, University of California, Berkeley., 2009.


6









7

Research Questions

“Can user patterns and experiences be used to improve scientific job executions in large scale systems?”

“Can a simple, reliable and a highly scalable uniform resource abstraction be achieved to interact with a variety compute resource providers? “

“Can these be put to use to advance science?”


8

Contributions• Propose and empirically demonstrate user patterns, deduced by

knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.

• Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as new dimension to consider during resource selections.

• Propose and demonstrate effectiveness and applicability of light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.

• Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.


9









10

Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds• Objective

– Reducing the impact of startup overheads for time-critical applications• Problem space

– Workflows can have multiple paths– Workflow descriptions not available

• Need for predictions to identify job execution sequence• Learning from user behavioral patterns to predict future jobs• Research outline

– Algorithm to predict future jobs by extracting user patterns from historical information

– Use of knowledge-based techniques• Zero knowledge or pre-populated job information consisting of connection between

jobs• Similar cases retrieved are used to predict future jobs, reducing high startup overheads

– Algorithm assessment • Two different workloads representing individual scientific jobs executed in LANL and set

of workflows executed by three users


11Demonstration of User Patterns with Workflows

• Suite of workflows can differ from domain to domain– E.g. WRF (Weather Research and Forecasting) as upstream node

• User patterns reveal sequence of jobs taking different users/domains into consideration

• Useful for a science gateway serving wide-range of mid-scale scientists

WRF

Weather Predictions

Crop Predictions

Wind Farm Location Evaluations

Wild Fire Propagation Simulation


12Role of Successful Predictions to Reduce Startup Overheads

• Largest gain can be achieved when our prediction accuracy is high and setup time (s) is large with respect to execution time (t)

r = probability of successful prediction (prediction accuracy)

Percentage time =reduction


For simplicity, assuming equal job exec and startup times

N

ii

N

ii ts

00 T

N

ii

N

ii

N

ii

N

ii

N

ii

sts

ts

r

r

000

00

T

T )1(

)(0

0

N

iii

N

ii

ts

sr

1)(

*

*)(

)*(*

str

st

sr

Nst

Nsr


13

Relationship of Predictions to Execution Time


• Observations– Percentage time reduction

increases with accuracy of predictions

– Time reduction is reduced exponentially with increased work-to-overhead ratio

• Need to find criticalpoint for a given situation– Fixing required percentage time

reduction for a given t/s ratio and finding required accuracy of predictions

• Cost of wrong predictions– Depends on compute resource– Demonstrated higher

prediction accuracies (~90%) will reduce impact of wrong predictions

– Compromising cost to improve time

1str

Accuracy of Predictions = total successful future job predictions / total predictions


14

Prediction Engine: System Architecture

PredictionRetriever


15

Use of Reasoning• Store and retrieve cases• Steps

– Retrieval of similar cases• Similarity measurement• Use of thresholds

– Reuse of old cases– Case adaptation– Storage


16

Case Similarity Calculation

• Each case represented by set of attributes– Selected by finding effect on goal variable (next

job)


17

Evaluation• Use cases

– Individual job workload1

• 40k jobs over two years from 1024-node CM-5 at Los Alamos National Lab

– Workflow use case• System doesn’t see or assume workflow

specification

User Workflows in the experiment

User 1 Workflow 1, Workflow 2, Workflow 5

User 2 Workflow 2, Workflow 4

User 3 Workflow 2, Workflow 3, Workflow 4

1: Parallel Workload Archive http://www.cs.huji.ac.il/labs/parallel/workload/

• Experimental setup– 2.0GHz dual-

core processor, 4GB memory and on a 64-bit Windows operating system


18

Evaluation: Average Accuracy of Predictions

Individual Jobs Workload• ~ 75% accurate

predictions with user patterns

• ~ 32% accurate predictions with service names

Workflow Workload• ~ 95% accurate

predictions with user patterns

• ~ 53% accurate predictions with service names


19

Evaluation: Time Saved

• Amount of time that can be saved, if resources are provisioned, when job is ready to run

• Startup time– Assumed to be 3mins (average for commercial providers)

Individual Jobs Workload Workflow Workload


20

Evaluation: Prediction Accuracies for Use Cases

• User patterns based predictions performs 2x better than service names based


21









22

User Perceived Reliability• Failures tolerated through

– fault tolerance, high availability, recoverability, etc.,[Birman05]. • What matters from a user’s point of view is whether these failures are

visible to users or not– E.g. reliability of commodity hardware (in clouds) vs user-perceived reliability

• Reliability is not of resources themselves – Not derived from halting failures, fail-stop failures, network partitioning

failures[Birman05] or machine downtimes. – It is a more broadly encompassing system reliability that can only be seen

at user or workflow level• Can depend on user’s configuration and job types as well

– We refer to this form of reliability as user-perceived reliability.• Importance of user-perceived reliability

– Selecting a resource to schedule an experiment when user has access to multiple compute resources

– E.g. LEAD reliability• supercomputing resources vs• Windows Azure resources


23

Why User Perceived Reliability is Useful

• User perceived failure probabilities – Cluster A, p(A) = 0.2 and Cluster B, p(B) = 0.3) = 0.2 * ( 1 – 0.3) = 0.14 ) = 0.3 * ( 1 – 0.2) = 0.24

• Since < try cluster A first and then cluster B.


24

Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions

• Objective– Reduce impact of low reliability of compute resources

• Deducing user-perceived reliabilities– learning from user experiences and perceptions

• Research outline– Algorithm to predict user perceived reliabilities, learning from

user experiences mining historical information– Use of machine learning techniques

• Trained classifiers to represent compute resources and their reliabilities• Prediction of job failures

– Algorithm assessment • Workloads from parallel workload archive representing jobs executed

in two different supercomputing clusters


25

System Architecture

• A machine learning classifier is trained to learn user-perceived reliabilities of each cluster.

• Classifiers types– Static classifier: train classifier initially from historical information– Dynamic (updateable) classifier: starts from zero knowledge and

build when system is in operation


26

System Architecture• Classifier manager uses Weka[Hall09] framework• Classification methods

– Naïve Bayes and KStar– Static and Dynamic classifiers

• Dynamic pruning of features[Fadishei09] for increased efficiency• Classifier manager

– Creates and maintains classifiers for each compute resource– A new job is evaluated based on these classifiers to deduce

predicted reliability of job execution• Policy Implementers

– Considers resource reliability predictions together with other quality of service information (time, cost) to select a resource


27

Evaluation• Workloads from parallel workload archive[Feitelson]

– LANL: Two years worth of jobs from 1994 to 1996 on1024-node CM-5 at Los Alamos National Lab– LPC: Ten months (Aug, 2004 to May, 2005) worth of job records on 70 Xeon node cluster at ”Laboratoire de Physique Corpusculaire” of Universitat Blaise-Pascal, France

• Minor cleanups to remove intermediate job states• 10000 jobs were selected from each workload

– LANL had 20% failed jobs– LPC had 30% failed jobs


28

Evaluation• Workload classification and maintenance

– Classifiers: Naïve Bayes[John95] and KStar[Cleary95] classifier implementations in Weka[Hall09].

– Classifier construction• Static classifier: first 1000 jobs trains classifier.• Dynamic classifier: all 10000 jobs for classifier construction and evaluation.

• Evaluation Metrics– Average reliability prediction accuracy: accuracy of predicting

success/fail of job– Time saved: cumulative time saved by aggregating execution time of a

job if it fails and if our system predicted failure successfully• baseline measure: ideal cumulative time that can be saved over time

– Time Consumed For Classification and Updating Classifier– Effect of pruning attributes

• Static subset of attributes (as proposed in Fadishei et el.[Fadishei09]) vs dynamic subset of attributes (checking affect on goal variable)

29

Evaluation• Evaluation Metrics

– Effect of Job Reliability Predictions on Selecting Compute Resources

• Extended version of GridSim[Buyya02] models four compute resourcesNWS[Wolski99] for bandwidth estimation and QBets[Nurmi07] for queue wait time estimation• Total execution time = data movement time + queue wait time + job execution time (found in workload)• Schedulers

– Total Execution Time Priority Scheduler – Reliability Prediction Based Time Priority Scheduler

• Metrics– Average Accuracy of Selecting Reliable Resources to Execute Jobs– Time Wasted Due to Incorrect Selection of Compute Resources to Execute Jobs

• All evaluations were run within a 3.0GHz dual-core processor, 4GB memory on Windows 7 professional operating system.



30

Evaluation Metrics SummaryParameters EnumerationWorkload LANL and LPCClassification Method Naïve Bayes and KstarClassifier Maintenance Static and UpdateableEvaluation Metrics • average reliability prediction

accuracy, • time saved • time consumed for classification

and updating classifier, • effect of pruning attributes• effect of job reliability predictions

on selecting compute resources

31Results:Average Reliability Prediction Accuracy


Static Dynamic / Updateable

LPC

• LANL Accuracy Saturation ~ 82%

• LPC Accuracy Saturation ~ 97%

• KStar has performed slightly better than Naïve Bayes

LANL

32Results:Time Savings


Static Dynamic / Updateable

• With static classifier, KStar has saved 90-100%

• Updateable classifier – For LANL

Both KStar and NB ~ 50% saving

– For LPC ~ 90% saving

LANL

LPC

33

Results:Time Consumed for Classification and Updating Classifier


• Both static and updateable Naïve Bayes classifiers take very little time (not included in graphs)

Static Classifier Updateable Classifier


34

Results:Effect of Pruning Attributes

• Static sub-set of attributes (Fadishei09) performs poorly on this data set and classifier

• Dynamic pruning has improved accuracy of predictions compared to non-pruned case, but improvement is marginal

• Conclusion -> our classifiers are handling noise features well without compromising accuracy of classifications

• Identification of attributes to prune is a dynamic and expensive task – system can be used in practical

cases even without pruning of attributes.


35Results:Effect of Job Reliability Predictions on Selecting

Compute Resources• Poor performance of execution time priority scheduler• After 1000 jobs (training) time wasted with our approach stays fairly

constant


36

Evaluation Conclusion

• Even though average accuracy of predictions with KStar classifier has decreased with static classifier, it has managed to learn and predict failures better than any other method.

• Even though amount of time saved has increased slightly with Naive Bayes updateable classifier, comparatively, amount of time saved using static KStar classifier is higher than both methods.

• Even though total accuracy of predictions is not performing compared to other methods, static KStar classifier is ideal for correctly predicting failure cases, with very low overhead.

• Taking user-perceived reliability of compute resources in to consideration can save a significant amount of time in scientific job executions


37









38Scientific Computing Resource Abstraction Layer

• Variety of scientific computing platforms and opportunities• Requirements

– Support existing job description languages and also should be extensible to support other languages.

– Provide a uniform and interoperable interface for external entities to interact with it.

– Support heterogeneous compute resource manager interfaces and operating platforms from grids, IaaS, PaaS clouds, departmental clusters.

– Extensibility to support new and future resource managers with minimal changes.

– Provide monitoring and fault recovery, especially when working with utility computing resources.

– Provide light-weight, robust and scalable infrastructure.– Integration to variety of workflow environments.


39

Scientific Computing Resource Abstraction Layer• Our contribution

– Resource abstraction layer • Implemented as a web service• Provides a uniform abstraction layer over heterogeneous compute resources including grids,

clouds and local departmental clusters.

– Support for standard job specification languages including, but not limited to, Job Submission Description Language (JSDL)[Anjomshoaa04] and Globus Resource Specification Language (RSL),

• directly interacts with resource managers so requires no grid or meta scheduling middleware

– Integration with current resource managers, including Load Leveler, PBS, LSF and Windows HPC, Amazon EC2 and Microsoft Azure platforms

• Features– Does not need high level of computer science knowledge to install and maintain

system • Use of Globus was a challenge for most non-compute scientists

– Involvement of system administrators to install and maintain Sigiri is minimal– Memory foot print of is minimal– Other tools require installation of most of heavy Globus stack but Sigiri does not

require a complete stack installation to run. (Note that installing Globus on a small clusters is something scientists never wanted to do.)

– Better fault tolerance and failure recovery.


40

Architecture

• Asynchronous messaging model of message publishers and consumers

• Daemons shadowing compute resources• Distributed component deployment

– Daemon, front end Web service and job queue


41

Client Interaction Service

• Deployed as an Apache Axis2 Web service to enable interoperability– Accepts job requests and enable management and

monitoring functions• Job submission schema does not enforce

schema for job description– Enables multiple job description languages


42

Client Interaction Service

Job Submission Request

Job Submission Response


43Daemons

• Each managed compute resource has a light-weight daemon– periodically checks job request queue– translates job specification to a resource manager specific language– submits pending jobs and persists correlation between resource manager's job id with internal

id

• Extensible daemon API – enables integration of wide range of resource managers while keeping complexities of these

resources managers transparent to end users of these systems

• Queuing based approach enables daemons to be run on any compute platform, without any software or operating system requirements

• Current Support– LSF, PBS,SLURM, LoadLeveler, Amazon EC2, Windows HPC, Windows Azure


44

Integration of Cloud Computing Resources

• Unique set of dynamically loaded and configured extensions to handle security, schedule jobs and perform required data movements.

• Enables scientists to interact with multiple cloud providers within same system

• Features– Extensions can be written as

modules independent of other extensions, typically to carry out a single task

– Enforced failure handling to prevent orphan VMs, resources


45

Security

• Client Security– Between client and Web service layer– Support for both transport level security (using SSL)

and application layer security (using WS-Security)– Client negotiation of security credentials with WS-

Security policy support within Apache Axis2• Compute Resource Security

– System has support to store different types of security credentials

• Username/password combinations, X.509 credentials


46

Performance Evaluation

• Test Scenarios– Case 1: Jobs arrive at our system as a burst of concurrent

submissions from a controlled number of clients.• Each client waits for all jobs to finish before submitting next set of

jobs.• For example, during test with 100 clients, each client sends 1 job to

server making 100 jobs coming to server in parallel.

– Case 2: Each client submits 10 jobs having varying execution times in sequence with no delay between submissions

• client does not block upon submission of a job• failure rate and server performance, from clients point of view, are

measured and number of simultaneous clients will be systematically increased


47

Performance Evaluation:Baseline Measurements


48

Performance Evaluation:Metrics


49

Performance Evaluation:Scalability Metrics


50

Performance Evaluation• Experimental Setup

– Daemon hosted within gatekeeper node (quad-core IBM PowerPC (1.6GHz) with 8GB of physical memory) of Big Red cluster

– System Web service and database co-hosted in a box with (4 2.6GHz dual-core processors with 32GB of RAM)

– Both these nodes were not dedicated for our experiment when we were running tests• Client Environment

– Setup within 128 node Odin Cluster (each node is a Dual AMD 2.0GHz Opteron processor with 4GB physical memory)

– All client nodes were used in dedicated mode and each client is running on separate java virtual machine to eliminate any external overhead

• Data Collection– Each test was run number of clients * 10 times and results were averaged.– Each parameter is tested for 100 to 1000 concurrent clients– Total of 110,000 tests were run.

• Gram4 experiment results produced in Gram4 evaluation paper[Marru08] were used for system performance comparison.


51

Results• Baseline Measurements

– All overheads scaling proportional to number of clients– No failures

Case 1 Case 2


52

Results

• Metrics for Test Case 1 and 2– Both response time and

total overhead scaling proportional to number of clients

– No failures


53

Results• Scalability Metrics

• Failures– No failures with Sigiri– Failures starting from300 clients for Gram

Case 1 Case 2


54









55

Applications: LEAD

• Motivations– Grid middleware reliability and scalability study[Marru08]

and workflow failure rates. – components of LEAD infrastructure were considered for

adaptation to other scientific environments.• Sigiri initially prototyped to support Load Leveler, PBS

and LSF. • Implications

– Improved workflow success rates – Mitigation need for Globus middleware– Ability work with non-standard job managers

56

Applications: LEAD II

• Emergence of community- driven, production-quality workflow infrastructures– E.g. Trident Scientific Workflow

Workbench with Workflow Foundation

• Possibility of using alternate supercomputing resources– E.g. Recent port WRF

(Weather Research & Forecast) model to Windows platform, Azure

• Support for Windows based scientific computing environments.

57

Background: LEAD II and Vortex2 Experiment

• May 1, 2010 to June 15, 2010

• ~6 weeks, 7-days per week• Workflow started on hour

every hour each morning. • Had to find and bind to

latest model data (i.e., RUC 13km and ADAS data) to set initial and boundary conditions. – If model data was not

available at NCEP and University of Oklahoma, workflow could not begin.

• Execution of complete WRF stack within 1 hour

58

Trident Vortex2 WorkflowBulk of time (50 min) spent in Lead Workflow Proxy Activity

Sigiri Integration

59

Applications: Enabling Geo-Science Application on Windows Azure

• Geo-Science Applications– High Resource Requirements

• Compute intensive, dedicated HPC hardware

– e.g. Weather Research and Forecasting (WRF) Model

– Emergence of ensemble applications

• Large amount of small jobs– e.g. Examining each air layer, over a

long period of time.

• Single experiment = About 14000 jobs each taking few minutes to complete

60

Geo-Science Applications: Opportunities

• Cloud computing resources– On-demand access to “unlimited” resources– Flexibility

• Worker roles and VM roles

• Recent porting of geo-science applications– WRF, WRF Preprocessing System (WPS) port to Windows

• Increased use of ensemble applications (large number of small runs)

• Production quality, opensource scientific workflow systems– Microsoft Trident

61

Research Vision• Enabling geo-science experiments

– Type of applications• Compute intensive, ensembles

– Type of scientists• Meteorologists, atmospheric scientists, emergency management personnel, geologists

• Utilizing both Cloud computing and Grid computing resources• Utilizing opensource, production quality scientific workflow

environments• Improved data and meta-data management

Geo-Science Applications Scientific Workflows Compute Resources


62

Proposed Framework

TridentActivity

Azure Blob Store

Azure Custom VM Images

Azure Management

API

IISSigiri Worker

Service

Windows 2008R2

MSMPI

WRF

VM Instance

Azure Fabric

Web Service

Job Mgmt.Daemons

Job Queue

Sigiri


63

Applications: Pragma Testbed Support

• Pacific Rim Applications and Grid Middleware (PRAGMA)[Zheng06]– an open international organization founded in 2002 to focus on

practical issues of building international scientific collaborations• In 2010, Indiana University (IU) joined PRAGMA and added a

dedicated cluster for testbed. • Sigiri was used within IU Pragma testbed

– IU PRAGMA testbed system required a light-weight system that could be installed and maintained with minimal effort.

– IU PRAGMA team wanted to evaluate on adding cloud resources into testbed with little or no changes to interfaces.

• In 2011, PRAGMA - Opal - Sigiri integration was demonstrated successfully


64









65

Related Work• Scientific Job Management Systems

– Grid Resource Allocation and Management (GRAM)[Foster05], Condor-G[Frey02], Nimrod/G[Buyya00], GridWay[Huedo05] and SAGA[Goodale06] and Falkon[Raicu07]

– provide uniform job management APIs, but are tightly integrated with complex middleware to address a broad range of problems.

– Carmen[Watson81] project • provided a cloud environment that has enabled collaboration between

neuroscientists• requires all programs to be packaged as WS-I[Ballinger04] compliant Web

services

– Condor[Frey02] pools can also be utilized to unify certain compute resource interactions.

• uses Globus toolkit[Foster05] (and GRAM underneath) • Poor failure recovery • overlooks failure modes of a cloud platform


66

Related Work• Scientific Research and Cloud Computing

– IaaS, PaaS and SaaS environment evaluations– Scientists have mainly evaluated use of IaaS services for scientific job

executions[Abadi09][Hoffa08][Keahey08] [Yu05]• Ease of setting up custom environments and control

– Growing interest in using PaaS services[Humphrey10][Lu10] [Qiu09]– Optimization to balance cost and time of executions[Deelman08][Yu05]– Startup overheads[Chase03][Figueiredo03][Foster06][Sotomayor06][Keahey07]

• Job Prediction Algorithms– Prediction of

• Execution times[Smith], job start times[Li04], queue-wait times[Nurmi07] and resource requirements[Julian04]

– AI based and statistical modeling based approaches• AppleS[Berman03] argues that a good scheduler must involve some

prediction of application and system performance• Reliability of Compute Resources

– Birman[Birman05] and aspects of resources causing system reliability issues– Statistical modeling to predict failures[Kandaswamy08]


67









68

Conclusion• User inspired management of scientific jobs

– Concentrate on identification of user patterns and perceptions– Harnesses historical information– Applies knowledge gained to improve scientific job executions– Argues that patterns, if identified based on individual users, can reveal important

information to make sophisticated estimations on resource requirements– Evaluations demonstrates usability of predictions for a meta-scheduler, especially

ones integrated into community gateways, to improve their scheduling decisions.• Resource abstraction service

– Help mid-scale scientists to obtain access to resources that are cheap and available

– Strives to do so with a tool that is easy to set up and administer• Prototype implementations introduced and discussed is integrated and

used in different domains and scientific applications• Applications demonstrate how our research contributed to advance

science in respective domains.


69

Contributions• Propose and empirically demonstrate user patterns, deduced by

knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.

• Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as a new dimension to consider during resource selections.

• Propose and demonstrate effectiveness and applicability of a light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.

• Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.


70

Future Work

• Short term research directions– Integration of future job predictions and user-

perceived reliability predictions– Evolving resource abstraction service to support

more compute resources– Management of ensemble runs– Fault tolerance with proactive replication

• Long Term Research Directions


71

Thank You !!

user inspired management of scientific jobs in grids and clouds

Technology

clouds contribution

experiencesusage patterns

resource abstraction

uniform abstraction

resource selections

sufficient compute resourcese

agriculturelacking resources

user perceived reliability