user inspired management of scientific jobs in grids and clouds
DESCRIPTION
This is my PhD defense presentation discussing the work I did on improving scientific job execution in Grids and Clouds. It talks about how user patterns can be used to learn user behavior and improve meta-scheduler decisions. The resource abstraction layer proposed and implemented helps scientists to interact with a wide variety compute resources.TRANSCRIPT
User Inspired Management of Scientific Jobs in Grids and Clouds
Eran Chinthaka WithanaSchool of Informatics and Computing
Indiana University, Bloomington, Indiana, USA
Doctoral CommitteeProfessor Beth Plale, PhDDr. Dennis Gannon, PhD
Professor Geoffrey Fox, PhDProfessor David Leake, PhD
2
Outline• Mid-Range Science
– Challenges and Opportunities– Current Landscape
• Research– Research Questions– Contributions
• Mining Historical Information to Find Patterns and Experiences– Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds
[Contribution 1]– Using Reliability Aspects of Computational Resources to Improve Scientific Job
Executions [Contribution 2]• Uniform Abstraction for Large-Scale Compute Resource Interactions
[Contribution 3, 4]• Applications• Related Work• Conclusion and Future Work
Thesis Defense - Eran Chinthaka Withana
3
Outline• Mid-Range Science
– Challenges and Opportunities– Current Landscape
• Research– Research Questions– Contributions
• Mining Historical Information to Find Patterns and Experiences– Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds
[Contribution 1]– Using Reliability Aspects of Computational Resources to Improve Scientific Job
Executions [Contribution 2]• Uniform Abstraction for Large-Scale Compute Resource Interactions
[Contribution 3, 4]• Applications• Related Work• Conclusion and Future Work
Thesis Defense - Eran Chinthaka Withana
4
Mid-Range Science
• Challenges– Resource requirements going beyond lab and university,
but not suited for large-scale resources– Difficulties finding sufficient compute resources
• E.g.: short term forecast in LEAD for energy and agriculture
– Lacking resources to have strong CS support person on team
– Need for less-expensive and more-available resources• Opportunities
– Wide variety of computational resources– Science gateways
Thesis Defense - Eran Chinthaka Withana
Thesis Defense - Eran Chinthaka Withana
5
Current Landscape• Grid Computing
– Batch orientation, long queues even under moderate loads, no access transparency– Drawbacks in quota system– Levels of computer science expertise required
• Cloud Computing– High availability, pay-as-you-go model, on-demand limitless1 resource allocation– Payment policy and research cost models
• Use of Workflow Systems– Hybrid workflows
• Enables utilization of heterogeneous compute resources– E.g.: Vortex2 Experiment
• Need for resource abstraction layers and optimal selection of resources
• Need for improvement of scientific job executions– Better scheduler decisions, selection of compute resources– Reliability issues in compute resources– Importance of learning user patterns and experiences
1M. Armbrust et al. Above the clouds: A Berkeley view of cloud computing Tech. Rep. UCB/EECS-2009-28, EECS Department, University of California, Berkeley., 2009.
Thesis Defense - Eran Chinthaka Withana
6
Outline• Mid-Range Science
– Challenges and Opportunities– Current Landscape
• Research– Research Questions– Contributions
• Mining Historical Information to Find Patterns and Experiences– Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds
[Contribution 1]– Using Reliability Aspects of Computational Resources to Improve Scientific Job
Executions [Contribution 2]• Uniform Abstraction for Large-Scale Compute Resource Interactions
[Contribution 3, 4]• Applications• Related Work• Conclusion and Future Work
Thesis Defense - Eran Chinthaka Withana
7
Research Questions
“Can user patterns and experiences be used to improve scientific job executions in large scale systems?”
“Can a simple, reliable and a highly scalable uniform resource abstraction be achieved to interact with a variety compute resource providers? “
“Can these be put to use to advance science?”
Thesis Defense - Eran Chinthaka Withana
8
Contributions• Propose and empirically demonstrate user patterns, deduced by
knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.
• Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as new dimension to consider during resource selections.
• Propose and demonstrate effectiveness and applicability of light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.
• Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.
Thesis Defense - Eran Chinthaka Withana
9
Outline• Mid-Range Science
– Challenges and Opportunities– Current Landscape
• Research– Research Questions– Contributions
• Mining Historical Information to Find Patterns and Experiences– Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds
[Contribution 1]– Using Reliability Aspects of Computational Resources to Improve Scientific Job
Executions [Contribution 2]• Uniform Abstraction for Large-Scale Compute Resource Interactions
[Contribution 3, 4]• Applications• Related Work• Conclusion and Future Work
Thesis Defense - Eran Chinthaka Withana
10
Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds• Objective
– Reducing the impact of startup overheads for time-critical applications• Problem space
– Workflows can have multiple paths– Workflow descriptions not available
• Need for predictions to identify job execution sequence• Learning from user behavioral patterns to predict future jobs• Research outline
– Algorithm to predict future jobs by extracting user patterns from historical information
– Use of knowledge-based techniques• Zero knowledge or pre-populated job information consisting of connection between
jobs• Similar cases retrieved are used to predict future jobs, reducing high startup overheads
– Algorithm assessment • Two different workloads representing individual scientific jobs executed in LANL and set
of workflows executed by three users
Thesis Defense - Eran Chinthaka Withana
11Demonstration of User Patterns with Workflows
• Suite of workflows can differ from domain to domain– E.g. WRF (Weather Research and Forecasting) as upstream node
• User patterns reveal sequence of jobs taking different users/domains into consideration
• Useful for a science gateway serving wide-range of mid-scale scientists
WRF
Weather Predictions
Crop Predictions
Wind Farm Location Evaluations
Wild Fire Propagation Simulation
Thesis Defense - Eran Chinthaka Withana
12Role of Successful Predictions to Reduce Startup Overheads
• Largest gain can be achieved when our prediction accuracy is high and setup time (s) is large with respect to execution time (t)
r = probability of successful prediction (prediction accuracy)
Percentage time =reduction
Percentage time =reduction
For simplicity, assuming equal job exec and startup times
N
ii
N
ii ts
00 T
N
ii
N
ii
N
ii
N
ii
N
ii
sts
ts
r
r
000
00
T
T )1(
)(0
0
N
iii
N
ii
ts
sr
1)(
*
*)(
)*(*
str
st
sr
Nst
Nsr
Thesis Defense - Eran Chinthaka Withana
13
Relationship of Predictions to Execution Time
Percentage time =reduction
• Observations– Percentage time reduction
increases with accuracy of predictions
– Time reduction is reduced exponentially with increased work-to-overhead ratio
• Need to find criticalpoint for a given situation– Fixing required percentage time
reduction for a given t/s ratio and finding required accuracy of predictions
• Cost of wrong predictions– Depends on compute resource– Demonstrated higher
prediction accuracies (~90%) will reduce impact of wrong predictions
– Compromising cost to improve time
1str
Accuracy of Predictions = total successful future job predictions / total predictions
Thesis Defense - Eran Chinthaka Withana
14
Prediction Engine: System Architecture
PredictionRetriever
Thesis Defense - Eran Chinthaka Withana
15
Use of Reasoning• Store and retrieve cases• Steps
– Retrieval of similar cases• Similarity measurement• Use of thresholds
– Reuse of old cases– Case adaptation– Storage
Thesis Defense - Eran Chinthaka Withana
16
Case Similarity Calculation
• Each case represented by set of attributes– Selected by finding effect on goal variable (next
job)
Thesis Defense - Eran Chinthaka Withana
17
Evaluation• Use cases
– Individual job workload1
• 40k jobs over two years from 1024-node CM-5 at Los Alamos National Lab
– Workflow use case• System doesn’t see or assume workflow
specification
User Workflows in the experiment
User 1 Workflow 1, Workflow 2, Workflow 5
User 2 Workflow 2, Workflow 4
User 3 Workflow 2, Workflow 3, Workflow 4
1: Parallel Workload Archive http://www.cs.huji.ac.il/labs/parallel/workload/
• Experimental setup– 2.0GHz dual-
core processor, 4GB memory and on a 64-bit Windows operating system
Thesis Defense - Eran Chinthaka Withana
18
Evaluation: Average Accuracy of Predictions
Individual Jobs Workload• ~ 75% accurate
predictions with user patterns
• ~ 32% accurate predictions with service names
Workflow Workload• ~ 95% accurate
predictions with user patterns
• ~ 53% accurate predictions with service names
Thesis Defense - Eran Chinthaka Withana
19
Evaluation: Time Saved
• Amount of time that can be saved, if resources are provisioned, when job is ready to run
• Startup time– Assumed to be 3mins (average for commercial providers)
Individual Jobs Workload Workflow Workload
Thesis Defense - Eran Chinthaka Withana
20
Evaluation: Prediction Accuracies for Use Cases
• User patterns based predictions performs 2x better than service names based
Thesis Defense - Eran Chinthaka Withana
21
Outline• Mid-Range Science
– Challenges and Opportunities– Current Landscape
• Research– Research Questions– Contributions
• Mining Historical Information to Find Patterns and Experiences– Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds
[Contribution 1]– Using Reliability Aspects of Computational Resources to Improve Scientific Job
Executions [Contribution 2]• Uniform Abstraction for Large-Scale Compute Resource Interactions
[Contribution 3, 4]• Applications• Related Work• Conclusion and Future Work
Thesis Defense - Eran Chinthaka Withana
22
User Perceived Reliability• Failures tolerated through
– fault tolerance, high availability, recoverability, etc.,[Birman05]. • What matters from a user’s point of view is whether these failures are
visible to users or not– E.g. reliability of commodity hardware (in clouds) vs user-perceived reliability
• Reliability is not of resources themselves – Not derived from halting failures, fail-stop failures, network partitioning
failures[Birman05] or machine downtimes. – It is a more broadly encompassing system reliability that can only be seen
at user or workflow level• Can depend on user’s configuration and job types as well
– We refer to this form of reliability as user-perceived reliability.• Importance of user-perceived reliability
– Selecting a resource to schedule an experiment when user has access to multiple compute resources
– E.g. LEAD reliability• supercomputing resources vs• Windows Azure resources
Thesis Defense - Eran Chinthaka Withana
23
Why User Perceived Reliability is Useful
• User perceived failure probabilities – Cluster A, p(A) = 0.2 and Cluster B, p(B) = 0.3) = 0.2 * ( 1 – 0.3) = 0.14 ) = 0.3 * ( 1 – 0.2) = 0.24
• Since < try cluster A first and then cluster B.
Thesis Defense - Eran Chinthaka Withana
24
Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions
• Objective– Reduce impact of low reliability of compute resources
• Deducing user-perceived reliabilities– learning from user experiences and perceptions
• Research outline– Algorithm to predict user perceived reliabilities, learning from
user experiences mining historical information– Use of machine learning techniques
• Trained classifiers to represent compute resources and their reliabilities• Prediction of job failures
– Algorithm assessment • Workloads from parallel workload archive representing jobs executed
in two different supercomputing clusters
Thesis Defense - Eran Chinthaka Withana
25
System Architecture
• A machine learning classifier is trained to learn user-perceived reliabilities of each cluster.
• Classifiers types– Static classifier: train classifier initially from historical information– Dynamic (updateable) classifier: starts from zero knowledge and
build when system is in operation
Thesis Defense - Eran Chinthaka Withana
26
System Architecture• Classifier manager uses Weka[Hall09] framework• Classification methods
– Naïve Bayes and KStar– Static and Dynamic classifiers
• Dynamic pruning of features[Fadishei09] for increased efficiency• Classifier manager
– Creates and maintains classifiers for each compute resource– A new job is evaluated based on these classifiers to deduce
predicted reliability of job execution• Policy Implementers
– Considers resource reliability predictions together with other quality of service information (time, cost) to select a resource
Thesis Defense - Eran Chinthaka Withana
27
Evaluation• Workloads from parallel workload archive[Feitelson]
– LANL: Two years worth of jobs from 1994 to 1996 on1024-node CM-5 at Los Alamos National Lab– LPC: Ten months (Aug, 2004 to May, 2005) worth of job records on 70 Xeon node cluster at ”Laboratoire de Physique Corpusculaire” of Universitat Blaise-Pascal, France
• Minor cleanups to remove intermediate job states• 10000 jobs were selected from each workload
– LANL had 20% failed jobs– LPC had 30% failed jobs
Thesis Defense - Eran Chinthaka Withana
28
Evaluation• Workload classification and maintenance
– Classifiers: Naïve Bayes[John95] and KStar[Cleary95] classifier implementations in Weka[Hall09].
– Classifier construction• Static classifier: first 1000 jobs trains classifier.• Dynamic classifier: all 10000 jobs for classifier construction and evaluation.
• Evaluation Metrics– Average reliability prediction accuracy: accuracy of predicting
success/fail of job– Time saved: cumulative time saved by aggregating execution time of a
job if it fails and if our system predicted failure successfully• baseline measure: ideal cumulative time that can be saved over time
– Time Consumed For Classification and Updating Classifier– Effect of pruning attributes
• Static subset of attributes (as proposed in Fadishei et el.[Fadishei09]) vs dynamic subset of attributes (checking affect on goal variable)
29
Evaluation• Evaluation Metrics
– Effect of Job Reliability Predictions on Selecting Compute Resources
• Extended version of GridSim[Buyya02] models four compute resourcesNWS[Wolski99] for bandwidth estimation and QBets[Nurmi07] for queue wait time estimation• Total execution time = data movement time + queue wait time + job execution time (found in workload)• Schedulers
– Total Execution Time Priority Scheduler – Reliability Prediction Based Time Priority Scheduler
• Metrics– Average Accuracy of Selecting Reliable Resources to Execute Jobs– Time Wasted Due to Incorrect Selection of Compute Resources to Execute Jobs
• All evaluations were run within a 3.0GHz dual-core processor, 4GB memory on Windows 7 professional operating system.
Thesis Defense - Eran Chinthaka Withana
Thesis Defense - Eran Chinthaka Withana
30
Evaluation Metrics SummaryParameters EnumerationWorkload LANL and LPCClassification Method Naïve Bayes and KstarClassifier Maintenance Static and UpdateableEvaluation Metrics • average reliability prediction
accuracy, • time saved • time consumed for classification
and updating classifier, • effect of pruning attributes• effect of job reliability predictions
on selecting compute resources
31Results:Average Reliability Prediction Accuracy
Thesis Defense - Eran Chinthaka Withana
Static Dynamic / Updateable
LPC
• LANL Accuracy Saturation ~ 82%
• LPC Accuracy Saturation ~ 97%
• KStar has performed slightly better than Naïve Bayes
LANL
32Results:Time Savings
Thesis Defense - Eran Chinthaka Withana
Static Dynamic / Updateable
• With static classifier, KStar has saved 90-100%
• Updateable classifier – For LANL
Both KStar and NB ~ 50% saving
– For LPC ~ 90% saving
LANL
LPC
33
Results:Time Consumed for Classification and Updating Classifier
Thesis Defense - Eran Chinthaka Withana
• Both static and updateable Naïve Bayes classifiers take very little time (not included in graphs)
Static Classifier Updateable Classifier
Thesis Defense - Eran Chinthaka Withana
34
Results:Effect of Pruning Attributes
• Static sub-set of attributes (Fadishei09) performs poorly on this data set and classifier
• Dynamic pruning has improved accuracy of predictions compared to non-pruned case, but improvement is marginal
• Conclusion -> our classifiers are handling noise features well without compromising accuracy of classifications
• Identification of attributes to prune is a dynamic and expensive task – system can be used in practical
cases even without pruning of attributes.
Thesis Defense - Eran Chinthaka Withana
35Results:Effect of Job Reliability Predictions on Selecting
Compute Resources• Poor performance of execution time priority scheduler• After 1000 jobs (training) time wasted with our approach stays fairly
constant
Thesis Defense - Eran Chinthaka Withana
36
Evaluation Conclusion
• Even though average accuracy of predictions with KStar classifier has decreased with static classifier, it has managed to learn and predict failures better than any other method.
• Even though amount of time saved has increased slightly with Naive Bayes updateable classifier, comparatively, amount of time saved using static KStar classifier is higher than both methods.
• Even though total accuracy of predictions is not performing compared to other methods, static KStar classifier is ideal for correctly predicting failure cases, with very low overhead.
• Taking user-perceived reliability of compute resources in to consideration can save a significant amount of time in scientific job executions
Thesis Defense - Eran Chinthaka Withana
37
Outline• Mid-Range Science
– Challenges and Opportunities– Current Landscape
• Research– Research Questions– Contributions
• Mining Historical Information to Find Patterns and Experiences– Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds
[Contribution 1]– Using Reliability Aspects of Computational Resources to Improve Scientific Job
Executions [Contribution 2]• Uniform Abstraction for Large-Scale Compute Resource Interactions
[Contribution 3, 4]• Applications• Related Work• Conclusion and Future Work
Thesis Defense - Eran Chinthaka Withana
38Scientific Computing Resource Abstraction Layer
• Variety of scientific computing platforms and opportunities• Requirements
– Support existing job description languages and also should be extensible to support other languages.
– Provide a uniform and interoperable interface for external entities to interact with it.
– Support heterogeneous compute resource manager interfaces and operating platforms from grids, IaaS, PaaS clouds, departmental clusters.
– Extensibility to support new and future resource managers with minimal changes.
– Provide monitoring and fault recovery, especially when working with utility computing resources.
– Provide light-weight, robust and scalable infrastructure.– Integration to variety of workflow environments.
Thesis Defense - Eran Chinthaka Withana
39
Scientific Computing Resource Abstraction Layer• Our contribution
– Resource abstraction layer • Implemented as a web service• Provides a uniform abstraction layer over heterogeneous compute resources including grids,
clouds and local departmental clusters.
– Support for standard job specification languages including, but not limited to, Job Submission Description Language (JSDL)[Anjomshoaa04] and Globus Resource Specification Language (RSL),
• directly interacts with resource managers so requires no grid or meta scheduling middleware
– Integration with current resource managers, including Load Leveler, PBS, LSF and Windows HPC, Amazon EC2 and Microsoft Azure platforms
• Features– Does not need high level of computer science knowledge to install and maintain
system • Use of Globus was a challenge for most non-compute scientists
– Involvement of system administrators to install and maintain Sigiri is minimal– Memory foot print of is minimal– Other tools require installation of most of heavy Globus stack but Sigiri does not
require a complete stack installation to run. (Note that installing Globus on a small clusters is something scientists never wanted to do.)
– Better fault tolerance and failure recovery.
Thesis Defense - Eran Chinthaka Withana
40
Architecture
• Asynchronous messaging model of message publishers and consumers
• Daemons shadowing compute resources• Distributed component deployment
– Daemon, front end Web service and job queue
Thesis Defense - Eran Chinthaka Withana
41
Client Interaction Service
• Deployed as an Apache Axis2 Web service to enable interoperability– Accepts job requests and enable management and
monitoring functions• Job submission schema does not enforce
schema for job description– Enables multiple job description languages
Thesis Defense - Eran Chinthaka Withana
42
Client Interaction Service
Job Submission Request
Job Submission Response
Thesis Defense - Eran Chinthaka Withana
43Daemons
• Each managed compute resource has a light-weight daemon– periodically checks job request queue– translates job specification to a resource manager specific language– submits pending jobs and persists correlation between resource manager's job id with internal
id
• Extensible daemon API – enables integration of wide range of resource managers while keeping complexities of these
resources managers transparent to end users of these systems
• Queuing based approach enables daemons to be run on any compute platform, without any software or operating system requirements
• Current Support– LSF, PBS,SLURM, LoadLeveler, Amazon EC2, Windows HPC, Windows Azure
Thesis Defense - Eran Chinthaka Withana
44
Integration of Cloud Computing Resources
• Unique set of dynamically loaded and configured extensions to handle security, schedule jobs and perform required data movements.
• Enables scientists to interact with multiple cloud providers within same system
• Features– Extensions can be written as
modules independent of other extensions, typically to carry out a single task
– Enforced failure handling to prevent orphan VMs, resources
Thesis Defense - Eran Chinthaka Withana
45
Security
• Client Security– Between client and Web service layer– Support for both transport level security (using SSL)
and application layer security (using WS-Security)– Client negotiation of security credentials with WS-
Security policy support within Apache Axis2• Compute Resource Security
– System has support to store different types of security credentials
• Username/password combinations, X.509 credentials
Thesis Defense - Eran Chinthaka Withana
46
Performance Evaluation
• Test Scenarios– Case 1: Jobs arrive at our system as a burst of concurrent
submissions from a controlled number of clients.• Each client waits for all jobs to finish before submitting next set of
jobs.• For example, during test with 100 clients, each client sends 1 job to
server making 100 jobs coming to server in parallel.
– Case 2: Each client submits 10 jobs having varying execution times in sequence with no delay between submissions
• client does not block upon submission of a job• failure rate and server performance, from clients point of view, are
measured and number of simultaneous clients will be systematically increased
Thesis Defense - Eran Chinthaka Withana
47
Performance Evaluation:Baseline Measurements
Thesis Defense - Eran Chinthaka Withana
48
Performance Evaluation:Metrics
Thesis Defense - Eran Chinthaka Withana
49
Performance Evaluation:Scalability Metrics
Thesis Defense - Eran Chinthaka Withana
50
Performance Evaluation• Experimental Setup
– Daemon hosted within gatekeeper node (quad-core IBM PowerPC (1.6GHz) with 8GB of physical memory) of Big Red cluster
– System Web service and database co-hosted in a box with (4 2.6GHz dual-core processors with 32GB of RAM)
– Both these nodes were not dedicated for our experiment when we were running tests• Client Environment
– Setup within 128 node Odin Cluster (each node is a Dual AMD 2.0GHz Opteron processor with 4GB physical memory)
– All client nodes were used in dedicated mode and each client is running on separate java virtual machine to eliminate any external overhead
• Data Collection– Each test was run number of clients * 10 times and results were averaged.– Each parameter is tested for 100 to 1000 concurrent clients– Total of 110,000 tests were run.
• Gram4 experiment results produced in Gram4 evaluation paper[Marru08] were used for system performance comparison.
Thesis Defense - Eran Chinthaka Withana
51
Results• Baseline Measurements
– All overheads scaling proportional to number of clients– No failures
Case 1 Case 2
Thesis Defense - Eran Chinthaka Withana
52
Results
• Metrics for Test Case 1 and 2– Both response time and
total overhead scaling proportional to number of clients
– No failures
Thesis Defense - Eran Chinthaka Withana
53
Results• Scalability Metrics
• Failures– No failures with Sigiri– Failures starting from300 clients for Gram
Case 1 Case 2
Thesis Defense - Eran Chinthaka Withana
54
Outline• Mid-Range Science
– Challenges and Opportunities– Current Landscape
• Research– Research Questions– Contributions
• Mining Historical Information to Find Patterns and Experiences– Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds
[Contribution 1]– Using Reliability Aspects of Computational Resources to Improve Scientific Job
Executions [Contribution 2]• Uniform Abstraction for Large-Scale Compute Resource Interactions
[Contribution 3, 4]• Applications• Related Work• Conclusion and Future Work
Thesis Defense - Eran Chinthaka Withana
55
Applications: LEAD
• Motivations– Grid middleware reliability and scalability study[Marru08]
and workflow failure rates. – components of LEAD infrastructure were considered for
adaptation to other scientific environments.• Sigiri initially prototyped to support Load Leveler, PBS
and LSF. • Implications
– Improved workflow success rates – Mitigation need for Globus middleware– Ability work with non-standard job managers
56
Applications: LEAD II
• Emergence of community- driven, production-quality workflow infrastructures– E.g. Trident Scientific Workflow
Workbench with Workflow Foundation
• Possibility of using alternate supercomputing resources– E.g. Recent port WRF
(Weather Research & Forecast) model to Windows platform, Azure
• Support for Windows based scientific computing environments.
57
Background: LEAD II and Vortex2 Experiment
• May 1, 2010 to June 15, 2010
• ~6 weeks, 7-days per week• Workflow started on hour
every hour each morning. • Had to find and bind to
latest model data (i.e., RUC 13km and ADAS data) to set initial and boundary conditions. – If model data was not
available at NCEP and University of Oklahoma, workflow could not begin.
• Execution of complete WRF stack within 1 hour
58
Trident Vortex2 WorkflowBulk of time (50 min) spent in Lead Workflow Proxy Activity
Sigiri Integration
59
Applications: Enabling Geo-Science Application on Windows Azure
• Geo-Science Applications– High Resource Requirements
• Compute intensive, dedicated HPC hardware
– e.g. Weather Research and Forecasting (WRF) Model
– Emergence of ensemble applications
• Large amount of small jobs– e.g. Examining each air layer, over a
long period of time.
• Single experiment = About 14000 jobs each taking few minutes to complete
60
Geo-Science Applications: Opportunities
• Cloud computing resources– On-demand access to “unlimited” resources– Flexibility
• Worker roles and VM roles
• Recent porting of geo-science applications– WRF, WRF Preprocessing System (WPS) port to Windows
• Increased use of ensemble applications (large number of small runs)
• Production quality, opensource scientific workflow systems– Microsoft Trident
61
Research Vision• Enabling geo-science experiments
– Type of applications• Compute intensive, ensembles
– Type of scientists• Meteorologists, atmospheric scientists, emergency management personnel, geologists
• Utilizing both Cloud computing and Grid computing resources• Utilizing opensource, production quality scientific workflow
environments• Improved data and meta-data management
Geo-Science Applications Scientific Workflows Compute Resources
Thesis Defense - Eran Chinthaka Withana
62
Proposed Framework
TridentActivity
Azure Blob Store
Azure Custom VM Images
Azure Management
API
IISSigiri Worker
Service
Windows 2008R2
MSMPI
WRF
VM Instance
Azure Fabric
Web Service
Job Mgmt.Daemons
Job Queue
Sigiri
Thesis Defense - Eran Chinthaka Withana
63
Applications: Pragma Testbed Support
• Pacific Rim Applications and Grid Middleware (PRAGMA)[Zheng06]– an open international organization founded in 2002 to focus on
practical issues of building international scientific collaborations• In 2010, Indiana University (IU) joined PRAGMA and added a
dedicated cluster for testbed. • Sigiri was used within IU Pragma testbed
– IU PRAGMA testbed system required a light-weight system that could be installed and maintained with minimal effort.
– IU PRAGMA team wanted to evaluate on adding cloud resources into testbed with little or no changes to interfaces.
• In 2011, PRAGMA - Opal - Sigiri integration was demonstrated successfully
Thesis Defense - Eran Chinthaka Withana
64
Outline• Mid-Range Science
– Challenges and Opportunities– Current Landscape
• Research– Research Questions– Contributions
• Mining Historical Information to Find Patterns and Experiences– Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds
[Contribution 1]– Using Reliability Aspects of Computational Resources to Improve Scientific Job
Executions [Contribution 2]• Uniform Abstraction for Large-Scale Compute Resource Interactions
[Contribution 3, 4]• Applications• Related Work• Conclusion and Future Work
Thesis Defense - Eran Chinthaka Withana
65
Related Work• Scientific Job Management Systems
– Grid Resource Allocation and Management (GRAM)[Foster05], Condor-G[Frey02], Nimrod/G[Buyya00], GridWay[Huedo05] and SAGA[Goodale06] and Falkon[Raicu07]
– provide uniform job management APIs, but are tightly integrated with complex middleware to address a broad range of problems.
– Carmen[Watson81] project • provided a cloud environment that has enabled collaboration between
neuroscientists• requires all programs to be packaged as WS-I[Ballinger04] compliant Web
services
– Condor[Frey02] pools can also be utilized to unify certain compute resource interactions.
• uses Globus toolkit[Foster05] (and GRAM underneath) • Poor failure recovery • overlooks failure modes of a cloud platform
Thesis Defense - Eran Chinthaka Withana
66
Related Work• Scientific Research and Cloud Computing
– IaaS, PaaS and SaaS environment evaluations– Scientists have mainly evaluated use of IaaS services for scientific job
executions[Abadi09][Hoffa08][Keahey08] [Yu05]• Ease of setting up custom environments and control
– Growing interest in using PaaS services[Humphrey10][Lu10] [Qiu09]– Optimization to balance cost and time of executions[Deelman08][Yu05]– Startup overheads[Chase03][Figueiredo03][Foster06][Sotomayor06][Keahey07]
• Job Prediction Algorithms– Prediction of
• Execution times[Smith], job start times[Li04], queue-wait times[Nurmi07] and resource requirements[Julian04]
– AI based and statistical modeling based approaches• AppleS[Berman03] argues that a good scheduler must involve some
prediction of application and system performance• Reliability of Compute Resources
– Birman[Birman05] and aspects of resources causing system reliability issues– Statistical modeling to predict failures[Kandaswamy08]
Thesis Defense - Eran Chinthaka Withana
67
Outline• Mid-Range Science
– Challenges and Opportunities– Current Landscape
• Research– Research Questions– Contributions
• Mining Historical Information to Find Patterns and Experiences– Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds
[Contribution 1]– Using Reliability Aspects of Computational Resources to Improve Scientific Job
Executions [Contribution 2]• Uniform Abstraction for Large-Scale Compute Resource Interactions
[Contribution 3, 4]• Applications• Related Work• Conclusion and Future Work
Thesis Defense - Eran Chinthaka Withana
68
Conclusion• User inspired management of scientific jobs
– Concentrate on identification of user patterns and perceptions– Harnesses historical information– Applies knowledge gained to improve scientific job executions– Argues that patterns, if identified based on individual users, can reveal important
information to make sophisticated estimations on resource requirements– Evaluations demonstrates usability of predictions for a meta-scheduler, especially
ones integrated into community gateways, to improve their scheduling decisions.• Resource abstraction service
– Help mid-scale scientists to obtain access to resources that are cheap and available
– Strives to do so with a tool that is easy to set up and administer• Prototype implementations introduced and discussed is integrated and
used in different domains and scientific applications• Applications demonstrate how our research contributed to advance
science in respective domains.
Thesis Defense - Eran Chinthaka Withana
69
Contributions• Propose and empirically demonstrate user patterns, deduced by
knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.
• Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as a new dimension to consider during resource selections.
• Propose and demonstrate effectiveness and applicability of a light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.
• Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.
Thesis Defense - Eran Chinthaka Withana
70
Future Work
• Short term research directions– Integration of future job predictions and user-
perceived reliability predictions– Evolving resource abstraction service to support
more compute resources– Management of ensemble runs– Fault tolerance with proactive replication
• Long Term Research Directions
Thesis Defense - Eran Chinthaka Withana
71
Thank You !!