the virtual grid application development software (vgrads ... · scheduling methods, fault...

17
The Virtual Grid Application Development Software (VGrADS) Project “VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance” http://vgrads.rice.edu/

Upload: others

Post on 27-Feb-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

The Virtual Grid Application Development Software (VGrADS) Project

“VGrADS: Enabling e-Science Workflows on Grids and Clouds with

Fault Tolerance”

http://vgrads.rice.edu/

Page 2: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

VGrADS Goal: Distributed Problem Solving •  Where We Want To Be

o  Transparent computing –  In an increasingly distributed

space –  Applications to HPC –  Applications to cloud computing

•  Where We Were (circa 2003) o  Low-level hand programming o  Programmer had to manage:

–  Heterogeneous resources –  Scheduling of computation and data movement –  Fault tolerance and performance adaptation

•  What Progress Have We Made? o  Separate application development from resource management

–  VGrADS provides a uniform “virtual grid” abstraction atop widely differing resources

o  Provide tools to bridge the gap –  Scheduling, resource management, distributed launch, simple

programming models, fault tolerance, grid economies

Page 3: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

Overview of SCʼxx Activities for VGrADS

•  Built on previous SC demonstrations o  Gradually built up system to handle LEAD workflow o  Previous years focused improved performance estimates,

scheduling methods, fault tolerance o  Use LEAD as an application driver

•  Current status o  VGrADS integrates HPC and cloud resources

–  Using TeraGrid (HPC), Amazon EC2 (cloud), Eucalyptus (cloud) resources

–  Using reservations, batch queues, and on-demand clouds o  Scheduling for balancing deadlines, reliability, and cost

–  vgES supports search for best set of resources –  Application-specific trade-offs of reliability, time, cost

o  Abstractions really do work!

Page 4: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

VGrADS Components

o  Virtual Grid Execution System (vgES) –  Uses Amazon EC2 tools to interact with cloud resources –  Uses QBETS and Globus to provision batch resources –  Uses Personal PBS to control execution on batch resources –  Provides a “resource gantt chart” view of resources to aid

higher level workflow orchestration tool o  Eucalyptus – Developed for VGrADS (now Eucalyptus Inc.)

–  Implements cloud computing on Xen-enabled clusters –  Open-source software infrastructure that is compatible with

Amazon EC2 ➔ vgES “thinks” a Eucalyptus cloud is EC2

o  Fault Tolerance (FTR) –  Schedules a task to increase the probability of successful

execution of a task up to a desired level, constrained by resource availability and application deadlines

Page 5: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

Executing LEAD Workflow Sets

o  Demonstrate planning and execution of LEAD workflow sets execution atop virtualized cloud and Grid resources.

o  LEAD Workflow Orchestration schedules a set of independent workflows with characteristics

–  a deadline D (e.g. 2 hours)

–  fraction F such that at least F of the workflows finish by the deadline (e.g. 3/8)

o  Virtual Grid Execution System (vgES) provides an abstraction over batch and cloud systems including Amazon EC2 and Eucalyptus cloud sites.

www.leadproject.org

Page 6: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

Workflow Orchestration

o  Workflow Orchestration

o  Phase 1: Minimal Scheduling – Schedule minimum fraction of workflows using simple probabilistic DAG scheduler

o  Phase 2: Fault Tolerance Tradeoff – Compare scheduling additional workflows with increasing fault-tolerance of one or more tasks of the scheduled workflows

o  Phase 3: Additional Scheduling – Use available slots for other scheduling

o  Phase 4: EC2 Scheduling - For tasks below certain threshold, schedule on EC2

o  Execution Manager

–  A prototype for ordered execution of tasks on the slot based on the schedule determined by the orchestration.

Page 7: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

Resources HPC Clou

d

Globus EC2 interfaces

Workflow Planner

Workflow Engine

Application Service

Portal

Virtual Grid Execution System

Resources HPC Clou

d

Globus EC2 interfaces

Execution Manager

Workflow Engine

Application Service

Portal

Execution Manager

resource planning

resource binding

query execution

plan

standard execution

specific protocol based execution

User User

Comparison of the LEAD-VGrADS collaboration system with cyberinfrastructure production deployments

Without VGrADS With VGrADS

Page 8: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

Start Deadline D

A

B C

D

J

K

L

E

F H G I

Start Deadline D

Batch Queue

Cloud

A B C D

E F H G

J K L

I

WAIT W

AIT

WAIT

WAIT

WAIT W

AIT

WAIT

WAIT

WAIT

WAIT

(a) Example Workflow Set

need F=2/3

(c) Coordinated Schedule

With VGrADS

(b) Uncoordinated Schedule

Without VGrADS

Batch Queue

Cloud

A B C D

E F H G I

WAIT W

AIT

J K L

WAIT

E´ F´ G´ H´ I´

E F H G I

Example Scheduling of Workflows

Page 9: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

DAGScheduler

VirtualGridExecu3onSystem

WorkflowPlanner

schedule

reso

urce

find

& b

ind

Faulttolerance EC2

Phase 1,2,3 Phase 2,3 Phase 4

BQP

GT4

, P

erso

nal P

BS

VARQ

Resources HPC Clou

d

Globus EC2 interfaces

quer

y ba

tch

queu

e tim

es

virt

ual r

es.

subm

it jo

b

setu

p vi

rtua

l mac

hine

s

Interaction of system components for resource procurementand planning

Page 10: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

LEAD / VGrADS Architecture : Putting It All Together

DAG Scheduler

Virtual Grid Execution System BQP / VAR

GT4, Personal PBS

HPC and Cloud Resources

VGrADS System

Globus

Workflow Planner

Execution Manager

LEAD BPEL Workflow Engine

Application Service

Event Broker

App. Factory

Portal

schedule

failures

resource find & bind

Fault tolerance

Batch Systems HPC

Cloud

LEAD System

failures

Page 11: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

Visualization Key

Run Done

Failed

Workflow Color

Time

Slot

Wid

th

Task Status

Resource Name

Page 12: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

Snapshot of Execution of 6 Workflows

Page 13: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

6 Workflows on 7 Clusters

Page 14: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

Infrastructure Timing Metrics

Planning time line Workflow execution time line

Binding time

Page 15: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

Fault Tolerance Exploration

Medium Reliability

Low Reliability

High Reliability

Page 16: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

Conclusions

•  VGrADS’ virtual grid abstraction simplifies o  Programming grid and cloud systems for e-Science workflows o  Managing QoS (performance and reliability)

•  The VGrADS system unifies workflow execution over o  batch queue systems (with and without advanced reservations),

and o  cloud computing sites (including Amazon EC2 and Eucalyptus)

•  The system provides an enabling technology for executing deadline-driven, fault-tolerant workflows

•  The integrated cyber-infrastructure from the LEAD and VGrADS system components provides a strong foundation for next-generation dynamic and adaptive environments for scientific workflows

Page 17: The Virtual Grid Application Development Software (VGrADS ... · scheduling methods, fault tolerance o Use LEAD as an application driver • Current status o VGrADS integrates HPC

Thank you..

Thank You

•  “VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance”, L. Ramakrishnan, D. Nurmi, A. Mandal, C. Koelbel, D. Gannon, T. M. Huang, Y. S. Kee, G. Obertelli, K. Thyagaraja, R. Wolski, A. Yarkhan and D. Zagorodnov

•  Paper to be presented at 3:30 p.m., Thursday, Nov. 19 in Room E145 - 146