shantenu jha for the saga team saga: an overview
TRANSCRIPT
Shantenu Jha for the SAGA Team
http://saga.cct.lsu.edu
SAGA: An Overview
Overview
The SAGA Philosophy• A Fresh Perspective on Distributed Applications and CI
SAGA in a Nutshell• SAGA Landscape
• Individual APIs
• OGF standard
SAGA in action• Applications
• Tools, Frameworks, Gateways, Access Layers..
Uptake and Roadmap
Critical Perspectives
Ability to develop simple, novel or effective distributed applications lags behind other aspects of CI • Distributed CI: Is the whole > than the sum of the parts?
Infrastructure capabilities (tools, programming systems) and policy determine applications, type development & execution:• Proportion of App. that utilize multiple distributed sites
sequentially, concurrently or asynchronously is low
• Not referring to tightly-coupled across multiple-sites
• Focus on extending legacy, static execution models
• Scale-Out of Simulations? Compute where the data is?
What novel applications & science has Distributed CI fostered• Distinguish challenges of provisioning Distributed CI versus support
for application development
Critical PerspectivesQuick Analysis
Several Factors responsible for perceived & actual lack of DA• Developing Distributed Applications is fundamentally hard!
• Coordination across multiple distinct resources
• Range of tools, prog. systems and environments large
• Interoperability and extensibility become difficult
• Commonly accepted abstractions not available
• E.g. Pilot-Job powerful, but no “unifying” tool on TG
• Deployment and execution challenges disjoint from the development process
• Generally good idea, but application development often influences where and how it can be deployed/executed
Distributed Applications Development Challenges
Developing Distributed Applications is fundamentally hard • Intrinsic:
• Design Points: Dynamical and Heterogeneous resources and Variable Control (or lack thereof)
• Coordination over Multiple & Distributed sites
• Scale-up and Scale-out
• Models of Distributed Applications:
• More than (peak) performance
• Primary role of Usage Modes
• Extrinsic:
• (Complex) Underlying infrastructure & its provisioning
• Programming systems, tools and environments
Distributed ApplicationsDevelopment Challenges
Dist. Application and Programming Systems and Tools• Incompleteness and/or out-of-phase:
• Need X and Y, but only X or Y available,
• e.g., Master-Worker paradigm supported, but no FT.
• Customization:
• Works well with tool A but not B,
• e.g., Pegasus-DAGMAN-Condor
• Robustness and Scalability:
• Works well in small or controlled environment, but not at-Scale
• e.g., SAGA–based Montage
Distributed Applications and PGILessons from Applications
There exist distributed applications that aim to utilize multiple resources:• Complex Coordination, Data Mgmt and FT issues
• Complex structures at different phases
Emerging Infrastructure present operational challenges• Don’t always provide the application requirement
Missing abstractions• Development, Deployment and Execution
• e.g. Coding against low-level middleware
Issues of policy and infrastructure design decisions• e.g. Co-allocation supported or not
• e.g. Narrow versus broad Grids
Distributed Application and PGI (2)
Lack of explicit support for addressing these challenges becomes self-fulfilling and self-perpetrating• Most PGI not designed to support distributed applications,
increasing effort to develop/deploy/execute
• Small number of distributed applications on TG leads to focus on single-site applications
(Ironically) Most applications have been developed to hide from heterogeneity and dynamism; not embrace them• Good heterogeneity vs Bad heterogeneity
• Dynamism: Performance Advantages
Applications have been brittle and not extensible:• Tied to specific tool or prog. system (& thus PGI)
Distributed Application and PGI (3)
Successful Evolution of Supported Capabilities and PGI• OSG support of HTC
• Condor from scavenging system to building block
• Condor Flocking
• TeraGrid and Gateways (User-level Abstraction)
• Several new capabilities for new communities
Not so Successful Evolution of Capabiliites on PGI• Co-scheduling on PGI
• Both technical and policy issues
• Scale-out of DAGs/Loosely-coupled ensembles
• Execution is logically or physically distributed
Distributed Application and PGI (4)
Supporting applications that have “come-of-age” useful• Coupling real-time simulations to live sensor data (LEAD)
• Currently neither OSG nor TG can support DDDAS
Often the underlying infrastructure and capabilities change quicker than application • Infrastructure: Grids and Clouds
• Many challenges remain the same, e.g., requirements of distribution coordination of data and computation
• Role for abstractions
• Applications still around: SFExpress, Netsolve->GridSolve
Distributed Application and PGI (5)Miscellaneous Factors
Role of Funding Agencies:• Changed vision, funding landscape
• TG: Run anywhere with tools/prog systems to support primarily static HPC applications
• Focus on static as opposed to distributed scale-out robust workflows
• http://www.cct.lsu.edu/~sjha/presentations/panel_discussion/punch_counterpunch.pdf
Role of Standard: • Lack of standards impedes interoperation
• No standards;
• Chicken-and-egg situation
• Simple standards have been effective, but with limited impact, eg., troika of JSDL, HPC-BP, OGSA-BES
Distributed Applications How do they differ from traditional HPC applications?
PGI design needs to be reflect some or all of these:
Performance Models:• Not just “peak utilization”; e.g., HPC & HTC (# of jobs)
Usage Modes:• The same application has multiple usage modes
• How applications are developed, deployed and executed is often determined by the infrastructure
Static vs Dynamic Execution:• Static applications is not enough; varying resource conditions,
application requirements
Skillful Decomposition vs Aggregation• Primacy of Coordination across distributed resources
Understanding Distributed ApplicationsIDEAS: First Principles Development Objectives
Interoperability: Ability to work across multiple distributed resources
Distributed Scale-Out: The ability to utilize multiple distributed resources concurrently
Extensibility: Support new patterns/abstractions, different programming systems, functionality & Infrastructure
Adaptivity: Response to fluctuations in dynamic resource and availability of dynamic data
Simplicity: Accommodate above distributed concerns at different levels easily…
Challenge: How to develop DA effectively and efficiently with the above as first-class objectives?
Overview
The SAGA Philosophy• A Fresh Perspective on Distributed Applications and CI
SAGA in a Nutshell• SAGA Landscape
• Individual APIs
• OGF standard
SAGA in action• Applications
• Tools, Frameworks, Gateways, Access Layers..
Uptake and Roadmap
SAGA: In a nutshell
There exists a lack of Programmatic approaches that:• Provide general-purpose, basic & common grid functionality for
applications and thus hide underlying complexity, varying semantics..
• The building blocks upon which to construct “consistent” higher-levels of functionality and abstractions
• Hides “bad” heterogeneity, means to address “good” heterogeneity
• Meets the need for a Broad Spectrum of Application:
• Simple scripts, Gateways, Smart Applications and Production Grade Tooling, Workflow…
Simple, integrated, stable, uniform and high-level interface• Simple and Stable: 80:20 restricted scope and Standard
• Integrated: Similar semantics & style across
• Uniform: Same interface for different distributed systems
SAGA: Provides Application* developers with units required to compose high-level functionality across (distinct) distributed systems (*) One Person’s Application is another Person’s Tool
Text
SAGA: The Standard Landscape
SAGA: Specification Landscape
Blue lines showwhich packageshave input in the Experience document
SAGA: In a thousand words..
Text
SAGA: Job SubmissionRole of Adaptors (middleware binding)SAGA: Role of Adaptors
SAGA Job API: Example
SAGA Job Package
SAGA File Package
File API: Example
SAGA Advert
SAGA Advert API: Example
SAGA: Other Packages
SAGA Task Model
All SAGA objects implement the task model
Every method has three “flavors”• Synchronous version - the implementation
• Asynchronous version - synchronous version wrapped in a task (thread) and started
• Task version - synchronous version wrapped in a task but not started (task handle returned)
Adaptor can implement own async. version
SAGA Implementation: Extensibility
Horizontal Extensibility – API Packages• Current packages:
• file management, job management, remote procedure calls, replica management, data streaming
• Steering, information services, checkpoint…
Vertical Extensibility – Middleware Bindings• Different adaptors for different middleware
• Set of ‘local’ adaptors
Extensibility for Optimization and Features• Bulk optimization, modular design
SAGA C++ quick tour
Open Source - released under the Boost Software License 1.0
Implemented as a set of libraries• SAGA Core - A light-weight engine / runtime that dispatches
calls from the API to the appropriate middle-ware adaptors
• SAGA functional packages - Groups of API calls for: jobs, files, service discovery, advert services, RPC, replicas, CPR, ... (extensible)
• SAGA language wrappers - Thin Python and C layers on top of thenative C++ API
• SAGA middleware adaptors - Take care of the API call execution on the middleware
Can be configured / packaged to suit your individual needs!
SAGA: Available Adaptors
Job Adaptors• Fork (localhost), SSH, Condor, Globus GRAM2, OMII GridSAM,
Amazon EC2, Platform LSF
File Adaptors• Local FS, Globus GridFTP, Hadoop Distributed Filesystem (HDFS),
CloudStore KFS, OpenCloud Sector-Sphere
Replica Adaptors• PostgreSQL/SQLite3, Globus RLS
Advert Adaptors• PostgreSQL/SQLite3, Hadoop H-Base, Hypertable
SAGA: Available Adaptors (2)
Other Adaptors• Default RPC / Stream / SD
Planned Adaptors• CURL file adaptor, gLite job adaptor (Ole), …..
Open issues• We’re in the process of consolidating the adaptor code
base and adding rigorous tests in order to improve adaptor quality
• Capability Provider Interface (CPI - the ‘Adaptor API’) is notdocumented or standardized (yet), but looking at existing adaptor code should get you started if you want to develop your own adaptor.
SAGA API: Towards a StandardStandards promote Interoperability
The need for standard programming interface• Trade-off “Go it alone” versus “Community” model
• Reinventing the wheel again, yet again, & then again
• MPI a useful analogy of community standard
• Vendors (Resource Provider), Software developers, users..
• social/historic parallels also important
• Time to adoption, after specification ....
OGF the natural choice (SAGA-RG, SAGA-WG)• Spin-off of the Applications Research Group
• Driven by UK, EU (German/Dutch), US
• Design derived from 23 Use Cases
• different projects, applications and functionality
• biological, coastal modelling, visualization
• Will discuss the advantage of SAGA as a standard specification
Overview
The SAGA Philosophy• A Fresh Perspective on Distributed Applications and CI
SAGA in a Nutshell• SAGA Landscape
• Individual APIs
• OGF standard
SAGA in Action• Applications
• Tools, Frameworks, Gateways. Access Layers..
Uptake and Roadmap
SAGA and Distributed Applications
Understanding Distributed Applications Implicit vs Explicitly Distributed ?
Which approach (implicit vs. explicit) is used depends:• How the application is used?
• Need to control/marshal more than one resource?
• Why distributed resources are being used?
• How much can be kept out of the application?
• Can’t predict in advance?
• Not obvious what to do, application-specific metric
If possible, Applications should not be explicitly distributed• GATEWAYS approach:
• Implicit for the end-users
• Supporting Applications? Or Application Usage Modes?
Taxonomy of Distributed Application Development
Example of Distributed Execution Mode:• Implicitly Distributed
• HTC of HTC: 1000 job submissions of NAMD the TG/LONI
• SAGA shell example (cf DESHL)
Example of Explicit Coordination and Distribution• Explicitly Distributed
• DAG-based Workflows (example of Higher-level API)
• EnKF-HM application
Example of SAGA-based Frameworks• Pilot-Jobs, Fault-tolerant Autonomic Framework
• MapReduce, All-Pairs
• Note: An application can belong to more than one type
DNA Energy Levels: HTC of HPC (Bishop; Tulane)
Montage: DAG-based Workflow Application Exemplar
Application Development Phase
Generation & Exec. Planning Phase
Execution Phase
DAG based Workflow ApplicationsExtensibility and Higher-level API
SAGA-based DAG ExecutionPreserving Performance
Ensemble Kalman FiltersHeterogeneous Sub-Tasks
Ensemble Kalman filters (EnKF), are recursive filters to handle large, noisy data; use the EnKF for history matching and reservoir characterization
EnKF is a particularly interesting case of irregular, hard-to-predict run time characteristics:
Results: Scale-Out Performance
Using more machines decreases the TTC and variation between experiments
Using BQP decreases the TTC & variation between experiments further
Lowest time to completion achieved when using BQP and all available resources
Khamra & Jha, GMAC, ICAC’09
43
• History match on a 1 million grid cell problem, with a thousands of ensemble members
• The entire system will have a few billion degrees of freedom
• This will increase the need for scale-out, autonomy, fault tolerance, self healing etc...
Extreme Distribution: Frameworks?
44
• History match on a 1 million grid cell problem, with a thousand ensemble members
• The entire system will have a few billion degrees of freedom
• This will increase the need for scale-out, autonomy, fault tolerance, self healing etc...
Extreme Distribution: Frameworks?
45
• History match on a 1 million grid cell problem, with a thousand ensemble members
• The entire system will have a few billion degrees of freedom
• This will increase the need for scale-out, autonomy, fault- tolerance, self healing etc...
Extreme Distribution: Frameworks?
SAGA-based Frameworks: Types
Frameworks: Logical structure for Capturing Application Requirements, Characteristics & Patterns• Runtime and/or Application Framework
Application Frameworks designed to either:• Pattern: Commonly recurring modes of computation
• Programming, Deployment, Execution, Data-access..
• MapReduce, Master-Worker, H-J Submission
• Abstraction: Mechanism to support patterns and application characteristics
Runtime Frameworks:• Load-Balancing – Compute and Data Distribution
SAGA-based Framework: Infrastructure-independent
SAGA-based Frameworks: Examples
SAGA-based Pilot-Job Framework (FAUST)• Extend to support Load-balancing for multi-components
SAGA MapReduce Framework: • Control the distribution of Tasks (workers)
• Master-Worker: File-Based &/or Stream-Based
• Data-locality optimization using SAGA’s replica API
SAGA NxM Framework:• Compute Matrix Elements, each is a Task
• All-to-All Sequence comparison
• Control the distribution of Tasks and Data
• Data-locality optimization via external (runtime) module
Abstractions for Dynamic Execution Container Task
Adaptive: Type A: Fix number of replicas; vary cores assigned to each
replica. Type B: Fix the size of replica, vary number of replicas (Cool Walking) -- Same temperature range (adaptive sampling) -- Greater temperature range (enhanced dynamics)
Abstractions for Dynamic ExecutionSAGA Pilot-Job (BigJob)
Coordinate Deployment & Scheduling of Multiple Pilot-Jobs
Distributed Adaptive Replica Exchange (DARE)Application Usage Mode (GridChem)
GridChem -- Extensions(Joohyun Kim, LSU)
Multi-Physics Runtime FrameworksExtensibility
Coupled Multi-Physics require two distinct, but concurrent simulations
Can co-scheduling be avoided?• Adaptive execution model:
Yes
Load-balancing required. • Pilot-Job facilitates LB!• Across sites? (open Q)
Multi-platform Pilot-Job:• MPI-based TG – Condor
Dynamic Execution Reduced Time to Solution
SAGA-MapReduce(Miceli, Jha et al CCGrid’09; Merzky, Jha et al GPC’09)
• Interoperability: Use multiple infrastructure concurrently
• Control the NW placement
• Simple staging of data
• SAGA-Sphere-Sector:
• Open Cloud Consortium
• Stream processing model
• Ongoing work
• Apply to all elements (files) in a data-set (stream)
Ts: Time-to-solution, including data-staging for SAGA-MapReduce (simple file-based mechanism)
Controlling Relative Compute-Data Placement
SAGA All-Pairs: Runtime Data Placement
Classical: Place task on 4 LONI machines (512px Dell Clusters)
• Simple data staging
“Intelligent”: Map a task to a resource based upon Cost
• Cost = Data Dependency + transfer times (latency)
“Ignoring Intelligent mapping is no longer an option”
• Quote (undergraduate) Miceli Classical Intelligent
0
100
200
300
400
500
600
Processing Time
"Intelligence" Time
Distributed Data Intensive ApplicationsResearch Challenges
Goal: Develop DDI scientific applications to utilize a broad range of distributed systems, without vendor lock-in, or disruption, yet with the flexibility and performance that scientific applications demand.• Frameworks as possible solutions
Frameworks address some primary challenges in developing Distributed DI Applications• Coordination of distributed data & computing
• Runtime (Dynamic) scheduling, placement
• Fault-tolerance
Many Challenges in developing such Frameworks:• What are the components? How are they coupled? Functionality
expressed/exposed? Coordination?
• Layering, Ordering, Encapsulations of Components
“Just because you use can’t use MPI (on distributed systems), doesn’t mean you can’t use other approaches”
Frameworks: Logical ordering
Understanding Distributed Applications Development Objectives Redux
Interoperability: Ability to work across multiple distributed resources• SAGA: Middleware Agnostic
Distributed Scale-Out: The ability to utilize multiple distributed resources concurrently• Support Multiple Pilot-Jobs: Ranger, Abe, QB
Extensibility: Support new patterns/abstractions, different programming systems, functionality & Infrastructure• Pilot-Job also Coupled CFD-MD, Integrated BQP
Adaptivity: Response to fluctuations in dynamic resource and availability of dynamic data
Simplicity: Accommodate above distributed concerns at different levels easily…
Does SAGA Provide A Fresh Perspective?
SAGA: Building the Abstractions to Bridge the Infrastructure-Applications Gap
Focus on Application Development and Characteristics, not infrastructure details
SAGA-based Tools and ProjectsAdvantage of Standards
JSAGA from IN2P3 (Lyon)• http://grid.in2p3.fr/jsaga/index.html
• Slides Ack: Sylvain Renaud
GANGA-DIANE-gLite (EGEE)• http://faust.cct.lsu.edu/trac/saga/wiki/Applications/GangaSAGA
• Slides Ack: Jackub Mosciki, Massimo L, O. Weidner
NAREGI/KEK (Active)• http://www.ogf.org/OGF27/materials/1767/OGF27_SAGA_KEK.pdf
DEISA• DEISA-based Shell and Workflow library (
http://www.fz-juelich.de/nic-series/volume38/pringle.pdf )
• http://deisa-jra7.forge.nesc.ac.uk/ and http://www.ogf.org/OGF19/materials/501/SAGA-DEISA.ppt
XtreemOS• http://saga.cct.lsu.edu/index.php?option=com_content&task=view&id=95&Itemid=174
Service Discovery (SD) Specification • (with gLite bindings; extended to NAREGI)
JSAGA: Implementer and user of SAGA
64
JSAGA uses SAGA in a module, which hides heterogeneity of grid infrastructures
JSAGA implements SAGA to hide heterogeneity of middlewares
Applications
jobscollection
JSAGA
SAGA
core engine+ plug-ins
JSAGA
Legacy APIs
JSAGA
65
Projects using JSAGA
Elis@• a web portal for submitting jobs to industrial and research
grid infrastructures
SimExplorer• a set of tools for managing simulation experiments
• includes a workflow engine that submit jobs to heterogeneous distributed computing resources
JJS• a tool for running efficiently short-life jobs on EGEE
JUX• a multi-protocols file browser
/
ganga integration
DIANE INTEGRATION
Diane without SAGA Diane with SAGA
DIANA is an execution manager with support for pilot-jobs + worker agents(IDEAS Redux)
Master
Agents scheduling
Heterogeneous resourcesallocation (Ganga + Ganga/SAGA)
Lattice-QCD Applications on heterogeneous resources
Ganga/gLite
Ganga/SAGA (to TeraGrid)
Ganga/SAGA (to *)
Payload distribution
Application-aware (and resource-aware) scheduling
Federating resources! EGEE Conference (Sep’09) (Not in this demo:
cloud resources, additional Grid infrastructures…)
RENKEI Project Aims
SAGA-Engine
gLiteNAREGISRB
iRODS
Adpt Adpt Adpt
C++ Interface
Python Binding
Service & Applications Svc Apps Apps
CloudLRMS
LSF/PBS/SGE/…
Middleware-independent service & application
RNSYet Another FC
service based on OGF standard
SAGA adaptors
SAGA framework
This activity is funded by MEXT as a part of RENKEI project which develops seamless linkage of resources in the Grids and the local one for e-Science.
KEK
Osaka Univ.Tsukuba
Univ.
HEPLibrary
SAGA
Text
SAGA: Europe/EGEE – EGI
TAMAS: Team to Assist Porting Applications to e-Science infrastructures • http://indico.cern.ch/getFile.py/access?contribId=4&sessionId=1&resI
d=1&materialId=slides&confId=72253
SAGA – gLite-UNICORE-ARC and SD (gLite)• Abstract at EGEE Users Forum (Uppsala, Apr’10)
University of Western England• Visualisation and workflows
Other projects..
Summary
Provides the basic abstractions to:• The building blocks upon which to construct “consistent”
higher-levels of functionality and abstractions
• Core feature set
• Extensible and growing; many externally driven (eg SD)
• Meets the need for a broad Spectrum of Application
Used for Applications, Tools and Service Layers• SAGA standard is interesting for Tools, Service-layer
developers
• Provides a fresh perspective to developing distributed applications and CI
• IDEAS (Interoperability, Distributed Scale-Out, Extensibility, Adaptivity, Simplicity)
SAGA Team and DPA Team and the UK-EPSRC (UK EPSRC: DPA, OMII-UK , OMII-UK PAL), NSF (HPCOPS, Cybertools) and LA-BOR
People:
SAGA D&D: Hartmut Kaiser, Ole Weidner, Andre Merzky, Joohyun Kim, Lukasz Lacinski, João Abecasis, Chris Miceli, Bety Rodriguez-Milla
SAGA Users: Andre Luckow, Yaakoub el-Khamra, Kate Stamou, Cybertools (Abhinav Thota, Jeff, N. Kim), Owain Kenway
Google SoC: Michael Miceli, Saurabh Sehgal, Miklos Erdelyi
Collaborators and Contributors: Steve Fisher & Group, Sylvain Renaud (JSAGA), Go Iwai & Yoshiyuki Watase (KEK)
DPA: Dan Katz, Murray Cole, Manish Parashar, Omer Rana, Jon Weissman
Acknowledgements