1 vo- and application- centric approaches to service level agreement marian bubak, jakub moscicki,...

1 VO- and Application- Centric Approaches to Service Level Agreement Marian Bubak, Jakub Moscicki, Marcin Radecki, and Tomasz Szepieniec Cyfronet AGH, Krakow, PL CERN / IT

Post on 19-Dec-2015




0 download



VO- and Application- Centric Approaches to Service Level Agreement

Marian Bubak, Jakub Moscicki, Marcin Radecki, and Tomasz Szepieniec

Cyfronet AGH, Krakow, PLCERN / IT



• VO-centric approach to SLA– Motivation– Basic requirements– SLA metrics– SLA execution– Bazaar tool

• QoS from a user perspective– User-level vs system-level techniques– Tools: Ganga/DIANE– Examples of QoS metrics– Case-study: Lattice QCD 2008

• Summary


VO-centric Approach to SLA



• Large number of VOs/users and resources

• Dynamic management is a must

• Remote interactions• Limitation of automation

– Policy managers want to decide about their resources

• Start with human-in-the-loop SLA process


Aim: SLA based Resource Allocation


What is needed

• Definition of meaningful and measurable SLA metrics

• Communication patterns– (Re-)negotiation – Configuration validation– Tracking demands/policy changes

• Complexity management and process traceability

• SLA execution monitoring (including feedback from users)

• So, we should

• define the SLA process

• and build a collaboration tool


EGEE Grid and Bazaar • Starting point

– No standard QoS metrics

– No procedures to express requirements

– Resources become available in the infrastructure even if not agreed

with VO

• Resource Allocation in Central Europe ROC (Bazaar Project)

– A procedure of tracking requests and responses to them

– Registration and monitoring of SLAs between VOs and Resources


– Collaboration tool for tracking the process


Central European Region in EGEE

• 8 countries, • 25 sites, • ~8000 cores, • ~850 TB storage • ~30 VOs


SLA Metrics

• Common language for users and providers

– Users: „I need to use x CPUs”

– Providers prefer to speak about aggregated wall-clock time in specific

period, without guarantee that resources will be available in (any) defined


• Expressive enough to satisfy users important requests

– Aggregated time, parallel use, waiting time (queues), condition of


• Configurable

– providers need to have technical possibility to configure the resources

according to the SLA (fabric layer need to support those requirements)

• Measurable in execution time


Examples of SLA Metrics

• Computational Resources– Guaranteed number of job slots in Local Batch System

CPUs or cores

– Total wall-clock time to be used in specified time period (in hours) weekly, monthly

– Access period (range of dates)– Maximum wall-clock- and CPU-time of a single job (hours)– Maximum waiting time from job submission to make it running (in

minutes)– Average power of a single core (benchmark results like: SpecInt)– Capacity available for temporal use by a job (GB)– Memory available per core/CPU (GB)– Maximum latency between nodes in the cluster (ms)


Examples of SLA Metrics

• Grid Storage Resources– Storage quota guaranteed (GB)– Maximum latency in accessing files (optional, in ms)– Minimum bandwidth in accessing files (optional, Gb/s)– Storage quota for temporal use (optional, GB)– Time limit for temporal use of storage (optional, hours)– Period of using storage (dates from-to)

• General Resource QoS– Minimum resource availability (optional, in %)– Minimum resource reliability (optional, in %)– Maximum time to acknowledge trouble ticket (days)– Maximum time to resolve trouble ticket (days)


SLA Execution Stages in Bazaar

The process is

initialized by a VO

by a call for


Next, a resource

providers define

their proposal for



States Transition Details

Each state transition must be confirmed by both sides

Proper configuration is

controlled by separated

set of states


Bazaar Functionality

– Call management - the user can perform call creation, edition and management.

– SLA management including negotiation - site managers can create a contract as a response to a call. Both partners can negotiate contract conditions and track contract changes.

– Notification management - system notifies a user via e-mail and user interface about actions like resource reconfiguration etc.

– Feedback - VO managers can assess site's configuration and both partners can provide a general assessment of the collaboration when the contract has been completed.

– Accounting and statistics - users can generate reports with resources usage statistics. In the next prototype, a tool shall enable obtaining data from EGEE accounting tools.


Bazaar in operation– Bazaar – a tool supporting

resource allocation including SLA negotiation

Integrated with EGEE Operation Portal (CIC Portal)

No cost of entry – data obtained from GOCDB and CIC-Portal VO-cards

Introduced into operations in Central European Region

– Main features of Bazaar Clear view on VOs demands

for resources Management of calls and

SLAs between VOs and RCs SLA negotiation support E-mail notifications Tracking of SLA changes


SLA in PL-Grid

• PL-Grid Project

– Grid operations center in Poland

– 3 different infrastructures: EGI compliant (currently gLite-

based), DEISA, cloud-like research grid

• SLA Management in PL-Grid

– We take ideas from Bazaar Project as a starting point

– Develop SLA-centric model including

• Impact on resources available at the technical level

• Notifications on missing resources

• Improvement on SLA monitoring and accounting

• Integration with computational grants system


PL-Grid Operation Tools Architecture



• Human in a negotiation loop seems to be unavoidable• SLAs should support VO and resource managers• Complexity management should be supported by Web

2.0 tools (collaboration tools with traceable processes)


QoS on the Grid with User-Level Scheduling


Some Grid applications

• Data Analysis– extraction of (statistical) parameters from data using event loop

ATLAS experiment at LHC• Monte Carlo simulation

– creation of statistical objects (e.g. histograms) or building images by generating large number of independent events

Geant4 simulations for radiotherapy in medical physics• Parameter sweep

– running a large number of independent jobs in various configurations Geant 4 regression tests

• High-throughput activities– autonomous computing over long periods of time

Avian Flu Drug Search (bio-informatics) Lattice QCD (theoretical physics)

• High-performance, short-deadline activities– short-deadline performance peak

ITU frequency analysis for RRC06


QoS for scientific applications

• In the Grid: the basic interaction of a user is sending jobs– efficient job/workload management plays central role– efficient scheduling often requires application-specific knowledge

which may be difficult at the system level

• The system provides an appropriate QoS if it responds in an acceptable way to the user and is capable of automatically maintaining the processing goals defined by the user (measured by metrics)

• Some QoS metrics (measure of user-defined goals)– turnaround time

typically minimize the total execution time of the job

– reliability / failure rate– response latency: time to obtain initial results– feedback from the execution

filling histograms with events -> significance of individual partial results decreases with time

– prioritization/scheduling of the tasks– predictability/stability of the execution


Mechanisms for better QoS

• In general QoS in NOT implemented on the Grid• Techniques for performance related metrics

– dedication of resources (wasteful)– advanced reservations

difficult for some users who do not plan ahead interactive work

– better scheduling: fast/slow queues (site configuration)– preemption: suspend lower priority job– migration: suspend and migrate elsewhere– better brokering: forecasting using monitoring systems (e.g. NWS)

• Techniques for failure related metrics– metascheduling (JDL retry count, Condor)

• Techniques for application-specific metrics– metascheduling (not generally implemented, e.g. out of scope of



QoS Implementation Choices

• QoS implementation– site service modifications

faster queues, scheduler modifications e.g. virtualization schemes with MAUI

– middleware modification checkpointing/migration, special services (e.g. GARA), Virtual


– system level modifications (unix kernel modules, special I/O)– user-level overlay schedulers (plot jobs, agents,...)

• Boundary conditions in a large Grid (e.g. EGEE)– acceptance/deployment of middleware changes: very slow due

organizational constraints– resource providers' constraints (site changes)

many sites cannot freely change their software (serving also non-grid users)

sysadmins do not like sudo-like programs

– interfacing legacy applications


User-level overlay

• Overlays are the only option if we talk about using existing Grid infrastructure at the large scale

LCG and EGEE Grid– the largest Grid

infrastructure to date

– over 250 sites– over 80K WNs– over 15 PB of



User-level tools

• DIANE: helps smaller scientific communities using distributed (Grid) resources more efficiently– reduce the application execution time– reduce the manual work overhead by providing fully automatic execution and

failure management,– efficiently integrate local and Grid resources– part of EGEE Respect suite

– http://cern.ch/diane

• Ganga: Job Management Interface– Submission gateway to many distributed systems– Easy job management and application configuration– http://cern.ch/ganga


• User-level overlay– each user uses a (temporary) overlay which is created for the

duration the computations

User-level Overlay

(drawing courtesy of ThIS collaboration


Master/Worker backbone

• Master/Worker processing of tasks– RunMaster executes on a local

host– WorkerAgents execute as Grid

jobs• TaskScheduler is a software

component (python module) which may be arbitrarily customized or replaced

• application plugins:– ApplicationWorker – ApplicationManager


• 3 functional parts– Submitter: selection and acquisition of the resources– M/W: scheduling and execution control– Directory Service: late binding of resources

• System is easily customized by plugins

Flexible architecture


Examples of QoS Metrics

Selected examples of QoS metrics for different applications


QoS Metric: predictability of execution

Comparison of G4 Production on LCG: DIANE and direct submission• 6 sites / 173 CPUs / 100 VO-shared, 70 VO-dedicated• 207 tasks, direct: 1 task = 1 job, DIANE workers


QoS metric: reliability

• Summary of ITU RRC06 runs– 200K jobs in less than 6 hours– worst case reliability: 0.0003 jobs lost

run #jobs #task turnaround CPUh #WN comment 1 243K 26K 6.40h 425h 190 lost <10 tasks (3*e-04) 2 237K 23K 6.30h 332h 125 lost 1 task (4*e-05) 3 224K 40K 3.05h 192h 210 OK 4 218K 39K 1.05h 151h 320 OK

– ITU RRC-06 (15 May–16 June 2006) 120 countries (~1200 delegates) negotiated the

new digital frequency plan a part of a new international agreement introduction of digital broadcasting

• UHF (470-862 Mhz)• VHF (174-230 Mhz)

preceded by RRC-04 and other international meetings


QoS Metric: low latency on the Grid

RRC06 ITU job 116 LCG workers 3470 tasks ~130 CPU h

large span of task lengthnot a priori known!




QoS Metric: stability of execution


• Case study: high-throughput Lattice QCD simulation– application-aware scheduler: prioritize tasks based on the

simulation parameters– active resource selection via Submitter (WorkerFactory)

dynamically select resources based on their fitness for the application


Lattice QCD 2008 @ Grid

• Study the behaviour of the critical point of quark-gluon plasma– The scientific results obtained by the

LQCD project were published in a paper P. de Forcrand et al.: "The chiral critical point of Nf = 3 QCD at finite density to the order (μ/T)4" and are available at http://arxiv.org/pdf/0808.1096

• Monte-Carlo simulation of discrete space-time lattice– need a lot of CPU– relatively small data (~Gbs)


LQCD execution history

• ongoing since May 2008– several phases (application and system upgrades, power-cuts,

etc...)– routinely production since September 2008

• runs unattended for months– operated by a single, not-a-Grid-expert user

• large-scale– ~1000 running jobs at any time– ~700 CPU-years since the May 2008– ~18 TB of data


Routinely LQCD production

• 700 CPU years since May 2008• ~18 TB of data transferred• ~800 simultaneous workers



• User-level overlay is a technique enhancing the QoS parameters for scientific applications in the EGEE Grid

• Pros & cons:– Existing infrastructure may be used “as is”– Application-specific optimizations (impossible at the system level)– Hard QoS not possible (infrastructure unreliable)– Faire-share implemented by the underlying infrastructure and

respected by the overlay (if used appropriately)– Used successful for diverse applications

• Overlays are a complementary approach to SLAs• More on tools:

