slalom webinar final technical outcomes explanined "using the slalom technical model to improve...
Post on 10-Jan-2017
133 Views
Preview:
TRANSCRIPT
Using the SLALOM model to improve Cloud SLAs
Efstathios KaranastasisICCS/NTUA
2
SLALOM Technical Track
3
Problem snapshotSLA Technological Landscape
• A lot of ambiguities exist in SLAs of Cloud providers
• The measurement/auditing process of an SLA cannot be done non-repudiably– i.e., the involved parties may be able to challenge the auditing of the SLOs
• Standard models are rare and are not widely used
• Differences between Cloud providers cannot be easily assessed– Absolute percentages cannot be compared among providers
4
Problem snapshot Ambiguities in SLAs
• Availability (as defined by providers) definition may encapsulate different formulas for its calculation
• The definition and calculation of availability may include different ways of identifying a failure, e.g.:
– Response time less than a limit
– Returned response within a string enumeration(i.e. a predefined range of string values)
• Preconditions apply
5
Problem snapshot Real world example of Ambiguity
• Ambiguity in the measurement process of AWS EC2 SLA
• “Unavailable” and “Unavailability” mean:
– When all of your running instances have no external connectivity
• Determination of external connectivity. How?
– Internet Layer: Pinging (ICMP)?• Security threat
– Application layer: Endpoint checking? • Includes application downtime• Not exclusively the responsibility of AWS EC2
6
Problem snapshotExamples of preconditions
• For any SLA to apply, a number of preconditions typically exist per provider
• Examples:
– Deployment: A specified number of Availability Zones must be used
– Deployment: Replication options must be used
– Usage/Measurement: Unavailable resources must first be restarted
– Usage/Measurement: The number of request must be throttled
7
Problem snapshotSLALOM Technical objectives
• To have a standard model for defining SLAs that eliminates ambiguities
• To facilitate the measurement, monitoring and enforcement of SLAs to achieve non-repudiability
• To abstract the SLA definition process (SLA SLO metric sub-metric) so as to enable the application of metrics that allow for direct comparability
8
Interaction with Standards
9
SLALOM@ISOInteraction with ISO
• Mapped SLALOM 3-layer initial approach to ISO baseline model– ISO approach powerful at describing more complex metrics (e.g. MS Azure SLA)
• Demonstrated and suggested the ISO model Extendibility for fully defining the way an SLO can be audited – ACCEPTED– Suggested the inclusion of an Extension class in the ISO model– Instantiate the ISO Extension class as the base Sample class of SLALOM– Introduce the SLALOM Sample layer for concretely defining the sampling process– In the latest revision of the draft ISO model all classes are extendable
• Applied on different types of Objectives of Commercial SLAs– GAE Datastore (PaaS)– AWS EC2 (IaaS)– Microsoft Azure (Storage)
• Showed applicability of the proposed approach for directly creating machine understandable descriptions of the SLOs
10
SLALOM@ISO ISO 19086-2 Metric model
• SLALOM two-fold contribution:– ISO model classes parameters: machine understandable– ISO model extension: definition of sampling process
SLALOM - proposed extension
Model from the latest revision of the 19086-2
draft standard, to be made available in the forthcoming weeks
All classes extendible
11
SLALOM@ISO SLALOM vs. ISO compliance
ISO-compliant SLA
• Usage of the ISO fields (classes, parameters)
• SLA not necessarily fully defined
SLALOM-compliant SLA
• ISO compliant• Clear and Well-defined• Non-repudiable• SLAs still not comparable
among providers
12
Mapping of Commercial SLAs
13
Commercial SLAs @SLALOMAmazon WS EC2
Amazon EC2
Level / definition Expression Notes
Sample definition
sc: UNDEFINED (assumed ‘ping’-> ICMP)
The sampling condition is not defined in the Amazon EC2 SLA. The concrete wording is “when all of your running instances have no external connectivity”. Nonetheless, the way to specify / measure “external connectivity” is not defined. For example, a customer could use a ping operation or a custom monitoring mechanism.
Type of operation: pingNot defined how the condition of connectivity can be actually measured (e.g. the ping operation mentioned previously).
Boundary period and error
definitions
bp > 60 sec The exact wording is “the percentage of minutes”, thus the period is 60 seconds.
ec = 100%Error condition reflecting that the error ratio is that for the entire bp the resource must be continuously “unavailable”.
Abstract metric definition availability < 99.95 % Availability metric definition given the boundary
period and error condition.
14
Commercial SLAs @SLALOMGoogle AE Datastore
Google AppEngine Datastore
Level / definition Expression Notes
Sample definitionsc: INTERNAL_ERROR
Several sampling conditions are defined per type of operation. For example it is specified (exact wording) “INTERNAL_ERROR, TIMEOUT, …” for API calls.
Type of operation: API calls Several type of operations are defined. An example is provided here.
Boundary period and error
definitions
bp > 300 sec The exact wording is “five consecutive minutes”.
ec > 10%Error condition reflecting that the error ratio is (exact wording) “ten percent Error Rate”.
Abstract metric definition availability < 99.95 % Availability metric definition given the
boundary period and error condition.
15
Commercial SLAs @SLALOMMicrosoft Azure
Microsoft Azure Storage
Level / definition Expression Notes
Sample definitionsc = 60 sec
Several sampling conditions are defined per type of operation. For example it is specified (exact wording) “Sixty (60) seconds” for PutBlockList and GetBlockList.
Type of operation: PutBlockList and GetBlockList
Several type of operations are defined. An example is provided here.
Boundary period and error
definitions
bp > 3600 sec The exact wording is “given one-hour interval”.
ec > 0%
Error condition reflecting that all periods should be taken into account for the availability metric evaluation (exact wording) “is the sum of Error Rates for each hour”.
Abstract metric definition availability < 99.9 %
Availability metric definition given the boundary period and error condition.
16
SLA Comparability
17
SLA comparabilityOverview
• Despite the fact that through the SLALOM / ISO model SLA descriptions may be aligned, this does not mean that SLAs (or their parameters) will be directly comparable
• Need for more abstract metrics, that result in direct comparisons– SLA success ratio (Published* by Cloud WG of SPEC**)– SLA strictness (Published* by Cloud WG of SPEC+)– Standardised datasets
• SLALOM model enables the application of comparable metrics– All SLA parameters are clearly and well defined– The SLAs are machine readable– Greatly simplifies the process and its automation
* Ready for Rain? A View from SPEC Research on the Future of Cloud Metrics** SPEC: Standard Performance Evaluation Corporation
18
SLA comparabilityComparative metrics
• SLA success ratio– Based on experience of usage of a service or provider– In the course of time keep track of successful or violated SLAs and total SLAs– Calculate the ratio: (Successful SLAs / Total SLAs)
• SLA strictness– Extract static SLA parameters of importance for a given domain or application– Assign weights to parameters and normalise– Map these parameters to an arbitrary function– Results in a comparative ranking of different SLAs
• Standardised datasets– Define a set of failure scenarios– Benchmark each provider SLA definition against the predefined scenario
19
SLA-related Lessons Learnt for Cloud Uptake
20
Lessons LearntDo
1) Target metrics that are directly comparable among providers
2) Consider directly machine understandable descriptions via standardised templates
3) Look into the ISO 19086 series of standards and adopt if applicable
4) Think outside the narrow Cloud box. With the advent of *aaS and the emergence of IoT, SLAs may refer to services external to the data center or to specific metrics needed by Cloud Services based on the individual Use Case
5) Consider composite services that may create chains of SLAs and their interdependencies. For guaranteeing response time to service-support services consider downstream (reseller) and upstream (e.g. provider’s subcontractors) actors’ requirements and the need to ‘float’ SLA clauses down the chain
6) Consider resource management as a key part of SLA upkeep and analysis process
7) Consider mechanisms that would allow providers, resellers and users to easily monitor the SLA in a common and understandable way, even if not experts.
21
Lessons LearntDon’t
1) Consider that offered terms are equivalent, even if they originally seem to refer to the same SLO. Always check the fine print for differences in how metrics are actually calculated
2) Consider that SLAs are monitored by providers.
3) Leave end users out of the loop. Comprehensiveness and clarity of an SLA (or its relevant metric) for non-experts should be a key target. Translate your metrics into plain English if necessary.
4) Limit yourself to popular metrics (e.g. availability) in SLAs. Users are also interested in more generic Quality of Experience (QoE) indexes such as stability
5) Expect the market to bend for you: fit in to current practice to the maximum extent and if not possible, hone your value proposition
22
SLALOM Contribution and Expected Impact
23
SLALOM contributionTender Evaluation
• Usable by various actors– Adopters to specify their needs– Providers to describe their value proposition– Third parties (resellers/brokers) to combine and offer services and suggest options
• Added value– Application of comparative metrics– Automation of the process
• Benefits– Improve transparency– Enhance efficiency– Establish fairness
24
SLALOM contributionContract monitoring
• Benefits– Achieve SLA non-repudiation– Establish trust and transparency for service execution compliant to
the terms and proper violation management– Enable automation of contract and performance management and
monitoring– Aid the involvement of actors like trusted third parties offering
relevant services
25
• SLALOM proposed specification / reference model already takes into account:– Standardisation approaches and working groups outcomes– Current SLAs and metrics offered by commercial Cloud providers – Views expressed by Cloud providers and adopters– Research outcomes
• Further feedback regarding applicability and practical usage of our model is more than welcome
• Please take the survey on IoT/Cloud metrics here: https://docs.google.com/forms/d/1JmwDXyO_1hT9iR-lm1c3LCQu_zF64nf-uFnxBeGMv3g/viewform
SLALOM contributionYour feedback needed
26
Contact us
• SLALOM Technicl WP Leaderekaranas@mail.ntua.grvandro@mail.ntua.gr gkousiou@mail.ntua.gr
• SLALOM Project Coordinatordaniel.field@atos.net
?
SLALOM Project 27
SLALOM is a CSA financed by European Commission under Grant agreement 644270
For more information on the initiative contact us:
@CloudSLAlomwww.SLALOM-Project.eu
SLALOM Project Coordinator (daniel.field@atos.net)
28
Backup slidesSLA Strictness example
29
Backup sliSLA strictness example
Provider/Service t q (s1 * q) q’ (s2 * q) p (s3 * p) x S S’
Google Compute 0 5 (1.00) 5 (0.10) 99.95 (0.50) 0 0.50 1.60
Amazon EC2 0 1 (0.20) 1 (0.02) 99.95 (0.50) 0 1.30 1.48
MS Azure Compute 1 1 (0.20) 1 (0.02) 99.95 (0.50) 0 2.30 2.48
• Extract static SLA parameters of importance for a given domain/application– All these parameters (e.g. boundary period, error rates) are described in the SLALOM model
• Map these parameters to an arbitrary Function, e.g.:
, where:– q: size of the boundary period– p: percentage of availability– t: running time vs. overall monthly time (boolean), t ϵ {0,1}– x: existence of performance metrics (boolean), x ϵ {0,1}– si: normalisation factor for the continuous variables so that:
(s1*q) ϵ [0,1], (s2*q) ϵ [0,0.1] and (s3*p) ϵ [0,0.5]
• Resulting value may be compared between providers
S = t + (1 - s1/2q) + s3p + x
30
Backup slidesMapping of AWS EC2 SLA
31
AWS EC2 SLA @SLALOM (1/9)
Amazon EC2
Level / definition Expression Notes
Sample definition
sc: UNDEFINED (assumed ‘ping’-> ICMP)
The sampling condition is not defined in the Amazon EC2 SLA. The concrete wording is “when all of your running instances have no external connectivity”. Nonetheless, the way to specify / measure “external connectivity” is not defined. For example, a customer could use a ping operation or a custom monitoring mechanism.
Type of operation: pingNot defined how the condition of connectivity can be actually measured (e.g. the ping operation mentioned previously).
Boundary period and error
definitions
bp > 60 sec The exact wording is “the percentage of minutes”, thus the period is 60 seconds.
ec = 100%Error condition reflecting that the error ratio is that for the entire bp the resource must be continuously “unavailable”.
Abstract metric definition availability < 99.95 % Availability metric definition given the boundary
period and error condition.
32
AWS EC2 SLA @SLALOM (2/9)
Abstract metric definition availability < 99.95 %
Availability metric definition given the boundary period and error condition.
Condition of SLA violation specification Availability threshold specification Availability definition and calculation Billing period specification Unavailability definition and calculation Unavailability interval definition and
calculation Boundary period specification Unreachable sample specification Sample definition and retrieval
PARAM_001PARAM_002
SAMPLE_001
QDT_001
UAP_001
BP_001
CFA_002
PARAM_003CONDITION
33
AWS EC2 SLA @SLALOM (3/9)
• Examples of preconditions:– Deployment: Number of Availability Zones used– Deployment: Replication options used– Usage/Measurement: Restarting of resources when unavailable– Usage/Measurement: Applied Throttling of requests
• Practical suggestions:– The strict definition of the Rules class to be concerning the
necessary preconditions to apply– Note field as placeholder for the actual SLA text that refers to a
given block
34
AWS EC2 SLA @SLALOM (4/9)
SAMPLE_001
Sample definition
sc: UNDEFINED (assumed ‘ping’-> ICMP)
The sampling condition is not defined in the Amazon EC2 SLA. The concrete wording is “when all of your running instances have no external connectivity”. Nonetheless, the way to specify / measure “external connectivity” is not defined. For example, a customer could use a ping operation or a custom monitoring mechanism.
Type of operation: ping
Not defined how the condition of connectivity can be actually measured (e.g. the ping operation mentioned previously).
SAMPLE_001
35
AWS EC2 SLA @SLALOM (5/9)
Boundary period and error
definitions
bp > 60 sec The exact wording is “the percentage of minutes”, thus the period is 60 seconds.
ec = 100%Error condition reflecting that the error ratio is that for the entire bp the resource must be continuously “unavailable”.
PARAM_001PARAM_002
SAMPLE_001
PARAM_001
PARAM_002
36
AWS EC2 SLA @SLALOM (6/9)
PARAM_001PARAM_002
SAMPLE_001
QDT_001
PARAM_001
PARAM_002SAMPLE_001QDT_001
• Calculation of Cloud Service Unavailability Interval
• Based on:- The current sample- The defined boundary period- The definition of unreachable sample
QDT_001
SAMPLE_001PARAM_001
PARAM_002
37
AWS EC2 SLA @SLALOM (7/9)
PARAM_001PARAM_002
SAMPLE_001
QDT_001
• Calculation of Cloud Service Unavailability
• Based on:- The Cloud Service Unavailability Interval
QDT_001
QDT_001
UAP_001
UAP_001
UAP_001
38
AWS EC2 SLA @SLALOM (8/9)
PARAM_001PARAM_002
SAMPLE_001
QDT_001
• Calculation of Cloud Service Availability
• Based on:- Billing period- The Cloud Service Unavailability
UAP_001
UAP_001
UAP_001
UAP_001
BP_001
BP_001 BP_001
BP_001
BP_001
CFA_002
CFA_002
CFA_002
39
AWS EC2 SLA @SLALOM (9/9)
PARAM_001PARAM_002
SAMPLE_001
QDT_001
• SLA Violation Condition- i.e.: Availability < 99.95%
UAP_001
BP_001
CFA_002
CFA_002 CFA_002
PARAM_003
PARAM_003
PARAM_003
PARAM_003
ASV_001ASV_001
ASV_001
40
Backup slidesMapping of GAE Datastore SLA
41
GAE Datastore SLA @SLALOM (1/11)
Google AppEngine Datastore
Level / definition Expression Notes
Sample definitionsc: INTERNAL_ERROR
Several sampling conditions are defined per type of operation. For example it is specified (exact wording) “INTERNAL_ERROR, TIMEOUT, …” for API calls.
Type of operation: API calls Several type of operations are defined. An example is provided here.
Boundary period and error
definitions
bp > 300 sec The exact wording is “five consecutive minutes”.
ec > 10%Error condition reflecting that the error ratio is (exact wording) “ten percent Error Rate”.
Abstract metric definition availability < 99.95 % Availability metric definition given the
boundary period and error condition.
42
GAE Datastore SLA @SLALOM (2/11)
SAMPLE_001SAMPLE_001
PARAM_003PARAM_002PARAM_001
ER_001DUR_001QDT_001
UAP_001
BP_001
CFA_002
PARAM_004ASV_001
Condition of SLA Violation specification Availability threshold specification Availability definition and calculation Billing Period specification Unavailability definition and calculation Unavailability Interval definition and calculation Sampling Period duration definition and calculation Error Rate definition and calculation Boundary Period specification Error Rate threshold specification Unreachable sample values specification Sample definition and retrieval
Abstract metric definition availability < 99.95 % Availability metric definition given the
boundary period and error condition.
43
GAE Datastore SLA @SLALOM (3/11)
• Examples of preconditions:– Deployment: Number of Availability Zones used– Deployment: Replication options used– Usage/Measurement: Restarting of resources when unavailable– Usage/Measurement: Applied Throttling of requests
• Practical suggestions:– The strict definition of the Rules class to be concerning the necessary
preconditions to apply– Note field as placeholder for the actual SLA text that refers to a given
block
44
GAE Datastore SLA @SLALOM (4/11)
Sample definition
sc: INTERNAL_ERROR
Several sampling conditions are defined per type of operation. For example it is specified (exact wording) “INTERNAL_ERROR, TIMEOUT, …” for API calls.
Type of operation: API callsSeveral type of operations are defined. An example is provided here.
SAMPLE_001SAMPLE_001
SAMPLE_001
45
GAE Datastore SLA @SLALOM (5/11)
Sample definition
sc: INTERNAL_ERROR
Several sampling conditions are defined per type of operation. For example it is specified (exact wording) “INTERNAL_ERROR, TIMEOUT, …” for API calls.
Type of operation: API callsSeveral type of operations are defined. An example is provided here.
SAMPLE_001SAMPLE_001PARAM_003
PARAM_003
46
GAE Datastore SLA @SLALOM (6/11)
SAMPLE_001SAMPLE_001PARAM_003
PARAM_003
Boundary period and error
definitions
bp > 300 sec The exact wording is “five consecutive minutes”.
ec > 10% Error condition reflecting that the error ratio is (exact wording) “ten percent Error Rate”.
PARAM_002
PARAM_002
PARAM_001PARAM_001
47
GAE Datastore SLA @SLALOM (7/11)
SAMPLE_001SAMPLE_001
PARAM_003PARAM_002PARAM_001
• Calculation of duration of sampling period:- The period during which a number of samples was received- Period duration calculation based on samples timestamp
• Calculation of actual Error Rate for sampling period:- Number of violation samples / number of total samples- Violation samples: samples containing values from a specific
values pool
ER_001
ER_001
SAMPLE_001
SAMPLE_001
PARAM_003
SAMPLE_001DUR_001
DUR_001
SAMPLE_001
DUR_001
ER_001
PARAM_003
48
GAE Datastore SLA @SLALOM (8/11)
SAMPLE_001SAMPLE_001
PARAM_003PARAM_002PARAM_001
• Calculation of Unavailability Interval
- IF [Sampling Period duration > Boundary Period]
- AND IF [Error Rate > Thershold (10%)]
- THEN [Unavailability Interval = Sampling Period duration]
ER_001
ER_001
QDT_001 DUR_001
DUR_001
PARAM_001
PARAM_002DUR_001
QDT_001
QDT_001
QDT_001
QDT_001
ER_001 PARAM_002
DUR_001 PARAM_001
QDT_001 DUR_001
49
GAE Datastore SLA @SLALOM (9/11)
SAMPLE_001SAMPLE_001
PARAM_003PARAM_002PARAM_001
• Calculation of Unavailability period- It equals the SUM of Unavailability Intervals
ER_001DUR_001QDT_001
QDT_001
UAP_001
UAP_001
UAP_001
QDT_001
UAP_001
50
GAE Datastore SLA @SLALOM (10/11)
SAMPLE_001SAMPLE_001
PARAM_003PARAM_002PARAM_001
ER_001DUR_001QDT_001
UAP_001
UAP_001
BP_001
BP_001BP_001
BP_001
CFA_002
CFA_002
CFA_002
• Calculation of Cloud Service Availability• Based on:
- Billing period- The Cloud Service Unavailability
CFA_002
BP_001
UAP_001
51
GAE Datastore SLA @SLALOM (11/11)
SAMPLE_001SAMPLE_001
PARAM_003PARAM_002PARAM_001
ER_001DUR_001QDT_001
UAP_001
BP_001
CFA_002
• SLA Violation Condition - i.e.: Availability < 99.95%
PARAM_004CFA_002
CFA_002
PARAM_004
PARAM_004
PARAM_004
ASV_001
ASV_001
ASV_001
top related