providing performance guarantees for cloud applications

41
Providing Performance Guarantees for Cloud Applications Anshul Gandhi IBM T. J. Watson Research Center Stony Brook University 1 Parijat Dube, Alexei Karve, Andrew Kochut, Li Zhang IBM T. J. Watson Research Center

Upload: rafael-vance

Post on 03-Jan-2016

14 views

Category:

Documents


0 download

DESCRIPTION

Providing Performance Guarantees for Cloud Applications. Anshul Gandhi IBM T. J. Watson Research Center Stony Brook University. Parijat Dube , Alexei Karve , Andrew Kochut , Li Zhang IBM T. J. Watson Research Center. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Providing Performance Guarantees for Cloud Applications

Providing Performance Guarantees for Cloud Applications

Anshul GandhiIBM T. J. Watson Research Center

Stony Brook University

1

Parijat Dube, Alexei Karve, Andrew Kochut, Li ZhangIBM T. J. Watson Research Center

Page 2: Providing Performance Guarantees for Cloud Applications

Motivation

• Businesses have started moving to the cloud for their IT needs─ reduces capital cost of buying servers─ allows for elastic resizing of applications that have dynamic workload

demand

• Cloud Service Providers (CSPs) offer monitoring and rule-based triggers to enable dynamic scaling of applications

Amazon auto scalingMicrosoft Azure Watch

2

Time

Dem

and

?

Page 3: Providing Performance Guarantees for Cloud Applications

Motivation

• The values have to be determined by the user─ requires expert knowledge of application (CPU, memory, n/w thresholds)─ requires performance modeling expertise (when and how to scale)

Amazon auto scalingMicrosoft Azure Watch

How to set these values ??

3

Page 4: Providing Performance Guarantees for Cloud Applications

Motivation

• The values have to be determined by the user─ requires expert knowledge of application (CPU, memory, n/w thresholds)─ requires performance modeling expertise (when and how to scale)

4

arrival rate (req/s)

95%

Res

p. ti

me

(ms)

400 ms

60 req/s

Offline benchmarking

Trial-and-error

Expert application knowledge

The values are critical for application performance

and require expertise

Page 5: Providing Performance Guarantees for Cloud Applications

Goal

• Take the burden away from the user─ user only specifies the performance requirement─ CSP should be fully responsible for scaling

5

Offline benchmarking

Trial-and-error

Expert application knowledge

Not possible for CSPs !

arrival rate (req/s)

95%

Res

p. ti

me

(ms)

400 ms

60 req/s

Page 6: Providing Performance Guarantees for Cloud Applications

6

View from user’s perspective

Page 7: Providing Performance Guarantees for Cloud Applications

7

View from CSP’s perspective

Page 8: Providing Performance Guarantees for Cloud Applications

8

Problem statement

How to scale an unobservable cloud application to provide performance guarantees ?

Page 9: Providing Performance Guarantees for Cloud Applications

9

Outline

1. Existing CSP solutions (survey)2. Our solution: Dependable Compute Cloud (DC2)

a) High-level idea behind DC2b) DC2 system architecturec) Modeling + Optimization (Queueing) (Kalman filtering)

3. Evaluation• RUBiS (eBay.com) multi-tier application• Various traces• Various workload mixes

4. Limitations and future work

Page 10: Providing Performance Guarantees for Cloud Applications

Existing CSP solutions

• Resource usage triggers─ Amazon Auto Scaling, Microsoft Azure Watch, VMware AppInsight, CiRBA

• Request rate for specific software (ex: apache)─ RightScale

• Latency/VM─ Amazon Elastic Load balancing

• Web site response time─ Scalr

10

User has to set values

Page 11: Providing Performance Guarantees for Cloud Applications

11

Outline

1. Existing CSP solutions (survey)2. Our solution: Dependable Compute Cloud (DC2)

a) High-level idea behind DC2b) DC2 system architecturec) Modeling + Optimization (Queueing) (Kalman filtering)

3. Evaluation• RUBiS (eBay.com) multi-tier application• Various traces• Various workload mixes

4. Limitations and future work

Page 12: Providing Performance Guarantees for Cloud Applications

12

DC2: High-level idea

VM utilization

Request rate

End-to-end response time

Service requirements of requests at each tier

Network delay

Background utilization (overhead)

Page 13: Providing Performance Guarantees for Cloud Applications

13

DC2: High-level idea

Service requirements of requests at each tier

Network delay

Background utilization (overhead)

VM utilization

Request rate

End-to-end response time

Kalman filtering

Page 14: Providing Performance Guarantees for Cloud Applications

14

DC2: High-level idea

Service requirements of requests at each tier

Network delay

Background utilization (overhead)

VM utilization

Request rate

End-to-end response time

Kalman filtering

Page 15: Providing Performance Guarantees for Cloud Applications

15

Outline

1. Existing CSP solutions (survey)2. Our solution: Dependable Compute Cloud (DC2)

a) High-level idea behind DC2b) DC2 system architecturec) Modeling + Optimization (Queueing) (Kalman filtering)

3. Evaluation• RUBiS (eBay.com) multi-tier application• Various traces• Various workload mixes

4. Limitations and future work

Page 16: Providing Performance Guarantees for Cloud Applications

16

DC2: System architecture

DC2 Userinitial topology + perf SLA

resource monitoring

application monitoring

scaling directives

modeling+optimization

Page 17: Providing Performance Guarantees for Cloud Applications

17

Outline

1. Existing CSP solutions (survey)2. Our solution: Dependable Compute Cloud (DC2)

a) High-level idea behind DC2b) DC2 system architecturec) Modeling + Optimization (Queueing) (Kalman filtering)

3. Evaluation• RUBiS (eBay.com) multi-tier application• Various traces• Various workload mixes

4. Limitations and future work

Page 18: Providing Performance Guarantees for Cloud Applications

18

DC2: Modeling + Optimization

Given initial topology, how to dynamically scale application in a cost-effective manner to ensure user-specified SLA compliance?

TSLA

Page 19: Providing Performance Guarantees for Cloud Applications

19

DC2: Modeling

multi-tier queueing network model

home

browse

buy

Page 20: Providing Performance Guarantees for Cloud Applications

20

DC2: Modeling

λ1

λ2

λ3

T1

T2

T3

S11 S21 S31 S13 S23 S33

S12 S22 S32

U0jdi

Parameters:• λi – Request rate for class i

• Ti – Response time for class i

• Sij – Service requirement for class i at tier j

• di – Network latency for class i

• U0j – Background utilization on tier j

• Uj – Utilization of tier j

24 parameters9 known + 15 unknown

iijijj

j j

ijii

SU0U

U

SdT

1

6 equations

Page 21: Providing Performance Guarantees for Cloud Applications

21

DC2: ModelingParameters:• λi – Request rate for class i

• Ti – Response time for class i

• Sij – Service requirement for class i at tier j

• di – Network latency for class i

• U0j – Background utilization on tier j

• Uj – Utilization of tier j

24 parameters9 known + 15 unknown

iijijj

j j

ijii

SU0U

U

SdT

1

6 equations

• Underdetermined system• Need to “infer” unknowns• Can leverage monitored values • Kalman filtering

• Observed states: {λi, Ti, Uj}

• Hidden states: {Sij, di, U0j}

“Guess” unknowns

Evaluate functions using guesses

Compare with monitored values

Improve guess

Page 22: Providing Performance Guarantees for Cloud Applications

22

Kalman filtering

“Guess” unknowns

Evaluate functions using guesses

Compare with monitored values

Improve guess

• KF is a reactive, feedback-based estimation approach that has only recently been employed for computer systems

• KF automatically learns the (possibly changing) system parameters, for any system, including combination of workloads

• We extend KF to a 3-tier 3-workload-class system• Based on KF estimation, DC2 automatically, and proactively, detects which tier

is the bottleneck, and how to resolve the bottleneck (scale VMs)─ do not require any knowledge of application, except topology

Page 23: Providing Performance Guarantees for Cloud Applications

23

Kalman filtering + Queueing

“Guess” unknowns

Evaluate functions using guesses

Compare with monitored values

Improve guess

• KF can be integrated with system models (ex, queueing models) to improve accuracy and convergence

• Model need not be accurate─ KF leverages (true) monitored values to account for model inaccuracies─ Well suited for approximate system models such as queueing-theoretic models─ Can use other models as well, ex: machine-learning based models

Page 24: Providing Performance Guarantees for Cloud Applications

24

Kalman filtering + Queueing: Evaluation

Time to converge~1 min (6 intervals)

Good accuracy

Change in workload triggered

Time to converge~3 min (18 intervals)

Good accuracy

Page 25: Providing Performance Guarantees for Cloud Applications

25

Outline

1. Existing CSP solutions (survey)2. Our solution: Dependable Compute Cloud (DC2)

a) High-level idea behind DC2b) DC2 system architecturec) Modeling + Optimization (Queueing) (Kalman filtering)

3. Evaluation• RUBiS (eBay.com) multi-tier application• Various traces• Various workload mixes

4. Limitations and future work

Page 26: Providing Performance Guarantees for Cloud Applications

26

Experimental setup

SoftLayer machines

OpenStack

8-core, 8 GB

RUBiS

Page 27: Providing Performance Guarantees for Cloud Applications

RUBiS

27

Page 28: Providing Performance Guarantees for Cloud Applications

28

RUBiS

• RUBiS is an open source benchmark inspired by ebay.com• We focus on scaling Tomcat app tier• Different workload classes (home, browse, buy)

4 vCPU 4 vCPU

2 vCPUSLA: Tbrowse < 40ms for every 10 s monitoring

interval

Page 29: Providing Performance Guarantees for Cloud Applications

29

Demo

Page 30: Providing Performance Guarantees for Cloud Applications

30

Evaluation

Bursty trace [WITS] Hill trace [ITA] Rampdown trace [WITS]

Base

MoreDB

MoreApp

MoreWeb

Page 31: Providing Performance Guarantees for Cloud Applications

31

DC2: All traces

Bursty trace [WITS] Hill trace [ITA] Rampdown trace [WITS]

Page 32: Providing Performance Guarantees for Cloud Applications

32

THRES(x,y): Bursty trace

Page 33: Providing Performance Guarantees for Cloud Applications

33

Bursty trace: All policies

Bursty trace [WITS] Hill trace [ITA] Rampdown trace [WITS]

STATIC-OPT

DC2

THRES(30,60)

THRES(30,50)

THRES(40,60)

V=0% K=3.00 V=0% K=4.00 V=0% K=6.00

V=0% K=2.50 V=0% K=2.44 V=0% K=4.76

V=0% K=2.50 V=6.66% K=2.56 V=0% K=6.00

V=0% K=2.79 V=0% K=2.72 V=0% K=6.00

V=2.02% K=2.19 V=15.87% K=2.13 V=0% K=4.62

Page 34: Providing Performance Guarantees for Cloud Applications

34

All traces: All policies

Bursty trace [WITS] Hill trace [ITA] Rampdown trace [WITS]

STATIC-OPT

DC2

THRES(30,60)

THRES(30,50)

THRES(40,60)

V=0% K=3.00 V=0% K=4.00 V=0% K=6.00

V=0% K=2.50 V=0% K=2.44 V=0% K=4.76

V=0% K=2.50 V=6.66% K=2.56 V=0% K=6.00

V=0% K=2.79 V=0% K=2.72 V=0% K=6.00

V=2.02% K=2.19 V=15.87% K=2.13 V=0% K=4.62

Page 35: Providing Performance Guarantees for Cloud Applications

35

All workloads: All policies (Bursty trace)

STATIC-OPT

DC2

THRES(30,60)

V=0% K=3.00 V=0% K=4.00 V=0% K=3.00 V=0% K=3.00

V=0% K=2.50 V=0% K=3.66 V=0% K=2.94 V=0% K=2.87

V=0% K=2.50 V=3.06% K=3.40 V=2.04% K=2.98 V=0% K=3.00

Base MoreDB MoreApp MoreWeb

• Rule-based policies like THRES require tuning and are not robust

• Other auto-scaling policies require control of application

• DC2 is superior to THRES and does not require application control

Page 36: Providing Performance Guarantees for Cloud Applications

36

Other applications

• Bottleneck analysis─ HeavyDB use case

• What-if analysis─ Optimal VM configuration

• Can be combined with forecasting models

• Can be combined with other system models ─ ML-based instead of queueing models

Page 37: Providing Performance Guarantees for Cloud Applications

37

Outline

1. Existing CSP solutions (survey)2. Our solution: Dependable Compute Cloud (DC2)

a) High-level idea behind DC2b) DC2 system architecturec) Modeling + Optimization (Queueing) (Kalman filtering)

3. Evaluation• RUBiS (eBay.com) multi-tier application• Various traces• Various workload mixes

4. Limitations and future work

Page 38: Providing Performance Guarantees for Cloud Applications

38

Limitations and future work

• Evaluation limited to dynamic web applications─ Currently investigating Hadoop-type applications

• Only applies to stateless tiers─ DB scaling would be challenging

• Scaling algorithm can be further improved─ Add delayedoff

• Non-zero convergence time

• Tunable parameters─ Response time threshold (35ms)─ Monitoring interval (10s)

Page 39: Providing Performance Guarantees for Cloud Applications

39

Conclusions

• Need for adaptive scaling services for (opaque) cloud applications─ Application agnostic─ Robust to arrival pattern and workload mix

• Existing commercial offerings do not suffice: rule-based• Existing auto-scaling research solutions do not apply due to lack

of visibility and control of opaque cloud applications

• Our solution: Dependable Compute Cloud (DC2)─ Does not require offline benchmarking or expert knowledge─ Can adapt to dynamic changes in workload

• Well suited for cloud users who lack expertise in system modeling and application knowledge

Page 40: Providing Performance Guarantees for Cloud Applications

40

Thank You !

Page 41: Providing Performance Guarantees for Cloud Applications

41

Conclusions

• Need for adaptive scaling services for (opaque) cloud applications─ Application agnostic─ Robust to arrival pattern and workload mix

• Existing commercial offerings do not suffice: rule-based• Existing auto-scaling research solutions do not apply due to lack

of visibility and control of opaque cloud applications

• Our solution: Dependable Compute Cloud (DC2)─ Does not require offline benchmarking or expert knowledge─ Can adapt to dynamic changes in workload

• Well suited for cloud users who lack expertise in system modeling and application knowledge