cicada - usenix · 1 vm 2 vm 3 vm 4 vm 5vm 6 vm 7 vm 8 vm 9 vm 2 vm 3 vm 4 vm 5 vm 6 vm 7 vm 8 vm 9...

CICADAPREDICTIVE GUARANTEES FOR CLOUD NETWORKS

Katrina LaCurts Jeff Mogul

Hari Balakrishnan Yoshio Turner

MIT CSAIL, Google Inc., HP Labs

lacurts@mit|2

BANDWIDTH GUARANTEES

lacurts@mit|2

BANDWIDTH GUARANTEES10 machines

lacurts@mit|2

BANDWIDTH GUARANTEES

bandwidth guarantee: a promise from the provider to the customer that its VMs will be able to

communicate with each other at a particular rate (informal definition)

10 machines

lacurts@mit|3

WHY GUARANTEES?

lacurts@mit|3

WHY GUARANTEES?

is there really that much data?

lacurts@mit|3

WHY GUARANTEES?applications can send

terabytes or petabytes of data internally [1]

is there really that much data?

[1] Zaharia, et al. “Improving MapReduce Performance in Heterogeneous Environments”. OSDI 2008

lacurts@mit|3

terabytes or petabytes of data internally [1]

aren’t cloud networks homogeneous and well-provisioned?

lacurts@mit|3

terabytes or petabytes of data internally [1]application traffic is

affected by other users

aren’t cloud networks homogeneous and well-provisioned?

lacurts@mit|3

what customers care about network performance?

lacurts@mit|3

enterprise customers need to satisfy SLAs with their own customers

what customers care about network performance?

lacurts@mit|4

POSSIBLE ARCHITECTURE

lacurts@mit|4

customer requests machines and guarantee

10 machines, 10Gb/s between each pair

lacurts@mit|4

provider places customer’s VMs,

enforces guarantee

lacurts@mit|4

enforces guarantee

what if some pairs of VMs use fewer than 10Gb/s?

lacurts@mit|4

enforces guarantee

over-provisioned network increased cost to customer,

wasted bandwidth within network

lacurts@mit|4

enforces guarantee

what if some pairs of VMs need more than 10Gb/s?

lacurts@mit|4

enforces guarantee

under-provisioned network poor performance for

customer’s application

lacurts@mit|4

enforces guarantee

under-provisioned network poor performance for

customer’s application

problem: how do customers know what guarantee they need?

lacurts@mit|5

CICADA’S ARCHITECTUREcicada makes predictions about an application’s

traffic to automatically generate a guarantee

lacurts@mit|5

CICADA’S ARCHITECTURE

customer makes initial request for machines

cicada makes predictions about an application’s traffic to automatically generate a guarantee

lacurts@mit|5

customer makes initial request for machines provider places

tenant VMs

lacurts@mit|5

hypervisors send measurements to cicada controller

tenant VMs

lacurts@mit|5

tenant VMs

cicada predicts bandwidth for each tenant

lacurts@mit|5

provider offers guarantee based on cicada’s prediction

provider places tenant VMs

lacurts@mit|5

provider updates placement based on cicada’s prediction

lacurts@mit|6

problem: is application traffic actually predictable? if so, is it captured by existing models?

lacurts@mit|7

APPLICATION TRAFFICto design cicada’s prediction algorithm, we need to understand how

cloud applications send traffic

lacurts@mit|7

APPLICATION TRAFFICvm1 vm2 vm3 vm4 vm5vm6 vm7 vm8 vm9

to design cicada’s prediction algorithm, we need to understand how cloud applications send traffic

lacurts@mit|7

rigid application (similar to VOC [1])

more data

less data

[1] Ballani, et al. “Towards Predictable Datacenter Networks”. SIGCOMM 2011.

lacurts@mit|7

more data

less datano data

lacurts@mit|7

vm1 vm2 vm3 vm4 vm5vm6 vm7 vm8 vm9

variable application

more data

less datano data

lacurts@mit|7

spatial variability: different pairs of VMs transfer different amounts of data

more data

less datano data

lacurts@mit|7

more data

less datano data

lacurts@mit|7

more data

less datano data

lacurts@mit|7

spatial variability: different pairs of VMs transfer different amounts of datatemporal variability: pairs of VMs transfer different amounts of data at different times

more data

less datano data

lacurts@mit|

DATASET

... ... ......

HP Cloud Serviceshttp://hpcloud.com

lacurts@mit|

DATASET

... ... ......

observed sflow data (could also use tcpdump, etc.)

lacurts@mit|

DATASET

... ... ......

(in this talk) collected one traffic matrix per hour, for

each application

lacurts@mit|

DATASET

... ... ......

M1 M2 Mn

observed traffic

each entry represents the number of bytes

transferred between i and j

each application

lacurts@mit|

DATASET

... ... ......

M1 M2 Mn

observed traffic

each entry represents the number of bytes

transferred between i and j

each application

goal: use this dataset to quantify spatial and temporal variability

lacurts@mit|

high spatial variabilityvm1 vm2 vm3 vm4 vm5

vm1 vm2 vm3 vm4 vm5

no spatial variability

SPATIAL VARIABILITYhow do we quantify spatial variability?

vm1 vm2 vm3 vm4 vm5

some spatial variability

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

1. let Fij = fraction of tenant traffic sent from VMi to VMj

vm1 vm2 vm3 vm4 vm5

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

Fij = 1/20

vm1 vm2 vm3 vm4 vm5

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

all Fij values are equal

Fij = 1/20

vm1 vm2 vm3 vm4 vm5

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

Fij = 1/20

vm1 vm2 vm3 vm4 vm5

Fij = 0

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

Fij = 1/20

vm1 vm2 vm3 vm4 vm5

Fij = 0 Fij = 1/12

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

Fij = 1/20

vm1 vm2 vm3 vm4 vm5

Fij = 0 Fij = 1/12

two distinct Fij values

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

Fij = 1/20 Fij < 1/20

vm1 vm2 vm3 vm4 vm5

Fij = 0 Fij = 1/12

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

Fij = 1/20 Fij < 1/20 Fij > 1/20

vm1 vm2 vm3 vm4 vm5

Fij = 0 Fij = 1/12

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

all Fij values are equal many distinct Fij values

Fij = 1/20 Fij < 1/20 Fij > 1/20

vm1 vm2 vm3 vm4 vm5

Fij = 0 Fij = 1/12

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

Fij = 1/20 Fij < 1/20 Fij > 1/20

2. calculate the coefficient of variation (σ/µ) of the Fij values

vm1 vm2 vm3 vm4 vm5

Fij = 0 Fij = 1/12

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

Fij = 1/20 Fij < 1/20 Fij > 1/20

⇒ σ(Fij)/µ(Fij) = 0

vm1 vm2 vm3 vm4 vm5

Fij = 0 Fij = 1/12

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

Fij = 1/20 Fij < 1/20 Fij > 1/20

⇒ σ(Fij)/µ(Fij) = 0

vm1 vm2 vm3 vm4 vm5

Fij = 0 Fij = 1/12

⇒ σ(Fij)/µ(Fij) = 1*

*see paper for this result

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

Fij = 1/20 Fij < 1/20 Fij > 1/20

⇒ σ(Fij)/µ(Fij) = 0 ⇒ σ(Fij)/µ(Fij) > 1

vm1 vm2 vm3 vm4 vm5

Fij = 0 Fij = 1/12

more data

less datano data

lacurts@mit|

vm1 vm2 vm3 vm4 vm5

Fij = 1/20 Fij < 1/20 Fij > 1/20

⇒ σ(Fij)/µ(Fij) = 0 ⇒ σ(Fij)/µ(Fij) > 1

vm1 vm2 vm3 vm4 vm5

Fij = 0 Fij = 1/12

the higher the cov, the higher the spatial variation

more data

less datano data

lacurts@mit|10

SPATIAL VARIABILITY

0.01 0.1 1 10 100

Spatial Variation(cov of Fij values)

lacurts@mit|10

SPATIAL VARIABILITY

0.01 0.1 1 10 100

applications exhibit high spatial variability

lacurts@mit|10

SPATIAL VARIABILITY

0.01 0.1 1 10 100

cov values as large as 100

lacurts@mit|10

SPATIAL VARIABILITY

0.01 0.1 1 10 100

(the same result holds for temporal variability, see paper)

lacurts@mit|10

SPATIAL VARIABILITY

0.01 0.1 1 10 100

(the same result holds for temporal variability, see paper)

customers cannot model such variability themselves, and existing models do not capture it

lacurts@mit|11

CICADA’S ALGORITHMM1 M2 Mn

...observed traffic

M̂n+1

prediction for next hour

Cicada’s prediction algorithm draws inspiration from the following works: [1] Wang, et al. “COPE: Traffic Engineering in Dynamic Networks”. SIGCOMM 2006. [2] Herbster and Warmuth. “Tracking the Best Expert”. Machine Learning Vol 32, 1998.

lacurts@mit|11

CICADA’S ALGORITHM

M̂n+1

M1 M2 Mn

...observed traffic

M̂n+1

= f(M1,M2, . . . ,Mn)

lacurts@mit|11

= w1 ·M1 + w2 ·M2 + . . .+ wn ·MnM̂n+1

M1 M2 Mn

...observed traffic

M̂n+1

lacurts@mit|11

= w1 ·M1 + w2 ·M2 + . . .+ wn ·MnM̂n+1

M1 M2 Mn

...observed traffic

M̂n+1

instead of setting the weights beforehand, cicada learns the weights online; they are updated with every

prediction based on errors made in previous predictions

lacurts@mit|11

= w1(n) ·M1 + w2(n) ·M2 + . . .+ wn(n) ·MnM̂n+1

M1 M2 Mn

...observed traffic

M̂n+1

lacurts@mit|11

= w1(n) ·M1 + w2(n) ·M2 + . . .+ wn(n) ·MnM̂n+1

M1 M2 Mn

...observed traffic

M̂n+1

wi(n+ 1) = wi(n)

lacurts@mit|11

= w1(n) ·M1 + w2(n) ·M2 + . . .+ wn(n) ·MnM̂n+1

M1 M2 Mn

...observed traffic

M̂n+1

· e�L(i,n)loss function

wi(n+ 1) = wi(n)

lacurts@mit|11

= w1(n) ·M1 + w2(n) ·M2 + . . .+ wn(n) ·MnM̂n+1

M1 M2 Mn

...observed traffic

M̂n+1

L(i, n) =k ~Mi � ~Mnk

k ~Mnk· e�L(i,n)

loss function

wi(n+ 1) = wi(n)

lacurts@mit|11

= w1(n) ·M1 + w2(n) ·M2 + . . .+ wn(n) ·MnM̂n+1

M1 M2 Mn

...observed traffic

M̂n+1

L(i, n) =k ~Mi � ~Mnk

k ~Mnk· e�L(i,n)

loss function

normalization

wi(n+ 1) = wi(n)

lacurts@mit|12

EVALUATIONto evaluate cicada, we compared its predictions to ones made by

a VOC-style system [1]vm1 vm2 vm3 vm4 vm5vm6 vm7 vm8 vm9

vm1 more data

less datano data

lacurts@mit|12

vm1 more data

less datano data

VOC requires customers to input parameters

(group sizes, amount of bandwidth)

lacurts@mit|12

vm1 more data

less datano data

we used an oracle method to pick the best possible

parameters, and also allowed for temporal variability

lacurts@mit|12

vm1 more data

less datano data

L2-norm errorE =

kM̂ �MkkM̂k

relative error

i=1 M̂ [i]�M [i]Pi=n2

i=1 M [i]

lacurts@mit|12

vm1 more data

less datano data

L2-norm errorE =

kM̂ �MkkM̂k

relative error

i=1 M̂ [i]�M [i]Pi=n2

i=1 M [i]

differentiates between over- and under-prediction

lacurts@mit|13

RESULTS PREDICTING AVERAGE DEMAND

does not differentiate between over- and under-prediction

0 1 2 3 4 5

Relative L2-norm Error (fraction)

CicadaVOC (Oracle)

-1 0 1 2 3 4 5

Relative Error (fraction)

CicadaVOC (Oracle)

lacurts@mit|13

RESULTS PREDICTING AVERAGE DEMAND

cicada’s predictions outperform VOC-style predictions (median error decreases by 90%) and

require no customer input(71% for L2 error)

0 1 2 3 4 5

CicadaVOC (Oracle)

-1 0 1 2 3 4 5

CicadaVOC (Oracle)

lacurts@mit|13

RESULTS

cicada occasionally under-predicts; VOC does not because we selected its parameters to

avoid under-prediction

PREDICTING AVERAGE DEMAND

0 1 2 3 4 5

CicadaVOC (Oracle)

-1 0 1 2 3 4 5

CicadaVOC (Oracle)

lacurts@mit|13

RESULTS

cicada occasionally under-predicts; VOC does not because we selected its parameters to

avoid under-prediction

PREDICTING AVERAGE DEMAND

cicada’s predictions typically require only 1-2 hours of

application history

0 1 2 3 4 5

CicadaVOC (Oracle)

-1 0 1 2 3 4 5

CicadaVOC (Oracle)

lacurts@mit|14

PREDICTIONS→GUARANTEEScicada turns its predictions into pipe-model guarantees; a separate guarantee for each (source, destination) pair

lacurts@mit|14

network resources must be available

lacurts@mit|14

network resources must be available customers can add a buffer to the offered guarantees

accept cicada’s guarantee, but add 5% more bandwidth

lacurts@mit|14

the provider may choose to augment predictions that cicada is not confident about

typically not confident for “small” applications

customers can add a buffer to the offered guarantees

lacurts@mit|14

the provider may choose to augment predictions that cicada is not confident about

typically not confident for “small” applications

customers can add a buffer to the offered guarantees

cicada can detect when a guarantee is too low, and offer a new one

observe packet drops

lacurts@mit|15

NETWORK UTILIZATIONdo cicada’s bandwidth guarantees improve network utilization?

lacurts@mit|15

wasted bandwidth: bandwidth that is guaranteed for an application, but not used by the application

lacurts@mit|15

provider’s goal: minimize wasted bandwidth

lacurts@mit|15

intra-rack bandwidth is usually cheap

(sometimes free)

lacurts@mit|15

(sometimes free)

inter-rack bandwidth is more expensive

lacurts@mit|15

provider’s goal: minimize wasted inter-rack bandwidth

(sometimes free)

inter-rack bandwidth is more expensive

lacurts@mit|16

NETWORK UTILIZATION

100000

200000

300000

400000

500000

600000

1 1.4142 1.5 1.666 1.75 2 4 8 16

Oversubscription Factor

cicada’s greedy placement heuristic: place the pairs of VMs that need the highest guarantees on the fastest paths (see paper for details)

lacurts@mit|16

NETWORK UTILIZATION

100000

200000

300000

400000

500000

600000

1 1.4142 1.5 1.666 1.75 2 4 8 16

Availab

bw factor 1x = VOC’s placement [1]

lacurts@mit|16

NETWORK UTILIZATION

= VOC’s placement [1]= cicada’s placement

100000

200000

300000

400000

500000

600000

1 1.4142 1.5 1.666 1.75 2 4 8 16

bw factor 1x

lacurts@mit|16

NETWORK UTILIZATION

100000

200000

300000

400000

500000

600000

1 1.4142 1.5 1.666 1.75 2 4 8 16

bw factor 1xbw factor 25x

lacurts@mit|16

NETWORK UTILIZATION

100000

200000

300000

400000

500000

600000

1 1.4142 1.5 1.666 1.75 2 4 8 16

Availab

bw factor 250x

lacurts@mit|16

NETWORK UTILIZATION

100000

200000

300000

400000

500000

600000

1 1.4142 1.5 1.666 1.75 2 4 8 16

Availab

bw factor 250x

as applications use more bandwidth, VOC’s placement wastes more inter-rack bandwidth

lacurts@mit|16

NETWORK UTILIZATION

100000

200000

300000

400000

500000

600000

1 1.4142 1.5 1.666 1.75 2 4 8 16

Availab

bw factor 250x

in some scenarios, cicada’s placement more than doubles the amount

of available inter-rack bandwidth

as applications use more bandwidth, VOC’s placement wastes more inter-rack bandwidth

lacurts@mit|17

SUMMARY

accurate

cloud applications exhibit variability that existing models don’t capture

cicada captures this variability, and provides guarantees that are

calculated quickly*

require little history

increase network utilization

lacurts@mit|17

SUMMARY

accurate

calculated quickly*

improve relative error by 90%

lacurts@mit|17

SUMMARY

accurate

calculated quickly*

< 10milliseconds per application

lacurts@mit|17

SUMMARY

accurate

calculated quickly*

1--2 hours for most applications

lacurts@mit|17

SUMMARY

accurate

calculated quickly*

1--2 hours for most applications

in some cases, inter-rack utilization is doubled

cicada - usenix · 1 vm 2 vm 3 vm 4 vm 5vm 6 vm 7 vm 8 vm 9 vm 2 vm 3 vm 4 vm 5 vm 6 vm 7 vm 8 vm 9...

Documents

mechanical valve series vm - rs components...

vm chronicles no.7 a seventh series of extraordinary...

cisco unified computing system · virtualization step2 10...

mixers 3d nutating shakermodel vm-03u color vm-03ru vm-03gu...

version 7 release 1 z/vm · the dfsms command authorization...

capacity improvement bulk drug...

version 7 release 1 z/vm

7 essentials for better vm protection

pbtpj.inpbtpj.in/seniorityuploads/65.pdftpj tpj tpj vm tpj...

version 7 release 1 z/vm · creating a hard link.....99

version 7 release 1 z/vm - ibm · 9/12/2018 · z/vm...

2020 corvette pricing - chevrolet · 1lt (continued) 5df...

pe-cons 7/17 av/nc/vm en - europa

oracle vm 3 - 2015.hroug.hr · building the case for...

architektura+systemu+ opencontrail+ - semihalf ·...

transforming server virtualization with cisco vn-link...#3...

version 7 release 1 z/vm · 06/12/2019 · figures. xvii

vm-7 monitoring system...turbine supervisory instrument...

vm‑4hdt, vm-3hdt, vm-2hdt...although this user manual...

version 7 release 1 z/vm - ibm · 2018-12-11 · summary of...