the only constant is change: incorporating time-varying bandwidth reservations in data centers
DESCRIPTION
The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers. Di Xie, Ning Ding , Y. Charlie Hu, Ramana Kompella. Cloud Computing is Hot. Private Cluster. Key Factors for Cloud Viability. Cost Performance. Performance Variability i n Cloud. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/1.jpg)
1
The Only Constant is Change: Incorporating Time-Varying Bandwidth
Reservations in Data Centers
Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella
![Page 2: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/2.jpg)
2
Cloud Computing is Hot
Private Cluster
![Page 3: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/3.jpg)
3
Key Factors for Cloud Viability
• Cost
• Performance
![Page 4: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/4.jpg)
4
Performance Variability in Cloud
• BW variation in cloud due to contention [Schad’10 VLDB]
• Causing unpredictable performance
Local Cluster Amazon EC20
100
200
300
400
500
600
700
800
900
1000
Bandwidth (Mbps)
![Page 5: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/5.jpg)
Network performance variability
Tenant Enterprise
Map Reduce
Job
Results
• Data analytics on an isolated clusterCompletion
Time4 hours
![Page 6: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/6.jpg)
Network performance variability
Tenant Enterprise
Map Reduce
Job
Results
• Data analytics on an isolated clusterCompletion
Time4 hours
• Data analytics in a multi-tenant datacenter
Tenant
Map Reduce
Job
Results
Datacenter
CompletionTime
10-16 hours
![Page 7: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/7.jpg)
Network performance variability
Tenant Enterprise
Map Reduce
Job
Results
Data analytics on an isolated clusterCompletion
Time4 hours
Data analytics in a multi-tenant datacenter
Tenant
Map Reduce
Job
Results
Datacenter
CompletionTime
10-16 hours
Variable tenant costsExpected cost (based on 4 hour completion time) = $100
Actual cost = $250-400
![Page 8: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/8.jpg)
Network performance variability
Tenant Enterprise
Map Reduce
Job
Results
• Data analytics on an isolated clusterCompletion
Time4 hours
• Data analytics in a multi-tenant datacenter
Tenant
Map Reduce
Job
Results
Datacenter
CompletionTime
10-16 hours
Variable tenant costsExpected cost (based on 4 hour completion time) = $100
Actual cost = $250-400
Unpredictability of application performance and tenant costs is a key hindrance to cloud adoption
Key Contributor: Network performance variation
![Page 9: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/9.jpg)
9
Reserving BW in Data Centers
• SecondNet [Guo’10]– Per VM-pair, per VM access bandwidth reservation
• Oktopus [Ballani’11]– Virtual Cluster (VC)– Virtual Oversubscribed Cluster (VOC)
![Page 10: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/10.jpg)
10
How BW Reservation Works
. . .
Virtual Cluster Model
Time
Bandwidth
N VMs
VirtualSwitch
1. Determine the model 2. Allocate and enforce the model
0 T
B
Only fixed-BW reservationRequest <N, B>
![Page 11: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/11.jpg)
11
Network Usage for MapReduce Jobs
Hadoop Sort, 4GB per VM
![Page 12: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/12.jpg)
12
Network Usage for MapReduce Jobs
Hadoop Sort, 4GB per VM
Hadoop Word Count, 2GB per VM
![Page 13: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/13.jpg)
13
Network Usage for MapReduce Jobs
Hadoop Sort, 4GB per VM
Hadoop Word Count, 2GB per VM
Hive Join, 6GB per VM
![Page 14: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/14.jpg)
14
Network Usage for MapReduce Jobs
Hadoop Sort, 4GB per VM
Hadoop Word Count, 2GB per VM
Hive Join, 6GB per VM
Hive Aggregation, 2GB per VM
![Page 15: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/15.jpg)
15
Network Usage for MapReduce Jobs
Hadoop Sort, 4GB per VM
Hadoop Word Count, 2GB per VM
Hive Join, 6GB per VM
Hive Aggregation, 2GB per VM
Time-varying network usage
![Page 16: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/16.jpg)
16
Motivating Example
• 4 machines, 2 VMs/machine, non-oversubscribednetwork
• Hadoop Sort– N: 4 VMs– B: 500Mbps/VM
1Gbps
500Mbps500Mbps
500Mbps
Not enough BW
![Page 17: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/17.jpg)
17
Motivating Example
• 4 machines, 2 VMs/machine, non-oversubscribednetwork
• Hadoop Sort– N: 4 VMs– B: 500Mbps/VM
1Gbps
500Mbps
![Page 18: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/18.jpg)
18
Under Fixed-BW Reservation Model
1Gbps
500MbpsJob3Job2
Virtual Cluster Model
Job1 Time
0 5 10 15 20 25 30
500
Bandwidth
![Page 19: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/19.jpg)
19
Under Time-Varying Reservation Model
1Gbps
500Mbps
TIVC Model
Job1 Time
0 5 10 15 20 25 30
500Job2Job3Job4Job5
J1 J2J3 J4J5
Bandwidth
Doubling VM, network utilization and the job
throughput
HadoopSort
![Page 20: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/20.jpg)
20
Temporally-Interleaved Virtual Cluster (TIVC)
• Key idea: Time-Varying BW Reservations
• Compared to fixed-BW reservation– Improves utilization of data center
• Better network utilization• Better VM utilization
– Increases cloud provider’s revenue– Reduces cloud user’s cost– Without sacrificing job performance
![Page 21: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/21.jpg)
21
Challenges in Realizing TIVC
. . .
Virtual Cluster Model
Time
Bandwidth
N VMs
VirtualSwitch 0 T
B
Request <N, B>
Time
Bandwidth
0 T
B
Request <N, B(t)>
Q1: What are right model functions?
Q2: How to automatically derive the models?
![Page 22: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/22.jpg)
22
Challenges in Realizing TIVC
Q3: How to efficiently allocate TIVC?
Q4: How to enforce TIVC?
![Page 23: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/23.jpg)
23
Challenges in Realizing TIVC
• What are the right model functions?
• How to automatically derive the models?
• How to efficiently allocate TIVC?
• How to enforce TIVC?
![Page 24: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/24.jpg)
24
How to Model Time-Varying BW?
Hadoop Hive Join
![Page 25: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/25.jpg)
25
TIVC Models
Virtual Cluster
T11 T32
![Page 26: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/26.jpg)
26
Hadoop Sort
![Page 27: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/27.jpg)
27
Hadoop Word Count
v
![Page 28: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/28.jpg)
28
Hadoop Hive Join
![Page 29: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/29.jpg)
29
Hadoop Hive Aggregation
![Page 30: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/30.jpg)
30
Challenges in Realizing TIVC
What are the right model functions?
• How to automatically derive the models?
• How to efficiently allocate TIVC?
• How to enforce TIVC?
![Page 31: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/31.jpg)
31
Possible Approach
• “White-box” approach– Given source code and data of cloud application,
analyze quantitative networking requirement– Very difficult in practice
• Observation: Many jobs are repeated many times– E.g., 40% jobs are recurring in Bing’s production data
center [Agarwal’12]– Of course, data itself may change across runs, but size
remains about the same
![Page 32: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/32.jpg)
32
Our Approach
• Solution: “Black-box” profiling based approach1. Collect traffic trace from profiling run2. Derive TIVC model from traffic trace
• Profiling: Same configuration as production runs– Same number of VMs– Same input data size per VM– Same job/VM configuration
How much BW should we reserve to the application?
![Page 33: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/33.jpg)
33
Impact of BW Capping
No-elongation BW threshold
![Page 34: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/34.jpg)
34
Choosing BW Cap
• Tradeoff between performance and cost– Cap > threshold: same performance, costs more– Cap < threshold: lower performance, may cost less
• Our Approach: Expose tradeoff to user1. Profile under different BW caps2. Expose run times and cost to user3. User picks the appropriate BW cap
Only below threshold ones
![Page 35: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/35.jpg)
35
From Profiling to Model Generation
• Collect traffic trace from each VM– Instantaneous throughput of 10ms bin
• Generate models for individual VMs
• Combine to obtain overall job’s TIVC model– Simplify allocation by working with one model– Does not lose efficiency since per-VM models are
roughly similar for MapReduce-like applications
![Page 36: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/36.jpg)
36
Generate Model for Individual VM
1. Choose Bb
2. Periods where B > Bb, set to BcapBW
Time
Bcap
Bb
![Page 37: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/37.jpg)
37
Challenges in Realizing TIVC
What are the right model functions?
How to automatically derive the models?
• How to efficiently allocate TIVC?
• How to enforce TIVC?
![Page 38: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/38.jpg)
38
TIVC Allocation Algorithm
• Spatio-temporal allocation algorithm– Extends VC allocation algorithm to time dimension– Employs dynamic programming– Chooses lowest level subtree
• Properties– Locality aware– Efficient and scalable
• 99th percentile 28ms on a 64,000-VM data center in scheduling 5,000 jobs
![Page 39: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/39.jpg)
39
Challenges in Realizing TIVC
What are the right model functions?
How to automatically derive the models?
How to efficiently allocate TIVC?
• How to enforce TIVC?
![Page 40: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/40.jpg)
40
Enforcing TIVC Reservation
• Possible to enforce completely in hypervisor– Does not have control over upper level links– Requires online rate monitoring and feedback– Increases hypervisor overhead and complexity
• Enforcing BW reservation in switches– Most small jobs will fit into a rack– Only a few large jobs cross the core– Avoid complexity in hypervisors
![Page 41: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/41.jpg)
41
Challenges in Realizing TIVC
What are the right model functions?
How to automatically derive the models?
How to efficiently allocate TIVC?
How to enforce TIVC?
![Page 42: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/42.jpg)
42
Proteus: Implementing TIVC Models
1. Determine the model
2. Allocate and enforce the model
![Page 43: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/43.jpg)
43
Evaluation
• Large-scale simulation– Performance– Cost– Allocation algorithm
• Prototype implementation– Small-scale testbed
![Page 44: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/44.jpg)
44
Simulation Setup
• 3-level tree topology– 16,000 Hosts x 4 VMs– 4:1 oversubscription
• Workload– N: exponential distribution around mean 49 – B(t): derive from real Hadoop apps
50Gbps
10Gbps
…
… …1Gbps
…
20 Aggr Switch
20 ToR Switch
40 Hosts
… … …
![Page 45: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/45.jpg)
45
Batched Jobs
• Scenario: 5,000 time-insensitive jobs
42% 21% 23% 35%
1/3 of each type
Completion time reduction
All rest results are for mixed
![Page 46: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/46.jpg)
46
Varying Oversubscription and Job Size
25.8% reduction for non-oversubscribed
network
![Page 47: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/47.jpg)
47
Dynamically Arriving Jobs
• Scenario: Accommodate users’ requests in shared data center– 5,000 jobs arrives dynamically with varying loads
Rejected: VC: 9.5%
TIVC: 3.4%
![Page 48: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/48.jpg)
48
Analysis: Higher Concurrency
• Under 80% load
7% higher job concurrency
28% higher VM utilization
Rejected jobs are large
28% higher revenue
Charge VMs
V M
![Page 49: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/49.jpg)
49
Testbed Experiment
• Setup– 18 machines
• Real 30 MapReduce jobs– 10 Sort– 10 Hive Join– 10 Hive Aggre.
![Page 50: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/50.jpg)
50
Testbed ResultTIVC finishes job faster than VC,
Baseline finishes the fastest
Baseline suffers at variability of completion time, TIVC achieves
similar performance as VC
![Page 51: The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers](https://reader035.vdocument.in/reader035/viewer/2022070405/56813d3a550346895da6fa9d/html5/thumbnails/51.jpg)
51
Conclusion• Network reservations in cloud are important
– Previous work proposed fixed-BW reservations– However, cloud apps exhibit time-varying BW usage
• They propose TIVC abstraction – Provides time-varying network reservations– Uses simple pulse functions– Automatically generates model– Efficiently allocates and enforces reservations
• Proteus shows TIVC benefits both cloud provider and users significantly