how to tame your vm - usenix...web1 web1-target web2 web2-target 50 per. mov. avg. (web1) 50 per....
Post on 29-Jun-2020
12 Views
Preview:
TRANSCRIPT
How to Tame your VM: an Automated Control System for Virtualized Services
Akkarit Sangpetch Andrew Turner Hyong Kim asangpet@andrew.cmu.edu andrewtu@andrew.cmu.edu kim@ece.cmu.edu
Department of Electrical and Computer Engineering
Carnegie Mellon University
LISA 2010 - November 11, 2010
Virtualized Services
• Virtualized Infrastructure (vSphere,Hyper-V,ESX,KVM)
– Easy to deploy
– Easy to migrate
– Easy to re-use virtual machines
• Multi-tier services are configured and deployed as virtual machines
2
DB2
App1
Hypervisor
App2
Web1
Hypervisor
Web2
DB1
Hypervisor
Db3 Web3 App3
Problem: What about performance?
• VMs share physical resource (CPU time / Disk / Network / Memory bandwidth)
• Need to maintain service quality
– Response time / Latency
• Existing infrastructure provides mismatch interface between the service’s performance goal (latency) and configurable parameters (resource shares) – Admins need to fully understand the
services before able to tune the system
3
DB2
App1
Hypervisor
App2
Web1
Hypervisor
Web2
DB1
Hypervisor
Db3 Web3 App3
Existing Solutions – Managing Service Performance
• Resource Provisioning
– Admins define resource shares / reservations / limits
– The “correct” values depends on hardware / applications / workloads
• Load Balancing
– Migrate VMs to balance host utilization
– Free up capacity for each host (low utilization ~ better performance)
4
vm2
vm1 vm3
Host 1
vm5
Host 2
vm4
vm4
vm1
vm2
vm3
vm4
50
10
30
10
# Shares
Automated Resource Control
• Dynamically adjust resource shares during run-time
• Input = required service response time – I want to serve pages from
vm1 in less than 4 seconds
• The system calculates number of shares required to maintain service level – Adapt to changes in workload
– Admins do not have to guess the number of shares
5
vm1
vm2
vm3
vm4
25
25
25
vm1
vm2
vm3
vm4
50
10
10
30
25
Control system
vm1 target vm1 response
vm2 target vm2 response
vm3 target vm3 response
vm4 target vm4 response
vm1 target vm1 response
vm2 target vm2 response
vm3 target vm3 response
vm4 target vm4 response
Control System
Controller
Modeler
VM
VM
Sensor
Actuator
VM
VM
Sensor
Actuator
KVM Host
VM
VM
Sensor
Actuator
6
KVM Host KVM Host
Record service response time
Adjust host CPU shares
Calculate model parameters
Calculate # of shares required
Sensor unit
Server VM
Sensor VM
Host Bridge
Request Packets
xtables TEE
tshark
To Modeler analyzer
{service:blog, tier:web, vm: www1, response time 250 ms}, {service:blog, tier:db, vm:db1, response time 2.5 ms}
7
• Record service response time • Starts when the server sees
“request” packet • Until the server sends
“response” packet • Based on libpcap / tshark – need
to recognize the application’s header
Request Packets
Request Packets
Actuator Unit
Actuator Agent
Hypervisor
Shares Allocation
8
CPU Cgroup
Server 2 VM
• Adjust number of shares based on controller’s allocation
• We use Linux/KVM as our hypervisor • Cgroup cpu.shares –
control CFS scheduling shares
• Similar mechanism exists on other platform (Xen weight / ESX shares)
Server 1 VM
50
{server1:80, server2:20 }
80
50
20
Modeler
Web server Database
Client
TDB
Tweb
9
Tweb = f (TDB ) TDB = f (Scpu)
• The modeler calculates the expected service response time based on shares allocated to each VM
• We use an intuitive model for our 2-tier services in the experiment
Scpu
Resource Shares Effect
Host 1 Host 2
Web server
VM (Wordpress)
SQL Server VM
(mysql)
Cruncher VM
(100% cpu)
NAS
Test Clients
10
SQL Server CPU Allocation effect
0
50
100
150
200
250
300
350
400
450
500
0
100
200
300
400
500
600
700
800
1 101 201 301 401 501 601 701 801 901
# o
f D
B C
PU
sh
are
s (o
ut
of
10
00
)
We
b R
esp
on
se T
ime
(m
s)
Time step (5s)
Web Response
thttp-msec cpu
0
50
100
150
200
250
300
350
400
450
500
0
5
10
15
20
25
30
35
40
1 101 201 301 401 501 601 701 801 901
# o
f D
B C
PU
sh
are
s (o
ut
of
10
00
)
Daa
bas
e R
esp
on
se T
ime
(m
s)
Time step (5s)
Database Response
tdb-msec cpu
11
Database Response vs CPU
12
0
2
4
6
8
10
12
14
16
18
20
0 50 100 150 200 250 300 350 400 450 500
Dat
abas
e R
esp
on
se T
ime
(m
s)
# of CPU Shares allocated to SQL VM (out of 1000)
TDB = a0 (Scpu)b0
Database Response vs CPU (controlled)
13
0
5
10
15
20
25
30
0 100 200 300 400 500 600 700 800 900 1000
Dat
abas
e R
esp
on
se T
ime
(m
s)
# of CPU Shares allocated to SQL server VM (out of 1000)
HTTP vs Database response
0
100
200
300
400
500
600
700
800
0 2 4 6 8 10 12 14 16 18 20
We
b R
esp
on
se (
ms)
Database Response (ms)
14
Tweb = a1 (TDB ) + b1
Controllers
• Model uses readings to estimate the service parameters <a0,b0,a1,b1> – TDB = a0 (Scpu)b0
– Tweb = a1 (TDB ) + b1
• Controller finds the minimal Scpu such that Tweb < specified response time – Long-term control: uses moving average to find model
parameters & shares – Short-term control: uses the last-period reading to
find model parameters • Avoid excessive response time violation while waiting for the
long-term model to settle
15
System Evaluations
Host 1 Host 2
NAS
Test Clients
Host 3
Web server 1 VM DB Server 1
VM
Web server 2 VM
DB server 2 VM
Actuator Sensor Sensor Sensor
16
Static Workload
0
1000
2000
3000
4000
5000
6000
7000
8000
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
web1 web1-target web2 web2-target
17
Serv
ice
Re
spo
nse
Tim
e (
ms)
Se
rvic
e R
esp
on
se T
ime
(m
s)
0
1000
2000
3000
4000
5000
6000
7000
8000
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
web1 web1-target web2 web2-target
Dynamic Workload
18
0
1000
2000
3000
4000
5000
6000
7000
1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 1401 1501 1601
web1 web1-target web2 web2-target 50 per. Mov. Avg. (web1) 50 per. Mov. Avg. (web2)
Web1 Load
0
1000
2000
3000
4000
5000
6000
7000
1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 1401 1501 1601
web1 web1-target web2 web2-target 50 per. Mov. Avg. (web1) 50 per. Mov. Avg. (web2)
Serv
ice
Re
spo
nse
Tim
e (
ms)
Se
rvic
e R
esp
on
se T
ime
(m
s)
Deviation from Target
Without Control With Control
Mean Deviation (ms)
Target Violation Period
Mean Deviation (ms)
Target Violation Period
Static Load
Instance 1 540 0% 109 (-80%) 9%
Instance 2 1043 100% 282 (-73%) 26%
Dynamic Load
Instance 1 276 7% 182 (-34%) 9%
Instance 2 354 27% 336 (-5%) 12%
19
Our control system helps track the target service response time
Further Improvements
• Enhance Service Model
– Service dependencies
• Cache / Proxy / Load balancer effects
– Non-cpu resource (disk / network / memory)
• Robust Control system
– Still susceptible to temporal system effects (garbage collections / cron jobs)
– Need to determine feasibility of the target performance
20
Conclusions
• Automated Resource Control system
– Maintain target system’s response time
– Allows admin to express allocation in term of performance target
• Managing service performance could be automate with sufficient domain knowledge
– Need basic framework to describe performance model
– Leave the details to the machines (parameter fitting / optimization / resource monitoring)
21
Sample Response & Share values
22
0
100
200
300
400
500
600
700
1 51 101 151 201 251 301 351
tWeb
p95
target
0
100
200
300
400
500
1 51 101 151 201 251 301 351
shares
share
top related