01/22/09icdcs20061 load unbalancing to improve performance under autocorrelated traffic ningfang mi...
Post on 21-Dec-2015
214 views
TRANSCRIPT
01/22/09 ICDCS2006 1
Load Unbalancing to Load Unbalancing to Improve Performance under Improve Performance under
Autocorrelated TrafficAutocorrelated TrafficNingfang Mi Ningfang Mi College of William and MaryCollege of William and Mary
Joint work with Qi Zhang Joint work with Qi Zhang College of William and MaryCollege of William and Mary
Alma Riska Alma Riska Seagate ResearchSeagate Research
Evgenia Smirni Evgenia Smirni College of William and MaryCollege of William and Mary
01/22/09 ICDCS2006 2
OutlineOutline
MotivationMotivation Our solutionOur solution Conclusion and future workConclusion and future work
01/22/09 ICDCS2006 3
Clustered ServersClustered Servers
Front-end Dispatche
rBack-end
Nodes
Load Balancin
g
Heavy tailed service time
Round Robin (RR)Round Robin (RR) RandomRandom Join Shortest Queue (JSQ)Join Shortest Queue (JSQ) Join Shortest Weighted Join Shortest Weighted
Queue (JSWQ)Queue (JSWQ) AdaptLoadAdaptLoad
Autocorrelated Interarrival time
Performanc
e?
01/22/09 ICDCS2006 4
Why Considering Dependence?Why Considering Dependence? BADBAD performance effect performance effect
Higher ACF, higher dependence, worse performance
- 0.1
0
0.1
0.2
0.3
0.4
0.5
0 50 100 150 200 250 300 350 400 450 500
lag (k)
Aut
ocor
rela
tion
Fun
ctio
n (A
CF)
NOACF
SRD
LRD
- 500
0
500
1000
1500
2000
2500
3000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Utilization
Resp
onse
Tim
e
NOACF
SRD
LRD
Higher ACF, higher dependence
01/22/09 ICDCS2006 5
Effect of ACF on Load BalancingEffect of ACF on Load Balancing
1
10
100
1000
10000
NOACF SRD LRD
Response Time
AdaptLoad JSWQ JSQ RR
Size-based Policies do NOT
win!
WHY?
01/22/09 ICDCS2006 6
SRD
- 0.1
0
0.1
0.2
0.3
0.4
0.5
0 50 100 150 200 250 300 350 400 450 500
lag (k)
ACF
Original stream
server 1
server 2
server 3
server 4
Possible Reason ...Possible Reason ...What is ACF in Each Node?What is ACF in Each Node?
Load plus ACF.
Load+ACF
01/22/09 ICDCS2006 7
OutlineOutline
MotivationMotivation Our solutionOur solution Conclusion and future workConclusion and future work
01/22/09 ICDCS2006 8
SolutionSolution
EqAL (EqAL (EqEqually distribute work ually distribute work guided by guided by AAutocorrelation and utocorrelation and LLoad)oad)
Balancing load Balancing load AdaptLoadAdaptLoad
Balancing ACFBalancing ACF Move jobs from strongly correlated node to Move jobs from strongly correlated node to
weakly correlated nodeweakly correlated node
01/22/09 ICDCS2006 9
Review: AdaptLoadReview: AdaptLoad
Each node only serves request with Each node only serves request with size falling in certain rangesize falling in certain range [s[s000, s0, s11), [s), [s11, s, s22), … [s), … [sN-1N-1, s, sNN∞)∞)
Self-adjust the size ranges by Self-adjust the size ranges by predicting the incoming workload predicting the incoming workload based on the histogram of previous based on the histogram of previous requestsrequests
01/22/09 ICDCS2006 10
Review: AdaptLoadReview: AdaptLoad
0
50
100
150
200
250
request size bin
tota
l si
zeStep 1: Build histogram on-line
e.g., request sizes in sequential: 25 100 57 34 22 9 210 …
01/22/09 ICDCS2006 11
Review: AdaptLoadReview: AdaptLoadStep 1: Build histogram on-line
e.g., request sizes in sequential: 25 100 57 34 22 9 210 …
0
10000
20000
30000
40000
50000
60000
70000
request size bin
tota
l s
ize
01/22/09 ICDCS2006 12
Review: AdaptLoadReview: AdaptLoad
0
10000
20000
30000
40000
50000
60000
70000
request size bin
tota
l s
ize
Step 1: Build histogram on-line
Step 2: At the end of monitoring window, find the boundaries to partition the total work (area) equally
Server 1
Server 2
Server 3 Server 4
s0 s1 s2 s3
s4
01/22/09 ICDCS2006 13
S_EQALS_EQAL Server i increase pi of its work
Corrective factor pi : ∑ pi = 0 negative (reducing work) vs. positive (increasing work) p1 =-R (pre-determined corrective constant) pi using semi-geometric method to decide
0
10000
20000
30000
40000
50000
60000
70000
request size bin
tota
l s
ize
Server 1
Server 2
Server 3 Server 4
Server 1
Server 2
Server 3 Server 4
01/22/09 ICDCS2006 14
Performance of S_EQALPerformance of S_EQAL
Service time: WorldCup 1998 TraceService time: WorldCup 1998 Trace Inter-arrival time: MMPP(2)Inter-arrival time: MMPP(2)
Same statistics moments as WorldCupSame statistics moments as WorldCup With short range dependence (SRD)With short range dependence (SRD)
4 servers in the cluster4 servers in the cluster Average utilization per server: 62%Average utilization per server: 62%
01/22/09 ICDCS2006 15
Average Slowdown by Average Slowdown by RR
0
1000
2000
3000
4000
5000
6000
7000S
low
do
wn
0 10 20 30 40 50 60 70 80 90
R (%)
Best Slowdown
01/22/09 ICDCS2006 16
Average Response Time by Average Response Time by RR
0
2000
4000
6000
8000
10000
12000R
esp
on
se t
ime
0 10 20 30 40 50 60 70 80 90
R (%)
Best Response Time
How to get optimal R ?
01/22/09 ICDCS2006 17
Dynamic Policy: D_EQALDynamic Policy: D_EQAL
Self adjustSelf adjust R R RR is initialized as 0 Adjust RR for a small value AdjAdj at the end of each
monitoring window The adjustment should improve both slowdown
and response time If not, wrong direction
Recalculate ppii
Set size boundaries
01/22/09 ICDCS2006 18
Performance of D_EQALPerformance of D_EQAL
0
1000
2000
3000
4000
5000
6000
7000
Slo
wdo
wn
0 10 20 30 50 70 90 D_EQAL
R (%)
0
2000
4000
6000
8000
10000
12000
Res
pons
e tim
e
0 10 30 50 70 90 D_EQAL
R (%)
01/22/09 ICDCS2006 19
Effectiveness of D_EQALEffectiveness of D_EQAL
0
10
20
30
40
50
60
70
0 100 200 300 400 500 600 700 800 900 1000
Monitoring windows
R (
%)
01/22/09 ICDCS2006 20
OutlineOutline
MotivationMotivation Our solutionOur solution Conclusion and future workConclusion and future work
01/22/09 ICDCS2006 21
Conclusion and Future WorkConclusion and Future Work
Load balancing policy should also consider Load balancing policy should also consider dependence structure in traffic.dependence structure in traffic.
D_EQAL balances the load and correlationD_EQAL balances the load and correlation Self-adaptive Self-adaptive effectiveeffective
Future workFuture work More adaptive -- detect the change of More adaptive -- detect the change of
dependence structuredependence structure Multiple classes - consider different priority Multiple classes - consider different priority
01/22/09 ICDCS2006 22
Thank you !
Questions?
01/22/09 ICDCS2006 23
Why Considering Dependence?Why Considering Dependence? BADBAD performance effect performance effect Metric: Autocorrelation function (ACF) Metric: Autocorrelation function (ACF)
Inter-arrival time of the Inter-arrival time of the iithth request: request: XXii
The correlation between inter-arrival times The correlation between inter-arrival times with lag with lag kk
corr [ X t , X tk ]=E [ X t− E [ X ] X tk −E [ X ] ]Var [ X ]
x0 x1 x2 x3 x4 x5 x6 x7 x8 lag(1)
lag(2)Higher ACF, higher dependence, worse performance
01/22/09 ICDCS2006 24
Examples of ACFExamples of ACF
- 0.1
0
0.1
0.2
0.3
0.4
0.5
0 50 100 150 200 250 300 350 400 450 500
lag (k)
ACF
NOACF
SRD
LRD
01/22/09 ICDCS2006 25
Effect of ACF on a Single ServerEffect of ACF on a Single Server
- 500
0
500
1000
1500
2000
2500
3000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Utilization (%)
Res
pons
e T
ime
NOACF
SRD
LRD
0
5000
10000
15000
20000
25000
30000
35000
40000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Utilization (%)Q
ueue
Len
gth
NOACF
SRD
LRD
01/22/09 ICDCS2006 27
Inside Each ServerInside Each Server
0
20
40
60
80
100
Uti
liza
tio
n (
%)
0 10 30 50 70 90
R (%)
Server 1 Server 2 Server 3 Server 4
01/22/09 ICDCS2006 28
1
10
100
1000
10000
100000
1000000R
esp
on
se t
ime
0 10 30 50 70 90
R (%)
Server 1 Server 2 Server 3 Server 4
0102030405060708090
100
Util
izat
ion
(%)
0 10 30 50 70 90
R (%)
Inside Each ServerInside Each Server
Too Bias
How to get optimal R ?