ascar: automating contention management for high

61
Automating Contention Management for High-Performance Storage Systems (MSST 15, June 5th, 2015) Yan Li, Xiaoyuan Lu, Ethan Miller, Darrell Long Storage Systems Research Center (SSRC) University of California, Santa Cruz ASCAR

Upload: others

Post on 11-Nov-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ASCAR: Automating Contention Management for High

Automating Contention Management for High-Performance Storage Systems

(MSST ’15, June 5th, 2015)

Yan Li, Xiaoyuan Lu, Ethan Miller, Darrell Long

Storage Systems Research Center (SSRC) University of California, Santa Cruz

ASCAR

Page 2: ASCAR: Automating Contention Management for High

ASCAR

Contention in modern shared-storage system

2

Client A

Client B

Client C

Client D

Client E Storage devices

Network switch

Network switch

Storage server

Storage server

Storage server

Storage server

ch

S

tch S

SS

Page 3: ASCAR: Automating Contention Management for High

ASCAR

Contention in modern shared-storage system

3

Client A

Client B

Client C

Client D

Client E Storage devices

Network switch

Network switch

Storage server

Storage server

Storage server

Storage server

ch S

ch

S

Page 4: ASCAR: Automating Contention Management for High

ASCAR

Contention in modern shared-storage system

4

Client A

Client B

Client C

Client D

Client E Storage devices

Storage server

Storage server

Storage server

Storage server

Network switch

Network switch

Sch

Page 5: ASCAR: Automating Contention Management for High

ASCAR

Contention in modern shared-storage system

5

Client A

Client B

Client C

Client D

Client E Storage devices

Storage server

Storage server

Storage server

Storage server

Network switch

Network switch

Sh ch

Page 6: ASCAR: Automating Contention Management for High

ASCAR

Contention harms efficiency

6

100 clients

150 clients

200 clients

tota

l thr

ough

put

Page 7: ASCAR: Automating Contention Management for High

ASCAR

0 5 10 15 20 25Time (second)

0

10

20

30

40

50

60

70

80

Thr

ough

put(

MB

/s)

Client Nodes Throughput

Node 1Node 2Node 3Node 4Node 5

Contention causes fluctuation

7

 Client throughput of a random write workload

 5 nodes accessing 5 servers

 High temporal and spatial I/O speed fluctuation

Page 8: ASCAR: Automating Contention Management for High

ASCAR

This is what’s happening at each server node

8

Picture credit: https://www.checkfelix.com/reiseblog/10-argumente-urlaub-schottland/

Page 9: ASCAR: Automating Contention Management for High

ASCAR

Limiting client I/O rates can improve performance

9

100 clients

150 clients

200 clients

… if done properly to

tal t

hrou

ghpu

t

200 clients with

contention control

Page 10: ASCAR: Automating Contention Management for High

ASCAR 10

200 clients

200 clients with

contention control

Challenges of distributed I/O rate control: 1. capability discovery

nonoptimal rate limit

Page 11: ASCAR: Automating Contention Management for High

ASCAR 11

200 clients

200 clients with

contention control

Challenges of distributed I/O rate control: 1. capability discovery

nonoptimal rate limit

Page 12: ASCAR: Automating Contention Management for High

ASCAR 12

200 clients

200 clients with

contention control

Challenges of distributed I/O rate control: 1. capability discovery

nonoptimal rate limit

Page 13: ASCAR: Automating Contention Management for High

ASCAR 13

200 clients

200 clients with

contention control

Challenges of distributed I/O rate control: 1. capability discovery

optimal rate limit

Page 14: ASCAR: Automating Contention Management for High

ASCAR 14

200 clients

200 clients with

contention control

Challenges of distributed I/O rate control: 1. capability discovery

 Capability discovery usually involves communication:

•  between clients

•  with a central controller

Page 15: ASCAR: Automating Contention Management for High

ASCAR 15

Challenges of distributed I/O rate control: 2. scalability

 Capability discovery usually involves communication:

•  between clients

•  with a central controller

 Intra-node communication grows at O(n2)

 Adds overhead to congested network

 Low responsiveness for highly dynamic workload

Page 16: ASCAR: Automating Contention Management for High

ASCAR

Our goal

 Reduce congestion and improve overall system efficiency

 General purpose

 Scalability

 Reduce human involvement

 Reduce management cost

16

Page 17: ASCAR: Automating Contention Management for High

ASCAR

Target environments

 High Performance Computing  Homogeneous I/O among clients; bandwidth sensitive; requires

fair bandwidth allocation

 Virtual Machines accessing a shared image  Performance isolation

 Enterprise computing  Interactive workloads, backup, indexing, data sharing

 Future data center-wise storage system that serves multiple customers

 Performance isolation

17

Page 18: ASCAR: Automating Contention Management for High

ASCAR

Core ideas

 Client-side rule-based I/O rate control  1. no need for central scheduling or coordination, nimble and

highly responsive

 2. no need to change server software or hardware

 3. no scale-up bottleneck

 Use machine learning and heuristics for rule generation and optimization

 no prior knowledge of the system or workload is required

18

Page 19: ASCAR: Automating Contention Management for High

ASCAR

Components of the ASCAR prototype

19

Legends: Traffic rules

ASCAR traffic controller

Client node

Application

File system client

Storage servers

ASCAR Rule Manager Workload

Character & Rule Set Database

Workload Classifier

Rule Generator & Optimizer (SHARP) traffic

rules

Workload character markers

Page 20: ASCAR: Automating Contention Management for High

ASCAR

Legends: Traffic rules

ASCAR traffic controller

Client node

Application

File system client

Storage servers

ASCAR Rule Manager Workload

Character & Rule Set Database

Workload Classifier

Rule Generator & Optimizer (SHARP) traffic

rules

Workload character markers

Components of the ASCAR prototype

20

Page 21: ASCAR: Automating Contention Management for High

ASCAR

Legends: Traffic rules

ASCAR traffic controller

Client node

Application

File system client

Storage servers

ASCAR Rule Manager Workload

Character & Rule Set Database

Workload Classifier

Rule Generator & Optimizer (SHARP) traffic

rules

Workload character markers

Components of the ASCAR prototype

21

rau

ASCAR Rule Manager Workload

Character & Rule Set Database

Workload Classifier

Rule Generator & Optimizer (SHARP) tr

ru

Workload character markers

Page 22: ASCAR: Automating Contention Management for High

ASCAR

Rule-based Contention Control

 Rules tell the controller how to react to congestions  shrink the congestion window if congestion is building up  enlarge the congestion window if speed is increasing

 Each rule maps a congestion state to an action   (Congestion State (CS) statistics) → <action>

 Each client tracks three congestion state statistics   (ack_ewma, send_ewma, pt_ratio)

 An action describes how to change the congestion window and rate limit

 <m, b, τ>  new_cong_window = m × cong_window + b

22

Page 23: ASCAR: Automating Contention Management for High

ASCAR

Rule set

 A rule set contains many rules

 A rule set covers the whole congestion state space

23

Page 24: ASCAR: Automating Contention Management for High

ASCAR

Rule set

24

Action <m, b, τ>

ack_ewma

pt_r

aio

0,0 INT_MAX

INT_

MAX

Page 25: ASCAR: Automating Contention Management for High

ASCAR

Generating a rule set for a certain application

 Extract a short signature workload that covers the important features of the application’s I/O workload

 a signature workload is usually 20 ~ 30 seconds long

 Generate candidate rules and benchmark them with the signature workload on the real system

 test possible combinations of action variables

 Split the hottest rule in the set to generate more rules

25

Page 26: ASCAR: Automating Contention Management for High

ASCAR

Begin with one rule: the whole state space maps to one action

26

Action <m, b, τ>

ack_ewma

pt_r

aio

0,0 INT_MAX

INT_

MAX

Page 27: ASCAR: Automating Contention Management for High

ASCAR

Try different values of <m, b, τ> with the workload

27

Try the Cartesian product of:

{m ± 0.01, m ± 0.02, m ± 0.04, … } ×

{b ± 0.01, b ± 0.02, b ± 0.04, … } ×

{ τ ± 1, τ ± 2, τ ± 4, … }

ack_ewma

pt_r

aio

0,0 INT_MAX

INT_

MAX

Page 28: ASCAR: Automating Contention Management for High

ASCAR

Find the rule that yields highest performance

28

Action <0.9, 1, 121>

ack_ewma

pt_r

aio

0,0 INT_MAX

INT_

MAX

Page 29: ASCAR: Automating Contention Management for High

ASCAR

Split the state space at the most observed state values

29

Action <0.9, 1, 121>

ack_ewma

pt_r

aio

0,0 INT_MAX

INT_

MAX

Action <0.9, 1, 121>

0.98312s

1.8

Action <0.9, 1, 121>

Action <0.9, 1, 121>

Page 30: ASCAR: Automating Contention Management for High

ASCAR

Run the workload, find out the rules that was triggered most often

30

Action <0.9, 1, 121>

used 32k times

ack_ewma

pt_r

aio

0,0 INT_MAX

INT_

MAX

Action <0.9, 1, 121>

used 0.3k times

0.98312s

1.8

Action <0.9, 1, 121> used 2k times

Action <0.9, 1, 121>

used 120 times

Page 31: ASCAR: Automating Contention Management for High

ASCAR

Improve the most used rules by sweeping all possible values of <m, b, τ>

31

Try the Cartesian product of:

{m ± 0.01, m ± 0.02, m ± 0.04, … } ×

{b ± 0.01, b ± 0.02, b ± 0.04, … } ×

{ τ ± 1, τ ± 2, τ ± 4, … }

ack_ewma

pt_r

aio

0,0 INT_MAX

INT_

MAX

Action <0.9, 1, 121>

0.98312s

1.8

Action <0.9, 1, 121>

Action <0.9, 1, 121>

Page 32: ASCAR: Automating Contention Management for High

ASCAR

Find the action that works best

32

Action <1.1, 1.5, 80>

ack_ewma

pt_r

aio

0,0 INT_MAX

INT_

MAX

Action <0.9, 1, 121>

0.98312s

1.8

Action <0.9, 1, 121>

Action <0.9, 1, 121>

Page 33: ASCAR: Automating Contention Management for High

ASCAR

Split the most used rule’s state space at the most observed state values

33

Action <1.1, 1.5, 80>

ack_ewma

pt_r

aio

0,0 INT_MAX

INT_

MAX

Action <0.9, 1, 121>

0.98312s

1.8

Action <0.9, 1, 121>

Action <0.9, 1, 121>

Action <1.1, 1.5, 80>

Action <1.1, 1.5, 80>

Action <1.1, 1.5, 80>

Page 34: ASCAR: Automating Contention Management for High

ASCAR

Run workload, find the most used rule, and improve it

34

Action <1.1, 1.5, 80>

ack_ewma

pt_r

aio

0,0 INT_MAX

INT_

MAX

Action <0.5, -1, 30>

0.98312s

1.8

Action <0.9, 1, 121>

Action <0.9, 1, 121>

Action <1.1, 1.5, 80>

Action <1.1, 1.5, 80>

Action <1.1, 1.5, 80>

Page 35: ASCAR: Automating Contention Management for High

ASCAR

After find the best action, split the most used rule

35

Action <1.1, 1.5, 80>

ack_ewma

pt_r

aio

0,0 INT_MAX

INT_

MAX

Action <0.5, -1,

30>

0.98312s

1.8

Action <0.9, 1, 121>

Action <0.9, 1, 121>

Action <1.1, 1.5, 80>

Action <1.1, 1.5, 80>

Action <1.1, 1.5, 80>

Action <0.5, -1, 30>

Action <0.5, -1,

30>

Action <0.5, -1, 30>

Page 36: ASCAR: Automating Contention Management for High

ASCAR

Repeat this process

36

Action <1.1, 1.5, 80>

ack_ewma

pt_r

aio

0,0 INT_MAX

INT_

MAX

Action <1, 2, 10>

0.98312s

1.8

Action <0.9, 1, 121>

Action <0.9, 1, 121>

Action <1.1, 1.5, 80>

Action <1.1, 1.5, 80>

Action <1.1, 1.5, 80>

Action <0.5, -1, 30>

Action <0.5, -1,

30>

Action <0.5, -1, 30>

Action <2, 1, 10>

Action <1.1, 1.5, 80>

Action <1.1, 1.5,

80>

Page 37: ASCAR: Automating Contention Management for High

ASCAR

Until either we are satisfied with the results or failed to find a good rule

37

Action <1.1, 1.5, 80>

ack_ewma

pt_r

aio

0,0 INT_MAX

INT_

MAX

Action <1, 2, 10>

0.98312s

1.8

Action <0.9, 1, 121>

Action <0.9, 1, 121>

Action <1.1, 1.5, 80>

Action <1.1, 1.5, 80>

Action <1.1, 1.5, 80>

Action <0.5, -1, 30>

Action <0.5, -1,

30>

Action <0.5, -1,

30>

Action <2, 1, 10>

Action <1.1, 1.5, 80>

Action <1.1, 1.5,

80>

Action <0.5, -1,

30>

Action <0.5, -1, 30>

Action <0.5, -1, 30>

Page 38: ASCAR: Automating Contention Management for High

ASCAR

Prototype and Evaluation

 An ASCAR prototype works with Lustre  Lustre is a popular high-performance file system used in HPC

 Client controller works within the Lustre client  Highly responsive and low overhead

 Hardware: 5 servers, 5 clients  Intel Xeon CPU E3-1230 V2 @ 3.30GHz, 16 GB RAM,

Intel 330 SSD for the OS, dedicated 7200 RPM HGST Travelstar Z7K500 hard drive for Lustre, Gigabit Ethernet

38

Page 39: ASCAR: Automating Contention Management for High

ASCAR

0 5 10 15 20 25Time (second)

0

10

20

30

40

50

60

70

80

Thr

ough

put(

MB

/s)

Client Nodes Throughput

Node 1Node 2Node 3Node 4Node 5

0 5 10 15 20 25Time (second)

0

10

20

30

40

50

60

70

80

Thr

ough

put(

MB

/s)

Client Nodes Throughput

Node 1Node 2Node 3Node 4Node 5

ASCAR is good at increasing throughout and decreasing speed variance

39

Without ASCAR

Node 1

With ASCAR

Page 40: ASCAR: Automating Contention Management for High

ASCAR 40

Seq. read

Seq. write

Random read

Random write

Random r/w

BTIO (B4)

BTIO (C4)

Random write BTIO (B4) BTIO (C4)

-60.00%

-50.00%

-40.00%

-30.00%

-20.00%

-10.00%

0.00%

10.00%

20.00%

30.00%

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00%

Spee

d va

rianc

e ch

ange

Throughput improvement

Workload Throughput Improvements

TP only

TP and Var

Page 41: ASCAR: Automating Contention Management for High

ASCAR

Seq. read

Seq. write

Random read

Random write

Random r/w

BTIO (B4)

BTIO (C4)

Random write BTIO (B4) BTIO (C4)

-60.00%

-50.00%

-40.00%

-30.00%

-20.00%

-10.00%

0.00%

10.00%

20.00%

30.00%

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00%

Spee

d va

rianc

e ch

ange

Throughput improvement

Workload Throughput Improvements

TP only

TP and Var

41

% 15.00% 20.00% 25.00% 30.00% 35.00% 00% 40.0

Higher throughput

Page 42: ASCAR: Automating Contention Management for High

ASCAR

Seq. read

Seq. write

Random read

Random write

Random r/w

BTIO (B4)

BTIO (C4)

Random write BTIO (B4) BTIO (C4)

-60.00%

-50.00%

-40.00%

-30.00%

-20.00%

-10.00%

0.00%

10.00%

20.00%

30.00%

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00%

Spee

d va

rianc

e ch

ange

Throughput improvement

Workload Throughput Improvements

TP only

TP and Var

42

m read

Th

Lower speed variance

Page 43: ASCAR: Automating Contention Management for High

ASCAR

Seq. read

Seq. write

Random read

Random write

Random r/w

BTIO (B4)

BTIO (C4)

Random write BTIO (B4) BTIO (C4)

-60.00%

-50.00%

-40.00%

-30.00%

-20.00%

-10.00%

0.00%

10.00%

20.00%

30.00%

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00%

Spee

d va

rianc

e ch

ange

Throughput improvement

Workload Throughput Improvements

TP only

TP and Var

43

Random writeRandom write

BTIO (B4) BTIO (B4)

BTIO (C4)

Page 44: ASCAR: Automating Contention Management for High

ASCAR

Seq. read

Seq. write

Random read

Random write

Random r/w

BTIO (B4)

BTIO (C4)

Random write BTIO (B4) BTIO (C4)

-60.00%

-50.00%

-40.00%

-30.00%

-20.00%

-10.00%

0.00%

10.00%

20.00%

30.00%

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00%

Spee

d va

rianc

e ch

ange

Throughput improvement

Workload Throughput Improvements

TP only

TP and Var

44

writewrite

BTIO (B4) BTIO (B4)

BTIO (C4)

35.....00000000% 33333333333000000000.0

Random wRandom w

20.00% 222222222225.0

Page 45: ASCAR: Automating Contention Management for High

ASCAR

Our prototype shows that …

 ASCAR increases the performance of all workloads  2% to 36%

 ASCAR works best to boost write-heavy workloads  25% to 36%

 and it can lower speed variance at the same time  by more than 50%

45

Page 46: ASCAR: Automating Contention Management for High

ASCAR

Handling a new workload without offline training

 Measure the workload’s features

 Compare features with known workloads

 Find the most similar known workload and use its contention control rules

46

Page 47: ASCAR: Automating Contention Management for High

ASCAR

Legends: Traffic rules

ASCAR traffic controller

Client node

Application

File system client

Storage servers

ASCAR Rule Manager Workload

Character & Rule Set Database

Workload Classifier

Rule Generator & Optimizer (SHARP) traffic

rules

Workload character markers

Components of the ASCAR prototype

47

Workload Character

& Rule Set Database

Workload Classifier

Page 48: ASCAR: Automating Contention Management for High

ASCAR

Measuring workloads’ pressure on the system

 Workloads can have radically different pressure on the underlying system

 And they require different contention management rules

 The combined workload from many applications is a mix of read/write and random/sequential

48

Page 49: ASCAR: Automating Contention Management for High

ASCAR

A 75% sequential + 25% random workload can be very different from another

Request p

Request p + 120

Request p + 100

Request p + 101

Request p + 121

Request p + 122

Request p + 200

Request p + 201

Request p

Request p + 3

Request p + 1

Request p + 2

Request p + 4

Request p + 100

Request p + 200

Request p + 351

Random requests

Page 50: ASCAR: Automating Contention Management for High

ASCAR

And there are many different 60% read + 40% write workloads out there

Read Write

Read Write Read

Read Read Write Write Write Read Read

Page 51: ASCAR: Automating Contention Management for High

ASCAR

The feature set we are using now

 Op type: read/write/metadata

 Ratios between ops (read to write, read/write to metadata, etc.)

 For each type of op, we measure the following features: 1. average size of sequential ops 2. location between seq. ops: random or sequential 3. average temporal gaps between seq. ops

Page 52: ASCAR: Automating Contention Management for High

ASCAR

Sample: Different 60% read + 40% write workloads

Read Write

Read Read Write Write Write Read Read

Read to write: 60/40 Avg. size of sequential read: 60 MB Avg. size of sequential write: 40 MB

Read to write: 60/40 Avg. size of sequential read: 15 MB Avg. size of sequential write: 13 MB

Page 53: ASCAR: Automating Contention Management for High

ASCAR

Calculating similarity of workloads

 Goal: to determine if they can use the same contention control rules

 How: start from a known workload and tweak its features, measuring the efficiency of existing rules on the changed workload

 Result: we used the results of hundreds of benchmarks to generate a decision tree

53

Page 54: ASCAR: Automating Contention Management for High

ASCAR

Calculating similarity of workloads

54

rw_ratio

read_size Linear model 1

Linear model 2

Linear model 3

<=7 >7

<=75 >75

Page 55: ASCAR: Automating Contention Management for High

ASCAR

rw_ratio

read_size Linear model 1

Linear model 2

Linear model 3

<=7 >7

<=75 >75

Calculating similarity of workloads

55

< 7 7

rw_ratio

read_size

Page 56: ASCAR: Automating Contention Management for High

ASCAR

This decision tree can precisely predict the performance of using a specific rule set with a new workload

 Correlation coefficient: 0.8676

 Mean absolute error: 0.038

 Root mean squared error: 0.0462

56

Page 57: ASCAR: Automating Contention Management for High

ASCAR

Conclusion: ASCAR …

 Does not require knowledge of the system or workloads

 fully unsupervised

 Only needs client-side controllers  no need to change hardware/application/network/server software

 Can improve highly changeable workloads  like burst I/O

 Work-conserving  no wasted bandwidth

57

Page 58: ASCAR: Automating Contention Management for High

ASCAR

Conclusion: ASCAR …

 we distilled feature set and method for calculating the similarity between workload in terms of applying contention control rules

 source code of our prototype is published as an open source project

 http://www.ssrc.ucsc.edu/ascar.html

58

Page 59: ASCAR: Automating Contention Management for High

ASCAR

Future work: online rule optimization

 Current ASCAR prototype requires a lengthy offline learning process

 Use cloud computing services for doing evaluation on a larger scale

 Online tweaking of rules using random-restart hill climbing

 Also need to evaluate the ASCAR algorithm on other workloads: database, web services

59

Page 60: ASCAR: Automating Contention Management for High

ASCAR

Acknowledgments

 This research was supported in part by the National Science Foundation under awards IIP-1266400, CCF-1219163, CNS- 1018928, the Department of Energy under award DE-FC02- 10ER26017/DESC0005417, Symantec Graduate Fellowship, and industrial members of the Center for Research in Storage Systems. We would like to thank the sponsors of the Storage Systems Research Center (SSRC), including Avago Technologies, Center for Information Technology Research in the Interest of Society (CITRIS of UC Santa Cruz), Department of Energy/Office of Science, EMC, Hewlett Packard Laboratories, Intel Corporation, National Science Foundation, NetApp, Sandisk, Seagate Technology, Symantec, and Toshiba for their generous support.

60

Page 61: ASCAR: Automating Contention Management for High

ASCAR

Thanks!

 ASCAR project: http://www.ssrc.ucsc.edu/ascar.html

 Contact: Yan Li <[email protected]>

61