improving i/o resource sharing of linux cgroup for nvme ......improving i/o resource sharing of...

40
Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*, Kwanghyun La*, Jihong Kim** *Memory Business, Samsung Electronics Co., Ltd. **Seoul National University

Upload: others

Post on 13-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems

USENIX HotStorage 2016

Sungyong Ahn*, Kwanghyun La*, Jihong Kim**

*Memory Business, Samsung Electronics Co., Ltd.

**Seoul National University

Page 2: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Introduction

Motivation

Contributions

Weight-based dynamic throttling (WDT) scheme

Experimental Results

Conclusion

Outline

Page 3: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Introduction

Motivation

Contributions

Weight-based dynamic throttling (WDT) scheme

Experimental Results

Conclusion

Outline

Page 4: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Multiple isolated instances (containers) running on a single host.

OS-level Virtualization

Page 5: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Multiple isolated instances (containers) running on a single host.

ā€¢ Hardware resources should be isolated and allocated to containers

OS-level Virtualization

Page 8: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

I/O bandwidth is shared according to I/O weights

Proportional I/O scheme in Linux Cgroups

10

5

2.5

1

0

2

4

6

8

10

12

No

rma

lize

d I

/O b

an

dw

idth C(10) C(5)

C(2.5) C(1)

Ideal proportional I/O sharing

Page 9: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Introduction

Motivation

Contributions

Weight-based dynamic throttling (WDT) scheme

Experimental Results

Conclusion

Outline

Page 12: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

0

2

4

6

8

10

12

1.5 1.10.5 1.0

0

2

4

6

8

10

12

BASELINEN

orm

aliz

ed

I/O

ban

dw

idth C(10) C(5)

C(2.5) C(1)

Existing Cgroups cannot support the proportional I/O to NVMe SSDs

Proportional I/O with NVMe SSDs

10

5

2.5 1.0

Page 13: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

NVMe SSDs have different I/O stack from SATA storage

Because...

SATA I/O stack

NVMe I/O stack

Existing proportional I/O scheme is implemented in single queue block layer

Page 14: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

First Attempt: Using the Existing Static Throttling

Upper limit of I/O bandwidth

ā€¢ Limit the maximum number of bytes or I/O requests for particular time interval (throttling window)

1 2 3

Throttling Window

4 5 6 7 8 9 10

throttled

Container A upper limit=10

Time

Page 15: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

0

2

4

6

8

10

12

Static throttling is not enough to support the proportional I/O

First Attempt: Using the Existing Static Throttling

9.9

2.21.2 1.0

0

2

4

6

8

10

12

Static Throttling

No

rmal

ize

d I/

O b

and

wid

th C(10) C(5)

C(2.5) C(1)

Page 16: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Because...

I/O workloads fluctuate with time

Page 17: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

I/O workloads fluctuate with time

Because...

Page 18: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Introduction

Motivation

Contributions

Weight-based Dynamic throttling (WDT) scheme

Experimental Results

Conclusion

Outline

Page 19: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Contributions

We achieved the proportional I/O for NVMe SSDs.

We achieved the scalable performance of Linux Cgroups.

Page 20: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Introduction

Motivation

Contributions

Weight-based Dynamic throttling (WDT) scheme

Experimental Results

Conclusion

Outline

Page 21: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Overview of WDT Scheme

Container

Block Layer

Container Container

weight w1 weight w2 weight w3 Credit allocation

CPM Monitoring

Data flow

Future I/O Demand

Predictor

Budget Distributor

TotalCredit Updater

Residual Credits

Carryover

TotalCredit ā„¬1š‘—+ ā„›1

š‘—

> š’°1š‘—

Monitoring

š¶š‘ƒš‘€1

ā„¬2š‘—+ ā„›2

š‘—

> š’°2š‘—

ā„¬3š‘—+ ā„›3

š‘—

> š’°3š‘—

Monitoring

š¶š‘ƒš‘€2

Monitoring

š¶š‘ƒš‘€3

š’ž3 š’ž2 š’ž1

WDT

To update TotalCredit, future I/O demand is predicted

Distributing the credits to containers according to I/O weights

Page 22: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

All containers are allocated credits in proportion to their I/O weight.

Budget Distributor

Throttling Window

Container A I/O weight=10

Time

Container B I/O weight=5

Throttling Window

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

TotalCredit = 15

Page 23: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

All containers are allocated credits in proportion to their I/O weight.

Credits are replenished periodically.

Budget Distributor

Throttling Window

Container A I/O weight=10

Time

Container B I/O weight=5

Throttling Window

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

TotalCredit = 15 Replenishment of Credit Replenishment of Credit

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

TotalCredit = 15

Page 24: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

All containers are allocated credits in proportion to their I/O weight.

Credits are replenished periodically.

If a container has no available credit, it is throttled.

Budget Distributor

Throttling Window

Container A I/O weight=10

Time

Container B I/O weight=5

Throttling Window

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

TotalCredit = 15 Replenishment of Credit Replenishment of Credit

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

TotalCredit = 15

1 2 3 4 5 6 7 8 9 10

throttled

1 2 3 4 5 6 7 8 9 10

throttled

1 2 3 4 5

throttled

1 2 3 4 5

throttled

Page 25: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

All containers are allocated credits in proportion to their I/O weight.

Credits are replenished periodically.

If a container has no available credit, it is throttled.

Budget Distributor

Page 26: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

In order to remove storage idle time, TotalCredit is adjusted.

TotalCredit Updater

Page 29: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Monitoring I/O demand of each container for every interval

ā€¢ Prediction of the future I/O demand from cumulative distribution function

ā€“ 80th percentile of a cumulative distribution of I/O demand (assuming normal distribution)

Future I/O Demand Predictor

Container

Block Layer

Container Container

weight w1 weight w2 weight w3 Credit allocation

CPM Monitoring

Data flow

Future I/O Demand

Predictor

Budget Distributor

TotalCredit Updater

Residual Credits

Carryover

TotalCredit ā„¬1š‘—+ ā„›1

š‘—

> š’°1š‘—

Monitoring

š¶š‘ƒš‘€1

ā„¬2š‘—+ ā„›2

š‘—

> š’°2š‘—

ā„¬3š‘—+ ā„›3

š‘—

> š’°3š‘—

Monitoring

š¶š‘ƒš‘€2

Monitoring

š¶š‘ƒš‘€3

š’ž3 š’ž2 š’ž1

WDT

Page 31: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Scalability problem of the existing Cgroups throttling layer

Scalability of the Existing Cgroups on NUMA

0

100

200

300

400

500

600

1 node 2 nodes 3 nodes 4 nodes

KIO

PS

The number of NUMA nodes

C1 C2

C3 C4

Page 32: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

All containers share a single request_queue lock across NUMA nodes

Because...

0.6

21.8

31.738.2

0

10

20

30

40

50

1 node 2 nodes 3 nodes 4 nodesCP

U c

ach

e m

iss

rati

o (

%)

The number of NUMA nodes

Hardware

Linux

Container A Container B

Single-queue Block Layer

Container C Container D

Multi-queue Block Layer

Proportional I/O (CFQ)

Cgroup I/O throttling

LOCK

Lock contention Remote memory accesses to the lock

state Cacheline invalidations caused by

cache coherence protocol

Page 33: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

We adopt fine-grained per-container locks

The cache miss ratio decreases to 12.8% from 38.2%

Per-container Locks

Hardware

Linux

Container A Container B

Single-queue Block Layer

Container C Container D

Multi-queue Block Layer

Proportional I/O (CFQ)

Cgroup I/O throttling

LOCK

Hardware

Linux

Container A Container B

Single-queue Block Layer

Container C Container D

Multi-queue Block Layer

Proportional I/O (CFQ)

Cgroup I/O throttling

LOCK LOCK LOCK LOCK

Page 34: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Introduction

Motivation

Contributions

Weight-based dynamic throttling (WDT) scheme

Experimental Results

Conclusion

Outline

Page 36: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

0

2

4

6

8

10

12

WDT scheme satisfies the proportional sharing requirements

Result 1: Proportional I/O Support

1.5

9.9 10.0

1.1

2.2

5.0

0.51.2

2.5

1.0 1.0 1.0

0

2

4

6

8

10

12

BASELINE Static Throttling WDT

No

rma

lize

d I

/O b

an

dw

idth

Page 37: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

WDT- : Using single spin lock

WDT : Using per-container locks

Result 2: Performance Scalability

1333

1762

670

881

334 440

133 176

0

200

400

600

800

1000

1200

1400

1600

1800

2000

WDT- WDT

I/O

ba

nd

wid

th (M

B/s

) C1 C2

C3 C4

Page 38: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Introduction

Motivation

Contributions

Weight-based dynamic throttling (WDT) scheme

Experimental Results

Conclusion

Outline

Page 39: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,

Proposed the weight-based dynamic throttling scheme to support proportional I/O sharing for NVMe SSDs.

Proposed the per-container locks for scalable performance.

Conclusion

Page 40: Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems USENIX HotStorage 2016 Sungyong Ahn*,