hpcs lab. high throughput, low latency and reliable remote file access hiroki ohtsuji and osamu...

18
HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

Upload: junior-dennis

Post on 29-Dec-2015

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

High Throughput, Low latency and Reliable Remote File Access

Hiroki Ohtsuji and Osamu TatebeUniversity of Tsukuba, Japan

/ JST CREST

Page 2: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Motivation and Background

• Data-intensive computing is a one of the most important issue in many areas

• Storage systems for Exa-byte (1018)

2

Need a fast and reliable remote file access system

Page 3: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Motivation and Background(cont’d)

• Data sharing– Distributed file system– Clients access the data via Network

• Bottlenecks– Wide-area network

• Long latency

– Storage cluster• Overhead of network

• Fault tolerance– Suggestion: Congestion avoidance

3

Meta data serverStorage node

Client

Page 4: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Remote file access with RDMA

• Latency of Ethernet is at least 50 microseconds– Overhead of software– Protocol– Memory copy

• Flash memory based storage devices– 25μs latency (e.g. Fusion-io ioDrive), (HDD=5ms)• Network becomes a bottleneck of the system

4

1 10100

100010000

100000

1000000

100000001

10

100

1000

10000

100000

10G EtherRDMAIPoIBSDP

Transfer size [bytes]Ti

me

[us]

Page 5: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Usage of Infiniband• IP over IB

– Use the IP Protocol stack of operating systems• Pros

– Can use as a network adapter

• Cons– Inefficient

• SDP (Socket Direct Protocol)– Pros

• Easy to use– Specify the LD_PRELOAD

• Cons– Performance

• RDMA (Verbs API)– Pros

• Low-latency– Cons

• No compatibility with socket APIs

Perf

orm

ance

Impl

emen

tatio

n co

st

5

better

better

Page 6: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Structure of OFED

Verbs API SDP IPoIBFrom ©Mellanox document

OFED: Drive and libraries for Infiniband

6

Page 7: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Remote file access with RDMA

• Architecture

Memory

Client Server

Server

Storage

Infiniband

Memory

Client

Application

RDMA

Infiniband FDR (54.3Gbps)Storage: Fusion-io ioDrive

7

RC

Low overhead remote file access with Verbs API

CPU bypass

Page 8: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Preliminary Evaluation: Throughput

• A client accesses the file on the file server via Infiniband w/ Verbs API– Sequential

(Verbs API)

8

10242048

40968192

1638432768

65536

131072

262144

524288

1048576

2097152

41943040

100200300400500600700800

Sequential RDMA (Fio)

RDMALocal Fio

RDMA size [KB]

Thro

ughp

ut [M

B/s]

Page 9: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Preliminary Evaluation of IOPS

• Stride access from 2KB-64KB (seek 1MB)

(Verbs API)

Stride = 1MB9

2048 4096 8192 16384 32768 655360

2000400060008000

1000012000140001600018000

RDMAFio Local

Read size [byte]

IOPS

Page 10: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Congestion avoidance by using redundant data

• Concentration of access– There are hotspots (files) on the storage node

• Redundant data– Fault tolerance– Can be use to avoid congestion

Page 11: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Redundant data

• Basic structure

11

0 1 2

File1 d0 d1 p0

0 1 2

File1 d0 d1 p0

d1 xor p0 = d0

Page 12: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

• RAID: connected with SCSI / SATA• →connected with network

12

SCSI / SATA

Network

Page 13: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Performance deterioration

13

0 1 2

File1 d0 d1 p0

File2 d0 d1 p0

3

0 1 2 3

Storage nodes

Clients

0 1 2

File1 d0 d1 p0

File2 p0 d0 d1

3

0 1 2 3

Storage nodes

Clients

Page 14: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Performance deterioration(cont’d)

14Clients RAM Disk, 16GB, Sequential

node0 node1 node2 node30

500

1000

1500

2000

2500

3000

3500

4000

3473 3381 3352 3323

2645 2616 2606 2613

集中無し集中発生

Thro

ughp

ut [M

B/s]

No congestionw/ congestion

Page 15: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Congestion avoidance

15

0 1 2

File1 d0 d1 p0

File2 d0 d1 p0

3

0 1 2 3

Clients

Storage nodes

Page 16: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Performance evaluation• Compare the cases

16復号ありw/ decode

node0 node1 node2 node30

500

1000

1500

2000

2500

3000

3500

4000

33823398

2969 3011

34733381 3352 3323

2645 2616 2606 2613

Case2

集中無し集中発生

Thro

ughp

ut [M

B/s]

Congestion avoided

+28% +15%

Clients復号ありRebuild the

striped data

No congestionw/ congestion

Page 17: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Related work

• Stephen C. Simms et al, Wide Area Filesystem Performance using Lustre on the TeraGrid, 2007.

• Wu, J., Wyckoff, P. and Panda, D.: PVFS over InfiniBand: Design and Performance Evaluation

• Erasure Coding in Windows Azure Storage, USENIX ATC ‘12– Shorten the latency by using redundant data

• HDFS RAID

17

Page 18: HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST

HPCS Lab.

Conclusion and Future work

• Remote file access with Infiniband RDMA• Congestion avoidance• Future work– How to detect the congestion– Writing of data(in progress)• w/o performance degradation

18

0 1 2 3

Client

Storage nodes