hpcs lab. high throughput, low latency and reliable remote file access hiroki ohtsuji and osamu...
TRANSCRIPT
HPCS Lab.
High Throughput, Low latency and Reliable Remote File Access
Hiroki Ohtsuji and Osamu TatebeUniversity of Tsukuba, Japan
/ JST CREST
HPCS Lab.
Motivation and Background
• Data-intensive computing is a one of the most important issue in many areas
• Storage systems for Exa-byte (1018)
2
Need a fast and reliable remote file access system
HPCS Lab.
Motivation and Background(cont’d)
• Data sharing– Distributed file system– Clients access the data via Network
• Bottlenecks– Wide-area network
• Long latency
– Storage cluster• Overhead of network
• Fault tolerance– Suggestion: Congestion avoidance
3
Meta data serverStorage node
Client
HPCS Lab.
Remote file access with RDMA
• Latency of Ethernet is at least 50 microseconds– Overhead of software– Protocol– Memory copy
• Flash memory based storage devices– 25μs latency (e.g. Fusion-io ioDrive), (HDD=5ms)• Network becomes a bottleneck of the system
4
1 10100
100010000
100000
1000000
100000001
10
100
1000
10000
100000
10G EtherRDMAIPoIBSDP
Transfer size [bytes]Ti
me
[us]
HPCS Lab.
Usage of Infiniband• IP over IB
– Use the IP Protocol stack of operating systems• Pros
– Can use as a network adapter
• Cons– Inefficient
• SDP (Socket Direct Protocol)– Pros
• Easy to use– Specify the LD_PRELOAD
• Cons– Performance
• RDMA (Verbs API)– Pros
• Low-latency– Cons
• No compatibility with socket APIs
Perf
orm
ance
Impl
emen
tatio
n co
st
5
better
better
HPCS Lab.
Structure of OFED
Verbs API SDP IPoIBFrom ©Mellanox document
OFED: Drive and libraries for Infiniband
6
HPCS Lab.
Remote file access with RDMA
• Architecture
Memory
Client Server
Server
Storage
Infiniband
Memory
Client
Application
RDMA
Infiniband FDR (54.3Gbps)Storage: Fusion-io ioDrive
7
RC
Low overhead remote file access with Verbs API
CPU bypass
HPCS Lab.
Preliminary Evaluation: Throughput
• A client accesses the file on the file server via Infiniband w/ Verbs API– Sequential
(Verbs API)
8
10242048
40968192
1638432768
65536
131072
262144
524288
1048576
2097152
41943040
100200300400500600700800
Sequential RDMA (Fio)
RDMALocal Fio
RDMA size [KB]
Thro
ughp
ut [M
B/s]
HPCS Lab.
Preliminary Evaluation of IOPS
• Stride access from 2KB-64KB (seek 1MB)
(Verbs API)
Stride = 1MB9
2048 4096 8192 16384 32768 655360
2000400060008000
1000012000140001600018000
RDMAFio Local
Read size [byte]
IOPS
HPCS Lab.
Congestion avoidance by using redundant data
• Concentration of access– There are hotspots (files) on the storage node
• Redundant data– Fault tolerance– Can be use to avoid congestion
HPCS Lab.
Redundant data
• Basic structure
11
0 1 2
File1 d0 d1 p0
0 1 2
File1 d0 d1 p0
d1 xor p0 = d0
HPCS Lab.
• RAID: connected with SCSI / SATA• →connected with network
12
SCSI / SATA
Network
HPCS Lab.
Performance deterioration
13
0 1 2
File1 d0 d1 p0
File2 d0 d1 p0
3
0 1 2 3
Storage nodes
Clients
0 1 2
File1 d0 d1 p0
File2 p0 d0 d1
3
0 1 2 3
Storage nodes
Clients
HPCS Lab.
Performance deterioration(cont’d)
14Clients RAM Disk, 16GB, Sequential
node0 node1 node2 node30
500
1000
1500
2000
2500
3000
3500
4000
3473 3381 3352 3323
2645 2616 2606 2613
集中無し集中発生
Thro
ughp
ut [M
B/s]
No congestionw/ congestion
HPCS Lab.
Congestion avoidance
15
0 1 2
File1 d0 d1 p0
File2 d0 d1 p0
3
0 1 2 3
Clients
Storage nodes
HPCS Lab.
Performance evaluation• Compare the cases
16復号ありw/ decode
node0 node1 node2 node30
500
1000
1500
2000
2500
3000
3500
4000
33823398
2969 3011
34733381 3352 3323
2645 2616 2606 2613
Case2
集中無し集中発生
Thro
ughp
ut [M
B/s]
Congestion avoided
+28% +15%
Clients復号ありRebuild the
striped data
No congestionw/ congestion
HPCS Lab.
Related work
• Stephen C. Simms et al, Wide Area Filesystem Performance using Lustre on the TeraGrid, 2007.
• Wu, J., Wyckoff, P. and Panda, D.: PVFS over InfiniBand: Design and Performance Evaluation
• Erasure Coding in Windows Azure Storage, USENIX ATC ‘12– Shorten the latency by using redundant data
• HDFS RAID
17
HPCS Lab.
Conclusion and Future work
• Remote file access with Infiniband RDMA• Congestion avoidance• Future work– How to detect the congestion– Writing of data(in progress)• w/o performance degradation
18
0 1 2 3
Client
Storage nodes