accelerating big data processing with rdma - enhanced...
TRANSCRIPT
Accelerating Big Data Processing with RDMA-Enhanced Apache Hadoop
Dhabaleswar K. (DK) Panda The Ohio State University
E-mail: [email protected] http://www.cse.ohio-state.edu/~panda
Keynote Talk at BPOE-4, in conjunction with ASPLOS ‘14
(March 2014)
by
Introduction to Big Data Applications and Analytics
• Big Data has become the one of the most important elements of business analytics
• Provides groundbreaking opportunities for enterprise information management and decision making
• The amount of data is exploding; companies are capturing and digitizing more information than ever
• The rate of information growth appears to be exceeding Moore’s Law
2 BPOE-4, ASPLOS '14
• Commonly accepted 3V’s of Big Data • Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf
• 5V’s of Big Data – 3V + Value, Veracity
• Scientific Data Management, Analysis, and Visualization
• Applications examples – Climate modeling
– Combustion
– Fusion
– Astrophysics
– Bioinformatics
• Data Intensive Tasks – Runs large-scale simulations on supercomputers
– Dump data on parallel storage systems
– Collect experimental / observational data
– Move experimental / observational data to analysis sites
– Visual analytics – help understand data visually
BPOE-4, ASPLOS '14 3
Not Only in Internet Services - Big Data in Scientific Domains
• Hadoop: http://hadoop.apache.org – The most popular framework for Big Data Analytics
– HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc.
• Spark: http://spark-project.org – Provides primitives for in-memory cluster computing; Jobs can load data into memory and
query it repeatedly
• Shark: http://shark.cs.berkeley.edu – A fully Apache Hive-compatible data warehousing system
• Storm: http://storm-project.net – A distributed real-time computation system for real-time analytics, online machine learning,
continuous computation, etc.
• S4: http://incubator.apache.org/s4 – A distributed system for processing continuous unbounded streams of data
• Web 2.0: RDBMS + Memcached (http://memcached.org) – Memcached: A high-performance, distributed memory object caching systems
BPOE-4, ASPLOS '14 4
Typical Solutions or Architectures for Big Data Applications and Analytics
• Focuses on large data and data analysis
• Hadoop (e.g. HDFS, MapReduce, HBase, RPC) environment is gaining a lot of momentum
• http://wiki.apache.org/hadoop/PoweredBy
BPOE-4, ASPLOS '14 5
Who Are Using Hadoop?
• Scale: 42,000 nodes, 365PB HDFS, 10M daily slot hours • http://developer.yahoo.com/blogs/hadoop/apache-hbase-yahoo-multi-tenancy-helm-again-171710422.html
• A new Jim Gray’ Sort Record: 1.42 TB/min over 2,100 nodes • Hadoop 0.23.7, an early branch of the Hadoop 2.X
• http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-092704088.html
BPOE-4, ASPLOS '14 6
Hadoop at Yahoo!
http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html
• Overview of Hadoop • Trends in Networking Technologies and Protocols • Challenges in Accelerating Hadoop • Designs and Case Studies
– HDFS – MapReduce – HBase – RPC – Combination of components
• Challenges in Big Data Benchmarking • Other Open Challenges in Acceleration • Conclusion and Q&A
Presentation Outline
7 BPOE-4, ASPLOS '14
Big Data Processing with Hadoop
• The open-source implementation of MapReduce programming model for Big Data Analytics
• Major components – HDFS – MapReduce
• Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase
• Model scales but high amount of communication during intermediate phases can be further optimized
8
HDFS
MapReduce
Hadoop Framework
User Applications
HBase
Hadoop Common (RPC)
BPOE-4, ASPLOS '14
Hadoop MapReduce Architecture & Operations
Disk Operations • Map and Reduce Tasks carry out the total job execution • Communication in shuffle phase uses HTTP over Java Sockets
9 BPOE-4, ASPLOS '14
Hadoop Distributed File System (HDFS)
• Primary storage of Hadoop; highly reliable and fault-tolerant
• Adopted by many reputed organizations – eg: Facebook, Yahoo!
• NameNode: stores the file system namespace
• DataNode: stores data blocks
• Developed in Java for platform-independence and portability
• Uses sockets for communication!
(HDFS Architecture)
10
RPC
RPC
RPC
RPC
BPOE-4, ASPLOS '14
Client
BPOE-4, ASPLOS '14 11
System-Level Interaction Between Clients and Data Nodes in HDFS
• Overview of Hadoop • Trends in Networking Technologies and Protocols • Challenges in Accelerating Hadoop • Designs and Case Studies
– HDFS – MapReduce – RPC – Combination of components
• Challenges in Big Data Benchmarking • Other Open Challenges in Acceleration • Conclusion and Q&A
Presentation Outline
12 BPOE-4, ASPLOS '14
BPOE-4, ASPLOS '14 13
Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)
0102030405060708090100
050
100150200250300350400450500
Perc
enta
ge o
f Clu
ster
s
Num
ber o
f Clu
ster
s
Timeline
Percentage of ClustersNumber of Clusters
HPC and Advanced Interconnects
• High-Performance Computing (HPC) has adopted advanced interconnects and protocols – InfiniBand
– 10 Gigabit Ethernet/iWARP
– RDMA over Converged Enhanced Ethernet (RoCE)
• Very Good Performance – Low latency (few micro seconds)
– High Bandwidth (100 Gb/s with dual FDR InfiniBand)
– Low CPU overhead (5-10%)
• OpenFabrics software stack with IB, iWARP and RoCE interfaces are driving HPC systems
• Many such systems in Top500 list
14 BPOE-4, ASPLOS '14
15
Trends of Networking Technologies in TOP500 Systems Percentage share of InfiniBand is steadily increasing
Interconnect Family – Systems Share
BPOE-4, ASPLOS '14
Open Standard InfiniBand Networking Technology • Introduced in Oct 2000 • High Performance Data Transfer
– Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec
-> 100Gbps), and low CPU utilization (5-10%) • Flexibility for LAN and WAN communication • Multiple Transport Services
– Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram
– Provides flexibility to develop upper layers • Multiple Operations
– Send/Recv – RDMA Read/Write – Atomic Operations (very unique)
• high performance and scalable implementations of distributed locks, semaphores, collective communication operations
• Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, ….
16 BPOE-4, ASPLOS '14
• Message Passing Interface (MPI) – Most commonly used programming model for HPC applications
• Parallel File Systems
• Storage Systems
• Almost 13 years of Research and Development since InfiniBand was introduced in October 2000
• Other Programming Models are emerging to take advantage of High-Performance Networks – UPC
– OpenSHMEM
– Hybrid MPI+PGAS (OpenSHMEM and UPC)
17
Use of High-Performance Networks for Scientific Computing
BPOE-4, ASPLOS '14
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MAPICH2-X (MPI + PGAS), Available since 2012
– Support for GPGPUs and MIC
– Used by more than 2,100 organizations (HPC Centers, Industry and Universities) in 71 countries
– More than 203,000 downloads from OSU site directly
– Empowering many TOP500 clusters • 7th ranked 462,462-core cluster (Stampede) at TACC
• 11th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology
• 16th ranked 96,192-core cluster (Pleiades) at NASA and many others
– Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the U.S. NSF-TACC Stampede System
MVAPICH2/MVAPICH2-X Software
18 BPOE-4, ASPLOS '14
BPOE-4, ASPLOS '14 19
One-way Latency: MPI over IB with MVAPICH2
0.00
1.00
2.00
3.00
4.00
5.00
6.00Small Message Latency
Message Size (bytes)
Late
ncy
(us)
1.66
1.56
1.64
1.82
0.99 1.09
1.12
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR
Large Message Latency
Message Size (bytes)
Late
ncy
(us)
BPOE-4, ASPLOS '14 20
Bandwidth: MPI over IB with MVAPICH2
0
2000
4000
6000
8000
10000
12000
14000Unidirectional Bandwidth
Band
wid
th (M
Byte
s/se
c)
Message Size (bytes)
3280
3385
1917 1706
6343
12485
12810
0
5000
10000
15000
20000
25000
30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR
Bidirectional Bandwidth
Band
wid
th (M
Byte
s/se
c)
Message Size (bytes)
3341 3704
4407
11643
6521
21025
24727
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
• Overview of Hadoop • Trends in Networking Technologies and Protocols • Challenges in Accelerating Hadoop • Designs and Case Studies
– HDFS – MapReduce – RPC – Combination of components
• Challenges in Big Data Benchmarking • Other Open Challenges in Acceleration • Conclusion and Q&A
Presentation Outline
21 BPOE-4, ASPLOS '14
Can High-Performance Interconnects Benefit Hadoop?
• Most of the current Hadoop environments use Ethernet Infrastructure with Sockets
• Concerns for performance and scalability
• Usage of High-Performance Networks is beginning to draw interest from many companies
• What are the challenges?
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?
• Can HPC Clusters with High-Performance networks be used for Big Data applications using Hadoop?
22 BPOE-4, ASPLOS '14
Designing Communication and I/O Libraries for Big Data Systems: Challenges
BPOE-4, ASPLOS '14 23
Big Data Middleware (HDFS, MapReduce, HBase and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
BPOE-4, ASPLOS '14 24
Common Protocols using Open Fabrics
Application/ Middleware
Verbs Sockets
Application /Middleware Interface
SDP
RDMA
SDP
InfiniBand Adapter
InfiniBand Switch
RDMA
IB Verbs
InfiniBand Adapter
InfiniBand Switch
User space
RDMA
RoCE
RoCE Adapter
User space
Ethernet Switch
TCP/IP
Ethernet Driver
Kernel Space
Protocol
InfiniBand Adapter
InfiniBand Switch
IPoIB
Ethernet Adapter
Ethernet Switch
Adapter
Switch
1/10/40 GigE
iWARP
Ethernet Switch
iWARP
iWARP Adapter
User space IPoIB
TCP/IP
Ethernet Adapter
Ethernet Switch
10/40 GigE-TOE
Hardware Offload
RSockets
InfiniBand Adapter
InfiniBand Switch
User space
RSockets
Can Hadoop be Accelerated with High-Performance Networks and Protocols?
25
Enhanced Designs
Application
Accelerated Sockets
10 GigE or InfiniBand
Verbs / Hardware Offload
Current Design
Application
Sockets
1/10 GigE Network
• Sockets not designed for high-performance – Stream semantics often mismatch for upper layers (Hadoop and HBase) – Zero-copy not available for non-blocking sockets
Our Approach
Application
OSU Design
10 GigE or InfiniBand
Verbs Interface
BPOE-4, ASPLOS '14
• Overview of Hadoop • Trends in Networking Technologies and Protocols • Challenges in Accelerating Hadoop • Designs and Case Studies
– HDFS – MapReduce – RPC – Combination of components
• Challenges in Big Data Benchmarking • Other Open Challenges in Acceleration • Conclusion and Q&A
Presentation Outline
26 BPOE-4, ASPLOS '14
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components
– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)
• Current release: 0.9.8
– Based on Apache Hadoop 1.2.1
– Compliant with Apache Hadoop 1.2.1 APIs and applications
– Tested with • Mellanox InfiniBand adapters (DDR, QDR and FDR)
• RoCE
• Various multi-core platforms
• Different file systems with disks and SSDs
– http://hadoop-rdma.cse.ohio-state.edu
RDMA for Apache Hadoop Project
27 BPOE-4, ASPLOS '14
Design Overview of HDFS with RDMA
• JNI Layer bridges Java based HDFS with communication library written in native code • Only the communication part of HDFS Write is modified; No change in HDFS architecture
28
HDFS
Verbs
RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Write Others
OSU Design
Enables high performance RDMA communication, while supporting traditional socket interface
• Design Features – RDMA-based HDFS
write – RDMA-based HDFS
replication – InfiniBand/RoCE
support
BPOE-4, ASPLOS '14
Communication Time in HDFS
• Cluster with HDD DataNodes
– 30% improvement in communication time over IPoIB (QDR)
– 56% improvement in communication time over 10GigE
• Similar improvements are obtained for SSD DataNodes
Reduced by 30%
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012
29
0
5
10
15
20
25
2GB 4GB 6GB 8GB 10GB
Com
mun
icat
ion
Tim
e (s
)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
BPOE-4, ASPLOS '14
Evaluations using TestDFSIO
• Cluster with 8 HDD DataNodes, single disk per node
– 24% improvement over IPoIB (QDR) for 20GB file size
• Cluster with 4 SSD DataNodes, single SSD per node
– 61% improvement over IPoIB (QDR) for 20GB file size
Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps
Increased by 24%
Increased by 61%
30
0
200
400
600
800
1000
1200
5 10 15 20
Tota
l Thr
ough
put (
MBp
s)
File Size (GB)
IPoIB (QDR) OSU-IB (QDR)
0
500
1000
1500
2000
2500
5 10 15 20
Tota
l Thr
ough
put (
MBp
s)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
BPOE-4, ASPLOS '14
Evaluations using TestDFSIO
• Cluster with 4 DataNodes, 1 HDD per node
– 24% improvement over IPoIB (QDR) for 20GB file size
• Cluster with 4 DataNodes, 2 HDD per node
– 31% improvement over IPoIB (QDR) for 20GB file size
• 2 HDD vs 1 HDD
– 76% improvement for OSU-IB (QDR)
– 66% improvement for IPoIB (QDR)
Increased by 31%
31
0
500
1000
1500
2000
2500
5 10 15 20
Tota
l Thr
ough
put (
MBp
s)
File Size (GB)
10GigE-1HDD 10GigE-2HDDIPoIB -1HDD (QDR) IPoIB -2HDD (QDR)OSU-IB -1HDD (QDR) OSU-IB -2HDD (QDR)
BPOE-4, ASPLOS '14
32
Evaluations using Enhanced DFSIO of Intel HiBench on SDSC-Gordon
• Cluster with 32 DataNodes, single SSD per node – 47% improvement in throughput over IPoIB (QDR) for 128GB file size
– 16% improvement in latency over IPoIB (QDR) for 128GB file size
BPOE-4, ASPLOS '14
Increased by 47% Reduced by 16%
0
500
1000
1500
2000
2500
32 64 128
Aggr
egat
ed T
hrou
ghpu
t (M
Bps)
File Size (GB)
IPoIB (QDR)OSU-IB (QDR)
0
20
40
60
80
100
120
140
160
32 64 128
Exec
utio
n Ti
me
(s)
File Size (GB)
IPoIB (QDR)
OSU-IB (QDR)
BPOE-4, ASPLOS '14 33
Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede
• Cluster with 64 DataNodes, single HDD per node – 64% improvement in throughput over IPoIB (FDR) for 256GB file size
– 37% improvement in latency over IPoIB (FDR) for 256GB file size
0
200
400
600
800
1000
1200
1400
64 128 256
Aggr
egat
ed T
hrou
ghpu
t (M
Bps)
File Size (GB)
IPoIB (FDR)OSU-IB (FDR) Increased by 64%
Reduced by 37%
0
100
200
300
400
500
600
64 128 256
Exec
utio
n Ti
me
(s)
File Size (GB)
IPoIB (FDR)
OSU-IB (FDR)
• Overview of Hadoop • Trends in Networking Technologies and Protocols • Challenges in Accelerating Hadoop • Designs and Case Studies
– HDFS – MapReduce – RPC – Combination of components
• Challenges in Big Data Benchmarking • Other Open Challenges in Acceleration • Conclusion and Q&A
Presentation Outline
34 BPOE-4, ASPLOS '14
Design Overview of MapReduce with RDMA
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based MapReduce with communication library written in native code
• InfiniBand/RoCE support
MapReduce
Verbs
RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)
OSU Design
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Job Tracker
Task Tracker
Map
Reduce
Design features for RDMA
Prefetch/Caching of MOF
In-Memory Merge
Overlap of Merge & Reduce
35 BPOE-4, ASPLOS '14
Evaluations using Sort
8 HDD DataNodes
• With 8 HDD DataNodes for 40GB sort
– 41% improvement over IPoIB (QDR)
– 39% improvement over UDA-IB (QDR)
36
M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May 2013.
BPOE-4, ASPLOS '14
8 SSD DataNodes
• With 8 SSD DataNodes for 40GB sort
– 52% improvement over IPoIB (QDR)
– 44% improvement over UDA-IB (QDR)
0
50
100
150
200
250
300
350
400
450
500
25 30 35 40
Job
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
10GigE IPoIB (QDR)UDA-IB (QDR) OSU-IB (QDR)
0
50
100
150
200
250
300
350
400
450
500
25 30 35 40
Job
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
10GigE IPoIB (QDR)UDA-IB (QDR) OSU-IB (QDR)
Evaluations using TeraSort
• 100 GB TeraSort with 8 DataNodes with 2 HDD per node
– 51% improvement over IPoIB (QDR)
– 41% improvement over UDA-IB (QDR)
37 BPOE-4, ASPLOS '14
0
200
400
600
800
1000
1200
1400
1600
1800
60 80 100
Job
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
10GigE IPoIB (QDR)UDA-IB (QDR) OSU-IB (QDR)
• For 240GB Sort in 64 nodes – 40% improvement over IPoIB (QDR)
with HDD used for HDFS
38
Performance Evaluation on Larger Clusters
BPOE-4, ASPLOS '14
0
200
400
600
800
1000
1200
Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
Job
Exec
utio
n Ti
me
(sec
)
IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)
Sort in OSU Cluster
0
100
200
300
400
500
600
700
Data Size: 80 GB Data Size: 160 GBData Size: 320 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
Job
Exec
utio
n Ti
me
(sec
)
IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)
TeraSort in TACC Stampede
• For 320GB TeraSort in 64 nodes – 38% improvement over IPoIB
(FDR) with HDD used for HDFS
• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size
• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size
Evaluations using PUMA Workload
39 BPOE-4, ASPLOS '14
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)
Nor
mal
ized
Exe
cutio
n Ti
me
Benchmarks
10GigE IPoIB (QDR) OSU-IB (QDR)
• 50 small MapReduce jobs executed in a cluster size of 4
• Maximum performance benefit 24% over IPoIB (QDR)
• Average performance benefit 12.89% over IPoIB (QDR)
40
Evaluations using SWIM
15
20
25
30
35
40
1 6 11 16 21 26 31 36 41 46
Exec
utio
n Ti
me
(sec
)
Job Number
IPoIB (QDR) OSU-IB (QDR)
BPOE-4, ASPLOS '14
• Overview of Hadoop • Trends in Networking Technologies and Protocols • Challenges in Accelerating Hadoop • Designs and Case Studies
– HDFS – MapReduce – RPC – Combination of components
• Challenges in Big Data Benchmarking • Other Open Challenges in Acceleration • Conclusion and Q&A
Presentation Outline
41 BPOE-4, ASPLOS '14
Design Overview of Hadoop RPC with RDMA
Hadoop RPC
Verbs
RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Our Design
Default
OSU Design
Enables high performance RDMA communication, while supporting traditional socket interface
42
• Design Features – JVM-bypassed buffer
management – On-demand connection
setup – RDMA or send/recv
based adaptive communication
– InfiniBand/RoCE support
BPOE-4, ASPLOS '14
Hadoop RPC over IB - Gain in Latency and Throughput
• Hadoop RPC over IB PingPong Latency
– 1 byte: 39 us; 4 KB: 52 us
– 42%-49% and 46%-50% improvements compared with the performance on 10 GigE and IPoIB (32Gbps), respectively
• Hadoop RPC over IB Throughput
– 512 bytes & 48 clients: 135.22 Kops/sec
– 82% and 64% improvements compared with the peak performance on 10 GigE and IPoIB (32Gbps), respectively
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128 256 512 102420484096
Late
ncy
(us)
Payload Size (Byte)
10GigEIPoIB(QDR)OSU-IB(QDR)
0
20
40
60
80
100
120
140
160
8 16 24 32 40 48 56 64
Thro
ughp
ut (K
ops/
Sec)
Number of Clients
43 BPOE-4, ASPLOS '14
• Overview of Hadoop • Trends in Networking Technologies and Protocols • Challenges in Accelerating Hadoop • Designs and Case Studies
– HDFS – MapReduce – RPC – Combination of components
• Challenges in Big Data Benchmarking • Other Open Challenges in Acceleration • Conclusion and Q&A
Presentation Outline
44 BPOE-4, ASPLOS '14
• For 5GB,
– 53.4% improvement over IPoIB, 126.9% improvement over 10GigE with HDD used for HDFS
– 57.8% improvement over IPoIB, 138.1% improvement over 10GigE with SSD used for HDFS
• For 20GB,
– 10.7% improvement over IPoIB, 46.8% improvement over 10GigE with HDD used for HDFS
– 73.3% improvement over IPoIB, 141.3% improvement over 10GigE with SSD used for HDFS
45
Performance Benefits - TestDFSIO
Cluster with 4 HDD Nodes, TestDFSIO with total 4 maps Cluster with 4 SSD Nodes, TestDFSIO with total 4 maps
0
100
200
300
400
500
600
700
800
900
1000
5 10 15 20
Tota
l Thr
ough
put (
MBp
s)
Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
0
100
200
300
400
500
600
700
800
900
5 10 15 20
Tota
l Thr
ough
put (
MBp
s)
Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
BPOE-4, ASPLOS '14
• For 80GB Sort in a cluster size of 8, – 40.7% improvement over IPoIB, 32.2% improvement over 10GigE with HDD
used for HDFS
– 43.6% improvement over IPoIB, 42.9% improvement over 10GigE with SSD used for HDFS
46
Performance Benefits - Sort
0
200
400
600
800
1000
1200
1400
1600
1800
48 64 80
Job
Exec
utio
n Ti
me
(sec
)
Sort Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
Cluster with 8 HDD Nodes, Sort with <8 maps, 8 reduces>/host Cluster with 8 SSD Nodes, Sort with <8 maps, 8 reduces>/host
0
200
400
600
800
1000
1200
1400
1600
48 64 80
Job
Exec
utio
n Ti
me
(sec
)
Sort Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
BPOE-4, ASPLOS '14
• For 100GB TeraSort in a cluster size of 8, – 45% improvement over IPoIB, 39% improvement over 10GigE with HDD used
for HDFS
– 46% improvement over IPoIB, 40% improvement over 10GigE with SSD used for HDFS
47
Performance Benefits - TeraSort
Cluster with 8 HDD Nodes, TeraSort with <8 maps, 8reduces>/host Cluster with 8 SSD Nodes, TeraSort with <8 maps, 8reduces>/host
0
500
1000
1500
2000
2500
80 90 100
Job
Exec
utio
n Ti
me
(sec
)
Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
0
500
1000
1500
2000
2500
80 90 100
Job
Exec
utio
n Ti
me
(sec
)
Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
BPOE-4, ASPLOS '14
0
50
100
150
200
250
300
350
400
50 GB 100 GB 200 GB 300 GB
Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64
Job
Exec
utio
n Ti
me
(sec
)
IPoIB (QDR) OSU-IB (QDR)
0
20
40
60
80
100
120
140
160
50 GB 100 GB 200 GB 300 GB
Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64
Job
Exec
utio
n Ti
me
(sec
)
IPoIB (QDR) OSU-IB (QDR)
Performance Benefits – RandomWriter & Sort in SDSC-Gordon
• RandomWriter in a cluster of single SSD per node
– 16% improvement over IPoIB (QDR) for 50GB in a cluster of 16
– 20% improvement over IPoIB (QDR) for 300GB in a cluster of 64
Reduced by 20% Reduced by 36%
48 BPOE-4, ASPLOS '14
• Sort in a cluster of single SSD per node
– 20% improvement over IPoIB (QDR) for 50GB in a cluster of 16
– 36% improvement over IPoIB (QDR) for 300GB in a cluster of 64
Performance Evaluation with BigDataBench
WordCount Sort
• Better benefit for communication intensive benchmarks, like Sort
• Smaller benefit for CPU intensive benchmarks, like WordCount
49 BPOE-4, ASPLOS '14
10GigE IPoIB (QDR) RDMA (QDR)
10GigE IPoIB (QDR) RDMA (QDR)
• Overview of Hadoop • Trends in Networking Technologies and Protocols • Challenges in Accelerating Hadoop • Designs and Case Studies
– HDFS – MapReduce – RPC – Combination of components
• Challenges in Big Data Benchmarking • Other Open Challenges in Acceleration • Conclusion and Q&A
Presentation Outline
50 BPOE-4, ASPLOS '14
• The current benchmarks provide some performance behavior
• However, do not provide any information to the designer/developer on: – What is happening at the lower-layer?
– Where the benefits are coming from?
– Which design is leading to benefits or bottlenecks?
– Which component in the design needs to be changed and what will be its impact?
– Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer?
BPOE-4, ASPLOS '14 51
Are the Current Benchmarks Sufficient for Big Data?
OSU MPI Micro-Benchmarks (OMB) Suite • A comprehensive suite of benchmarks to
– Compare performance of different MPI libraries on various networks and systems – Validate low-level functionalities – Provide insights to the underlying MPI-level designs
• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth
• Extended later to – MPI-2 one-sided – Collectives – GPU-aware data movement – OpenSHMEM (point-to-point and collectives) – UPC
• Has become an industry standard • Extensively used for design/development of MPI libraries, performance comparison
of MPI libraries and even in procurement of large-scale systems • Available from http://mvapich.cse.ohio-state.edu/benchmarks
• Available in an integrated manner with MVAPICH2 stack
52 BPOE-4, ASPLOS '14
Big Data Middleware (HDFS, MapReduce, HBase and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
RDMA Protocols
Challenges in Benchmarking of RDMA-based Designs
53
Current Benchmarks
No Benchmarks
Correlation?
BPOE-4, ASPLOS '14
Big Data Middleware (HDFS, MapReduce, HBase and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
RDMA Protocols
Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications
BPOE-4, ASPLOS '14 54
Applications-Level
Benchmarks
Micro-Benchmarks
• Overview of Hadoop • Trends in Networking Technologies and Protocols • Challenges in Accelerating Hadoop • Designs and Case Studies
– HDFS – MapReduce – RPC – Combination of components
• Challenges in Big Data Benchmarking • Other Open Challenges in Acceleration • Conclusion and Q&A
Presentation Outline
55 BPOE-4, ASPLOS '14
Big Data Middleware (HDFS, MapReduce, HBase and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
RDMA Protocol
Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges
BPOE-4, ASPLOS '14
Upper level Changes?
56
• Multi-threading and Synchronization – Multi-threaded model exploration
– Fine-grained synchronization and lock-free design
– Unified helper threads for different components
– Multi-endpoint design to support multi-threading communications
• QoS and Virtualization – Network virtualization and locality-aware communication for Big
Data middleware
– Hardware-level virtualization support for End-to-End QoS
– I/O scheduling and storage virtualization
– Live migration
BPOE-4, ASPLOS '14 57
Open Challenges
• Support of Accelerators – Efficient designs for Big Data middleware to take advantage of
NVIDA GPGPUs and Intel MICs
– Offload computation-intensive workload to accelerators
– Explore maximum overlapping between communication and offloaded computation
• Fault Tolerance Enhancements – Novel data replication schemes for high-efficiency fault tolerance
– Exploration of light-weight fault tolerance mechanisms for Big Data
• Support of Parallel File Systems – Optimize Big Data middleware over parallel file systems (e.g.
Lustre) on modern HPC clusters
BPOE-4, ASPLOS '14 58
Open Challenges (Cont’d)
Case Study 1 - HDFS Replication: Pipelined vs Parallel
• Basic mechanism of HDFS fault tolerance is Data Replication – Replicates each block to multiple DataNodes
Can Parallel Replication benefit HDFS and its applications when HDFS runs on different kinds of High Performance
Networks?
BPOE-4, ASPLOS '14 59
Pipelined Replication Parallel Replication
0
50
100
150
200
250
300
350
400
20 25 30
Tota
l Thr
ough
put (
MBp
s)
File Size (GB)
10GigE-pipelined
10GigE-parallel
IPoIB-pipelined (32Gbps)
IPoIB-parallel (32Gbps)
OSU-IB-pipelined (32Gbps)
OSU-IB-parallel (32Gbps)460
480
500
520
540
560
580
600
620
40 60 80
Exec
utio
n Ti
me
(s)
File Size (GB)
Parallel Replication Evaluations on Different Networks using TestDFSIO
• Cluster with 8 HDD DataNodes – 11% improvement over 10GigE
– 10% improvement over IPoIB (32Gbps)
– 12% improvement over OSU-IB (32Gbps)
Cluster with 8 HDD DataNodes Cluster with 32 HDD DataNodes
BPOE-4, ASPLOS '14 60
N. Islam, X. Lu, M. W. Rahman, and D. K. Panda, Can Parallel Replication Benefit HDFS for High-Performance Interconnects? Int'l Symposium on High-Performance Interconnects (HotI '13), August 2013.
• Cluster with 32 HDD DataNodes – 8% improvement over IPoIB (32Gbps)
– 9% improvement over OSU-IB (32Gbps)
• For 500GB Sort in 64 nodes – 44% improvement over IPoIB (FDR)
61
Case Study 2 - Performance Improvement Potential of MapReduce over Lustre on HPC Clusters (Initial Study)
BPOE-4, ASPLOS '14
Sort in TACC Stampede Sort in SDSC Gordon
• For 100GB Sort in 16 nodes – 33% improvement over IPoIB (QDR)
0
200
400
600
800
1000
1200
300 400 500
Job
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB (FDR)OSU-IB (FDR)
0
50
100
150
200
250
60 80 100
Job
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB (QDR)OSU-IB (QDR)
• Can more optimizations be achieved by leveraging more features of Lustre?
• Upcoming RDMA for Apache Hadoop Releases will support – HDFS Parallel Replication
– Advanced MapReduce
– HBase
– Hadoop 2.0.x
• Memcached-RDMA
• Exploration of other components (Threading models, QoS, Virtualization, Accelerators, etc.)
• Advanced designs with upper-level changes and optimizations
• Comprehensive set of Benchmarks
• More details at http://hadoop-rdma.cse.ohio-state.edu BPOE-4, ASPLOS '14
Future Plans of OSU Big Data Project
62
Tutorial today afternoon (1:30-5:00pm), Canyon C
Accelerating Big Data Processing with Hadoop and Memcached on Datacenters with Modern Networking and
Storage Architecture
63
More Details on Acceleration of Hadoop ..
BPOE-4, ASPLOS '14
• Presented an overview of Big Data, Hadoop and Memcached
• Provided an overview of Networking Technologies
• Discussed challenges in accelerating Hadoop
• Presented initial designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, RPC and HBase
• Results are promising
• Many other open issues need to be solved
• Will enable Big Data community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner
BPOE-4, ASPLOS '14
Concluding Remarks
64
BPOE-4, ASPLOS '14
Personnel Acknowledgments Current Students
– S. Chakraborthy (Ph.D.)
– N. Islam (Ph.D.)
– J. Jose (Ph.D.)
– M. Li (Ph.D.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekhar (Ph.D.)
Past Students – P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
65
Past Research Scientist – S. Sur
Current Post-Docs – X. Lu
– K. Hamidouche
Current Programmers – M. Arnold
– J. Perkins
Past Post-Docs – H. Wang
– X. Besseron
– H.-W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– R. Shir (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Senior Research Associate – H. Subramoni
Past Programmers – D. Bureddy
BPOE-4, ASPLOS '14
Pointers
http://nowlab.cse.ohio-state.edu
[email protected] http://www.cse.ohio-state.edu/~panda
66
http://mvapich.cse.ohio-state.edu
http://hadoop-rdma.cse.ohio-state.edu