accelerating big data processing with rdma - enhanced...

Accelerating Big Data Processing with RDMA-Enhanced Apache Hadoop

Dhabaleswar K. (DK) Panda The Ohio State University

E-mail: [email protected] http://www.cse.ohio-state.edu/~panda

Keynote Talk at BPOE-4, in conjunction with ASPLOS ‘14

(March 2014)

by

http://www.cse.ohio-state.edu/%7Epanda

Introduction to Big Data Applications and Analytics

• Big Data has become the one of the most important elements of business analytics

• Provides groundbreaking opportunities for enterprise information management and decision making

• The amount of data is exploding; companies are capturing and digitizing more information than ever

• The rate of information growth appears to be exceeding Moore’s Law

2 BPOE-4, ASPLOS '14

• Commonly accepted 3V’s of Big Data • Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

• 5V’s of Big Data – 3V + Value, Veracity

http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

• Scientific Data Management, Analysis, and Visualization

• Applications examples – Climate modeling

– Combustion

– Fusion

– Astrophysics

– Bioinformatics

• Data Intensive Tasks – Runs large-scale simulations on supercomputers

– Dump data on parallel storage systems

– Collect experimental / observational data

– Move experimental / observational data to analysis sites

– Visual analytics – help understand data visually

BPOE-4, ASPLOS '14 3

Not Only in Internet Services - Big Data in Scientific Domains

• Hadoop: http://hadoop.apache.org – The most popular framework for Big Data Analytics

– HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc.

• Spark: http://spark-project.org – Provides primitives for in-memory cluster computing; Jobs can load data into memory and

query it repeatedly

• Shark: http://shark.cs.berkeley.edu – A fully Apache Hive-compatible data warehousing system

• Storm: http://storm-project.net – A distributed real-time computation system for real-time analytics, online machine learning,

continuous computation, etc.

• S4: http://incubator.apache.org/s4 – A distributed system for processing continuous unbounded streams of data

• Web 2.0: RDBMS + Memcached (http://memcached.org) – Memcached: A high-performance, distributed memory object caching systems


Typical Solutions or Architectures for Big Data Applications and Analytics

http://hadoop.apache.org

http://spark-project.org

http://shark.cs.berkeley.edu

http://storm-project.net

http://incubator.apache.org/s4

http://memcached.org

• Focuses on large data and data analysis

• Hadoop (e.g. HDFS, MapReduce, HBase, RPC) environment is gaining a lot of momentum

• http://wiki.apache.org/hadoop/PoweredBy


Who Are Using Hadoop?

http://wiki.apache.org/hadoop/PoweredBy

• Scale: 42,000 nodes, 365PB HDFS, 10M daily slot hours • http://developer.yahoo.com/blogs/hadoop/apache-hbase-yahoo-multi-tenancy-helm-again-171710422.html

• A new Jim Gray’ Sort Record: 1.42 TB/min over 2,100 nodes • Hadoop 0.23.7, an early branch of the Hadoop 2.X

• http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-092704088.html


Hadoop at Yahoo!

http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html

• Overview of Hadoop • Trends in Networking Technologies and Protocols • Challenges in Accelerating Hadoop • Designs and Case Studies

– HDFS – MapReduce – HBase – RPC – Combination of components

• Challenges in Big Data Benchmarking • Other Open Challenges in Acceleration • Conclusion and Q&A

Presentation Outline


Big Data Processing with Hadoop

• The open-source implementation of MapReduce programming model for Big Data Analytics

• Major components – HDFS – MapReduce

• Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase

• Model scales but high amount of communication during intermediate phases can be further optimized

8

HDFS

MapReduce

Hadoop Framework

User Applications

HBase

Hadoop Common (RPC)

BPOE-4, ASPLOS '14

Hadoop MapReduce Architecture & Operations

Disk Operations • Map and Reduce Tasks carry out the total job execution • Communication in shuffle phase uses HTTP over Java Sockets


Hadoop Distributed File System (HDFS)

• Primary storage of Hadoop; highly reliable and fault-tolerant

• Adopted by many reputed organizations – eg: Facebook, Yahoo!

• NameNode: stores the file system namespace

• DataNode: stores data blocks

• Developed in Java for platform-independence and portability

• Uses sockets for communication!

(HDFS Architecture)

10

RPC

RPC

RPC

RPC

BPOE-4, ASPLOS '14

Client


System-Level Interaction Between Clients and Data Nodes in HDFS


– HDFS – MapReduce – RPC – Combination of components





Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

0102030405060708090100

050

100150200250300350400450500

Perc

enta

ge o

f Clu

ster

s

Num

ber o

f Clu

ster

s

Timeline

Percentage of ClustersNumber of Clusters

HPC and Advanced Interconnects

• High-Performance Computing (HPC) has adopted advanced interconnects and protocols – InfiniBand

– 10 Gigabit Ethernet/iWARP

– RDMA over Converged Enhanced Ethernet (RoCE)

• Very Good Performance – Low latency (few micro seconds)

– High Bandwidth (100 Gb/s with dual FDR InfiniBand)

– Low CPU overhead (5-10%)

• OpenFabrics software stack with IB, iWARP and RoCE interfaces are driving HPC systems

• Many such systems in Top500 list


15

Trends of Networking Technologies in TOP500 Systems Percentage share of InfiniBand is steadily increasing

Interconnect Family – Systems Share

BPOE-4, ASPLOS '14

Open Standard InfiniBand Networking Technology • Introduced in Oct 2000 • High Performance Data Transfer

– Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec

-> 100Gbps), and low CPU utilization (5-10%) • Flexibility for LAN and WAN communication • Multiple Transport Services

– Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram

– Provides flexibility to develop upper layers • Multiple Operations

– Send/Recv – RDMA Read/Write – Atomic Operations (very unique)

• high performance and scalable implementations of distributed locks, semaphores, collective communication operations

• Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, ….


• Message Passing Interface (MPI) – Most commonly used programming model for HPC applications

• Parallel File Systems

• Storage Systems

• Almost 13 years of Research and Development since InfiniBand was introduced in October 2000

• Other Programming Models are emerging to take advantage of High-Performance Networks – UPC

– OpenSHMEM

– Hybrid MPI+PGAS (OpenSHMEM and UPC)

17

Use of High-Performance Networks for Scientific Computing

BPOE-4, ASPLOS '14

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MAPICH2-X (MPI + PGAS), Available since 2012

– Support for GPGPUs and MIC

– Used by more than 2,100 organizations (HPC Centers, Industry and Universities) in 71 countries

– More than 203,000 downloads from OSU site directly

– Empowering many TOP500 clusters • 7th ranked 462,462-core cluster (Stampede) at TACC

• 11th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology

• 16th ranked 96,192-core cluster (Pleiades) at NASA and many others

– Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Partner in the U.S. NSF-TACC Stampede System

MVAPICH2/MVAPICH2-X Software


http://mvapich.cse.ohio-state.edu/


One-way Latency: MPI over IB with MVAPICH2

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(us)

1.66

1.56

1.64

1.82

0.99 1.09

1.12

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Large Message Latency


Late

ncy

(us)


Bandwidth: MPI over IB with MVAPICH2

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Band

wid

th (M

Byte

s/se

c)


3280

3385

1917 1706

6343

12485

12810

0

5000

10000

15000

20000

25000

30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Bidirectional Bandwidth

Band

wid

th (M

Byte

s/se

c)


3341 3704

4407

11643

6521

21025

24727

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

Can High-Performance Interconnects Benefit Hadoop?

• Most of the current Hadoop environments use Ethernet Infrastructure with Sockets

• Concerns for performance and scalability

• Usage of High-Performance Networks is beginning to draw interest from many companies

• What are the challenges?

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?

• Can HPC Clusters with High-Performance networks be used for Big Data applications using Hadoop?


Designing Communication and I/O Libraries for Big Data Systems: Challenges


Big Data Middleware (HDFS, MapReduce, HBase and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE

and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks


Common Protocols using Open Fabrics

Application/ Middleware

Verbs Sockets

Application /Middleware Interface

SDP

RDMA

SDP

InfiniBand Adapter

InfiniBand Switch

RDMA

IB Verbs

InfiniBand Adapter

InfiniBand Switch

User space

RDMA

RoCE

RoCE Adapter

User space

Ethernet Switch

TCP/IP

Ethernet Driver

Kernel Space

Protocol

InfiniBand Adapter

InfiniBand Switch

IPoIB

Ethernet Adapter

Ethernet Switch

Adapter

Switch

1/10/40 GigE

iWARP

Ethernet Switch

iWARP

iWARP Adapter

User space IPoIB

TCP/IP

Ethernet Adapter

Ethernet Switch

10/40 GigE-TOE

Hardware Offload

RSockets

InfiniBand Adapter

InfiniBand Switch

User space

RSockets

Can Hadoop be Accelerated with High-Performance Networks and Protocols?

25

Enhanced Designs

Application

Accelerated Sockets

10 GigE or InfiniBand

Verbs / Hardware Offload

Current Design

Application

Sockets

1/10 GigE Network

• Sockets not designed for high-performance – Stream semantics often mismatch for upper layers (Hadoop and HBase) – Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design

10 GigE or InfiniBand

Verbs Interface

BPOE-4, ASPLOS '14

• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)

• Current release: 0.9.8

– Based on Apache Hadoop 1.2.1

– Compliant with Apache Hadoop 1.2.1 APIs and applications

– Tested with • Mellanox InfiniBand adapters (DDR, QDR and FDR)

• RoCE

• Various multi-core platforms

• Different file systems with disks and SSDs

– http://hadoop-rdma.cse.ohio-state.edu

RDMA for Apache Hadoop Project


http://hadoop-rdma.cse.ohio-state.edu/

Design Overview of HDFS with RDMA

• JNI Layer bridges Java based HDFS with communication library written in native code • Only the communication part of HDFS Write is modified; No change in HDFS architecture

28

HDFS

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

Enables high performance RDMA communication, while supporting traditional socket interface

• Design Features – RDMA-based HDFS

write – RDMA-based HDFS

replication – InfiniBand/RoCE

support

BPOE-4, ASPLOS '14

Communication Time in HDFS

• Cluster with HDD DataNodes

– 30% improvement in communication time over IPoIB (QDR)

– 56% improvement in communication time over 10GigE

• Similar improvements are obtained for SSD DataNodes

Reduced by 30%

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

29

0

5

10

15

20

25

2GB 4GB 6GB 8GB 10GB

Com

mun

icat

ion

Tim

e (s

)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

BPOE-4, ASPLOS '14

Evaluations using TestDFSIO

• Cluster with 8 HDD DataNodes, single disk per node

– 24% improvement over IPoIB (QDR) for 20GB file size

• Cluster with 4 SSD DataNodes, single SSD per node


Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps

Increased by 24%

Increased by 61%

30

0

200

400

600

800

1000

1200

5 10 15 20

Tota

l Thr

ough

put (

MBp

s)

File Size (GB)

IPoIB (QDR) OSU-IB (QDR)

0

500

1000

1500

2000

2500

5 10 15 20

Tota

l Thr

ough

put (

MBp

s)

File Size (GB)


BPOE-4, ASPLOS '14

Evaluations using TestDFSIO

• Cluster with 4 DataNodes, 1 HDD per node


• Cluster with 4 DataNodes, 2 HDD per node


• 2 HDD vs 1 HDD

– 76% improvement for OSU-IB (QDR)

– 66% improvement for IPoIB (QDR)

Increased by 31%

31

0

500

1000

1500

2000

2500

5 10 15 20

Tota

l Thr

ough

put (

MBp

s)

File Size (GB)

10GigE-1HDD 10GigE-2HDDIPoIB -1HDD (QDR) IPoIB -2HDD (QDR)OSU-IB -1HDD (QDR) OSU-IB -2HDD (QDR)

BPOE-4, ASPLOS '14

32

Evaluations using Enhanced DFSIO of Intel HiBench on SDSC-Gordon

• Cluster with 32 DataNodes, single SSD per node – 47% improvement in throughput over IPoIB (QDR) for 128GB file size

– 16% improvement in latency over IPoIB (QDR) for 128GB file size

BPOE-4, ASPLOS '14

Increased by 47% Reduced by 16%

0

500

1000

1500

2000

2500

32 64 128

Aggr

egat

ed T

hrou

ghpu

t (M

Bps)

File Size (GB)

IPoIB (QDR)OSU-IB (QDR)

0

20

40

60

80

100

120

140

160

32 64 128

Exec

utio

n Ti

me

(s)

File Size (GB)

IPoIB (QDR)

OSU-IB (QDR)


Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede

• Cluster with 64 DataNodes, single HDD per node – 64% improvement in throughput over IPoIB (FDR) for 256GB file size

– 37% improvement in latency over IPoIB (FDR) for 256GB file size

0

200

400

600

800

1000

1200

1400

64 128 256

Aggr

egat

ed T

hrou

ghpu

t (M

Bps)

File Size (GB)

IPoIB (FDR)OSU-IB (FDR) Increased by 64%

Reduced by 37%

0

100

200

300

400

500

600

64 128 256

Exec

utio

n Ti

me

(s)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

Design Overview of MapReduce with RDMA

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based MapReduce with communication library written in native code

• InfiniBand/RoCE support

MapReduce

Verbs


OSU Design

Applications




Job Tracker

Task Tracker

Map

Reduce

Design features for RDMA

Prefetch/Caching of MOF

In-Memory Merge

Overlap of Merge & Reduce


Evaluations using Sort

8 HDD DataNodes

• With 8 HDD DataNodes for 40GB sort

– 41% improvement over IPoIB (QDR)

– 39% improvement over UDA-IB (QDR)

36

M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May 2013.

BPOE-4, ASPLOS '14

8 SSD DataNodes

• With 8 SSD DataNodes for 40GB sort



0

50

100

150

200

250

300

350

400

450

500

25 30 35 40

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

10GigE IPoIB (QDR)UDA-IB (QDR) OSU-IB (QDR)

0

50

100

150

200

250

300

350

400

450

500

25 30 35 40

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)


Evaluations using TeraSort

• 100 GB TeraSort with 8 DataNodes with 2 HDD per node




0

200

400

600

800

1000

1200

1400

1600

1800

60 80 100

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)


• For 240GB Sort in 64 nodes – 40% improvement over IPoIB (QDR)

with HDD used for HDFS

38

Performance Evaluation on Larger Clusters

BPOE-4, ASPLOS '14

0

200

400

600

800

1000

1200

Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)

Sort in OSU Cluster

0

100

200

300

400

500

600

700

Data Size: 80 GB Data Size: 160 GBData Size: 320 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)

TeraSort in TACC Stampede

• For 320GB TeraSort in 64 nodes – 38% improvement over IPoIB

(FDR) with HDD used for HDFS

• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size

• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evaluations using PUMA Workload


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)

Nor

mal

ized

Exe

cutio

n Ti

me

Benchmarks


• 50 small MapReduce jobs executed in a cluster size of 4

• Maximum performance benefit 24% over IPoIB (QDR)

• Average performance benefit 12.89% over IPoIB (QDR)

40

Evaluations using SWIM

15

20

25

30

35

40

1 6 11 16 21 26 31 36 41 46

Exec

utio

n Ti

me

(sec

)

Job Number


BPOE-4, ASPLOS '14

Design Overview of Hadoop RPC with RDMA

Hadoop RPC

Verbs


Applications




Our Design

Default

OSU Design

Enables high performance RDMA communication, while supporting traditional socket interface

42

• Design Features – JVM-bypassed buffer

management – On-demand connection

setup – RDMA or send/recv

based adaptive communication

– InfiniBand/RoCE support

BPOE-4, ASPLOS '14

Hadoop RPC over IB - Gain in Latency and Throughput

• Hadoop RPC over IB PingPong Latency

– 1 byte: 39 us; 4 KB: 52 us

– 42%-49% and 46%-50% improvements compared with the performance on 10 GigE and IPoIB (32Gbps), respectively

• Hadoop RPC over IB Throughput

– 512 bytes & 48 clients: 135.22 Kops/sec

– 82% and 64% improvements compared with the peak performance on 10 GigE and IPoIB (32Gbps), respectively

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128 256 512 102420484096

Late

ncy

(us)

Payload Size (Byte)

10GigEIPoIB(QDR)OSU-IB(QDR)

0

20

40

60

80

100

120

140

160

8 16 24 32 40 48 56 64

Thro

ughp

ut (K

ops/

Sec)

Number of Clients


• For 5GB,

– 53.4% improvement over IPoIB, 126.9% improvement over 10GigE with HDD used for HDFS

– 57.8% improvement over IPoIB, 138.1% improvement over 10GigE with SSD used for HDFS

• For 20GB,

– 10.7% improvement over IPoIB, 46.8% improvement over 10GigE with HDD used for HDFS


45

Performance Benefits - TestDFSIO

Cluster with 4 HDD Nodes, TestDFSIO with total 4 maps Cluster with 4 SSD Nodes, TestDFSIO with total 4 maps

0

100

200

300

400

500

600

700

800

900

1000

5 10 15 20

Tota

l Thr

ough

put (

MBp

s)

Size (GB)


0

100

200

300

400

500

600

700

800

900

5 10 15 20

Tota

l Thr

ough

put (

MBp

s)

Size (GB)


BPOE-4, ASPLOS '14

• For 80GB Sort in a cluster size of 8, – 40.7% improvement over IPoIB, 32.2% improvement over 10GigE with HDD

used for HDFS


46

Performance Benefits - Sort

0

200

400

600

800

1000

1200

1400

1600

1800

48 64 80

Job

Exec

utio

n Ti

me

(sec

)

Sort Size (GB)


Cluster with 8 HDD Nodes, Sort with <8 maps, 8 reduces>/host Cluster with 8 SSD Nodes, Sort with <8 maps, 8 reduces>/host

0

200

400

600

800

1000

1200

1400

1600

48 64 80

Job

Exec

utio

n Ti

me

(sec

)

Sort Size (GB)


BPOE-4, ASPLOS '14

• For 100GB TeraSort in a cluster size of 8, – 45% improvement over IPoIB, 39% improvement over 10GigE with HDD used

for HDFS

– 46% improvement over IPoIB, 40% improvement over 10GigE with SSD used for HDFS

47

Performance Benefits - TeraSort

Cluster with 8 HDD Nodes, TeraSort with <8 maps, 8reduces>/host Cluster with 8 SSD Nodes, TeraSort with <8 maps, 8reduces>/host

0

500

1000

1500

2000

2500

80 90 100

Job

Exec

utio

n Ti

me

(sec

)

Size (GB)


0

500

1000

1500

2000

2500

80 90 100

Job

Exec

utio

n Ti

me

(sec

)

Size (GB)


BPOE-4, ASPLOS '14

0

50

100

150

200

250

300

350

400

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exec

utio

n Ti

me

(sec

)


0

20

40

60

80

100

120

140

160

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exec

utio

n Ti

me

(sec

)


Performance Benefits – RandomWriter & Sort in SDSC-Gordon

• RandomWriter in a cluster of single SSD per node

– 16% improvement over IPoIB (QDR) for 50GB in a cluster of 16


Reduced by 20% Reduced by 36%


• Sort in a cluster of single SSD per node



Performance Evaluation with BigDataBench

WordCount Sort

• Better benefit for communication intensive benchmarks, like Sort

• Smaller benefit for CPU intensive benchmarks, like WordCount


10GigE IPoIB (QDR) RDMA (QDR)

10GigE IPoIB (QDR) RDMA (QDR)

• The current benchmarks provide some performance behavior

• However, do not provide any information to the designer/developer on: – What is happening at the lower-layer?

– Where the benefits are coming from?

– Which design is leading to benefits or bottlenecks?

– Which component in the design needs to be changed and what will be its impact?

– Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer?


Are the Current Benchmarks Sufficient for Big Data?

OSU MPI Micro-Benchmarks (OMB) Suite • A comprehensive suite of benchmarks to

– Compare performance of different MPI libraries on various networks and systems – Validate low-level functionalities – Provide insights to the underlying MPI-level designs

• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth

• Extended later to – MPI-2 one-sided – Collectives – GPU-aware data movement – OpenSHMEM (point-to-point and collectives) – UPC

• Has become an industry standard • Extensively used for design/development of MPI libraries, performance comparison

of MPI libraries and even in procurement of large-scale systems • Available from http://mvapich.cse.ohio-state.edu/benchmarks

• Available in an integrated manner with MVAPICH2 stack


http://mvapich.cse.ohio-state.edu/benchmarks





(HDD and SSD)


Applications



Other Protocols?



QoS



Virtualization

Benchmarks

RDMA Protocols

Challenges in Benchmarking of RDMA-based Designs

53

Current Benchmarks

No Benchmarks

Correlation?

BPOE-4, ASPLOS '14





(HDD and SSD)


Applications



Other Protocols?



QoS



Virtualization

Benchmarks

RDMA Protocols

Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications


Applications-Level

Benchmarks

Micro-Benchmarks





(HDD and SSD)


Applications



Other Protocols?



QoS



Virtualization

Benchmarks

RDMA Protocol

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges

BPOE-4, ASPLOS '14

Upper level Changes?

56

• Multi-threading and Synchronization – Multi-threaded model exploration

– Fine-grained synchronization and lock-free design

– Unified helper threads for different components

– Multi-endpoint design to support multi-threading communications

• QoS and Virtualization – Network virtualization and locality-aware communication for Big

Data middleware

– Hardware-level virtualization support for End-to-End QoS

– I/O scheduling and storage virtualization

– Live migration


Open Challenges

• Support of Accelerators – Efficient designs for Big Data middleware to take advantage of

NVIDA GPGPUs and Intel MICs

– Offload computation-intensive workload to accelerators

– Explore maximum overlapping between communication and offloaded computation

• Fault Tolerance Enhancements – Novel data replication schemes for high-efficiency fault tolerance

– Exploration of light-weight fault tolerance mechanisms for Big Data

• Support of Parallel File Systems – Optimize Big Data middleware over parallel file systems (e.g.

Lustre) on modern HPC clusters


Open Challenges (Cont’d)

Case Study 1 - HDFS Replication: Pipelined vs Parallel

• Basic mechanism of HDFS fault tolerance is Data Replication – Replicates each block to multiple DataNodes

Can Parallel Replication benefit HDFS and its applications when HDFS runs on different kinds of High Performance

Networks?


Pipelined Replication Parallel Replication

0

50

100

150

200

250

300

350

400

20 25 30

Tota

l Thr

ough

put (

MBp

s)

File Size (GB)

10GigE-pipelined

10GigE-parallel

IPoIB-pipelined (32Gbps)

IPoIB-parallel (32Gbps)

OSU-IB-pipelined (32Gbps)

OSU-IB-parallel (32Gbps)460

480

500

520

540

560

580

600

620

40 60 80

Exec

utio

n Ti

me

(s)

File Size (GB)

Parallel Replication Evaluations on Different Networks using TestDFSIO

• Cluster with 8 HDD DataNodes – 11% improvement over 10GigE

– 10% improvement over IPoIB (32Gbps)

– 12% improvement over OSU-IB (32Gbps)

Cluster with 8 HDD DataNodes Cluster with 32 HDD DataNodes


N. Islam, X. Lu, M. W. Rahman, and D. K. Panda, Can Parallel Replication Benefit HDFS for High-Performance Interconnects? Int'l Symposium on High-Performance Interconnects (HotI '13), August 2013.

• Cluster with 32 HDD DataNodes – 8% improvement over IPoIB (32Gbps)

– 9% improvement over OSU-IB (32Gbps)

• For 500GB Sort in 64 nodes – 44% improvement over IPoIB (FDR)

61

Case Study 2 - Performance Improvement Potential of MapReduce over Lustre on HPC Clusters (Initial Study)

BPOE-4, ASPLOS '14

Sort in TACC Stampede Sort in SDSC Gordon

• For 100GB Sort in 16 nodes – 33% improvement over IPoIB (QDR)

0

200

400

600

800

1000

1200

300 400 500

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB (FDR)OSU-IB (FDR)

0

50

100

150

200

250

60 80 100

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB (QDR)OSU-IB (QDR)

• Can more optimizations be achieved by leveraging more features of Lustre?

• Upcoming RDMA for Apache Hadoop Releases will support – HDFS Parallel Replication

– Advanced MapReduce

– HBase

– Hadoop 2.0.x

• Memcached-RDMA

• Exploration of other components (Threading models, QoS, Virtualization, Accelerators, etc.)

• Advanced designs with upper-level changes and optimizations

• Comprehensive set of Benchmarks

• More details at http://hadoop-rdma.cse.ohio-state.edu BPOE-4, ASPLOS '14

Future Plans of OSU Big Data Project

62

Tutorial today afternoon (1:30-5:00pm), Canyon C

Accelerating Big Data Processing with Hadoop and Memcached on Datacenters with Modern Networking and

Storage Architecture

63

More Details on Acceleration of Hadoop ..

BPOE-4, ASPLOS '14

• Presented an overview of Big Data, Hadoop and Memcached

• Provided an overview of Networking Technologies

• Discussed challenges in accelerating Hadoop

• Presented initial designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, RPC and HBase

• Results are promising

• Many other open issues need to be solved

• Will enable Big Data community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner

BPOE-4, ASPLOS '14

Concluding Remarks

64

BPOE-4, ASPLOS '14

Personnel Acknowledgments Current Students

– S. Chakraborthy (Ph.D.)

– N. Islam (Ph.D.)

– J. Jose (Ph.D.)

– M. Li (Ph.D.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekhar (Ph.D.)

Past Students – P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

65

Past Research Scientist – S. Sur

Current Post-Docs – X. Lu

– K. Hamidouche

Current Programmers – M. Arnold

– J. Perkins

Past Post-Docs – H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– R. Shir (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Senior Research Associate – H. Subramoni

Past Programmers – D. Bureddy

BPOE-4, ASPLOS '14

Pointers

http://nowlab.cse.ohio-state.edu

[email protected] http://www.cse.ohio-state.edu/~panda

66

http://mvapich.cse.ohio-state.edu

http://hadoop-rdma.cse.ohio-state.edu

http://nowlab.cse.ohio-state.edu/




mailto:[email protected]















accelerating big data processing with rdma - enhanced...

Documents