high performance big data (hibd): accelerating hadoop...

35
High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Talk at Mellanox Theatre (SC ’15) by

Upload: others

Post on 20-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at Mellanox Theatre (SC ’15)

by

Page 2: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

• Substantial impact on designing and utilizing modern data management and

processing systems in multiple tiers

– Front-end data accessing and serving (Online)

• Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)

• HDFS, MapReduce, Spark

Mellanox Booth (SC '15)

Data Processing and Management on Modern Clusters

2

Page 3: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Worker

Worker

Worker

Worker

Master

HDFS

Driver

Zookeeper

Worker

SparkContext

Spark - An Example of Back-end Data Processing Middleware

• An in-memory data-processing framework

– Iterative machine learning jobs

– Interactive data analytics

– Scala based Implementation

– Standalone, YARN, Mesos

• Scalable, communication and I/O intensive

– Wide dependencies between Resilient Distributed Datasets (RDDs)

– MapReduce-like shuffle operations to repartition RDDs

– Sockets based communication

Mellanox Booth (SC '15)

http://spark.apache.org

3

Page 4: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Mellanox Booth (SC '15) 4

Memcached - An Example of Front-end Data Management Middleware

• Three-layer architecture of Web 2.0

– Web Servers, Memcached Servers, Database Servers

• Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

Main

memoryCPUs

SSD HDD

High Performance Networks

... ...

...

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High

Performance

Networks

High

Performance

NetworksInternet

Page 5: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

• High End Computing (HEC) is growing dramatically

– High Performance Computing

– Big Data Computing

• Technology Advancement

– Multi-core/many-core technologies

– Remote Direct Memory Access (RDMA)-enabled

networking (InfiniBand and RoCE)

– Solid State Drives (SSDs), Non-Volatile Random-Access

Memory (NVRAM), NVMe-SSD

– Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

Drivers for Modern HPC Clusters

Tianhe – 2 Titan Stampede Tianhe – 1A

Mellanox Booth (SC '15) 5

Page 6: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Mellanox Booth (SC '15)

Kernel

Space

Interconnects and Protocols in the Open Fabrics Stack

Application / Middleware

Verbs

Ethernet

Adapter

Ethernet Switch

Ethernet Driver

TCP/IP

1/10/40 GigE

InfiniBand

Adapter

InfiniBand

Switch

IPoIB

IPoIB

Ethernet

Adapter

Ethernet Switch

Hardware Offload

TCP/IP

10/40 GigE-TOE

InfiniBand

Adapter

InfiniBand

Switch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCE Adapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBand

Switch

InfiniBand

Adapter

RDMA

User Space

IB Native

Sockets

Application / Middleware Interface

Protocol

Adapter

Switch

InfiniBand

Adapter

InfiniBand

Switch

RDMA

SDP

SDP

6

Page 7: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

• Challenges for Accelerating Big Data Processing

• Accelerating Big Data Processing on RDMA-enabled High-

Performance Interconnects

– RDMA-enhanced Designs for HDFS, MapReduce, Spark, and Memcached

• Accelerating Big Data Processing on High-Performance Storage

– Enhanced Designs for HDFS and MapReduce over Lustre

– SSD-assisted Hybrid Memory for RDMA-based Memcached

• The High-Performance Big Data (HiBD) Project

• Conclusion and Q&A

Presentation Outline

Mellanox Booth (SC '15) 7

Page 8: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Mellanox Booth (SC '15) 8

How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications?

Bring HPC and Big Data processing into a “convergent trajectory”!

What are the major

bottlenecks in

current Big Data

processing

middleware (e.g.

Hadoop, Spark, and

Memcached)?

Can the bottlenecks be alleviated with

new designs by taking advantage of HPC technologies?

Can RDMA-enabled

high-performance

interconnects

benefit Big Data

processing?

Can HPC Clusters with

high-performance

storage systems (e.g.

SSD, parallel file

systems) benefit Big

Data applications?

How much

performance

benefits can be

achieved through

enhanced designs?

How to design

benchmarks for

evaluating the

performance of

Big Data

middleware on

HPC clusters?

Page 9: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocol

Challenges in Designing Communication and I/O Libraries for Big Data Systems

Mellanox Booth (SC '15)

Upper level

Changes?

9

I/O and File Systems

Page 10: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

• Challenges for Accelerating Big Data Processing

• Accelerating Big Data Processing on RDMA-enabled High-

Performance Interconnects

– RDMA-enhanced Designs for HDFS, MapReduce, Spark, and Memcached

• Accelerating Big Data Processing on High-Performance Storage

– Enhanced Designs for HDFS and MapReduce over Lustre

– SSD-assisted Hybrid Memory for RDMA-based Memcached

• The High-Performance Big Data (HiBD) Project

• Conclusion and Q&A

Presentation Outline

Mellanox Booth (SC '15) 10

Page 11: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Design Overview of HDFS with RDMA

HDFS

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

• Design Features

– RDMA-based HDFS write

– RDMA-based HDFS replication

– Parallel replication support

– On-demand connection setup

– InfiniBand/RoCE support

Mellanox Booth (SC '15) 11

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,

High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in

RDMA-Enhanced HDFS, HPDC '14, June 2014

Page 12: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Mellanox Booth (SC '15)

Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede

• Cluster with 64 DataNodes, single HDD per node

– 64% improvement in throughput over IPoIB (FDR) for 256GB file size

– 37% improvement in latency over IPoIB (FDR) for 256GB file size

0

200

400

600

800

1000

1200

1400

64 128 256

Agg

rega

ted

Th

rou

ghp

ut

(MB

ps)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR) Increased by 64% Reduced by 37%

0

100

200

300

400

500

600

64 128 256

Exec

uti

on

Tim

e (s

)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

12

Page 13: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Design Overview of MapReduce with RDMA

MapReduce

Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

OSU Design

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Job

Tracker

Task

Tracker

Map

Reduce

Mellanox Booth (SC '15)

• Design Features

– RDMA-based shuffle

– Prefetching and caching map output

– Efficient Shuffle Algorithms

– In-memory merge

– On-demand Shuffle Adjustment

– Hybrid overlapping

• map, shuffle, and merge

• shuffle, merge, and reduce

– On-demand connection setup

– InfiniBand/RoCE support

13

M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping

in MapReduce over High Performance Interconnects, ICS, June 2014.

Page 14: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size

• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evaluations using PUMA Workload

Mellanox Booth (SC '15)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)

No

rmal

ized

Exe

cuti

on

Tim

e

Benchmarks

10GigE IPoIB (QDR) OSU-IB (QDR)

14

Page 15: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Design Overview of Spark with RDMA

• Design Features

– RDMA based shuffle

– SEDA-based plugins

– Dynamic connection management and sharing

– Non-blocking and out-of-order data transfer

– Off-JVM-heap buffer management

– InfiniBand/RoCE support

Mellanox Booth (SC '15)

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Scala based Spark with communication library written in native code

X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data

Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

15

Page 16: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Performance Evaluation on TACC Stampede - SortByTest

• Intel SandyBridge + FDR, 16 Worker Nodes, 256 Cores, (256M 256R)

• RDMA-based design for Spark 1.4.0

• RDMA vs. IPoIB with 256 concurrent tasks, single disk per node and

RamDisk. For SortByKey Test:

– Shuffle time reduced by up to 77% over IPoIB (56Gbps)

– Total time reduced by up to 58% over IPoIB (56Gbps)

16 Worker Nodes, SortByTest Shuffle Time 16 Worker Nodes, SortByTest Total Time

Mellanox Booth (SC '15) 16

0

5

10

15

20

25

30

35

40

16 32 64

Tim

e (s

ec)

Data Size (GB)

IPoIB RDMA

0

5

10

15

20

25

30

35

40

16 32 64

Tim

e (s

ec)

Data Size (GB)

IPoIB RDMA 77%

58%

Page 17: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Memcached-RDMA Design

• Server and client perform a negotiation protocol

– Master thread assigns clients to appropriate worker thread

• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread

• All other Memcached data structures are shared among RDMA and Sockets worker threads

• Native IB-verbs-level Design and evaluation with

– Server : Memcached (http://memcached.org)

– Client : libmemcached (http://libmemcached.org)

– Different networks and protocols: 10GigE, IPoIB, native IB (RC, UD)

Sockets Client

RDMA Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Sockets Worker Thread

Verbs Worker Thread

Shared

Data

Memory Slabs Items

1

1

2

2

Mellanox Booth (SC '15) 17

Page 18: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

1

10

100

10001 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

Tim

e (

us)

Message Size

OSU-IB (FDR)

IPoIB (FDR)

0

100

200

300

400

500

600

700

16 32 64 128 256 512 1024 2048 4080

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r Se

con

d (

TPS)

No. of Clients

• Memcached Get latency

– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us

– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

• Memcached Throughput (4bytes)

– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s

– Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (TACC Stampede)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X

2X

Mellanox Booth (SC '15) 18

Page 19: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

• Challenges for Accelerating Big Data Processing

• Accelerating Big Data Processing on RDMA-enabled High-

Performance Interconnects

– RDMA-enhanced Designs for HDFS, MapReduce, Spark, and Memcached

• Accelerating Big Data Processing on High-Performance Storage

– Enhanced Designs for HDFS and MapReduce over Lustre

– SSD-assisted Hybrid Memory for RDMA-based Memcached

• The High-Performance Big Data (HiBD) Project

• Conclusion and Q&A

Presentation Outline

Mellanox Booth (SC '15) 19

Page 20: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Optimize I/O Performance for Big Data Applications with High-Performance Storage on HPC Clusters

MetaData Servers

Object Storage Servers

Compute Nodes

App Master

Map Reduce

RAM

Disk Lustre Client

Lustre Setup

• HPC Cluster Deployment – Hybrid topological solution of Beowulf

architecture with separate I/O nodes – Lean compute nodes with light OS; more

memory space; small local storage – Sub-cluster of dedicated I/O nodes with parallel

file systems, such as Lustre

• HDFS over Heterogeneous Storage – RAMDisk, SSD, HDD, Lustre as data directories

• MapReduce over Lustre – Local disk is used as the intermediate data

directory – Lustre is used as the intermediate data

directory

• Hybrid Memcached with RAM + SSD 20 Mellanox Booth (SC '15)

Map Reduce

SSD HDD

Page 21: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Triple-H

Heterogeneous Storage

• Design Features

– Three modes

• Default (HHH)

• In-Memory (HHH-M)

• Lustre-Integrated (HHH-L)

– Policies to efficiently utilize the

heterogeneous storage devices

• RAM, SSD, HDD, Lustre

– Eviction/Promotion based on

data usage pattern

– Hybrid Replication

– Lustre-Integrated mode:

• Lustre-based fault-tolerance

21

Enhanced HDFS with In-memory and Heterogeneous Storage

Hybrid

Replication

Data Placement Policies

Eviction/

Promotion

RAM Disk SSD HDD

Lustre

N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS

on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015

Applications

Mellanox Booth (SC '15)

Page 22: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

• For 160GB TestDFSIO in 32 nodes

– Write Throughput: 7x improvement

over IPoIB (FDR)

– Read Throughput: 2x improvement

over IPoIB (FDR)

Performance Improvement on TACC Stampede (HHH)

• For 120GB RandomWriter in 32

nodes

– 3x improvement over IPoIB (QDR)

0

1000

2000

3000

4000

5000

6000

write read

Tota

l Th

rou

ghp

ut

(MB

ps)

IPoIB (FDR) OSU-IB (FDR)

TestDFSIO

0

50

100

150

200

250

300

8:30 16:60 32:120

Exe

cuti

on

Tim

e (

s)

Cluster Size:Data Size

IPoIB (FDR) OSU-IB (FDR)

RandomWriter

Mellanox Booth (SC '15)

Increased by 7x

Increased by 2x Reduced by 3x

22

Page 23: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Performance Improvement on SDSC Gordon (HHH vs. Tachyon)

• RandomWriter: 200GB on 32 SSD Nodes (QDR)

– HHH reduces the execution time by 47% over Tachyon-WTC (IPoIB)), 56% over HDFS (IPoIB)

• Sort: 200GB on 32 SSD Nodes (QDR)

– HHH reduces the execution time by 19% over Tachyon-WTC (IPoIB), 31% over HDFS (IPoIB)

RandomWriter

0

20

40

60

80

100

120

140

160

180

200

8:50 16:100 32:200

Exe

cuti

on

Tim

e (

s)

Cluster Size:Data Size (GB)

HDFS Tachyon-WTC Tachyon-C HHH

Sort

0

50

100

150

200

250

300

350

400

450

500

8:50 16:100 32:200

Exe

cuti

on

Tim

e (

s)

Cluster Size:Data Size (GB)

Reduced by 47% Reduced by 19%

Mellanox Booth (SC '15) 23

N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File

Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015

Page 24: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Intermediate Data Directory

Design Overview of Shuffle Strategies for MapReduce over Lustre

Mellanox Booth (SC '15)

• Design Features

– Two shuffle approaches

• Lustre read based shuffle

• RDMA based shuffle

– Hybrid shuffle algorithm to take benefit from both shuffle approaches

– Dynamically adapts to the better shuffle approach for each shuffle request based on profiling values for each Lustre read operation

– In-memory merge and overlapping of different phases are kept similar to RDMA-enhanced MapReduce design

24

Map 1 Map 2 Map 3

Lustre

Reduce 1 Reduce 2

Lustre Read / RDMA

In-memory

merge/sort reduce

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN

MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015.

In-memory

merge/sort reduce

Page 25: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

• For 500GB Sort in 64 nodes

– 44% improvement over IPoIB (FDR)

Performance Improvement of MapReduce over Lustre on TACC-Stampede

Mellanox Booth (SC '15)

• For 640GB Sort in 128 nodes

– 48% improvement over IPoIB (FDR)

0

200

400

600

800

1000

1200

300 400 500

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

0

50

100

150

200

250

300

350

400

450

500

20 GB 40 GB 80 GB 160 GB 320 GB 640 GB

Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128

Job

Exe

cuti

on

Tim

e (s

ec)

IPoIB (FDR) OSU-IB (FDR)

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-

based Approach Benefit?, Euro-Par, August 2014.

• Local disk is used as the intermediate data directory

Reduced by 48% Reduced by 44%

25

Page 26: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

• For 80GB Sort in 8 nodes

– 34% improvement over IPoIB (QDR)

Case Study - Performance Improvement of MapReduce over Lustre on SDSC-Gordon

Mellanox Booth (SC '15)

• For 120GB TeraSort in 16 nodes

– 25% improvement over IPoIB (QDR)

• Lustre is used as the intermediate data directory

0

100

200

300

400

500

600

700

800

900

40 60 80

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (QDR)

OSU-Lustre-Read (QDR)

OSU-RDMA-IB (QDR)

OSU-Hybrid-IB (QDR)

0

100

200

300

400

500

600

700

800

900

40 80 120

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (QDR)

OSU-Lustre-Read (QDR)

OSU-RDMA-IB (QDR)

OSU-Hybrid-IB (QDR)

Reduced by 25% Reduced by 34%

26

Page 27: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Overview of SSD-Assisted Hybrid RDMA-Memcached Design

• Design Features

• Hybrid slab allocation and

management for higher data

retention

• Log-structured sequence of

blocks flushed to SSD

• SSD fast random read to

achieve low latency object

access

• Uses LRU to evict data to SSD

Mellanox Booth (SC '15) 27

Page 28: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

– Memcached latency test with Zipf distribution, server with 1 GB memory, 32 KB key-value

pair size, total size of data accessed is 1 GB (when data fits in memory) and 1.5 GB (when

data does not fit in memory)

– When data fits in memory: RDMA-Mem/Hybrid gives 5x improvement over IPoIB-Mem

– When data does not fix in memory: RDMA-Hybrid gives 2x-2.5x over IPoIB/RDMA-Mem

Performance Evaluation on SDSC Comet (IB FDR + SATA/NVMe SSDs)

Mellanox Booth (SC '15) 28

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get

IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe

IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe

Data Fits In Memory Data Does Not Fit In Memory

Late

ncy

(u

s)

slab allocation (SSD write) cache check+load (SSD read) cache update server response client wait miss-penalty

Page 29: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

• Challenges for Accelerating Big Data Processing

• Accelerating Big Data Processing on RDMA-enabled High-

Performance Interconnects

– RDMA-enhanced Designs for HDFS, MapReduce, Spark, and Memcached

• Accelerating Big Data Processing on High-Performance Storage

– Enhanced Designs for HDFS and MapReduce over Lustre

– SSD-assisted Hybrid Memory for RDMA-based Memcached

• The High-Performance Big Data (HiBD) Project

• Conclusion and Q&A

Presentation Outline

Mellanox Booth (SC '15) 29

Page 30: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

– Plugins for Apache and HDP Hadoop distributions

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• RDMA for Memcached (RDMA-Memcached)

• OSU HiBD-Benchmarks (OHB)

– HDFS and Memcached Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 135 organizations from 20 countries

• More than 13,300 downloads from the project site

The High-Performance Big Data (HiBD) Project

Mellanox Booth (SC '15) 30

Page 31: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

Mellanox Booth (SC '15) 31

Different Modes of RDMA for Apache Hadoop 2.x

• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to

have better fault-tolerance as well as performance. This mode is enabled by default in the package.

• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to

perform all I/O operations in-memory and obtain as much performance benefit as possible.

• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.

• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides

support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks

and without local disks.

• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different

running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre).

Page 32: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

• Upcoming Releases of RDMA-enhanced Packages will support

– CDH Plugin

– Spark

– HBase

• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support

– MapReduce

– RPC

• Exploration of other components (Threading models, QoS, Virtualization,

Accelerators, etc.)

• Advanced designs with upper-level changes and optimizations

Mellanox Booth (SC '15)

Future Plans of OSU High Performance Big Data Project

32

Page 33: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

• Discussed communication and I/O challenges in accelerating Big

Data middleware

• Presented initial designs to take advantage of InfiniBand/RDMA

and high-performance storage architectures for Hadoop, Spark,

and Memcached

• Presented challenges in designing benchmarks

• Results are promising

• Many other open issues need to be solved

• Will enable Big Data processing community to take advantage of

modern HPC technologies to carry out their analytics in a fast and

scalable manner

Mellanox Booth (SC '15)

Concluding Remarks

33

Page 34: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

An Additional Talk

34

• Tomorrow, Thursday (11:00-11:30am) • Exploiting Full Potential of GPU Clusters with InfiniBand

using MVAPICH2-GDR

Mellanox Booth (SC '15)

Page 35: High Performance Big Data (HiBD): Accelerating Hadoop ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached

[email protected]

Mellanox Booth (SC '15)

Thank You!

The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/

35

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/