accelerating big data processing with hadoop, spark … · accelerating big data processing with...

68
Accelerating Big Data Processing with Hadoop, Spark and Memcached Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Talk at HPC Advisory Council Switzerland Conference (Mar ‘15) by

Upload: volien

Post on 05-Jun-2018

244 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Accelerating Big Data Processing with Hadoop, Spark and Memcached

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at HPC Advisory Council Switzerland Conference (Mar ‘15)

by

Page 2: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Introduction to Big Data Applications and Analytics

• Big Data has become the one of the most

important elements of business analytics

• Provides groundbreaking opportunities

for enterprise information management

and decision making

• The amount of data is exploding;

companies are capturing and digitizing

more information than ever

• The rate of information growth appears

to be exceeding Moore’s Law

2 HPCAC Switzerland Conference (Mar '15)

• Commonly accepted 3V’s of Big Data • Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

• 5V’s of Big Data – 3V + Value, Veracity

Page 3: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Substantial impact on designing and utilizing modern data management and

processing systems in multiple tiers

– Front-end data accessing and serving (Online)

• Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)

• HDFS, MapReduce, Spark

HPCAC Switzerland Conference (Mar '15)

Data Management and Processing on Modern Clusters

3

Page 4: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Open-source implementation of Google MapReduce, GFS, and BigTable

for Big Data Analytics

• Hadoop Common Utilities (RPC, etc.), HDFS, MapReduce, YARN

• http://hadoop.apache.org

4

Overview of Apache Hadoop Architecture

HPCAC Switzerland Conference (Mar '15)

Hadoop Distributed File System (HDFS)

MapReduce (Cluster Resource Management & Data Processing)

Hadoop Common/Core (RPC, ..)

Hadoop Distributed File System (HDFS)

YARN (Cluster Resource Management & Job Scheduling)

Hadoop Common/Core (RPC, ..)

MapReduce (Data Processing)

Other Models (Data Processing)

Hadoop 1.x Hadoop 2.x

Page 5: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Spark Architecture Overview

• An in-memory data-processing framework

– Iterative machine learning jobs

– Interactive data analytics

– Scala based Implementation

– Standalone, YARN, Mesos

• Scalable and communication intensive

– Wide dependencies between Resilient Distributed Datasets (RDDs)

– MapReduce-like shuffle operations to repartition RDDs

– Sockets based communication

HPCAC Switzerland Conference (Mar '15)

http://spark.apache.org

Worker

Worker

Worker

Worker

Master

HDFS

Driver

Zookeeper

Worker

SparkContext

5

Page 6: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Memcached Architecture

• Three-layer architecture of Web 2.0

– Web Servers, Memcached Servers, Database Servers

• Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

6

Main

memoryCPUs

SSD HDD

High Performance Networks

... ...

...

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High

Performance

Networks

High

Performance

NetworksInternet

HPCAC Switzerland Conference (Mar '15)

Page 7: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Challenges for Accelerating Big Data Processing

• The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark

– Sample Performance Numbers for RDMA-Hadoop 2.x 0.9.6 Release

• RDMA-based designs for Memcached and HBase

– RDMA-based Memcached with SSD-assisted Hybrid Memory

– RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing

– OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

HPCAC Switzerland Conference (Mar '15) 7

Page 8: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• High End Computing (HEC) is growing dramatically

– High Performance Computing

– Big Data Computing

• Technology Advancement

– Multi-core/many-core technologies and accelerators

– Remote Direct Memory Access (RDMA)-enabled

networking (InfiniBand and RoCE)

– Solid State Drives (SSDs) and Non-Volatile Random-

Access Memory (NVRAM)

– Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

Drivers for Modern HPC Clusters

Tianhe – 2 Titan Stampede Tianhe – 1A

HPCAC Switzerland Conference (Mar '15) 8

Page 9: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

HPCAC Switzerland Conference (Mar '15)

Kernel

Space

Interconnects and Protocols in the Open Fabrics Stack

Application / Middleware

Verbs

Ethernet

Adapter

Ethernet Switch

Ethernet Driver

TCP/IP

1/10/40 GigE

InfiniBand

Adapter

InfiniBand

Switch

IPoIB

IPoIB

Ethernet

Adapter

Ethernet Switch

Hardware Offload

TCP/IP

10/40 GigE-TOE

InfiniBand

Adapter

InfiniBand

Switch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCE Adapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBand

Switch

InfiniBand

Adapter

RDMA

User Space

IB Native

Sockets

Application / Middleware Interface

Protocol

Adapter

Switch

InfiniBand

Adapter

InfiniBand

Switch

RDMA

SDP

SDP

9

Page 10: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Wide Adoption of RDMA Technology

• Message Passing Interface (MPI) for HPC

• Parallel File Systems

– Lustre

– GPFS

• Delivering excellent performance:

– < 1.0 microsec latency

– 100 Gbps bandwidth

– 5-10% CPU utilization

• Delivering excellent scalability

HPCAC Switzerland Conference (Mar '15) 10

Page 11: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocol

Challenges in Designing Communication and I/O Libraries for Big Data Systems

HPCAC Switzerland Conference (Mar '15)

Upper level

Changes?

11

Page 12: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?

Current Design

Application

Sockets

1/10 GigE Network

• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers

– Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design

10 GigE or InfiniBand

Verbs Interface

HPCAC Switzerland Conference (Mar '15) 12

Page 13: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Challenges for Accelerating Big Data Processing

• The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark

– Sample Performance Numbers for RDMA-Hadoop 2.x 0.9.6 Release

• RDMA-based designs for Memcached and HBase

– RDMA-based Memcached with SSD-assisted Hybrid Memory

– RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing

– OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

HPCAC Switzerland Conference (Mar '15) 13

Page 14: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• RDMA for Memcached (RDMA-Memcached)

• OSU HiBD-Benchmarks (OHB)

• http://hibd.cse.ohio-state.edu

• Users Base: 95 organizations from 18 countries

• More than 2,900 downloads

• RDMA for Apache HBase and Spark will be available in near future

Overview of the HiBD Project and Releases

HPCAC Switzerland Conference (Mar '15) 14

Page 15: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance design with native InfiniBand and RoCE support at the verbs-level for

HDFS, MapReduce, and RPC components

– Enhanced HDFS with in-memory and heterogeneous storage

– High performance design of MapReduce over Lustre

– Easily configurable for different running modes (HHH, HHH-M, HHH-L, and MapReduce

over Lustre) and different protocols (native InfiniBand, RoCE, and IPoIB)

• Current release: 0.9.6

– Based on Apache Hadoop 2.6.0

– Compliant with Apache Hadoop 2.6.0 APIs and applications

– Tested with

• Mellanox InfiniBand adapters (DDR, QDR and FDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms

• Different file systems with disks and SSDs and Lustre

– http://hibd.cse.ohio-state.edu

RDMA for Apache Hadoop 2.x Distribution

HPCAC Switzerland Conference (Mar '15) 15

Page 16: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• High-Performance Design of Memcached over RDMA-enabled Interconnects

– High performance design with native InfiniBand and RoCE support at the verbs-

level for Memcached and libMemcached components

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based

support (Ethernet and InfiniBand with IPoIB)

– High performance design of SSD-Assisted Hybrid Memory

• Current release: 0.9.3

– Based on Memcached 1.4.22 and libMemcached 1.0.18

– Compliant with libMemcached APIs and applications

– Tested with

• Mellanox InfiniBand adapters (DDR, QDR and FDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms

• SSD

– http://hibd.cse.ohio-state.edu

RDMA for Memcached Distribution

HPCAC Switzerland Conference (Mar '15) 16

Page 17: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Released in OHB 0.7.1 (ohb_memlat)

• Evaluates the performance of stand-alone Memcached

• Three different micro-benchmarks

– SET Micro-benchmark: Micro-benchmark for memcached set

operations

– GET Micro-benchmark: Micro-benchmark for memcached get

operations

– MIX Micro-benchmark: Micro-benchmark for a mix of memcached

set/get operations (Read:Write ratio is 90:10)

• Calculates average latency of Memcached operations

• Can measure throughput in Transactions Per Second

HPCAC Switzerland Conference (Mar '15)

OSU HiBD Micro-Benchmark (OHB) Suite - Memcached

17

Page 18: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Challenges for Accelerating Big Data Processing

• The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark

– Sample Performance Numbers for Hadoop 2.x 0.9.6 Release

• RDMA-based designs for Memcached and HBase

– RDMA-based Memcached with SSD-assisted Hybrid Memory

– RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing

– OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

HPCAC Switzerland Conference (Mar '15) 18

Page 19: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• RDMA-based Designs and Performance Evaluation – HDFS

– MapReduce

– Spark

Acceleration Case Studies and In-Depth Performance Evaluation

HPCAC Switzerland Conference (Mar '15) 19

Page 20: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Design Overview of HDFS with RDMA

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based HDFS with communication library written in native code

HDFS

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

• Design Features

– RDMA-based HDFS write

– RDMA-based HDFS replication

– Parallel replication support

– On-demand connection setup

– InfiniBand/RoCE support

HPCAC Switzerland Conference (Mar '15) 20

Page 21: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Communication Times in HDFS

• Cluster with HDD DataNodes

– 30% improvement in communication time over IPoIB (QDR)

– 56% improvement in communication time over 10GigE

• Similar improvements are obtained for SSD DataNodes

Reduced by 30%

HPCAC Switzerland Conference (Mar '15)

0

5

10

15

20

25

2GB 4GB 6GB 8GB 10GB

Co

mm

un

icat

ion

Tim

e (s

)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

21

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,

High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in

RDMA-Enhanced HDFS, HPDC '14, June 2014

Page 22: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Triple-H

Heterogeneous Storage

• Design Features

– Three modes

• Default (HHH)

• In-Memory (HHH-M)

• Lustre-Integrated (HHH-L)

– Policies to efficiently utilize the

heterogeneous storage devices

• RAM, SSD, HDD, Lustre

– Eviction/Promotion based on

data usage pattern

– Hybrid Replication

– Lustre-Integrated mode:

• Lustre-based fault-tolerance

22

Enhanced HDFS with In-memory and Heterogeneous Storage

Hybrid

Replication

Data Placement Policies

Eviction/

Promotion

RAM Disk SSD HDD

Lustre

N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS

on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015

Applications

HPCAC Switzerland Conference (Mar '15)

Page 23: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• HHH (default): Heterogeneous storage devices with hybrid

replication schemes

– I/O operations over RAM disk, SSD, and HDD

– Hybrid replication (in-memory and persistent storage)

– Better fault-tolerance as well as performance

• HHH-M: High-performance in-memory I/O operations

– Memory replication (in-memory only with lazy persistence)

– As much performance benefit as possible

• HHH-L: Lustre integrated

– Take advantage of the Lustre available in HPC clusters

– Lustre-based fault-tolerance (No HDFS replication)

– Reduced local storage space usage

HPCAC Switzerland Conference (Mar '15) 23

Enhanced HDFS with In-memory and Heterogeneous Storage – Three Modes

Page 24: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• For 160GB TestDFSIO in 32 nodes

– Write Throughput: 7x improvement

over IPoIB (FDR)

– Read Throughput: 2x improvement

over IPoIB (FDR)

Performance Improvement on TACC Stampede (HHH)

• For 120GB RandomWriter in 32

nodes

– 3x improvement over IPoIB (QDR)

24

0

1000

2000

3000

4000

5000

6000

write read

Tota

l Th

rou

ghp

ut

(MB

ps)

IPoIB (FDR) OSU-IB (FDR)

TestDFSIO

0

50

100

150

200

250

300

8:30 16:60 32:120

Exe

cuti

on

Tim

e (

s)

Cluster Size:Data Size

IPoIB (FDR) OSU-IB (FDR)

RandomWriter

HPCAC Switzerland Conference (Mar '15)

Increased by 7x

Increased by 2x Reduced by 3x

Page 25: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• For 60GB Sort in 8 nodes

– 24% improvement over default HDFS

– 54% improvement over Lustre

– 33% storage space saving compared to default HDFS

Performance Improvement on SDSC Gordon (HHH-L)

25

0

50

100

150

200

250

300

350

400

450

500

20 40 60

Exe

cuti

on

Tim

e (

s)

Data Size (GB)

HDFS-IPoIB (QDR)

Lustre-IPoIB (QDR)

OSU-IB (QDR)

Storage space for 60GB Sort Sort

Storage Used (GB)

HDFS-IPoIB (QDR)

360

Lustre-IPoIB (QDR)

120

OSU-IB (QDR) 240

HPCAC Switzerland Conference (Mar '15)

Reduced by 54%

Page 26: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

26

Evaluation with PUMA and CloudBurst (HHH-L/HHH)

0

500

1000

1500

2000

2500

SequenceCount Grep

Exe

cuti

on

Tim

e (

s)

HDFS-IPoIB (QDR)

Lustre-IPoIB (QDR)

OSU-IB (QDR) HDFS-IPoIB (FDR)

OSU-IB (FDR)

60.24 s 48.3 s

CloudBurst PUMA

• PUMA on OSU RI

– SequenceCount with HHH-L: 17%

benefit over Lustre, 8% over HDFS

– Grep with HHH: 29.5% benefit over

Lustre, 13.2% over HDFS

• CloudBurst on TACC Stampede

– With HHH: 19% improvement

over HDFS

HPCAC Switzerland Conference (Mar '15)

Reduced by 17%

Reduced by 29.5%

Page 27: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• RDMA-based Designs and Performance Evaluation – HDFS

– MapReduce

– Spark

Acceleration Case Studies and In-Depth Performance Evaluation

HPCAC Switzerland Conference (Mar '15) 27

Page 28: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Design Overview of MapReduce with RDMA

MapReduce

Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

OSU Design

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Job

Tracker

Task

Tracker

Map

Reduce

HPCAC Switzerland Conference (Mar '15)

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based MapReduce with communication library written in native code

• Design Features

– RDMA-based shuffle

– Prefetching and caching map output

– Efficient Shuffle Algorithms

– In-memory merge

– On-demand Shuffle Adjustment

– Advanced overlapping

• map, shuffle, and merge

• shuffle, merge, and reduce

– On-demand connection setup

– InfiniBand/RoCE support

28

Page 29: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Advanced Overlapping among different phases

• A hybrid approach to achieve

maximum possible overlapping

in MapReduce across all phases

compared to other approaches – Efficient Shuffle Algorithms

– Dynamic and Efficient Switching

– On-demand Shuffle Adjustment

Default Architecture

Enhanced Overlapping

Advanced Overlapping

M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda,

HOMR: A Hybrid Approach to Exploit Maximum

Overlapping in MapReduce over High Performance

Interconnects, ICS, June 2014.

29 HPCAC Switzerland Conference (Mar '15)

Page 30: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• For 240GB Sort in 64 nodes (512

cores)

– 40% improvement over IPoIB (QDR)

with HDD used for HDFS

Performance Evaluation of Sort and TeraSort

HPCAC Switzerland Conference (Mar '15)

0

200

400

600

800

1000

1200

Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exe

cuti

on

Tim

e (s

ec)

IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)

Sort in OSU Cluster

0

100

200

300

400

500

600

700

Data Size: 80 GB Data Size: 160 GBData Size: 320 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exe

cuti

on

Tim

e (s

ec)

IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)

TeraSort in TACC Stampede

• For 320GB TeraSort in 64 nodes

(1K cores)

– 38% improvement over IPoIB

(FDR) with HDD used for HDFS

30

Reduced by 40%

Reduced by 38%

Page 31: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size

• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evaluations using PUMA Workload

HPCAC Switzerland Conference (Mar '15)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)

No

rmal

ized

Exe

cuti

on

Tim

e

Benchmarks

10GigE IPoIB (QDR) OSU-IB (QDR)

31

Reduced by 50% Reduced by 49%

Page 32: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Optimize Hadoop YARN MapReduce over Parallel File Systems

MetaData Servers

Object Storage Servers

Compute Nodes

App Master

Map Reduce

Lustre Client

Lustre Setup

• HPC Cluster Deployment – Hybrid topological solution of Beowulf

architecture with separate I/O nodes – Lean compute nodes with light OS; more

memory space; small local storage – Sub-cluster of dedicated I/O nodes with

parallel file systems, such as Lustre

• MapReduce over Lustre – Local disk is used as the intermediate data

directory – Lustre is used as the intermediate data

directory

32 HPCAC Switzerland Conference (Mar '15)

Page 33: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Intermediate Data Directory

Design Overview of Shuffle Strategies for MapReduce over Lustre

HPCAC Switzerland Conference (Mar '15)

• Design Features

– Two shuffle approaches

• Lustre read based shuffle

• RDMA based shuffle

– Hybrid shuffle algorithm to take benefit from both shuffle approaches

– Dynamically adapts to the better shuffle approach for each shuffle request based on profiling values for each Lustre read operation

– In-memory merge and overlapping of different phases are kept similar to RDMA-enhanced MapReduce design

33

Map 1 Map 2 Map 3

Lustre

Reduce 1 Reduce 2

Lustre Read / RDMA

In-memory

merge/sort reduce

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN

MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015.

In-memory

merge/sort reduce

Page 34: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• For 500GB Sort in 64 nodes

– 44% improvement over IPoIB (FDR)

Performance Improvement of MapReduce over Lustre on TACC-Stampede

HPCAC Switzerland Conference (Mar '15)

• For 640GB Sort in 128 nodes

– 48% improvement over IPoIB (FDR)

0

200

400

600

800

1000

1200

300 400 500

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

0

50

100

150

200

250

300

350

400

450

500

20 GB 40 GB 80 GB 160 GB 320 GB 640 GB

Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128

Job

Exe

cuti

on

Tim

e (s

ec)

IPoIB (FDR) OSU-IB (FDR)

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-

based Approach Benefit?, Euro-Par, August 2014.

• Local disk is used as the intermediate data directory

34

Reduced by 48% Reduced by 44%

Page 35: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• For 80GB Sort in 8 nodes

– 34% improvement over IPoIB (QDR)

Case Study - Performance Improvement of MapReduce over Lustre on SDSC-Gordon

HPCAC Switzerland Conference (Mar '15)

• For 120GB TeraSort in 16 nodes

– 25% improvement over IPoIB (QDR)

• Lustre is used as the intermediate data directory

35

0

100

200

300

400

500

600

700

800

900

40 60 80

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (QDR)

OSU-Lustre-Read (QDR)

OSU-RDMA-IB (QDR)

OSU-Hybrid-IB (QDR)

0

100

200

300

400

500

600

700

800

900

40 80 120

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (QDR)

OSU-Lustre-Read (QDR)

OSU-RDMA-IB (QDR)

OSU-Hybrid-IB (QDR)

Reduced by 25% Reduced by 34%

Page 36: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• RDMA-based Designs and Performance Evaluation – HDFS

– MapReduce

– Spark

Acceleration Case Studies and In-Depth Performance Evaluation

HPCAC Switzerland Conference (Mar '15) 36

Page 37: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Design Overview of Spark with RDMA

• Design Features

– RDMA based shuffle

– SEDA-based plugins

– Dynamic connection management and sharing

– Non-blocking and out-of-order data transfer

– Off-JVM-heap buffer management

– InfiniBand/RoCE support

HPCAC Switzerland Conference (Mar '15)

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Scala based Spark with communication library written in native code

X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data

Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

37

Page 38: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

HPCAC Switzerland Conference (Mar '15)

Preliminary Results of Spark-RDMA Design - GroupBy

0

1

2

3

4

5

6

7

8

9

4 6 8 10

Gro

up

By

Tim

e (s

ec)

Data Size (GB)

0

1

2

3

4

5

6

7

8

9

8 12 16 20

Gro

up

By

Tim

e (s

ec)

Data Size (GB)

10GigE

IPoIB

RDMA

Cluster with 4 HDD Nodes, GroupBy with 32 cores Cluster with 8 HDD Nodes, GroupBy with 64 cores

• Cluster with 4 HDD Nodes, single disk per node, 32 concurrent tasks

– 18% improvement over IPoIB (QDR) for 10GB data size

• Cluster with 8 HDD Nodes, single disk per node, 64 concurrent tasks

– 20% improvement over IPoIB (QDR) for 20GB data size

38

Reduced by 20% Reduced by 18%

Page 39: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Challenges for Accelerating Big Data Processing

• The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark

– Sample Performance Numbers for Hadoop 2.x 0.9.6 Release

• RDMA-based designs for Memcached and HBase

– RDMA-based Memcached with SSD-assisted Hybrid Memory

– RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing

– OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

HPCAC Switzerland Conference (Mar '15) 39

Page 40: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

0

1000

2000

3000

4000

5000

6000

7000

8000

80 100 120

Tota

l Th

rou

ghp

ut

(MB

ps)

Data Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

0

50

100

150

200

250

300

80 100 120

Exec

uti

on

Tim

e (s

)

Data Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

• For Throughput,

– 6-7.8x improvement over IPoIB

for 80-120 GB file size

40

Performance Benefits – TestDFSIO in TACC Stampede (HHH)

Cluster with 32 Nodes with HDD, TestDFSIO with a total 128 maps

• For Latency,

– 2.5-3x improvement over

IPoIB for 80-120 GB file size

Increased by 7.8x Reduced by 2.5x

HPCAC Switzerland Conference (Mar '15)

Page 41: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

0

50

100

150

200

250

80 100 120

Exec

uti

on

Tim

e (s

)

Data Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

0

50

100

150

200

250

80 100 120

Exec

uti

on

Tim

e (s

)

Data Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

41

Performance Benefits – RandomWriter & TeraGen in TACC-Stampede (HHH)

Cluster with 32 Nodes with a total of 128 maps

• RandomWriter

– 3-4x improvement over IPoIB

for 80-120 GB file size

• TeraGen

– 4-5x improvement over IPoIB

for 80-120 GB file size

RandomWriter TeraGen

Reduced by 3x Reduced by 4x

HPCAC Switzerland Conference (Mar '15)

Page 42: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

0

100

200

300

400

500

600

700

800

900

80 100 120

Exe

cuti

on

Tim

e (

s)

Data Size (GB)

IPoIB (FDR) OSU-IB (FDR)

0

100

200

300

400

500

600

80 100 120

Exec

uti

on

Tim

e (s

)

Data Size (GB)

IPoIB (FDR) OSU-IB (FDR)

42

Performance Benefits – Sort & TeraSort in TACC-Stampede (HHH)

Cluster with 32 Nodes with a total of 128 maps and 64 reduces

• Sort with single HDD per node

– 40-52% improvement over IPoIB

for 80-120 GB data

• TeraSort with single HDD per node

– 42-44% improvement over IPoIB

for 80-120 GB data

Reduced by 52% Reduced by 44%

Cluster with 32 Nodes with a total of 128 maps and 57 reduces

HPCAC Switzerland Conference (Mar '15)

Page 43: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

0

5000

10000

15000

20000

25000

30000

60 80 100

Tota

l Th

rou

ghp

ut

(MB

ps)

Data Size (GB)

IPoIB (FDR) OSU-IB (FDR)

• TestDFSIO Write on TACC

Stampede

– 28x improvement over IPoIB

for 60-100 GB file size

43

Performance Benefits – TestDFSIO on TACC Stampede and SDSC Gordon (HHH-M)

TACC Stampede Cluster with 32 Nodes with HDD, TestDFSIO with a total 128 maps

• TestDFSIO Write on SDSC

Gordon

– 6x improvement over IPoIB

for 60-100 GB file size

Increased by 28x

HPCAC Switzerland Conference (Mar '15)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

60 80 100

Tota

l Th

rou

ghp

ut

(MB

ps)

Data Size (GB)

IPoIB (QDR) OSU-IB (QDR) Increased by 6x

SDSC Gordon Cluster with 16 Nodes with SSD, TestDFSIO with a total 64 maps

Page 44: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

44

Performance Benefits – TestDFSIO and Sort in SDSC-Gordon (HHH-L)

0

50

100

150

200

250

300

350

400

450

500

40 60 80

Exec

uti

on

Tim

e (s

)

Data Size (GB)

HDFS Lustre

OSU-IB

• Sort

– up to 28% improvement over

HDFS

– up to 50% improvement over

Lustre

Sort TestDFSIO

• TestDFSIO for 80GB data size

– Write: 9x improvement over

HDFS

– Read: 29% improvement over

Lustre

Cluster with 16 Nodes

Reduced by 50% Increased by 29%

0

2000

4000

6000

8000

10000

12000

14000

16000

Write Read

Tota

l Th

rou

ghp

ut

(MB

ps)

HDFS Lustre OSU-IB

Increased by 9x

HPCAC Switzerland Conference (Mar '15)

Page 45: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Challenges for Accelerating Big Data Processing

• The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark

– Sample Performance Numbers for Hadoop 2.x 0.9.6 Release

• RDMA-based designs for Memcached and HBase

– RDMA-based Memcached with SSD-assisted Hybrid Memory

– RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing

– OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

HPCAC Switzerland Conference (Mar '15) 45

Page 46: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Memcached-RDMA Design

• Server and client perform a negotiation protocol

– Master thread assigns clients to appropriate worker thread

• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread

• All other Memcached data structures are shared among RDMA and Sockets worker threads

• Native IB-verbs-level Design and evaluation with

– Server : Memcached (http://memcached.org)

– Client : libmemcached (http://libmemcached.org)

– Different networks and protocols: 10GigE, IPoIB, native IB (RC, UD)

Sockets Client

RDMA Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Sockets Worker Thread

Verbs Worker Thread

Shared

Data

Memory Slabs Items

1

1

2

2

HPCAC Switzerland Conference (Mar '15) 46

Page 47: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

1

10

100

10001 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

Tim

e (

us)

Message Size

OSU-IB (FDR)

IPoIB (FDR)

0

100

200

300

400

500

600

700

16 32 64 128 256 512 1024 2048 4080

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r Se

con

d (

TPS)

No. of Clients

• Memcached Get latency

– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us

– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

• Memcached Throughput (4bytes)

– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s

– Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X

2X

HPCAC Switzerland Conference (Mar '15) 47

Page 48: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Illustration with Read-Cache-Read access pattern using modified

mysqlslap load testing tool

• Memcached-RDMA can

- improve query latency by up to 66% over IPoIB (32Gbps)

- throughput by up to 69% over IPoIB (32Gbps)

HPCAC Switzerland Conference (Mar '15) 48

Micro-benchmark Evaluation for OLDP workloads

0

1

2

3

4

5

6

7

8

64 96 128 160 320 400

Late

ncy

(se

c)

No. of Clients

Memcached-IPoIB (32Gbps)

Memcached-RDMA (32Gbps)

0

500

1000

1500

2000

2500

3000

3500

64 96 128 160 320 400

Thro

ugh

pu

t (

Kq

/s)

No. of Clients

Memcached-IPoIB (32Gbps)

Memcached-RDMA (32Gbps)

D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data

Processing Workloads with Memcached and MySQL, ISPASS’15

Reduced by 66%

Page 49: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• ohb_memlat & ohb_memthr latency & throughput micro-benchmarks

• Memcached-RDMA can

- improve query latency by up to 70% over IPoIB (32Gbps)

- improve throughput by up to 2X over IPoIB (32Gbps)

- No overhead in using hybrid mode when all data can fit in memory

HPCAC Switzerland Conference (Mar '15) 49

Performance Benefits on SDSC-Gordon – OHB Latency & Throughput Micro-Benchmarks

0123456789

10

64 128 256 512 1024

Thro

ugh

pu

t (m

illio

n t

ran

s/se

c)

No. of Clients

IPoIB (32Gbps)RDMA-Mem (32Gbps)RDMA-Hybrid (32Gbps)

0

100

200

300

400

500

Ave

rage

late

ncy

(u

s)

Message Size (Bytes)

IPoIB (32Gbps)RDMA-Mem (32Gbps)RDMA-Hyb (32Gbps) 2X

Page 50: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

ohb_memhybrid – Uniform Access Pattern, single client and single server with 64MB

• Success Rate of In-Memory Vs. Hybrid SSD-Memory for different spill factors

– 100% success rate for Hybrid design while that of pure In-memory degrades

• Average Latency with penalty for In-Memory Vs. Hybrid SSD-Assisted mode for spill

factor 1.5.

– up to 53% improvement over In-memory with server miss penalty as low as 1.5

ms 50

Performance Benefits on OSU-RI-SSD – OHB Micro-benchmark for Hybrid Memcached

0

200

400

600

800

1000

512 2K 8K 32K 128K

Late

ncy

wit

h p

enal

ty (

us)

Message Size (Bytes)

RDMA-Mem (32Gbps)

RDMA-Hybrid (32Gbps)

55

65

75

85

95

105

115

125

1 1.15 1.3 1.45

Succ

ess

Rat

e

Spill Factor

RDMA-Mem-Uniform

RDMA-Hybrid

HPCAC Switzerland Conference (Mar '15)

53%

Page 51: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Challenges for Accelerating Big Data Processing

• The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark

– Sample Performance Numbers for Hadoop 2.x 0.9.6 Release

• RDMA-based designs for Memcached and HBase

– RDMA-based Memcached with SSD-assisted Hybrid Memory

– RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing

– OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

HPCAC Switzerland Conference (Mar '15) 51

Page 52: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

HPCAC Switzerland Conference (Mar '15)

HBase-RDMA Design Overview

• JNI Layer bridges Java based HBase with communication library written in native code

• Enables high performance RDMA communication, while supporting traditional socket interface

HBase

IB Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

OSU-IB Design

Applications

1/10 GigE, IPoIB Networks

Java Socket Interface

Java Native Interface (JNI)

52

Page 53: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

HPCAC Switzerland Conference (Mar '15)

HBase – YCSB Read-Write Workload

• HBase Get latency (Yahoo! Cloud Service Benchmark)

– 64 clients: 2.0 ms; 128 Clients: 3.5 ms

– 42% improvement over IPoIB for 128 clients

• HBase Put latency

– 64 clients: 1.9 ms; 128 Clients: 3.5 ms

– 40% improvement over IPoIB for 128 clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

10GigE

Read Latency Write Latency

OSU-IB (QDR) IPoIB (QDR)

53

J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12

Page 54: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Challenges for Accelerating Big Data Processing

• The High-Performance Big Data (HiBD) Project

• RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark

– Sample Performance Numbers for Hadoop 2.x 0.9.6 Release

• RDMA-based designs for Memcached and HBase

– RDMA-based Memcached with SSD-assisted Hybrid Memory

– RDMA-based HBase

• Challenges in Designing Benchmarks for Big Data Processing

– OSU HiBD Benchmarks

• Conclusion and Q&A

Presentation Outline

HPCAC Switzerland Conference (Mar '15) 54

Page 55: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocol

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges

HPCAC Switzerland Conference (Mar '15)

Upper level

Changes?

55

Page 56: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• The current benchmarks provide some performance

behavior

• However, do not provide any information to the

designer/developer on:

– What is happening at the lower-layer?

– Where the benefits are coming from?

– Which design is leading to benefits or bottlenecks?

– Which component in the design needs to be changed and what will

be its impact?

– Can performance gain/loss at the lower-layer be correlated to the

performance gain/loss observed at the upper layer?

HPCAC Switzerland Conference (Mar '15)

Are the Current Benchmarks Sufficient for Big Data Management and Processing?

56

Page 57: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

OSU MPI Micro-Benchmarks (OMB) Suite

• A comprehensive suite of benchmarks to – Compare performance of different MPI libraries on various networks and systems

– Validate low-level functionalities

– Provide insights to the underlying MPI-level designs

• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth

• Extended later to – MPI-2 one-sided

– Collectives

– GPU-aware data movement

– OpenSHMEM (point-to-point and collectives)

– UPC

• Has become an industry standard

• Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems

• Available from http://mvapich.cse.ohio-state.edu/benchmarks

• Available in an integrated manner with MVAPICH2 stack

HPCAC Switzerland Conference (Mar '15) 57

Page 58: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocols

Challenges in Benchmarking of RDMA-based Designs

Current

Benchmarks

No Benchmarks

Correlation?

HPCAC Switzerland Conference (Mar '15) 58

Page 59: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocols

Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications

HPCAC Switzerland Conference (Mar '15)

Applications-

Level

Benchmarks

Micro-

Benchmarks

59

Page 60: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Evaluate the performance of standalone HDFS

• Five different benchmarks

– Sequential Write Latency (SWL)

– Sequential or Random Read Latency (SRL or RRL)

– Sequential Write Throughput (SWT)

– Sequential Read Throughput (SRT)

– Sequential Read-Write Throughput (SRWT)

HPCAC Switzerland Conference (Mar '15)

OSU HiBD Micro-Benchmark (OHB) Suite - HDFS

Benchmark File Name

File Size

HDFS Parameter

Readers Writers Random/ Sequential

Read

Seek Interval

SWL √ √ √

SRL/RRL √ √ √ √ √ (RRL)

SWT √ √ √

SRT √ √ √

SRWT √ √ √ √

N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D.

K. Panda, A Micro-benchmark Suite for

Evaluating HDFS Operations on Modern

Clusters, Int'l Workshop on Big Data

Benchmarking (WBDB '12), December 2012.

60

Page 61: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

OSU HiBD Micro-Benchmark (OHB) Suite - RPC • Two different micro-benchmarks to evaluate the performance of standalone

Hadoop RPC

– Latency: Single Server, Single Client

– Throughput: Single Server, Multiple Clients

• A simple script framework for job launching and resource monitoring

• Calculates statistics like Min, Max, Average

• Network configuration, Tunable parameters, DataType, CPU Utilization

Component Network Address

Port Data Type Min Msg Size

Max Msg Size

No. of Iterations

Handlers Verbose

lat_client √ √ √ √ √ √ √

lat_server √ √ √ √

Component Network Address

Port Data Type Min Msg Size

Max Msg Size

No. of Iterations

No. of Clients Handlers Verbose

thr_client √ √ √ √ √ √ √

thr_server √ √ √ √ √ √

X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop RPC on High-

Performance Networks, Int'l Workshop on Big Data Benchmarking (WBDB '13), July 2013.

HPCAC Switzerland Conference (Mar '15) 61

Page 62: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Evaluate the performance of stand-alone MapReduce

• Does not require or involve HDFS or any other distributed file system

• Considers various factors that influence the data shuffling phase

– underlying network configuration, number of map and reduce tasks, intermediate

shuffle data pattern, shuffle data size etc.

• Three different micro-benchmarks based on intermediate shuffle data

patterns

– MR-AVG micro-benchmark: intermediate data is evenly distributed among

reduce tasks.

– MR-RAND micro-benchmark: intermediate data is pseudo-randomly

distributed among reduce tasks.

– MR-SKEW micro-benchmark: intermediate data is unevenly distributed

among reduce tasks.

HPCAC Switzerland Conference (Mar '15)

OSU HiBD Micro-Benchmark (OHB) Suite - MapReduce

D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating

Hadoop MapReduce on High-Performance Networks, BPOE-5 (2014).

62

Page 63: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Upcoming Releases of RDMA-enhanced Packages will support

– Spark

– HBase

– Plugin-based designs

• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support

– HDFS

– MapReduce

– RPC

• Exploration of other components (Threading models, QoS, Virtualization,

Accelerators, etc.)

• Advanced designs with upper-level changes and optimizations

HPCAC Switzerland Conference (Mar '15)

Future Plans of OSU High Performance Big Data Project

63

Page 64: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Presented an overview of Big Data processing middleware

• Discussed challenges in accelerating Big Data middleware

• Presented initial designs to take advantage of InfiniBand/RDMA

for Hadoop, Spark, Memcached, and HBase

• Presented challenges in designing benchmarks

• Results are promising

• Many other open issues need to be solved

• Will enable Big processing community to take advantage of

modern HPC technologies to carry out their analytics in a fast and

scalable manner

HPCAC Switzerland Conference (Mar '15)

Concluding Remarks

64

Page 65: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Personnel Acknowledgments

65

Current Students

– A. Awan (Ph.D.)

– A. Bhat (M.S.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

– N. Islam (Ph.D.)

Past Students

– P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist – S. Sur

Current Post-Doc

– J. Lin

Current Programmer

– J. Perkins

Past Post-Docs – H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekar (Ph.D.)

– M. Li (Ph.D.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Senior Research Associates

– K. Hamidouche

– X. Lu

Past Programmers

– D. Bureddy

– H. Subramoni

Current Research Specialist

– M. Arnold

HPCAC Switzerland Conference (Mar '15)

Page 66: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

[email protected]

HPCAC Switzerland Conference (Mar '15)

Thank You!

The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/

66

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The MVAPICH2/MVAPICH2-X Project http://mvapich.cse.ohio-state.edu/

Page 67: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

HPCAC Switzerland Conference (Mar '15) 67

Call For Participation

International Workshop on High-Performance Big

Data Computing (HPBDC 2015)

In conjunction with International Conference on Distributed

Computing Systems (ICDCS 2015)

In Hilton Downtown, Columbus, Ohio, USA, Monday, June 29th, 2015

http://web.cse.ohio-state.edu/~luxi/hpbdc2015

Page 68: Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

• Looking for Bright and Enthusiastic Personnel to join as

– Post-Doctoral Researchers

– PhD Students

– Hadoop/Big Data Programmer/Software Engineer

– MPI Programmer/Software Engineer

• If interested, please contact me at this conference and/or send

an e-mail to [email protected]

68

Multiple Positions Available in My Group

HPCAC China, Nov. '14