big data meets hpc exploiting hpc technologies for ...€¦ · for accelerating big data processing...

Big Data Meets HPC – Exploiting HPC Technologies for Accelerating Big Data Processing

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Keynote Talk at HPCAC-Switzerland (Mar 2016)

by

http://www.cse.ohio-state.edu/~panda

HPCAC-Switzerland (Mar ‘16) 2Network Based Computing Laboratory

• Big Data has become the one of the most

important elements of business analytics

• Provides groundbreaking opportunities for

enterprise information management and

decision making

• The amount of data is exploding; companies

are capturing and digitizing more information

than ever

• The rate of information growth appears to be

exceeding Moore’s Law

Introduction to Big Data Applications and Analytics


• Commonly accepted 3V’s of Big Data• Volume, Velocity, VarietyMichael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

• 4/5V’s of Big Data – 3V + *Veracity, *Value

4V Characteristics of Big Data

Courtesy: http://api.ning.com/files/tRHkwQN7s-Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW-JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg

http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf


• Webpages (content, graph)

• Clicks (ad, page, social)

• Users (OpenID, FB Connect, etc.)

• e-mails (Hotmail, Y!Mail, Gmail, etc.)

• Photos, Movies (Flickr, YouTube, Video, etc.)

• Cookies / tracking info (see Ghostery)

• Installed apps (Android market, App Store, etc.)

• Location (Latitude, Loopt, Foursquared, Google Now, etc.)

• User generated content (Wikipedia & co, etc.)

• Ads (display, text, DoubleClick, Yahoo, etc.)

• Comments (Discuss, Facebook, etc.)

• Reviews (Yelp, Y!Local, etc.)

• Social connections (LinkedIn, Facebook, etc.)

• Purchase decisions (Netflix, Amazon, etc.)

• Instant Messages (YIM, Skype, Gtalk, etc.)

• Search terms (Google, Bing, etc.)

• News articles (BBC, NYTimes, Y!News, etc.)

• Blog posts (Tumblr, Wordpress, etc.)

• Microblogs (Twitter, Jaiku, Meme, etc.)

• Link sharing (Facebook, Delicious, Buzz, etc.)

Data Generation in Internet Services and Applications

Number of Apps in the Apple App Store, Android Market, Blackberry, and Windows Phone (2013)• Android Market: <1200K• Apple App Store: ~1000KCourtesy: http://dazeinfo.com/2014/07/10/apple-inc-aapl-ios-google-inc-goog-android-growth-mobile-ecosystem-2014/


Velocity of Big Data – How Much Data Is Generated Every Minute on the Internet?

The global Internet population grew 18.5% from 2013 to 2015 and now represents

3.2 Billion People.Courtesy: https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/


• Scientific Data Management, Analysis, and Visualization

• Applications examples

– Climate modeling

– Combustion

– Fusion

– Astrophysics

– Bioinformatics

• Data Intensive Tasks

– Runs large-scale simulations on supercomputers

– Dump data on parallel storage systems

– Collect experimental / observational data

– Move experimental / observational data to analysis sites

– Visual analytics – help understand data visually

Not Only in Internet Services - Big Data in Scientific Domains


• Hadoop: http://hadoop.apache.org

– The most popular framework for Big Data Analytics

– HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc.

• Spark: http://spark-project.org

– Provides primitives for in-memory cluster computing; Jobs can load data into memory and query it repeatedly

• Storm: http://storm-project.net

– A distributed real-time computation system for real-time analytics, online machine learning, continuous

computation, etc.

• S4: http://incubator.apache.org/s4

– A distributed system for processing continuous unbounded streams of data

• GraphLab: http://graphlab.org

– Consists of a core C++ GraphLab API and a collection of high-performance machine learning and data mining

toolkits built on top of the GraphLab API.

• Web 2.0: RDBMS + Memcached (http://memcached.org)

– Memcached: A high-performance, distributed memory object caching systems

Typical Solutions or Architectures for Big Data Analytics

http://hadoop.apache.org

http://spark-project.org

http://storm-project.net

http://incubator.apache.org/s4

http://graphlab.org/

http://memcached.org


Big Data Processing with Hadoop Components

• Major components included in this tutorial:

– MapReduce (Batch)

– HBase (Query)

– HDFS (Storage)

– RPC (Inter-process communication)

• Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase

• Model scales but high amount of communication during intermediate phases can be further optimized

HDFS

MapReduce

Hadoop Framework

User Applications

HBase

Hadoop Common (RPC)


Worker

Worker

Worker

Worker

Master

HDFS

Driver

Zookeeper

Worker

SparkContext

• An in-memory data-processing framework

– Iterative machine learning jobs

– Interactive data analytics

– Scala based Implementation

– Standalone, YARN, Mesos

• Scalable and communication intensive

– Wide dependencies between Resilient Distributed Datasets (RDDs)

– MapReduce-like shuffle operations to repartition RDDs

– Sockets based communication

Spark Architecture Overview

http://spark.apache.org

http://spark.apache.org


Memcached Architecture

• Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

Main

memoryCPUs

SSD HDD

High Performance Networks

... ...

...

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD


• Substantial impact on designing and utilizing data management and processing systems in multiple tiers

– Front-end data accessing and serving (Online)

• Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)

• HDFS, MapReduce, Spark

Data Management and Processing on Modern Clusters


• Introduced in Oct 2000• High Performance Data Transfer

– Interprocessor communication and I/O– Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec -> 100Gbps), and

low CPU utilization (5-10%)

• Multiple Operations– Send/Recv– RDMA Read/Write– Atomic Operations (very unique)

• high performance and scalable implementations of distributed locks, semaphores, collective communication operations

• Leading to big changes in designing – HPC clusters– File systems– Cloud computing systems– Grid computing systems

Open Standard InfiniBand Networking Technology


Interconnects and Protocols in OpenFabrics Stack

Kernel

Space

Application / Middleware

Verbs

Ethernet

Adapter

EthernetSwitch

Ethernet Driver

TCP/IP

1/10/40/100 GigE

InfiniBand

Adapter

InfiniBand

Switch

IPoIB

IPoIB

Ethernet

Adapter

Ethernet Switch

Hardware Offload

TCP/IP

10/40 GigE-TOE

InfiniBand

Adapter

InfiniBand

Switch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCEAdapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBand

Switch

InfiniBand

Adapter

RDMA

User Space

IB Native

Sockets

Application / Middleware Interface

Protocol

Adapter

Switch

InfiniBand

Adapter

InfiniBand

Switch

RDMA

SDP

SDP


How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications?

Bring HPC and Big Data processing into a “convergent trajectory”!

What are the major

bottlenecks in current Big

Data processing

middleware (e.g. Hadoop,

Spark, and Memcached)?

Can the bottlenecks be alleviated with new

designs by taking advantage of HPC

technologies?

Can RDMA-enabled

high-performance

interconnects

benefit Big Data

processing?

Can HPC Clusters with

high-performance

storage systems (e.g.

SSD, parallel file

systems) benefit Big

Data applications?

How much

performance benefits

can be achieved

through enhanced

designs?

How to design

benchmarks for

evaluating the

performance of Big

Data middleware on

HPC clusters?


Designing Communication and I/O Libraries for Big Data Systems: Challenges

Big Data Middleware(HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies

(InfiniBand, 1/10/40/100 GigEand Intelligent NICs)

Storage Technologies(HDD, SSD, and NVMe-SSD)

Programming Models(Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-PointCommunication

QoS

Threaded Modelsand Synchronization

Fault-ToleranceI/O and File Systems

Virtualization

Benchmarks

Upper level

Changes?


• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers

– Zero-copy not available for non-blocking sockets

Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?

Current Design

Application

Sockets

1/10/40/100 GigENetwork

Our Approach

Application

OSU Design

10/40/100 GigE or InfiniBand

Verbs Interface


• RDMA for Apache Spark

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• RDMA for Memcached (RDMA-Memcached)

• OSU HiBD-Benchmarks (OHB)

– HDFS and Memcached Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 155 organizations from 20 countries

• More than 15,500 downloads from the project site

• RDMA for Apache HBase and Impala (upcoming)

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCE

http://hibd.cse.ohio-state.edu

mailto:[email protected]


• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and

RPC components

– Enhanced HDFS with in-memory and heterogeneous storage

– High performance design of MapReduce over Lustre

– Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP

– Easily configurable for different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre) and different protocols (native

InfiniBand, RoCE, and IPoIB)

• Current release: 0.9.9

– Based on Apache Hadoop 2.7.1

– Compliant with Apache Hadoop 2.7.1, HDP 2.3.0.0 and CDH 5.6.0 APIs and applications

– Tested with

• Mellanox InfiniBand adapters (DDR, QDR and FDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms

• Different file systems with disks and SSDs and Lustre

– http://hibd.cse.ohio-state.edu

RDMA for Apache Hadoop 2.x Distribution

http://hadoop-rdma.cse.ohio-state.edu/


• High-Performance Design of Spark over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level

for Spark

– RDMA-based data shuffle and SEDA-based shuffle architecture

– Non-blocking and chunk-based data transfer

– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)


– Based on Apache Spark 1.5.1

– Tested with




• RAM disks, SSDs, and HDD


RDMA for Apache Spark Distribution



• High-Performance Design of Memcached over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for

Memcached and libMemcached components

– High performance design of SSD-Assisted Hybrid Memory

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and

InfiniBand with IPoIB)


– Based on Memcached 1.4.24 and libMemcached 1.0.18

– Compliant with libMemcached APIs and applications

– Tested with




• SSD


RDMA for Memcached Distribution



• Micro-benchmarks for Hadoop Distributed File System (HDFS)

– Sequential Write Latency (SWL) Benchmark, Sequential Read Latency (SRL)

Benchmark, Random Read Latency (RRL) Benchmark, Sequential Write Throughput

(SWT) Benchmark, Sequential Read Throughput (SRT) Benchmark

– Support benchmarking of

• Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Platform (HDP) HDFS, Cloudera

Distribution of Hadoop (CDH) HDFS

• Micro-benchmarks for Memcached

– Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark

• Current release: 0.8

• http://hibd.cse.ohio-state.edu

OSU HiBD Micro-Benchmark (OHB) Suite – HDFS & Memcached



• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well

as performance. This mode is enabled by default in the package.

• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-

memory and obtain as much performance benefit as possible.

• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.

• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top

of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.

• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-

L, and MapReduce over Lustre).

Different Modes of RDMA for Apache Hadoop 2.x


• RDMA-based Designs and Performance Evaluation

– HDFS

– MapReduce

– RPC

– HBase

– Spark

– Memcached (Basic, Hybrid and Non-blocking APIs)

– HDFS + Memcached-based Burst Buffer

– OSU HiBD Benchmarks (OHB)

Acceleration Case Studies and Performance Evaluation


• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based HDFS with communication library written in native code

Design Overview of HDFS with RDMA

HDFS

Verbs

RDMA Capable Networks(IB, iWARP, RoCE ..)

Applications

1/10/40/100 GigE, IPoIB Network

Java Socket Interface Java Native Interface (JNI)

WriteOthers

OSU Design

• Design Features

– RDMA-based HDFS write

– RDMA-based HDFS replication

– Parallel replication support

– On-demand connection setup

– InfiniBand/RoCE support

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS

over InfiniBand , Supercomputing (SC), Nov 2012

N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014


Triple-H

Heterogeneous Storage

• Design Features

– Three modes

• Default (HHH)

• In-Memory (HHH-M)

• Lustre-Integrated (HHH-L)

– Policies to efficiently utilize the heterogeneous

storage devices

• RAM, SSD, HDD, Lustre

– Eviction/Promotion based on data usage

pattern

– Hybrid Replication

– Lustre-Integrated mode:

• Lustre-based fault-tolerance

Enhanced HDFS with In-Memory and Heterogeneous Storage

Hybrid Replication

Data Placement Policies

Eviction/Promotion

RAM Disk SSD HDD

Lustre

N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters

with Heterogeneous Storage Architecture, CCGrid ’15, May 2015

Applications


Design Overview of MapReduce with RDMA

MapReduce

Verbs

RDMA Capable Networks

(IB, iWARP, RoCE ..)

OSU Design

Applications



Job

Tracker

Task

Tracker

Map

Reduce


• JNI Layer bridges Java based MapReduce with communication library written in native code

• Design Features

– RDMA-based shuffle

– Prefetching and caching map output

– Efficient Shuffle Algorithms

– In-memory merge

– On-demand Shuffle Adjustment

– Advanced overlapping

• map, shuffle, and merge

• shuffle, merge, and reduce



M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in

MapReduce over High Performance Interconnects, ICS, June 2014


• A hybrid approach to achieve maximum

possible overlapping in MapReduce across

all phases compared to other approaches– Efficient Shuffle Algorithms

– Dynamic and Efficient Switching

– On-demand Shuffle Adjustment

Advanced Overlapping among Different Phases

Default Architecture

Enhanced Overlapping with In-Memory Merge

Advanced Hybrid Overlapping

M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A

Hybrid Approach to Exploit Maximum Overlapping in MapReduce

over High Performance Interconnects, ICS, June 2014


• Design Features

– JVM-bypassed buffer management

– RDMA or send/recv based adaptive communication

– Intelligent buffer allocation and adjustment for serialization



Design Overview of Hadoop RPC with RDMA

Hadoop RPC

Verbs

RDMA Capable Networks(IB, iWARP, RoCE ..)

Applications



Our DesignDefault

OSU Design


• JNI Layer bridges Java based RPC with communication library written in native code

X. Lu, N. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda, High-Performance Design of Hadoop RPC with

RDMA over InfiniBand, Int'l Conference on Parallel Processing (ICPP '13), October 2013.


0

50

100

150

200

250

80 100 120

Exec

uti

on

Tim

e (s

)

Data Size (GB)

IPoIB (FDR)

0

50

100

150

200

250

80 100 120

Exec

uti

on

Tim

e (s

)

Data Size (GB)

IPoIB (FDR)

Performance Benefits – RandomWriter & TeraGen in TACC-Stampede

Cluster with 32 Nodes with a total of 128 maps

• RandomWriter

– 3-4x improvement over IPoIB

for 80-120 GB file size

• TeraGen

– 4-5x improvement over IPoIB

for 80-120 GB file size

RandomWriter TeraGen

Reduced by 3x Reduced by 4x


0

100

200

300

400

500

600

700

800

900

80 100 120

Exec

uti

on

Tim

e (s

)

Data Size (GB)

IPoIB (FDR) OSU-IB (FDR)

0

100

200

300

400

500

600

80 100 120

Exec

uti

on

Tim

e (s

)

Data Size (GB)


Performance Benefits – Sort & TeraSort in TACC-Stampede

Cluster with 32 Nodes with a total of 128 maps and 64 reduces

• Sort with single HDD per node

– 40-52% improvement over IPoIB

for 80-120 GB data

• TeraSort with single HDD per node

– 42-44% improvement over IPoIB

for 80-120 GB data

Reduced by 52% Reduced by 44%

Cluster with 32 Nodes with a total of 128 maps and 57 reduces


Evaluation of HHH and HHH-L with Applications

HDFS (FDR) HHH (FDR)

60.24 s 48.3 s

CloudBurstMR-MSPolyGraph

0

200

400

600

800

1000

4 6 8

Exec

uti

on

Tim

e (s

)

Concurrent maps per host

HDFS Lustre HHH-L Reduced by 79%

• MR-MSPolygraph on OSU RI with

1,000 maps

– HHH-L reduces the execution time

by 79% over Lustre, 30% over HDFS

• CloudBurst on TACC Stampede

– With HHH: 19% improvement over

HDFS


Evaluation with Spark on SDSC Gordon (HHH vs. Tachyon/Alluxio)

• For 200GB TeraGen on 32 nodes

– Spark-TeraGen: HHH has 2.4x improvement over Tachyon; 2.3x over HDFS-IPoIB (QDR)

– Spark-TeraSort: HHH has 25.2% improvement over Tachyon; 17% over HDFS-IPoIB (QDR)

0

20

40

60

80

100

120

140

160

180

8:50 16:100 32:200

Exe

cuti

on

Tim

e (

s)

Cluster Size : Data Size (GB)

IPoIB (QDR) Tachyon OSU-IB (QDR)

0

100

200

300

400

500

600

700

8:50 16:100 32:200

Exe

cuti

on

Tim

e (

s)

Cluster Size : Data Size (GB)

Reduced by 2.4x

Reduced by 25.2%

TeraGen TeraSort

N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File

Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015


Intermediate Data Directory

Design Overview of Shuffle Strategies for MapReduce over Lustre

• Design Features

– Two shuffle approaches

• Lustre read based shuffle

• RDMA based shuffle

– Hybrid shuffle algorithm to take benefit from both shuffle approaches

– Dynamically adapts to the better shuffle approach for each shuffle request based on profiling values for each Lustre read operation

– In-memory merge and overlapping of different phases are kept similar to RDMA-enhanced MapReduce design

Map 1 Map 2 Map 3

Lustre

Reduce 1 Reduce 2

Lustre Read / RDMA

In-memory

merge/sortreduce

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN

MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015

In-memory

merge/sortreduce


• For 500GB Sort in 64 nodes

– 44% improvement over IPoIB (FDR)

Performance Improvement of MapReduce over Lustre on TACC-Stampede


– 48% improvement over IPoIB (FDR)

0

200

400

600

800

1000

1200

300 400 500

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (FDR)OSU-IB (FDR)

0

50

100

150

200

250

300

350

400

450

500

20 GB 40 GB 80 GB 160 GB 320 GB 640 GB

Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128

Job

Exe

cuti

on

Tim

e (s

ec)


M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-

based Approach Benefit?, Euro-Par, August 2014.

• Local disk is used as the intermediate data directoryReduced by 48%Reduced by 44%



– 34% improvement over IPoIB (QDR)

Case Study - Performance Improvement of MapReduce over Lustre on SDSC-Gordon

• For 120GB TeraSort in 16 nodes

– 25% improvement over IPoIB (QDR)

• Lustre is used as the intermediate data directory

0

100

200

300

400

500

600

700

800

900

40 60 80

Job

Exe

cuti

on

Tim

e (

sec)

Data Size (GB)

IPoIB (QDR)

OSU-Lustre-Read (QDR)

OSU-RDMA-IB (QDR)

OSU-Hybrid-IB (QDR)

0

100

200

300

400

500

600

700

800

900

40 80 120

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

Reduced by 25%Reduced by 34%



– HDFS

– MapReduce

– RPC

– HBase

– Spark






HBase-RDMA Design Overview

• JNI Layer bridges Java based HBase with communication library written in native code


HBase

IB Verbs

RDMA Capable Networks

(IB, iWARP, RoCE ..)

OSU-IB Design

Applications

1/10/40/100 GigE, IPoIB Networks



HBase – YCSB Read-Write Workload

• HBase Get latency (QDR, 10GigE)

– 64 clients: 2.0 ms; 128 Clients: 3.5 ms

– 42% improvement over IPoIB for 128 clients

• HBase Put latency (QDR, 10GigE)

– 64 clients: 1.9 ms; 128 Clients: 3.5 ms

– 40% improvement over IPoIB for 128 clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

10GigE

Read Latency Write Latency

OSU-IB (QDR)IPoIB (QDR)

J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12


HBase – YCSB Get Latency and Throughput on SDSC-Comet

• HBase Get average latency (FDR)

– 4 client threads: 38 us

– 59% improvement over IPoIB for 4 client threads

• HBase Get total throughput

– 4 client threads: 102 Kops/sec

– 2.4x improvement over IPoIB for 4 client threads

Get Latency Get Throughput

0

0.02

0.04

0.06

0.08

0.1

0.12

1 2 3 4Ave

rage

Lat

ency

(m

s)

Number of Client Threads


0

20

40

60

80

100

120

1 2 3 4

Tota

l Th

rou

ghp

ut

(Ko

ps/

sec)

Number of Client Threads

IPoIB (FDR) OSU-IB (FDR)59%

2.4x



– HDFS

– MapReduce

– RPC

– HBase

– Spark






• Design Features

– RDMA based shuffle

– SEDA-based plugins

– Dynamic connection management and sharing

– Non-blocking data transfer

– Off-JVM-heap buffer management


Design Overview of Spark with RDMA


• JNI Layer bridges Scala based Spark with communication library written in native code

X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early

Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014


• InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)

• RDMA-based design for Spark 1.5.1

• RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node.

– SortBy: Total time reduced by up to 80% over IPoIB (56Gbps)

– GroupBy: Total time reduced by up to 57% over IPoIB (56Gbps)

Performance Evaluation on SDSC Comet – SortBy/GroupBy

64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time

0

50

100

150

200

250

300

64 128 256

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

0

50

100

150

200

250

64 128 256

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

57%80%


• InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)


• RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node.

– Sort: Total time reduced by 38% over IPoIB (56Gbps)

– TeraSort: Total time reduced by 15% over IPoIB (56Gbps)

Performance Evaluation on SDSC Comet – HiBench Sort/TeraSort

64 Worker Nodes, 1536 cores, Sort Total Time 64 Worker Nodes, 1536 cores, TeraSort Total Time

0

50

100

150

200

250

300

350

400

450

64 128 256

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

0

100

200

300

400

500

600

64 128 256

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

15%38%


• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)


• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.

– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)

– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)

Performance Evaluation on SDSC Comet – HiBench PageRank

32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

0

50

100

150

200

250

300

350

400

450

Huge BigData Gigantic

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

0

100

200

300

400

500

600

700

800

Huge BigData Gigantic

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

43%37%



– HDFS

– MapReduce

– RPC

– HBase

– Spark






• Server and client perform a negotiation protocol

– Master thread assigns clients to appropriate worker thread

• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread

• All other Memcached data structures are shared among RDMA and Sockets worker threads

• Memcached Server can serve both socket and verbs clients simultaneously

• Memcached applications need not be modified; uses verbs interface if available

Memcached-RDMA Design

SocketsClient

RDMAClient

MasterThread

SocketsWorker Thread

VerbsWorker Thread

SocketsWorker Thread

VerbsWorker Thread

SharedData

MemorySlabsItems

…

1

1

2

2


1

10

100

1000

1 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

Tim

e (

us)

Message Size

OSU-IB (FDR)IPoIB (FDR)

0

100

200

300

400

500

600

700

16 32 64 128 256 512 1024 2048 4080Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r Se

con

d (

TPS)

No. of Clients

• Memcached Get latency

– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us

– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

• Memcached Throughput (4bytes)

– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s

– Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

RDMA-Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X

2X


• Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing

tool

• Memcached-RDMA can

- improve query latency by up to 66% over IPoIB (32Gbps)

- throughput by up to 69% over IPoIB (32Gbps)

Micro-benchmark Evaluation for OLDP workloads

012345678

64 96 128 160 320 400

Late

ncy

(se

c)

No. of Clients

Memcached-IPoIB (32Gbps)

Memcached-RDMA (32Gbps)

0

1000

2000

3000

4000

64 96 128 160 320 400

Thro

ugh

pu

t (

Kq

/s)

No. of Clients

Memcached-IPoIB (32Gbps)

Memcached-RDMA (32Gbps)

D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads

with Memcached and MySQL, ISPASS’15

Reduced by 66%


• ohb_memlat & ohb_memthr latency & throughput micro-benchmarks

• Memcached-RDMA can

- improve query latency by up to 70% over IPoIB (32Gbps)

- improve throughput by up to 2X over IPoIB (32Gbps)

- No overhead in using hybrid mode when all data can fit in memory

Performance Benefits of Hybrid Memcached (Memory + SSD) on SDSC-Gordon

0

2

4

6

8

10

64 128 256 512 1024

Thro

ugh

pu

t (m

illio

n t

ran

s/se

c)

No. of Clients

IPoIB (32Gbps)

RDMA-Mem (32Gbps)

RDMA-Hybrid (32Gbps)

0

100

200

300

400

500

Ave

rage

late

ncy

(u

s)

Message Size (Bytes)

2X


– Memcached latency test with Zipf distribution, server with 1 GB memory, 32 KB key-value pair size, total

size of data accessed is 1 GB (when data fits in memory) and 1.5 GB (when data does not fit in memory)

– When data fits in memory: RDMA-Mem/Hybrid gives 5x improvement over IPoIB-Mem

– When data does not fit in memory: RDMA-Hybrid gives 2x-2.5x over IPoIB/RDMA-Mem

Performance Evaluation on IB FDR + SATA/NVMe SSDs

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get

IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe

IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe

Data Fits In Memory Data Does Not Fit In Memory

Late

ncy

(u

s)

slab allocation (SSD write) cache check+load (SSD read) cache update server response client wait miss-penalty


– RDMA-Accelerated Communication for Memcached Get/Set

– Hybrid ‘RAM+SSD’ slab management for higher data retention

– Non-blocking API extensions • memcached_(iset/iget/bset/bget/test/wait)

• Achieve near in-memory speeds while hiding bottlenecks of network and SSD I/O

• Ability to exploit communication/computation overlap

• Optional buffer re-use guarantees

– Adaptive slab manager with different I/O schemes for higher throughput.

Accelerating Hybrid Memcached with RDMA, Non-blocking Extensions and SSDs

D. Shankar, X. Lu, N. S. Islam, M. W. Rahman, and D. K. Panda, High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits, IPDPS, May 2016

BLOCKING API

NON-BLOCKING API REQ.

NON-BLOCKING API REPLY

CLI

ENT

SER

VER

HYBRID SLAB MANAGER (RAM+SSD)

RDMA-ENHANCED COMMUNICATION LIBRARY

RDMA-ENHANCED COMMUNICATION LIBRARY

LIBMEMCACHED LIBRARY

Blocking API Flow Non-Blocking API Flow


– Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve• >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty• >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid

– Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem

Performance Evaluation with Non-Blocking Memcached API

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set Get

IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB-i

H-RDMA-Opt-NonB-b

Ave

rage

Lat

en

cy (

us)

Miss Penalty (Backend DB Access Overhead)

Client Wait

Server Response

Cache Update

Cache check+Load (Memory and/or SSD read)

Slab Allocation (w/ SSD write on Out-of-Mem)

H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee



– HDFS

– MapReduce

– RPC

– HBase

– Spark






• Design Features

– Memcached-based burst-buffer

system

• Hides latency of parallel file

system access

• Read from local storage and

Memcached

– Data locality achieved by writing data

to local storage

– Different approaches of integration

with parallel file system to guarantee

fault-tolerance

Accelerating I/O Performance of Big Data Analytics through RDMA-based Key-Value Store

Application

I/O Forwarding Module

Map/Reduce Task DataNode

Local Disk

Data LocalityFault-tolerance

Lustre

Memcached-based Burst Buffer System


Evaluation with PUMA Workloads

Gains on OSU RI with our approach (Mem-bb) on 24 nodes

• SequenceCount: 34.5% over Lustre, 40% over HDFS

• RankedInvertedIndex: 27.3% over Lustre, 48.3% over HDFS

• HistogramRating: 17% over Lustre, 7% over HDFS

0

500

1000

1500

2000

2500

3000

3500

4000

4500

SeqCount RankedInvIndex HistoRating

Exe

cuti

on

Tim

e (

s)

Workloads

HDFS (32Gbps)

Lustre (32Gbps)

Mem-bb (32Gbps)

48.3%

40%

17%

N. S. Islam, D. Shankar, X. Lu, M.

W. Rahman, and D. K. Panda,

Accelerating I/O Performance of

Big Data Analytics with RDMA-

based Key-Value Store, ICPP

’15, September 2015



– HDFS

– MapReduce

– RPC

– HBase

– Spark






• The current benchmarks provide some performance behavior

• However, do not provide any information to the designer/developer on:

– What is happening at the lower-layer?

– Where the benefits are coming from?

– Which design is leading to benefits or bottlenecks?

– Which component in the design needs to be changed and what will be its impact?

– Can performance gain/loss at the lower-layer be correlated to the performance

gain/loss observed at the upper layer?

Are the Current Benchmarks Sufficient for Big Data?







Applications



Other Protocols?



QoS



Virtualization

Benchmarks

RDMA Protocols

Challenges in Benchmarking of RDMA-based Designs

Current

Benchmarks

No Benchmarks

Correlation?


• A comprehensive suite of benchmarks to

– Compare performance of different MPI libraries on various networks and systems

– Validate low-level functionalities

– Provide insights to the underlying MPI-level designs

• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth

• Extended later to

– MPI-2 one-sided

– Collectives

– GPU-aware data movement

– OpenSHMEM (point-to-point and collectives)

– UPC

• Has become an industry standard

• Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems

• Available from http://mvapich.cse.ohio-state.edu/benchmarks

• Available in an integrated manner with MVAPICH2 stack

OSU MPI Micro-Benchmarks (OMB) Suite

http://mvapich.cse.ohio-state.edu/benchmarks







Applications



Other Protocols?



QoS



Virtualization

Benchmarks

RDMA Protocols

Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications

Applications-Level

Benchmarks

Micro-

Benchmarks


• HDFS Benchmarks

– Sequential Write Latency (SWL) Benchmark

– Sequential Read Latency (SRL) Benchmark

– Random Read Latency (RRL) Benchmark

– Sequential Write Throughput (SWT) Benchmark

– Sequential Read Throughput (SRT) Benchmark

• Memcached Benchmarks

– Get Benchmark

– Set Benchmark

– Mixed Get/Set Benchmark

• Available as a part of OHB 0.8

OSU HiBD Benchmarks (OHB)N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D.

K. Panda, A Micro-benchmark Suite for

Evaluating HDFS Operations on Modern

Clusters, Int'l Workshop on Big Data

Benchmarking (WBDB '12), December 2012

D. Shankar, X. Lu, M. W. Rahman, N. Islam, and

D. K. Panda, A Micro-Benchmark Suite for

Evaluating Hadoop MapReduce on High-

Performance Networks, BPOE-5 (2014)

X. Lu, M. W. Rahman, N. Islam, and D. K. Panda,

A Micro-Benchmark Suite for Evaluating Hadoop

RPC on High-Performance Networks, Int'l

Workshop on Big Data Benchmarking (WBDB

'13), July 2013

To be Released

mailto:[email protected]


• Upcoming Releases of RDMA-enhanced Packages will support

– HBase

– Impala

• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support

– MapReduce

– RPC

• Advanced designs with upper-level changes and optimizations

– Memcached with Non-blocking API


On-going and Future Plans of OSU High Performance Big Data (HiBD) Project


• Discussed challenges in accelerating Hadoop, Spark and Memcached

• Presented initial designs to take advantage of InfiniBand/RDMA for HDFS,

MapReduce, RPC, Spark, and Memcached

• Results are promising

• Many other open issues need to be solved

• Will enable Big Data community to take advantage of modern HPC

technologies to carry out their analytics in a fast and scalable manner

Concluding Remarks


Funding Acknowledgments

Funding Support by

Equipment Support by


Personnel AcknowledgmentsCurrent Students

– A. Augustine (M.S.)

– A. Awan (Ph.D.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

– N. Islam (Ph.D.)

– M. Li (Ph.D.)

Past Students

– P. Balaji (Ph.D.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist

– S. Sur

Current Post-Doc

– J. Lin

– D. Banerjee

Current Programmer

– J. Perkins

Past Post-Docs

– H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekar (Ph.D.)

– K. Kulkarni (M.S.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Research Scientists Current Senior Research Associate

– H. Subramoni

– X. Lu

Past Programmers

– D. Bureddy

- K. Hamidouche

Current Research Specialist

– M. Arnold


Second International Workshop on High-Performance Big Data Computing (HPBDC)

HPBDC 2016 will be held with IEEE International Parallel and Distributed Processing

Symposium (IPDPS 2016), Chicago, Illinois USA, May 27th, 2016Keynote Talk: Dr. Chaitanya Baru,

Senior Advisor for Data Science, National Science Foundation (NSF); Distinguished Scientist, San Diego Supercomputer Center (SDSC)

Panel Moderator: Jianfeng Zhan (ICT/CAS)Panel Topic: Merge or Split: Mutual Influence between Big Data and HPC Techniques

Six Regular Research Papers and Two Short Research Papers

http://web.cse.ohio-state.edu/~luxi/hpbdc2016

HPBDC 2015 was held in conjunction with ICDCS’15





[email protected]

Thank You!

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/

big data meets hpc exploiting hpc technologies for ...€¦ · for accelerating big data processing...

Documents