sunriseorsunset:exploringthedesignspaceofbigdata so7ware...

37
Sunrise or Sunset: Exploring the Design Space of Big Data So7ware Stacks Dhabaleswar K. (DK) Panda The Ohio State University Email: [email protected] h<p://www.cse.ohiostate.edu/~panda Panel PresentaAon at HPBDC ‘17 by

Upload: others

Post on 24-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

Sunrise  or  Sunset:  Exploring  the  Design  Space  of  Big  Data  So7ware  Stacks  

 

Dhabaleswar  K.  (DK)  Panda  The  Ohio  State  University  

E-­‐mail:  [email protected]­‐state.edu  h<p://www.cse.ohio-­‐state.edu/~panda  

Panel  PresentaAon  at  HPBDC  ‘17    

by  

Page 2: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   2  Network  Based  CompuAng  Laboratory  

Q1:  Are  Big  Data  So7ware  Stacks  Mature  or  Not?  

•  Big  Data  soEware  stacks  like  Hadoop,  Spark  and  Memcached  have  been  there  for  mulKple  years  –  Hadoop  –  11  years  (Apache  Hadoop  0.1.0  released  on  April,  2006)  

–  Spark  –    5  years  (Apache  Spark  0.5.1  released  on  June,  2012)  

–  Memcached  –  14  years  (IniKal  release  of  Memcached  on  May  22,  2003)  

•  Increasingly  being  used  in  producKon  environments  

•  OpKmized  for  commodity  clusters  with  Ethernet  and  TCP/IP  interface  

•  Not  yet  able  to  take  full  advantage  of  modern  cluster  and/or  HPC  technologies    

 

Page 3: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   3  Network  Based  CompuAng  Laboratory  

•  SubstanKal  impact  on  designing  and  uKlizing  data  management  and  processing  systems  in  mulKple  Kers  

–  Front-­‐end  data  accessing  and  serving  (Online)  •  Memcached  +  DB  (e.g.  MySQL),  HBase  

–  Back-­‐end  data  analyKcs  (Offline)  •  HDFS,  MapReduce,  Spark  

Data  Management  and  Processing  on  Modern  Clusters

Internet

Front-end Tier Back-end Tier

Web ServerWeb

ServerWeb Server

Memcached+ DB (MySQL)Memcached

+ DB (MySQL)Memcached+ DB (MySQL)

NoSQL DB (HBase)NoSQL DB

(HBase)NoSQL DB (HBase)

HDFS

MapReduce Spark

Data Analytics Apps/Jobs

Data Accessing and Serving

Page 4: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   4  Network  Based  CompuAng  Laboratory  

•  Focuses  on  large  data  and  data  analysis  

•  Hadoop  (e.g.  HDFS,  MapReduce,  RPC,  HBase)  environment  is  gaining  a  lot  of  momentum  

•  h<p://wiki.apache.org/hadoop/PoweredBy  

 

Who  Are  Using  Hadoop?

Page 5: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   5  Network  Based  CompuAng  Laboratory  

•  Generalize  MapReduce  to  support  new  apps  in  same  engine  

•  Two  Key  ObservaKons  –  General  task  support  with  DAG    

–  MulK-­‐stage  and  interacKve  apps  require  faster  data  sharing  across  parallel  jobs  

Spark  Ecosystem  

Spark

Spark Streaming (real-time)

GraphX (graph)

… Spark SQL

MLlib (Machine Learning)

BlinkDB

Standalone   Apache  Mesos   YARN  

Caffe, TensorFlow, BigDL, etc.

(Deep Learning)

Page 6: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   6  Network  Based  CompuAng  Laboratory  

•  Focuses  on  large  data  and  data  analysis  with  in-­‐memory  techniques  

•  Apache  Spark  is  gaining  a  lot  of  momentum  

•  h<p://spark.apache.org/powered-­‐by.html    

Who  Are  Using  Spark?

Page 7: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   7  Network  Based  CompuAng  Laboratory  

Q2:  What  are  the  Main  Driving  forces  for  New-­‐generaAon  Big  Data  So7ware  Stacks?  

Page 8: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   8  Network  Based  CompuAng  Laboratory  

Big  Data  (Hadoop,  Spark,  

HBase,  Memcached,  

etc.)  

Deep  Learning  (Caffe,  TensorFlow,  

BigDL,  etc.)  

HPC    (MPI,  RDMA,  Lustre,  etc.)  

Increasing  Usage  of  HPC,  Big  Data  and  Deep  Learning  

Convergence  of  HPC,  Big  Data,  and  Deep  Learning!!!  

Page 9: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   9  Network  Based  CompuAng  Laboratory  

How  Can  HPC  Clusters  with  High-­‐Performance  Interconnect  and  Storage  Architectures  Benefit  Big  Data  and  Deep  Learning  ApplicaAons?  

Bring  HPC,  Big  Data  processing,  and  Deep  Learning  into  a  “convergent  trajectory”!  

What  are  the  major  bo<lenecks  in  current  Big  Data  processing  and  Deep  Learning  middleware  (e.g.  

Hadoop,  Spark)?  

Can  the  bo<lenecks  be  alleviated  with  new  designs  by  taking  advantage  of  HPC  technologies?  

Can  RDMA-­‐enabled  high-­‐performance  interconnects    benefit  Big  Data  

processing  and  Deep  Learning?  

Can  HPC  Clusters  with  high-­‐performance  

storage  systems  (e.g.  SSD,  parallel  file  

systems)  benefit  Big  Data  and  Deep  Learning  

applicaKons?  

How  much  performance  benefits  

can  be  achieved  through  enhanced  

designs?  How  to  design  benchmarks  for    evaluaKng  the  

performance  of  Big  Data  and  Deep  Learning  middleware  on  HPC  

clusters?  

Page 10: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   10  Network  Based  CompuAng  Laboratory  

Can  We  Run  Big  Data  and  Deep  Learning  Jobs  on  ExisAng  HPC  Infrastructure?  

Page 11: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   11  Network  Based  CompuAng  Laboratory  

Can  We  Run  Big  Data  and  Deep  Learning  Jobs  on  ExisAng  HPC  Infrastructure?  

Page 12: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   12  Network  Based  CompuAng  Laboratory  

Can  We  Run  Big  Data  and  Deep  Learning  Jobs  on  ExisAng  HPC  Infrastructure?  

Page 13: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   13  Network  Based  CompuAng  Laboratory  

Can  We  Run  Big  Data  and  Deep  Learning  Jobs  on  ExisAng  HPC  Infrastructure?  

Page 14: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   14  Network  Based  CompuAng  Laboratory  

Q3:  What  Chances    are  Provided  for  the  Academia  CommuniAes  in  Exploring  the  Design  Spaces  of  Big  Data  

So7ware  Stacks?      

Page 15: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   15  Network  Based  CompuAng  Laboratory  

Designing  CommunicaAon  and  I/O  Libraries  for  Big  Data  Systems:  Challenges      

Big  Data  Middleware  (HDFS,  MapReduce,  HBase,  Spark,  gRPC/TensorFlow,  and  Memcached)  

Networking  Technologies  (InfiniBand,  1/10/40/100  GigE  

and  Intelligent  NICs)  

Storage  Technologies  (HDD,  SSD,  NVM,  and  NVMe-­‐

SSD)  

Programming  Models  (Sockets)  

ApplicaAons  

Commodity  CompuAng  System  Architectures  

(MulA-­‐  and  Many-­‐core  architectures  and  accelerators)  

RDMA  Protocols  

CommunicaAon  and  I/O  Library  Point-­‐to-­‐Point  CommunicaAon  

QoS  &  Fault  Tolerance  

Threaded  Models  and  SynchronizaAon  

Performance  Tuning  I/O  and  File  Systems  

VirtualizaAon  (SR-­‐IOV)  

Benchmarks  

Page 16: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   16  Network  Based  CompuAng  Laboratory  

•  RDMA  for  Apache  Spark    

•  RDMA  for  Apache  Hadoop  2.x  (RDMA-­‐Hadoop-­‐2.x)  –  Plugins  for  Apache,  Hortonworks  (HDP)  and  Cloudera  (CDH)  Hadoop  distribuKons  

•  RDMA  for  Apache  HBase  

•  RDMA  for  Memcached  (RDMA-­‐Memcached)  

•  RDMA  for  Apache  Hadoop  1.x  (RDMA-­‐Hadoop)  

•  OSU  HiBD-­‐Benchmarks  (OHB)  –  HDFS,  Memcached,  HBase,  and  Spark  Micro-­‐benchmarks  

•  hip://hibd.cse.ohio-­‐state.edu  

•  Users  Base:  230  organizaKons  from  30  countries  

•  More  than  21,800  downloads  from  the  project  site  

The  High-­‐Performance  Big  Data  (HiBD)  Project  

Available  for  InfiniBand  and  RoCE  Also  run  on  Ethernet  

Page 17: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   17  Network  Based  CompuAng  Laboratory  

•  High-­‐Performance  Design  of  Hadoop  over  RDMA-­‐enabled  Interconnects  

–  High  performance  RDMA-­‐enhanced  design  with  naKve  InfiniBand  and  RoCE  support  at  the  verbs-­‐level  for  HDFS,  MapReduce,  and  RPC  components  

–  Enhanced  HDFS  with  in-­‐memory  and  heterogeneous  storage  

–  High  performance  design  of  MapReduce  over  Lustre  

–  Memcached-­‐based  burst  buffer  for  MapReduce  over  Lustre-­‐integrated  HDFS  (HHH-­‐L-­‐BB  mode)  

–  Plugin-­‐based  architecture  supporKng  RDMA-­‐based  designs  for  Apache  Hadoop,  CDH  and  HDP  

–  Easily  configurable  for  different  running  modes  (HHH,  HHH-­‐M,  HHH-­‐L,  HHH-­‐L-­‐BB,  and  MapReduce  over  Lustre)  and  different  protocols  (naKve  InfiniBand,  RoCE,  and  IPoIB)  

•  Current  release:  1.1.0  

–  Based  on  Apache  Hadoop  2.7.3  

–  Compliant  with  Apache  Hadoop  2.7.1,  HDP  2.5.0.3    and  CDH  5.8.2  APIs  and  applicaKons  

–  Tested  with  

•  Mellanox  InfiniBand  adapters  (DDR,  QDR,  FDR,  and  EDR)  

•  RoCE  support  with  Mellanox  adapters  

•  Various  mulK-­‐core  plarorms  

•  Different  file  systems  with  disks  and  SSDs  and  Lustre  

RDMA  for  Apache  Hadoop  2.x  DistribuAon  

hip://hibd.cse.ohio-­‐state.edu  

Page 18: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   18  Network  Based  CompuAng  Laboratory  

•  HHH:  Heterogeneous  storage  devices  with  hybrid  replicaKon  schemes  are  supported  in  this  mode  of  operaKon  to  have  be<er  fault-­‐tolerance  as  well  as  performance.  This  mode  is  enabled  by  default  in  the  package.    

•  HHH-­‐M:  A  high-­‐performance  in-­‐memory  based  setup  has  been  introduced  in  this  package  that  can  be  uKlized  to  perform  all  I/O  operaKons  in-­‐memory  and  obtain  as  much  performance  benefit  as  possible.    

•  HHH-­‐L:  With  parallel  file  systems  integrated,  HHH-­‐L  mode  can  take  advantage  of  the  Lustre  available  in  the  cluster.  

•  HHH-­‐L-­‐BB:  This  mode  deploys  a  Memcached-­‐based  burst  buffer  system  to  reduce  the  bandwidth  bo<leneck  of  shared  file  system  access.  The  burst  buffer  design  is  hosted  by  Memcached  servers,  each  of  which  has  a  local  SSD.  

•  MapReduce  over  Lustre,  with/without  local  disks:  Besides,  HDFS  based  soluKons,  this  package  also  provides  support  to  run  MapReduce  jobs  on  top  of  Lustre  alone.  Here,  two  different  modes  are  introduced:  with  local  disks  and  without  local  disks.  

•  Running  with  Slurm  and  PBS:  Supports  deploying  RDMA  for  Apache  Hadoop  2.x  with  Slurm  and  PBS  in  different  running  modes  (HHH,  HHH-­‐M,  HHH-­‐L,  and  MapReduce  over  Lustre).  

Different  Modes  of  RDMA  for  Apache  Hadoop  2.x  

Page 19: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   19  Network  Based  CompuAng  Laboratory  

•  High-­‐Performance  Design  of  Spark    over  RDMA-­‐enabled  Interconnects  –  High  performance  RDMA-­‐enhanced  design  with  naKve  InfiniBand  and  RoCE  support  at  the  verbs-­‐level  for  Spark  

–  RDMA-­‐based  data  shuffle  and  SEDA-­‐based  shuffle  architecture  

–  Support  pre-­‐connecKon,  on-­‐demand  connecKon,  and  connecKon  sharing  

–  Non-­‐blocking  and  chunk-­‐based  data  transfer  

–  Off-­‐JVM-­‐heap  buffer  management  

–  Easily  configurable  for  different  protocols  (naKve  InfiniBand,  RoCE,  and  IPoIB)  

•  Current  release:  0.9.4  

–  Based  on  Apache  Spark    2.1.0  

–  Tested  with  •  Mellanox  InfiniBand  adapters  (DDR,  QDR,  FDR,  and  EDR)  

•  RoCE  support  with  Mellanox  adapters  

•  Various  mulK-­‐core  plarorms  

•  RAM  disks,  SSDs,  and  HDD  

–  hip://hibd.cse.ohio-­‐state.edu  

RDMA  for  Apache  Spark  DistribuAon  

Page 20: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   20  Network  Based  CompuAng  Laboratory  

•  RDMA  for  Apache  Hadoop  2.x  and  RDMA  for  Apache  Spark  are  installed  and  available  on  SDSC  Comet.  –  Examples  for  various  modes  of  usage  are  available  in:  

•  RDMA  for  Apache  Hadoop  2.x:  /share/apps/examples/HADOOP  

•  RDMA  for  Apache  Spark:  /share/apps/examples/SPARK/  

–  Please  email  [email protected]  (reference  Comet  as  the  machine,  and  SDSC  as  the  site)  if  you  have  any  further  quesKons  about  usage  and  configuraKon.    

•  RDMA  for  Apache  Hadoop  is  also  available  on  Chameleon  Cloud  as  an  appliance  –  h<ps://www.chameleoncloud.org/appliances/17/    

HiBD  Packages  on  SDSC  Comet  and  Chameleon  Cloud  

M.  TaAneni,  X.  Lu,  D.  J.  Choi,  A.  Majumdar,  and  D.  K.  Panda,  Experiences  and  Benefits  of  Running  RDMA  Hadoop  and  Spark  on  SDSC  Comet,    XSEDE’16,  July  2016  

Page 21: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   21  Network  Based  CompuAng  Laboratory  

0  50  

100  150  200  250  300  350  400  

80   120   160  

ExecuA

on  Tim

e  (s)  

Data  Size  (GB)  

IPoIB  (EDR)  OSU-­‐IB  (EDR)  

0  100  200  300  400  500  600  700  800  

80   160   240  

ExecuA

on  Tim

e  (s)  

Data  Size  (GB)  

IPoIB  (EDR)  OSU-­‐IB  (EDR)  

Performance  Numbers  of  RDMA  for  Apache  Hadoop  2.x  –  RandomWriter  &  TeraGen  in  OSU-­‐RI2  (EDR)  

Cluster  with  8  Nodes  with  a  total  of  64  maps  

•  RandomWriter  –  3x  improvement  over  IPoIB  

for  80-­‐160  GB  file  size  

•  TeraGen  –  4x  improvement  over  IPoIB  for  

80-­‐240  GB  file  size  

RandomWriter   TeraGen  

Reduced  by  3x   Reduced  by  4x  

Page 22: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   22  Network  Based  CompuAng  Laboratory  

0  100  200  300  400  500  600  700  800  

80   120   160  

ExecuA

on  Tim

e  (s)  

Data  Size  (GB)  

IPoIB  (EDR)  OSU-­‐IB  (EDR)  

Performance  Numbers  of  RDMA  for  Apache  Hadoop  2.x  –  Sort  &  TeraSort  in  OSU-­‐RI2  (EDR)  

Cluster  with  8  Nodes  with  a  total  of    64  maps  and  32  reduces  

•  Sort  –  61%  improvement  over  IPoIB  for  

80-­‐160  GB  data    

•  TeraSort  –  18%  improvement  over  IPoIB  for  

80-­‐240  GB  data  

Reduced  by  61%  Reduced  by  18%  

Cluster  with  8  Nodes  with  a  total  of    64  maps  and  14  reduces  

Sort   TeraSort  

0  

100  

200  

300  

400  

500  

600  

80   160   240  

ExecuA

on  Tim

e  (s)  

Data  Size  (GB)  

IPoIB  (EDR)  OSU-­‐IB  (EDR)  

Page 23: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   23  Network  Based  CompuAng  Laboratory  

•  Design  Features  –  RDMA  based  shuffle  plugin  –  SEDA-­‐based  architecture  –  Dynamic  connecKon  

management  and  sharing  

–  Non-­‐blocking  data  transfer  –  Off-­‐JVM-­‐heap  buffer  

management  –  InfiniBand/RoCE  support  

Design  Overview  of  Spark  with  RDMA

•  Enables  high  performance  RDMA  communicaKon,  while  supporKng  tradiKonal  socket  interface  

•  JNI  Layer  bridges  Scala  based  Spark  with  communicaKon  library  wri<en  in  naKve  code  

 X.  Lu,  M.  W.  Rahman,  N.  Islam,  D.  Shankar,  and  D.  K.  Panda,  AcceleraAng  Spark  with  RDMA  for  Big  Data  Processing:  Early  Experiences,  Int'l  Symposium  on  High  Performance  Interconnects  (HotI'14),  August  2014  

X.  Lu,  D.  Shankar,  S.  Gugnani,  and  D.  K.  Panda,  High-­‐Performance  Design  of  Apache  Spark  with  RDMA  and  Its  Benefits  on  Various  Workloads,  IEEE  BigData  ‘16,  Dec.  2016.  

Spark  Core  

RDMA  Capable  Networks  (IB,    iWARP,  RoCE  ..)  

Apache  Spark  Benchmarks/ApplicaAons/Libraries/Frameworks  

1/10/40/100  GigE,  IPoIB    Network  

Java  Socket  Interface   Java  NaAve  Interface  (JNI)  

 NaAve  RDMA-­‐based  Comm.  Engine  

 

Shuffle  Manager  (Sort,  Hash,  Tungsten-­‐Sort)  

Block  Transfer  Service  (Neiy,  NIO,  RDMA-­‐Plugin)  Neiy  Server  

NIO  Server  

RDMA  Server  

Neiy  Client  

NIO  Client  

RDMA  Client  

Page 24: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   24  Network  Based  CompuAng  Laboratory  

•  InfiniBand  FDR,  SSD,  64  Worker  Nodes,  1536  Cores,  (1536M  1536R)  

•  RDMA  vs.  IPoIB  with  1536  concurrent  tasks,  single  SSD  per  node.    –  SortBy:  Total  Kme  reduced  by  up  to  80%  over  IPoIB  (56Gbps)    

–  GroupBy:  Total  Kme  reduced  by  up  to  74%  over  IPoIB  (56Gbps)    

Performance  EvaluaAon  on  SDSC  Comet  –  SortBy/GroupBy

64  Worker  Nodes,  1536  cores,  SortByTest    Total  Time   64  Worker  Nodes,  1536  cores,  GroupByTest    Total  Time  

0  

50  

100  

150  

200  

250  

300  

64   128   256  

Time  (sec)  

Data  Size  (GB)  

IPoIB  

RDMA  

0  

50  

100  

150  

200  

250  

64   128   256  

Time  (sec)  

Data  Size  (GB)  

IPoIB  

RDMA  

74%  80%  

Page 25: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   25  Network  Based  CompuAng  Laboratory  

•  InfiniBand  FDR,  SSD,  32/64  Worker  Nodes,  768/1536  Cores,  (768/1536M  768/1536R)  

•  RDMA  vs.  IPoIB  with  768/1536  concurrent  tasks,  single  SSD  per  node.    –  32  nodes/768  cores:  Total  Kme  reduced  by  37%  over  IPoIB  (56Gbps)    

–  64  nodes/1536  cores:  Total  Kme  reduced  by  43%  over  IPoIB  (56Gbps)    

Performance  EvaluaAon  on  SDSC  Comet  –  HiBench  PageRank

32  Worker  Nodes,  768  cores,  PageRank  Total  Time   64  Worker  Nodes,  1536  cores,  PageRank  Total  Time  

0  50  

100  150  200  250  300  350  400  450  

Huge   BigData   GiganKc  

Time  (sec)  

Data  Size  (GB)  

IPoIB  

RDMA  

0  

100  

200  

300  

400  

500  

600  

700  

800  

Huge   BigData   GiganKc  

Time  (sec)  

Data  Size  (GB)  

IPoIB  

RDMA  

43%  37%  

Page 26: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   26  Network  Based  CompuAng  Laboratory  

EvaluaAon  with  BigDL  on  RDMA-­‐Spark  

0  

200  

400  

600  

800  

1000  

24   48   96   192   384  

One

 Epo

ch  Tim

e  (sec)  

Number  of  cores  

IPoIB   RDMA  

•  VGG  training  model  on  the  CIFAR-­‐10  dataset    

•  Evaluated  on  SDSC  Comet  supercomputer  

•  IniKal  Results:  RDMA-­‐based  Spark  outperforms  default  Spark  over  IPoIB  by  a  factor  of  4.58x  

4.58x  

Page 27: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   27  Network  Based  CompuAng  Laboratory  

Design  Overview  of  NVM  and  RDMA-­‐aware  HDFS  (NVFS)  •  Design  Features  

•  RDMA  over  NVM  •  HDFS  I/O  with  NVM  

•  Block  Access  •  Memory  Access  

•  Hybrid  design  •  NVM  with  SSD  as  a  hybrid  storage  for  HDFS  I/O  

•  Co-­‐Design  with  Spark  and  HBase  •  Cost-­‐effecKveness  •  Use-­‐case    

ApplicaAons  and  Benchmarks  

Hadoop  MapReduce   Spark   HBase  

Co-­‐Design  (Cost-­‐EffecAveness,  Use-­‐case)  

 RDMA  Receiver  

 RDMA  Sender  

DFSClient  RDMA  

Replicator  

   

RDMA  Receiver  

NVFS  -­‐BlkIO  

Writer/Reader  

NVM  

NVFS-­‐MemIO  

SSD   SSD   SSD  

NVM  and  RDMA-­‐aware  HDFS  (NVFS)  DataNode  

N.  S.  Islam,  M.  W.  Rahman  ,  X.  Lu,  and  D.  K.  Panda,  High  Performance  Design  for  HDFS  with  Byte-­‐Addressability  of  NVM  and  RDMA,  24th  InternaAonal  Conference  on  SupercompuAng  (ICS),  June  2016  

Page 28: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   28  Network  Based  CompuAng  Laboratory  

EvaluaAon  with  Hadoop  MapReduce  

0  

50  

100  

150  

200  

250  

300  

350  

Write   Read  

Average  Th

roughp

ut  (M

Bps)  

HDFS  (56Gbps)  

NVFS-­‐BlkIO  (56Gbps)  

NVFS-­‐MemIO  (56Gbps)  

•  TestDFSIO  on  SDSC  Comet  (32  nodes)  –  Write:  NVFS-­‐MemIO  gains  by  4x  over  

HDFS  

–  Read:  NVFS-­‐MemIO  gains  by  1.2x  over  HDFS  

TestDFSIO  

0  

200  

400  

600  

800  

1000  

1200  

1400  

Write   Read  

Average  Th

roughp

ut  (M

Bps)  

HDFS  (56Gbps)  

NVFS-­‐BlkIO  (56Gbps)  

NVFS-­‐MemIO  (56Gbps)  

4x  

1.2x  

4x  

2x  

SDSC  Comet  (32  nodes)   OSU  Nowlab  (4  nodes)  

•  TestDFSIO  on  OSU  Nowlab  (4  nodes)  –  Write:  NVFS-­‐MemIO  gains  by  4x  over  

HDFS  

–  Read:  NVFS-­‐MemIO  gains  by  2x  over  HDFS  

Page 29: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   29  Network  Based  CompuAng  Laboratory  

Overview  of  RDMA-­‐Hadoop-­‐Virt  Architecture  

•  VirtualizaKon-­‐aware  modules  in  all  the  four  main  Hadoop  components:  –   HDFS:  VirtualizaKon-­‐aware  Block  Management  

to  improve  fault-­‐tolerance  

–   YARN:  Extensions  to  Container  AllocaKon  Policy  to  reduce  network  traffic  

–   MapReduce:  Extensions  to  Map  Task  Scheduling  Policy  to  reduce  network  traffic  

–   Hadoop  Common:    Topology  DetecKon  Module  for  automaKc  topology  detecKon  

•  CommunicaKons  in  HDFS,  MapReduce,  and  RPC  go  through  RDMA-­‐based  designs  over  SR-­‐IOV  enabled  InfiniBand  

HDFS

YARN

Had

oop

Com

mon

MapReduce

HBase Others

Virtual Machines Bare-Metal nodes Containers

Big Data Applications

Topo

logy

Det

ectio

n M

odul

e

Map Task Scheduling Policy Extension

Container Allocation Policy Extension

CloudBurst MR-MS Polygraph Others

Virtualization Aware Block Management

S.  Gugnani,  X.  Lu,  D.  K.  Panda.  Designing  VirtualizaAon-­‐aware  and  AutomaAc  Topology  DetecAon  Schemes  for  AcceleraAng  Hadoop  on  SR-­‐IOV-­‐enabled  Clouds.  CloudCom,  2016.    

Page 30: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   30  Network  Based  CompuAng  Laboratory  

EvaluaAon  with  ApplicaAons  

–  14%  and  24%  improvement  with  Default  Mode  for  CloudBurst  and  Self-­‐Join  

–  30%  and  55%  improvement  with  Distributed  Mode  for  CloudBurst  and  Self-­‐Join  

0  

20  

40  

60  

80  

100  

Default  Mode   Distributed  Mode  

EXEC

UTION  TIM

E  CloudBurst  

RDMA-­‐Hadoop   RDMA-­‐Hadoop-­‐Virt  

0  50  100  150  200  250  300  350  400  

Default  Mode   Distributed  Mode  

EXEC

UTION  TIM

E  

Self-­‐Join  

RDMA-­‐Hadoop   RDMA-­‐Hadoop-­‐Virt  

30%  reducKon  

55%  reducKon  

Page 31: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   31  Network  Based  CompuAng  Laboratory  

•  Deep  Learning  frameworks  are  a  different  game  altogether  –  Unusually  large  message  sizes  (order  of  

megabytes)  

–  Most  communicaKon  based  on  GPU  buffers  

•  How  to  address  these  newer  requirements?  –  GPU-­‐specific  CommunicaKon  Libraries  (NCCL)  

•  NVidia's  NCCL  library  provides  inter-­‐GPU  communicaKon  

–  CUDA-­‐Aware  MPI  (MVAPICH2-­‐GDR)  •  Provides  support  for  GPU-­‐based  communicaKon  

•  Can  we  exploit  CUDA-­‐Aware  MPI  and  NCCL  to  support  Deep  Learning  applicaKons?  

Deep  Learning:  New  Challenges  for  MPI  RunAmes  

1

32

4

InternodeComm.(Knomial)

1 2

CPU

PLX

3 4

PLX

Intranode Comm.(NCCLRing)

RingDirection

Hierarchical  CommunicaAon  (Knomial  +  NCCL  ring)  

Page 32: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   32  Network  Based  CompuAng  Laboratory  

•  NCCL  has  some  limitaKons  –  Only  works  for  a  single  node,  thus,  no  scale-­‐out  on  

mulKple  nodes  

–  DegradaKon  across  IOH  (socket)  for  scale-­‐up  (within  a  node)  

•  We  propose  opKmized  MPI_Bcast  

–  CommunicaKon  of  very  large  GPU  buffers  (order  of  megabytes)  

–  Scale-­‐out  on  large  number  of  dense  mulK-­‐GPU  nodes  

•  Hierarchical  CommunicaKon  that  efficiently  exploits:  –  CUDA-­‐Aware  MPI_Bcast  in  MV2-­‐GDR    

–  NCCL  Broadcast  primiKve  

Efficient  Broadcast:  MVAPICH2-­‐GDR  and  NCCL  

0  10  20  30  

2   4   8   16   32   64  Time  (secon

ds)  

Number  of  GPUs  

MV2-­‐GDR   MV2-­‐GDR-­‐Opt  

Performance  Benefits:  Microso7  CNTK  DL  framework    (25%  avg.  improvement  )    

Performance  Benefits:  OSU  Micro-­‐benchmarks  

Efficient  Large  Message  Broadcast  using  NCCL  and  CUDA-­‐Aware  MPI  for  Deep  Learning,    A.   Awan  ,  K.  Hamidouche  ,  A.  Venkatesh  ,  and  D.  K.  Panda,    The  23rd  European  MPI  Users'  Group  MeeAng  (EuroMPI  16),  Sep  2016  [Best  Paper  Runner-­‐Up]  

2.2X  

Page 33: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   33  Network  Based  CompuAng  Laboratory  

0  100  200  300  

2   4   8   16   32   64   128  Latency  (m

s)  

Message  Size  (MB)  

Reduce  –  192  GPUs  

Large  Message  OpAmized  CollecAves  for  Deep  Learning  

0  

100  

200  

128   160   192  Latency  (m

s)  

No.  of  GPUs  

Reduce  –  64  MB  

0  100  200  300  

16   32   64  Latency  (m

s)  

No.  of  GPUs  

Allreduce  -­‐  128  MB  

0  

50  

100  

2   4   8   16   32   64   128  Latency  (m

s)  

Message  Size  (MB)  

Bcast  –  64  GPUs  

0  

50  

100  

16   32   64  Latency  (m

s)  

No.  of  GPUs  

Bcast    128  MB  

•  MV2-­‐GDR  provides  opKmized  collecKves  for  large  message  sizes    

•  OpKmized  Reduce,  Allreduce,  and  Bcast    

•  Good  scaling  with  large  number  of  GPUs  

•  Available  with  MVAPICH2-­‐GDR  2.2GA  

 

 

0  100  200  300  

2   4   8   16   32   64   128  Latency  (m

s)  

Message  Size  (MB)  

Allreduce  –  64  GPUs  

Page 34: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   34  Network  Based  CompuAng  Laboratory  

•  Caffe  :  A  flexible  and  layered  Deep  Learning  framework.  

•  Benefits  and  Weaknesses  –  MulK-­‐GPU  Training  within  a  single  node  

–  Performance  degradaKon  for  GPUs  across  different  sockets    

–  Limited  Scale-­‐out  

•  OSU-­‐Caffe:  MPI-­‐based  Parallel  Training    

–  Enable  Scale-­‐up  (within  a  node)  and  Scale-­‐out  (across  mulK-­‐GPU  nodes)  

–  network  on  ImageNet  dataset  

OSU-­‐Caffe:  Scalable  Deep  Learning  

0  

50  

100  

150  

200  

250  

8   16   32   64   128  

Training  Tim

e  (secon

ds)  

No.  of  GPUs  

GoogLeNet  (ImageNet)  on  128  GPUs  

Caffe   OSU-­‐Caffe  (1024)   OSU-­‐Caffe  (2048)  Invalid  use  case  OSU-­‐Caffe  is  publicly  available  from:  

h<p://hidl.cse.ohio-­‐state.edu    

A.  A.  Awan,  K.  Hamidouche,  J.  Hashmi,  and  D.  K.  Panda,  S-­‐Caffe:  Co-­‐designing  MPI  RunAmes  and  Caffe  for  Scalable  Deep  Learning  on  Modern  GPU  Clusters, PPoPP,  Sep  2017  

Page 35: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   35  Network  Based  CompuAng  Laboratory  

•  High-­‐Performance  designs  for  Big  Data  middleware  –  NVM-­‐aware  communicaKon  and  I/O  schemes  for  Big  Data  –  SATA-­‐/PCIe-­‐/NVMe-­‐SSD  support  –  High-­‐Bandwidth  Memory  support  –  Threaded  Models  and  SynchronizaKon  –  Locality-­‐aware  designs  

•  Fault-­‐tolerance/resiliency  –  MigraKon  support  with  virtual  machines  –  Data  replicaKon  

•  Efficient  data  access  and  placement  policies  •  Efficient  task  scheduling  •  Fast  deployment  and  automaKc  configuraKons  on  Clouds  •  OpKmizaKon  for  Deep  Learning  applicaKons  

 

Open  Challenges  in  Designing  CommunicaAon  and  I/O  Middleware  for  High-­‐Performance  Big  Data  Processing  

Page 36: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   36  Network  Based  CompuAng  Laboratory  

Sunrise  or  Sunset  of  Big  Data  So7ware?  

Assuming  6:00  am  as  sunrise  and    6:00  pm  as  sunset,  We  are  at  8:00  am.  

Page 37: SunriseorSunset:ExploringtheDesignSpaceofBigData So7ware ...web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-dk.pdf · NetworkBased(CompuAng(Laboratory (HPBDC‘17Panel( 9

HPBDC  ‘17  Panel   37  Network  Based  CompuAng  Laboratory  

[email protected]­‐state.edu  

hip://www.cse.ohio-­‐state.edu/~panda  

Thank  You!  

Network-­‐Based  CompuKng  Laboratory  h<p://nowlab.cse.ohio-­‐state.edu/

The  High-­‐Performance  Big  Data  Project  h<p://hibd.cse.ohio-­‐state.edu/