big data meets hpc - exploiting hpc technologies for accelerating big data processing

Post on 12-Apr-2017

573 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

BigDataMeetsHPC–Exploi4ngHPCTechnologiesforAccelera4ngBigDataProcessing

DhabaleswarK.(DK)PandaTheOhioStateUniversity

E-mail:panda@cse.ohio-state.edu

h<p://www.cse.ohio-state.edu/~panda

KeynoteTalkatHPCAC-Switzerland(Mar2016)

by

HPCAC-Switzerland(Mar‘16) 2NetworkBasedCompu4ngLaboratory

•  BigDatahasbecometheoneofthemostimportantelementsofbusinessanalyFcs

•  ProvidesgroundbreakingopportuniFesforenterpriseinformaFonmanagementanddecisionmaking

•  Theamountofdataisexploding;companiesarecapturinganddigiFzingmoreinformaFonthanever

•  TherateofinformaFongrowthappearstobeexceedingMoore’sLaw

Introduc4ontoBigDataApplica4onsandAnaly4cs

HPCAC-Switzerland(Mar‘16) 3NetworkBasedCompu4ngLaboratory

•  Commonlyaccepted3V’sofBigData•  Volume,Velocity,VarietyMichaelStonebraker:BigDataMeansatLeastThreeDifferentThings,hWp://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

•  4/5V’sofBigData–3V+*Veracity,*Value

4VCharacteris4csofBigData

Courtesy:hWp://api.ning.com/files/tRHkwQN7s-Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW-JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg

HPCAC-Switzerland(Mar‘16) 4NetworkBasedCompu4ngLaboratory

•  Webpages(content,graph)

•  Clicks(ad,page,social)

•  Users(OpenID,FBConnect,etc.)

•  e-mails(Hotmail,Y!Mail,Gmail,etc.)

•  Photos,Movies(Flickr,YouTube,Video,etc.)

•  Cookies/trackinginfo(seeGhostery)

•  Installedapps(Androidmarket,AppStore,etc.)

•  LocaFon(LaFtude,Loopt,Foursquared,GoogleNow,etc.)

•  Usergeneratedcontent(Wikipedia&co,etc.)

•  Ads(display,text,DoubleClick,Yahoo,etc.)

•  Comments(Discuss,Facebook,etc.)

•  Reviews(Yelp,Y!Local,etc.)

•  SocialconnecFons(LinkedIn,Facebook,etc.)

•  Purchasedecisions(Netflix,Amazon,etc.)

•  InstantMessages(YIM,Skype,Gtalk,etc.)

•  Searchterms(Google,Bing,etc.)

•  NewsarFcles(BBC,NYTimes,Y!News,etc.)

•  Blogposts(Tumblr,Wordpress,etc.)

•  Microblogs(Twi<er,Jaiku,Meme,etc.)

•  Linksharing(Facebook,Delicious,Buzz,etc.)

DataGenera4oninInternetServicesandApplica4ons

NumberofAppsintheAppleAppStore,AndroidMarket,Blackberry,andWindowsPhone(2013)•  AndroidMarket:<1200K•  AppleAppStore:~1000KCourtesy:hWp://dazeinfo.com/2014/07/10/apple-inc-aapl-ios-google-inc-goog-android-growth-mobile-ecosystem-2014/

HPCAC-Switzerland(Mar‘16) 5NetworkBasedCompu4ngLaboratory

VelocityofBigData–HowMuchDataIsGeneratedEveryMinuteontheInternet?

TheglobalInternetpopula4ongrew18.5%from2013to2015andnowrepresents3.2BillionPeople.

Courtesy:hWps://www.domo.com/blog/2015/08/data-never-sleeps-3-0/

HPCAC-Switzerland(Mar‘16) 6NetworkBasedCompu4ngLaboratory

•  ScienFficDataManagement,Analysis,andVisualizaFon

•  ApplicaFonsexamples–  Climatemodeling

–  CombusFon

–  Fusion

–  Astrophysics

–  BioinformaFcs

•  DataIntensiveTasks–  Runslarge-scalesimulaFonsonsupercomputers

–  Dumpdataonparallelstoragesystems

–  Collectexperimental/observaFonaldata

–  Moveexperimental/observaFonaldatatoanalysissites

–  VisualanalyFcs–helpunderstanddatavisually

NotOnlyinInternetServices-BigDatainScien4ficDomains

HPCAC-Switzerland(Mar‘16) 7NetworkBasedCompu4ngLaboratory

•  Hadoop:h<p://hadoop.apache.org–  ThemostpopularframeworkforBigDataAnalyFcs

–  HDFS,MapReduce,HBase,RPC,Hive,Pig,ZooKeeper,Mahout,etc.

•  Spark:h<p://spark-project.org–  ProvidesprimiFvesforin-memoryclustercompuFng;Jobscanloaddataintomemoryandqueryitrepeatedly

•  Storm:h<p://storm-project.net–  Adistributedreal-FmecomputaFonsystemforreal-FmeanalyFcs,onlinemachinelearning,conFnuous

computaFon,etc.

•  S4:h<p://incubator.apache.org/s4–  AdistributedsystemforprocessingconFnuousunboundedstreamsofdata

•  GraphLab:h<p://graphlab.org–  ConsistsofacoreC++GraphLabAPIandacollecFonofhigh-performancemachinelearninganddatamining

toolkitsbuiltontopoftheGraphLabAPI.

•  Web2.0:RDBMS+Memcached(h<p://memcached.org)–  Memcached:Ahigh-performance,distributedmemoryobjectcachingsystems

TypicalSolu4onsorArchitecturesforBigDataAnaly4cs

HPCAC-Switzerland(Mar‘16) 8NetworkBasedCompu4ngLaboratory

BigDataProcessingwithHadoopComponents •  Majorcomponentsincludedinthis

tutorial:–  MapReduce(Batch)–  HBase(Query)–  HDFS(Storage)–  RPC(Inter-processcommunicaFon)

•  UnderlyingHadoopDistributedFileSystem(HDFS)usedbybothMapReduceandHBase

•  ModelscalesbuthighamountofcommunicaFonduringintermediatephasescanbefurtheropFmized

HDFS

MapReduce

HadoopFramework

UserApplica4ons

HBase

HadoopCommon(RPC)

HPCAC-Switzerland(Mar‘16) 9NetworkBasedCompu4ngLaboratory

Worker

Worker

Worker

Worker

Master

HDFSDriver

Zookeeper

Worker

SparkContext

•  Anin-memorydata-processingframework–  IteraFvemachinelearningjobs–  InteracFvedataanalyFcs–  ScalabasedImplementaFon

–  Standalone,YARN,Mesos

•  ScalableandcommunicaFonintensive–  WidedependenciesbetweenResilient

DistributedDatasets(RDDs)–  MapReduce-likeshuffleoperaFonsto

reparFFonRDDs

–  SocketsbasedcommunicaFon

SparkArchitectureOverview

hWp://spark.apache.org

HPCAC-Switzerland(Mar‘16) 10NetworkBasedCompu4ngLaboratory

MemcachedArchitecture

•  DistributedCachingLayer–  AllowstoaggregatesparememoryfrommulFplenodes–  Generalpurpose

•  Typicallyusedtocachedatabasequeries,resultsofAPIcalls•  Scalablemodel,buttypicalusageverynetworkintensive

Main memory CPUs

SSD HDD

High Performance Networks

... ...

...

Main memory CPUs

SSD HDD

Main memory CPUs

SSD HDD

Main memory CPUs

SSD HDD

Main memory CPUs

SSD HDD

HPCAC-Switzerland(Mar‘16) 11NetworkBasedCompu4ngLaboratory

•  SubstanFalimpactondesigninganduFlizingdatamanagementandprocessingsystemsinmulFpleFers

–  Front-enddataaccessingandserving(Online)•  Memcached+DB(e.g.MySQL),HBase

–  Back-enddataanalyFcs(Offline)•  HDFS,MapReduce,Spark

DataManagementandProcessingonModernClusters

Internet

Front-end Tier Back-end Tier

Web ServerWeb

ServerWeb Server

Memcached+ DB (MySQL)Memcached

+ DB (MySQL)Memcached+ DB (MySQL)

NoSQL DB (HBase)NoSQL DB

(HBase)NoSQL DB (HBase)

HDFS

MapReduce Spark

Data Analytics Apps/Jobs

Data Accessing and Serving

HPCAC-Switzerland(Mar‘16) 12NetworkBasedCompu4ngLaboratory

•  IntroducedinOct2000•  HighPerformanceDataTransfer

–  InterprocessorcommunicaFonandI/O–  Lowlatency(<1.0microsec),Highbandwidth(upto12.5GigaBytes/sec->100Gbps),and

lowCPUuFlizaFon(5-10%)•  MulFpleOperaFons

–  Send/Recv–  RDMARead/Write–  AtomicOperaFons(veryunique)

•  highperformanceandscalableimplementaFonsofdistributedlocks,semaphores,collecFvecommunicaFonoperaFons

•  Leadingtobigchangesindesigning–  HPCclusters–  Filesystems–  CloudcompuFngsystems–  GridcompuFngsystems

OpenStandardInfiniBandNetworkingTechnology

HPCAC-Switzerland(Mar‘16) 13NetworkBasedCompu4ngLaboratory

InterconnectsandProtocolsinOpenFabricsStack

Kernel Space

Applica4on/Middleware

Verbs

EthernetAdapter

EthernetSwitch

EthernetDriver

TCP/IP

1/10/40/100GigE

InfiniBandAdapter

InfiniBandSwitch

IPoIB

IPoIB

EthernetAdapter

EthernetSwitch

HardwareOffload

TCP/IP

10/40GigE-TOE

InfiniBandAdapter

InfiniBandSwitch

UserSpace

RSockets

RSockets

iWARPAdapter

EthernetSwitch

TCP/IP

UserSpace

iWARP

RoCEAdapter

EthernetSwitch

RDMA

UserSpace

RoCE

InfiniBandSwitch

InfiniBandAdapter

RDMA

UserSpace

IBNa4ve

Sockets

Applica4on/MiddlewareInterface

Protocol

Adapter

Switch

InfiniBandAdapter

InfiniBandSwitch

RDMA

SDP

SDP

HPCAC-Switzerland(Mar‘16) 14NetworkBasedCompu4ngLaboratory

HowCanHPCClusterswithHigh-PerformanceInterconnectandStorageArchitecturesBenefitBigDataApplica4ons?

BringHPCandBigDataprocessingintoa“convergenttrajectory”!

Whatarethemajorbo<lenecksincurrentBig

Dataprocessingmiddleware(e.g.Hadoop,Spark,andMemcached)?

Canthebo<lenecksbealleviatedwithnewdesignsbytakingadvantageofHPCtechnologies?

CanRDMA-enabledhigh-performanceinterconnectsbenefitBigDataprocessing?

CanHPCClusterswithhigh-performance

storagesystems(e.g.SSD,parallelfile

systems)benefitBigDataapplicaFons?

Howmuchperformancebenefits

canbeachievedthroughenhanced

designs?HowtodesignbenchmarksforevaluaFngthe

performanceofBigDatamiddlewareon

HPCclusters?

HPCAC-Switzerland(Mar‘16) 15NetworkBasedCompu4ngLaboratory

DesigningCommunica4onandI/OLibrariesforBigDataSystems:Challenges

BigDataMiddleware(HDFS,MapReduce,HBase,SparkandMemcached)

NetworkingTechnologies(InfiniBand,1/10/40/100GigE

andIntelligentNICs)

StorageTechnologies(HDD,SSD,andNVMe-SSD)

ProgrammingModels(Sockets)

Applica4ons

CommodityCompu4ngSystemArchitectures

(Mul4-andMany-corearchitecturesandaccelerators)

OtherProtocols?

Communica4onandI/OLibraryPoint-to-PointCommunica4on

QoS

ThreadedModelsandSynchroniza4on

Fault-ToleranceI/OandFileSystems

Virtualiza4on

Benchmarks

UpperlevelChanges?

HPCAC-Switzerland(Mar‘16) 16NetworkBasedCompu4ngLaboratory

•  Socketsnotdesignedforhigh-performance–  StreamsemanFcsowenmismatchforupperlayers–  Zero-copynotavailablefornon-blockingsockets

CanBigDataProcessingSystemsbeDesignedwithHigh-PerformanceNetworksandProtocols?

CurrentDesign

Applica4on

Sockets

1/10/40/100GigENetwork

OurApproach

Applica4on

OSUDesign

10/40/100GigEorInfiniBand

VerbsInterface

HPCAC-Switzerland(Mar‘16) 17NetworkBasedCompu4ngLaboratory

•  RDMAforApacheSpark

•  RDMAforApacheHadoop2.x(RDMA-Hadoop-2.x)–  PluginsforApache,Hortonworks(HDP)andCloudera(CDH)HadoopdistribuFons

•  RDMAforApacheHadoop1.x(RDMA-Hadoop)

•  RDMAforMemcached(RDMA-Memcached)

•  OSUHiBD-Benchmarks(OHB)–  HDFSandMemcachedMicro-benchmarks

•  hWp://hibd.cse.ohio-state.edu

•  UsersBase:155organizaFonsfrom20countries

•  Morethan15,500downloadsfromtheprojectsite

•  RDMAforApacheHBaseandImpala(upcoming)

TheHigh-PerformanceBigData(HiBD)Project

AvailableforInfiniBandandRoCE

HPCAC-Switzerland(Mar‘16) 18NetworkBasedCompu4ngLaboratory

•  High-PerformanceDesignofHadoopoverRDMA-enabledInterconnects

–  HighperformanceRDMA-enhanceddesignwithnaFveInfiniBandandRoCEsupportattheverbs-levelforHDFS,MapReduce,andRPCcomponents

–  EnhancedHDFSwithin-memoryandheterogeneousstorage

–  HighperformancedesignofMapReduceoverLustre

–  Plugin-basedarchitecturesupporFngRDMA-baseddesignsforApacheHadoop,CDHandHDP

–  Easilyconfigurablefordifferentrunningmodes(HHH,HHH-M,HHH-L,andMapReduceoverLustre)anddifferentprotocols(naFveInfiniBand,RoCE,andIPoIB)

•  Currentrelease:0.9.9

–  BasedonApacheHadoop2.7.1

–  CompliantwithApacheHadoop2.7.1,HDP2.3.0.0andCDH5.6.0APIsandapplicaFons

–  Testedwith

•  MellanoxInfiniBandadapters(DDR,QDRandFDR)

•  RoCEsupportwithMellanoxadapters

•  VariousmulF-corepla|orms

•  DifferentfilesystemswithdisksandSSDsandLustre

–  hWp://hibd.cse.ohio-state.edu

RDMAforApacheHadoop2.xDistribu4on

HPCAC-Switzerland(Mar‘16) 19NetworkBasedCompu4ngLaboratory

•  High-PerformanceDesignofSparkoverRDMA-enabledInterconnects–  HighperformanceRDMA-enhanceddesignwithnaFveInfiniBandandRoCEsupportattheverbs-level

forSpark

–  RDMA-baseddatashuffleandSEDA-basedshufflearchitecture

–  Non-blockingandchunk-baseddatatransfer

–  Easilyconfigurablefordifferentprotocols(naFveInfiniBand,RoCE,andIPoIB)

•  Currentrelease:0.9.1

–  BasedonApacheSpark1.5.1

–  Testedwith•  MellanoxInfiniBandadapters(DDR,QDRandFDR)

•  RoCEsupportwithMellanoxadapters

•  VariousmulF-corepla|orms

•  RAMdisks,SSDs,andHDD

–  hWp://hibd.cse.ohio-state.edu

RDMAforApacheSparkDistribu4on

HPCAC-Switzerland(Mar‘16) 20NetworkBasedCompu4ngLaboratory

•  High-PerformanceDesignofMemcachedoverRDMA-enabledInterconnects–  HighperformanceRDMA-enhanceddesignwithnaFveInfiniBandandRoCEsupportattheverbs-levelfor

MemcachedandlibMemcachedcomponents

–  HighperformancedesignofSSD-AssistedHybridMemory

–  EasilyconfigurablefornaFveInfiniBand,RoCEandthetradiFonalsockets-basedsupport(EthernetandInfiniBandwithIPoIB)

•  Currentrelease:0.9.4

–  BasedonMemcached1.4.24andlibMemcached1.0.18

–  CompliantwithlibMemcachedAPIsandapplicaFons

–  Testedwith•  MellanoxInfiniBandadapters(DDR,QDRandFDR)•  RoCEsupportwithMellanoxadapters•  VariousmulF-corepla|orms•  SSD

–  hWp://hibd.cse.ohio-state.edu

RDMAforMemcachedDistribu4on

HPCAC-Switzerland(Mar‘16) 21NetworkBasedCompu4ngLaboratory

•  Micro-benchmarksforHadoopDistributedFileSystem(HDFS)–  SequenFalWriteLatency(SWL)Benchmark,SequenFalReadLatency(SRL)

Benchmark,RandomReadLatency(RRL)Benchmark,SequenFalWriteThroughput(SWT)Benchmark,SequenFalReadThroughput(SRT)Benchmark

–  Supportbenchmarkingof•  ApacheHadoop1.xand2.xHDFS,HortonworksDataPla|orm(HDP)HDFS,Cloudera

DistribuFonofHadoop(CDH)HDFS

•  Micro-benchmarksforMemcached–  GetBenchmark,SetBenchmark,andMixedGet/SetBenchmark

•  Currentrelease:0.8

•  hWp://hibd.cse.ohio-state.edu

OSUHiBDMicro-Benchmark(OHB)Suite–HDFS&Memcached

HPCAC-Switzerland(Mar‘16) 22NetworkBasedCompu4ngLaboratory

•  HHH:HeterogeneousstoragedeviceswithhybridreplicaFonschemesaresupportedinthismodeofoperaFontohavebe<erfault-toleranceaswellasperformance.Thismodeisenabledbydefaultinthepackage.

•  HHH-M:Ahigh-performancein-memorybasedsetuphasbeenintroducedinthispackagethatcanbeuFlizedtoperformallI/OoperaFonsin-memoryandobtainasmuchperformancebenefitaspossible.

•  HHH-L:Withparallelfilesystemsintegrated,HHH-LmodecantakeadvantageoftheLustreavailableinthecluster.

•  MapReduceoverLustre,with/withoutlocaldisks:Besides,HDFSbasedsoluFons,thispackagealsoprovidessupporttorunMapReducejobsontopofLustrealone.Here,twodifferentmodesareintroduced:withlocaldisksandwithoutlocaldisks.

•  RunningwithSlurmandPBS:SupportsdeployingRDMAforApacheHadoop2.xwithSlurmandPBSindifferentrunningmodes(HHH,HHH-M,HHH-L,andMapReduceoverLustre).

DifferentModesofRDMAforApacheHadoop2.x

HPCAC-Switzerland(Mar‘16) 23NetworkBasedCompu4ngLaboratory

•  RDMA-basedDesignsandPerformanceEvaluaFon–  HDFS–  MapReduce–  RPC–  HBase–  Spark–  Memcached(Basic,HybridandNon-blockingAPIs)–  HDFS+Memcached-basedBurstBuffer–  OSUHiBDBenchmarks(OHB)

Accelera4onCaseStudiesandPerformanceEvalua4on

HPCAC-Switzerland(Mar‘16) 24NetworkBasedCompu4ngLaboratory

•  EnableshighperformanceRDMAcommunicaFon,whilesupporFngtradiFonalsocketinterface•  JNILayerbridgesJavabasedHDFSwithcommunicaFonlibrarywri<eninnaFvecode

DesignOverviewofHDFSwithRDMA

HDFS

Verbs

RDMACapableNetworks(IB,iWARP,RoCE..)

Applica4ons

1/10/40/100GigE,IPoIBNetwork

JavaSocketInterface JavaNa4veInterface(JNI)WriteOthers

OSUDesign

•  DesignFeatures–  RDMA-basedHDFSwrite–  RDMA-basedHDFS

replicaFon–  ParallelreplicaFon

support–  On-demandconnecFon

setup–  InfiniBand/RoCEsupport

N.S.Islam,M.W.Rahman,J.Jose,R.Rajachandrasekar,H.Wang,H.Subramoni,C.MurthyandD.K.Panda,HighPerformanceRDMA-BasedDesignofHDFSoverInfiniBand,Supercompu4ng(SC),Nov2012

N.Islam,X.Lu,W.Rahman,andD.K.Panda,SOR-HDFS:ASEDA-basedApproachtoMaximizeOverlappinginRDMA-EnhancedHDFS,HPDC'14,June2014

HPCAC-Switzerland(Mar‘16) 25NetworkBasedCompu4ngLaboratory

Triple-H

HeterogeneousStorage

•  DesignFeatures–  Threemodes

•  Default(HHH)•  In-Memory(HHH-M)

•  Lustre-Integrated(HHH-L)–  PoliciestoefficientlyuFlizetheheterogeneous

storagedevices•  RAM,SSD,HDD,Lustre

–  EvicFon/PromoFonbasedondatausagepa<ern

–  HybridReplicaFon

–  Lustre-Integratedmode:

•  Lustre-basedfault-tolerance

EnhancedHDFSwithIn-MemoryandHeterogeneousStorage

HybridReplica4on

DataPlacementPolicies

Evic4on/Promo4on

RAMDisk SSD HDD

Lustre

N.Islam,X.Lu,M.W.Rahman,D.Shankar,andD.K.Panda,Triple-H:AHybridApproachtoAccelerateHDFSonHPCClusterswithHeterogeneousStorageArchitecture,CCGrid’15,May2015

Applica4ons

HPCAC-Switzerland(Mar‘16) 26NetworkBasedCompu4ngLaboratory

DesignOverviewofMapReducewithRDMA

MapReduce

Verbs

RDMACapableNetworks(IB,iWARP,RoCE..)

OSUDesign

Applica4ons

1/10/40/100GigE,IPoIBNetwork

JavaSocketInterface JavaNa4veInterface(JNI)

JobTracker

TaskTracker

Map

Reduce

•  EnableshighperformanceRDMAcommunicaFon,whilesupporFngtradiFonalsocketinterface•  JNILayerbridgesJavabasedMapReducewithcommunicaFonlibrarywri<eninnaFvecode

•  DesignFeatures–  RDMA-basedshuffle–  Prefetchingandcachingmapoutput–  EfficientShuffleAlgorithms–  In-memorymerge

–  On-demandShuffleAdjustment–  Advancedoverlapping

•  map,shuffle,andmerge

•  shuffle,merge,andreduce

–  On-demandconnecFonsetup–  InfiniBand/RoCEsupport

M.W.Rahman,X.Lu,N.S.Islam,andD.K.Panda,HOMR:AHybridApproachtoExploitMaximumOverlappinginMapReduceoverHighPerformanceInterconnects,ICS,June2014

HPCAC-Switzerland(Mar‘16) 27NetworkBasedCompu4ngLaboratory

•  AhybridapproachtoachievemaximumpossibleoverlappinginMapReduceacrossallphasescomparedtootherapproaches–  EfficientShuffleAlgorithms–  DynamicandEfficientSwitching

–  On-demandShuffleAdjustment

AdvancedOverlappingamongDifferentPhases

DefaultArchitecture

EnhancedOverlappingwithIn-MemoryMerge

AdvancedHybridOverlapping

M.W.Rahman,X.Lu,N.S.Islam,andD.K.Panda,HOMR:AHybridApproachtoExploitMaximumOverlappinginMapReduceoverHighPerformanceInterconnects,ICS,June2014

HPCAC-Switzerland(Mar‘16) 28NetworkBasedCompu4ngLaboratory

•  DesignFeatures–  JVM-bypassedbuffer

management–  RDMAorsend/recvbased

adapFvecommunicaFon–  IntelligentbufferallocaFonand

adjustmentforserializaFon–  On-demandconnecFonsetup

–  InfiniBand/RoCEsupport

DesignOverviewofHadoopRPCwithRDMA

HadoopRPC

Verbs

RDMACapableNetworks(IB,iWARP,RoCE..)

Applica4ons

1/10/40/100GigE,IPoIBNetwork

JavaSocketInterface JavaNa4veInterface(JNI)

OurDesign

Default

OSUDesign

•  EnableshighperformanceRDMAcommunicaFon,whilesupporFngtradiFonalsocketinterface•  JNILayerbridgesJavabasedRPCwithcommunicaFonlibrarywri<eninnaFvecode X.Lu,N.Islam,M.W.Rahman,J.Jose,H.Subramoni,H.Wang,andD.K.Panda,High-PerformanceDesignofHadoopRPCwith

RDMAoverInfiniBand,Int'lConferenceonParallelProcessing(ICPP'13),October2013.

HPCAC-Switzerland(Mar‘16) 29NetworkBasedCompu4ngLaboratory

0

50

100

150

200

250

80 100 120

Execu4

onTim

e(s)

DataSize(GB)

IPoIB(FDR)

0

50

100

150

200

250

80 100 120

Execu4

onTim

e(s)

DataSize(GB)

IPoIB(FDR)

PerformanceBenefits–RandomWriter&TeraGeninTACC-Stampede

Clusterwith32Nodeswithatotalof128maps

•  RandomWriter–  3-4ximprovementoverIPoIB

for80-120GBfilesize

•  TeraGen–  4-5ximprovementoverIPoIB

for80-120GBfilesize

RandomWriter TeraGen

Reducedby3x Reducedby4x

HPCAC-Switzerland(Mar‘16) 30NetworkBasedCompu4ngLaboratory

0100200300400500600700800900

80 100 120

Execu4

onTim

e(s)

DataSize(GB)

IPoIB(FDR) OSU-IB(FDR)

0

100

200

300

400

500

600

80 100 120

Execu4

onTim

e(s)

DataSize(GB)

IPoIB(FDR) OSU-IB(FDR)

PerformanceBenefits–Sort&TeraSortinTACC-Stampede

Clusterwith32Nodeswithatotalof128mapsand64reduces

•  SortwithsingleHDDpernode–  40-52%improvementoverIPoIB

for80-120GBdata

•  TeraSortwithsingleHDDpernode–  42-44%improvementoverIPoIB

for80-120GBdata

Reducedby52% Reducedby44%

Clusterwith32Nodeswithatotalof128mapsand57reduces

HPCAC-Switzerland(Mar‘16) 31NetworkBasedCompu4ngLaboratory

Evalua4onofHHHandHHH-LwithApplica4ons

HDFS(FDR) HHH(FDR)

60.24s 48.3s

CloudBurstMR-MSPolyGraph

0

200

400

600

800

1000

4 6 8

ExecuF

onTim

e(s)

Concurrentmapsperhost

HDFS Lustre HHH-L Reducedby79%

•  MR-MSPolygraphonOSURIwith1,000maps–  HHH-LreducestheexecuFonFme

by79%overLustre,30%overHDFS

•  CloudBurstonTACCStampede–  WithHHH:19%improvementover

HDFS

HPCAC-Switzerland(Mar‘16) 32NetworkBasedCompu4ngLaboratory

Evalua4onwithSparkonSDSCGordon(HHHvs.Tachyon/Alluxio)

•  For200GBTeraGenon32nodes

–  Spark-TeraGen:HHHhas2.4ximprovementoverTachyon;2.3xoverHDFS-IPoIB(QDR)

–  Spark-TeraSort:HHHhas25.2%improvementoverTachyon;17%overHDFS-IPoIB(QDR)

020406080

100120140160180

8:50 16:100 32:200

Execu4

onTim

e(s)

ClusterSize:DataSize(GB)

IPoIB(QDR) Tachyon OSU-IB(QDR)

0

100

200

300

400

500

600

700

8:50 16:100 32:200

Execu4

onTim

e(s)

ClusterSize:DataSize(GB)

Reducedby2.4x

Reducedby25.2%

TeraGen TeraSort

N.Islam,M.W.Rahman,X.Lu,D.Shankar,andD.K.Panda,PerformanceCharacteriza4onandAccelera4onofIn-MemoryFileSystemsforHadoopandSparkApplica4onsonHPCClusters,IEEEBigData’15,October2015

HPCAC-Switzerland(Mar‘16) 33NetworkBasedCompu4ngLaboratory

IntermediateDataDirectory

DesignOverviewofShuffleStrategiesforMapReduceoverLustre•  DesignFeatures

–  Twoshuffleapproaches•  Lustrereadbasedshuffle

•  RDMAbasedshuffle

–  Hybridshufflealgorithmtotakebenefitfrombothshuffleapproaches

–  Dynamicallyadaptstothebe<ershuffleapproachforeachshufflerequestbasedonprofilingvaluesforeachLustrereadoperaFon

–  In-memorymergeandoverlappingofdifferentphasesarekeptsimilartoRDMA-enhancedMapReducedesign

Map1 Map2 Map3

Lustre

Reduce1 Reduce2LustreRead/RDMA

In-memorymerge/sort

reduce

M.W.Rahman,X.Lu,N.S.Islam,R.Rajachandrasekar,andD.K.Panda,HighPerformanceDesignofYARNMapReduceonModernHPCClusterswithLustreandRDMA,IPDPS,May2015

In-memorymerge/sort

reduce

HPCAC-Switzerland(Mar‘16) 34NetworkBasedCompu4ngLaboratory

•  For500GBSortin64nodes–  44%improvementoverIPoIB(FDR)

PerformanceImprovementofMapReduceoverLustreonTACC-Stampede

•  For640GBSortin128nodes–  48%improvementoverIPoIB(FDR)

0

200

400

600

800

1000

1200

300 400 500

JobExecu4

onTim

e(sec)

DataSize(GB)

IPoIB(FDR)OSU-IB(FDR)

050

100150200250300350400450500

20GB 40GB 80GB 160GB 320GB 640GB

Cluster:4 Cluster:8 Cluster:16 Cluster:32 Cluster:64 Cluster:128

JobExecu4

onTim

e(sec) IPoIB(FDR) OSU-IB(FDR)

M.W.Rahman,X.Lu,N.S.Islam,R.Rajachandrasekar,andD.K.Panda,MapReduceoverLustre:CanRDMA-basedApproachBenefit?,Euro-Par,August2014.

•  LocaldiskisusedastheintermediatedatadirectoryReducedby48%Reducedby44%

HPCAC-Switzerland(Mar‘16) 35NetworkBasedCompu4ngLaboratory

•  For80GBSortin8nodes–  34%improvementoverIPoIB(QDR)

CaseStudy-PerformanceImprovementofMapReduceoverLustreonSDSC-Gordon

•  For120GBTeraSortin16nodes–  25%improvementoverIPoIB(QDR)

•  Lustreisusedastheintermediatedatadirectory

0

100

200

300

400

500

600

700

800

900

40 60 80

JobExecu4

onTim

e(sec)

DataSize(GB)

IPoIB(QDR)OSU-Lustre-Read(QDR)OSU-RDMA-IB(QDR)OSU-Hybrid-IB(QDR)

0

100

200

300

400

500

600

700

800

900

40 80 120

JobExecu4

onTim

e(sec)

DataSize(GB)

Reducedby25%Reducedby34%

HPCAC-Switzerland(Mar‘16) 36NetworkBasedCompu4ngLaboratory

•  RDMA-basedDesignsandPerformanceEvaluaFon–  HDFS–  MapReduce–  RPC–  HBase–  Spark–  Memcached(Basic,HybridandNon-blockingAPIs)–  HDFS+Memcached-basedBurstBuffer–  OSUHiBDBenchmarks(OHB)

Accelera4onCaseStudiesandPerformanceEvalua4on

HPCAC-Switzerland(Mar‘16) 37NetworkBasedCompu4ngLaboratory

HBase-RDMADesignOverview

•  JNILayerbridgesJavabasedHBasewithcommunicaFonlibrarywri<eninnaFvecode•  EnableshighperformanceRDMAcommunicaFon,whilesupporFngtradiFonalsocketinterface

HBase

IBVerbs

RDMACapableNetworks(IB,iWARP,RoCE..)

OSU-IBDesign

Applica4ons

1/10/40/100GigE,IPoIBNetworks

JavaSocketInterface JavaNa4veInterface(JNI)

HPCAC-Switzerland(Mar‘16) 38NetworkBasedCompu4ngLaboratory

HBase–YCSBRead-WriteWorkload

•  HBaseGetlatency(QDR,10GigE)–  64clients:2.0ms;128Clients:3.5ms–  42%improvementoverIPoIBfor128clients

•  HBasePutlatency(QDR,10GigE)–  64clients:1.9ms;128Clients:3.5ms–  40%improvementoverIPoIBfor128clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Time(us)

No.ofClients

010002000300040005000600070008000900010000

8 16 32 64 96 128

Time(us)

No.ofClients

10GigE

ReadLatency WriteLatency

OSU-IB(QDR)IPoIB(QDR)

J.Huang,X.Ouyang,J.Jose,M.W.Rahman,H.Wang,M.Luo,H.Subramoni,ChetMurthy,andD.K.Panda,High-PerformanceDesignofHBasewithRDMAoverInfiniBand,IPDPS’12

HPCAC-Switzerland(Mar‘16) 39NetworkBasedCompu4ngLaboratory

HBase–YCSBGetLatencyandThroughputonSDSC-Comet

•  HBaseGetaveragelatency(FDR)–  4clientthreads:38us–  59%improvementoverIPoIBfor4clientthreads

•  HBaseGettotalthroughput–  4clientthreads:102Kops/sec–  2.4ximprovementoverIPoIBfor4clientthreads

GetLatency GetThroughput

0

0.02

0.04

0.06

0.08

0.1

0.12

1 2 3 4AverageLatency(m

s)

NumberofClientThreads

IPoIB(FDR) OSU-IB(FDR)

0

20

40

60

80

100

120

1 2 3 4TotalThrou

ghpu

t(Ko

ps/

sec)

NumberofClientThreads

IPoIB(FDR) OSU-IB(FDR)59% 2.4x

HPCAC-Switzerland(Mar‘16) 40NetworkBasedCompu4ngLaboratory

•  RDMA-basedDesignsandPerformanceEvaluaFon–  HDFS–  MapReduce–  RPC–  HBase–  Spark–  Memcached(Basic,HybridandNon-blockingAPIs)–  HDFS+Memcached-basedBurstBuffer–  OSUHiBDBenchmarks(OHB)

Accelera4onCaseStudiesandPerformanceEvalua4on

HPCAC-Switzerland(Mar‘16) 41NetworkBasedCompu4ngLaboratory

•  DesignFeatures–  RDMAbasedshuffle–  SEDA-basedplugins–  DynamicconnecFon

managementandsharing

–  Non-blockingdatatransfer–  Off-JVM-heapbuffer

management–  InfiniBand/RoCEsupport

DesignOverviewofSparkwithRDMA Spark Applications(Scala/Java/Python)

Spark(Scala/Java)

BlockFetcherIteratorBlockManager

Task TaskTask Task

Java NIO ShuffleServer

(default)

Netty ShuffleServer

(optional)

RDMA ShuffleServer

(plug-in)

Java NIO ShuffleFetcher(default)

Netty ShuffleFetcher

(optional)

RDMA ShuffleFetcher(plug-in)

Java Socket RDMA-based Shuffle Engine(Java/JNI)

1/10 Gig Ethernet/IPoIB (QDR/FDR) Network

Native InfiniBand (QDR/FDR)

•  EnableshighperformanceRDMAcommunicaFon,whilesupporFngtradiFonalsocketinterface•  JNILayerbridgesScalabasedSparkwithcommunicaFonlibrarywri<eninnaFvecode X.Lu,M.W.Rahman,N.Islam,D.Shankar,andD.K.Panda,Accelera4ngSparkwithRDMAforBigDataProcessing:Early

Experiences,Int'lSymposiumonHighPerformanceInterconnects(HotI'14),August2014

HPCAC-Switzerland(Mar‘16) 42NetworkBasedCompu4ngLaboratory

•  InfiniBandFDR,SSD,64WorkerNodes,1536Cores,(1536M1536R)

•  RDMA-baseddesignforSpark1.5.1

•  RDMAvs.IPoIBwith1536concurrenttasks,singleSSDpernode.–  SortBy:TotalFmereducedbyupto80%overIPoIB(56Gbps)

–  GroupBy:TotalFmereducedbyupto57%overIPoIB(56Gbps)

PerformanceEvalua4ononSDSCComet–SortBy/GroupBy

64WorkerNodes,1536cores,SortByTestTotalTime 64WorkerNodes,1536cores,GroupByTestTotalTime

0

50

100

150

200

250

300

64 128 256

Time(sec)

DataSize(GB)

IPoIB

RDMA

0

50

100

150

200

250

64 128 256

Time(sec)

DataSize(GB)

IPoIB

RDMA

57%80%

HPCAC-Switzerland(Mar‘16) 43NetworkBasedCompu4ngLaboratory

•  InfiniBandFDR,SSD,64WorkerNodes,1536Cores,(1536M1536R)

•  RDMA-baseddesignforSpark1.5.1

•  RDMAvs.IPoIBwith1536concurrenttasks,singleSSDpernode.–  Sort:TotalFmereducedby38%overIPoIB(56Gbps)

–  TeraSort:TotalFmereducedby15%overIPoIB(56Gbps)

PerformanceEvalua4ononSDSCComet–HiBenchSort/TeraSort

64WorkerNodes,1536cores,SortTotalTime 64WorkerNodes,1536cores,TeraSortTotalTime

050

100150200250300350400450

64 128 256

Time(sec)

DataSize(GB)

IPoIB

RDMA

0

100

200

300

400

500

600

64 128 256

Time(sec)

DataSize(GB)

IPoIB

RDMA

15%38%

HPCAC-Switzerland(Mar‘16) 44NetworkBasedCompu4ngLaboratory

•  InfiniBandFDR,SSD,32/64WorkerNodes,768/1536Cores,(768/1536M768/1536R)

•  RDMA-baseddesignforSpark1.5.1

•  RDMAvs.IPoIBwith768/1536concurrenttasks,singleSSDpernode.–  32nodes/768cores:TotalFmereducedby37%overIPoIB(56Gbps)

–  64nodes/1536cores:TotalFmereducedby43%overIPoIB(56Gbps)

PerformanceEvalua4ononSDSCComet–HiBenchPageRank

32WorkerNodes,768cores,PageRankTotalTime 64WorkerNodes,1536cores,PageRankTotalTime

050

100150200250300350400450

Huge BigData GiganFc

Time(sec)

DataSize(GB)

IPoIB

RDMA

0

100

200

300

400

500

600

700

800

Huge BigData GiganFc

Time(sec)

DataSize(GB)

IPoIB

RDMA

43%37%

HPCAC-Switzerland(Mar‘16) 45NetworkBasedCompu4ngLaboratory

•  RDMA-basedDesignsandPerformanceEvaluaFon–  HDFS–  MapReduce–  RPC–  HBase–  Spark–  Memcached(Basic,HybridandNon-blockingAPIs)–  HDFS+Memcached-basedBurstBuffer–  OSUHiBDBenchmarks(OHB)

Accelera4onCaseStudiesandPerformanceEvalua4on

HPCAC-Switzerland(Mar‘16) 46NetworkBasedCompu4ngLaboratory

•  ServerandclientperformanegoFaFonprotocol–  Masterthreadassignsclientstoappropriateworkerthread

•  Onceaclientisassignedaverbsworkerthread,itcancommunicatedirectlyandis“bound”tothatthread

•  AllotherMemcacheddatastructuresaresharedamongRDMAandSocketsworkerthreads

•  MemcachedServercanservebothsocketandverbsclientssimultaneously

•  MemcachedapplicaFonsneednotbemodified;usesverbsinterfaceifavailable

Memcached-RDMADesign

SocketsClient

RDMAClient

MasterThread

SocketsWorkerThreadVerbsWorkerThread

SocketsWorkerThread

VerbsWorkerThread

SharedData

MemorySlabsItems…

1

1

2

2

HPCAC-Switzerland(Mar‘16) 47NetworkBasedCompu4ngLaboratory

1

10

100

1000

1 2 4 8 16 32 641282565121K 2K 4K

Time(us)

MessageSize

OSU-IB(FDR)IPoIB(FDR)

0100200300400500600700

16 32 64 128 256 512 102420484080Thou

sand

sofT

ransac4o

ns

perS

econ

d(TPS)

No.ofClients

•  MemcachedGetlatency–  4bytesOSU-IB:2.84us;IPoIB:75.53us–  2KbytesOSU-IB:4.49us;IPoIB:123.42us

•  MemcachedThroughput(4bytes)–  4080clientsOSU-IB:556Kops/sec,IPoIB:233Kops/s–  Nearly2Ximprovementinthroughput

MemcachedGETLatency MemcachedThroughput

RDMA-MemcachedPerformance(FDRInterconnect)

ExperimentsonTACCStampede(IntelSandyBridgeCluster,IB:FDR)

LatencyReducedbynearly20X

2X

HPCAC-Switzerland(Mar‘16) 48NetworkBasedCompu4ngLaboratory

•  IllustraFonwithRead-Cache-Readaccesspa<ernusingmodifiedmysqlslaploadtesFngtool

•  Memcached-RDMAcan-  improvequerylatencybyupto66%overIPoIB(32Gbps)

-  throughputbyupto69%overIPoIB(32Gbps)

Micro-benchmarkEvalua4onforOLDPworkloads

012345678

64 96 128 160 320 400

Latency(sec)

No.ofClients

Memcached-IPoIB(32Gbps)Memcached-RDMA(32Gbps)

0

1000

2000

3000

4000

64 96 128 160 320 400

Throughp

ut(Kq

/s)

No.ofClients

Memcached-IPoIB(32Gbps)Memcached-RDMA(32Gbps)

D.Shankar,X.Lu,J.Jose,M.W.Rahman,N.Islam,andD.K.Panda,CanRDMABenefitOn-LineDataProcessingWorkloadswithMemcachedandMySQL,ISPASS’15

Reducedby66%

HPCAC-Switzerland(Mar‘16) 49NetworkBasedCompu4ngLaboratory

•  ohb_memlat&ohb_memthrlatency&throughputmicro-benchmarks

•  Memcached-RDMAcan-  improvequerylatencybyupto70%overIPoIB(32Gbps)

-  improvethroughputbyupto2XoverIPoIB(32Gbps)

-  Nooverheadinusinghybridmodewhenalldatacanfitinmemory

PerformanceBenefitsofHybridMemcached(Memory+SSD)onSDSC-Gordon

0

2

4

6

8

10

64 128 256 512 1024

Throughp

ut(m

illiontran

s/sec)

No.ofClients

IPoIB(32Gbps)RDMA-Mem(32Gbps)RDMA-Hybrid(32Gbps)

0

100

200

300

400

500

Averagelatency(us)

MessageSize(Bytes)

2X

HPCAC-Switzerland(Mar‘16) 50NetworkBasedCompu4ngLaboratory

–  MemcachedlatencytestwithZipfdistribuFon,serverwith1GBmemory,32KBkey-valuepairsize,totalsizeofdataaccessedis1GB(whendatafitsinmemory)and1.5GB(whendatadoesnotfitinmemory)

–  Whendatafitsinmemory:RDMA-Mem/Hybridgives5ximprovementoverIPoIB-Mem

–  Whendatadoesnotfitinmemory:RDMA-Hybridgives2x-2.5xoverIPoIB/RDMA-Mem

PerformanceEvalua4ononIBFDR+SATA/NVMeSSDs

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get

IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe

IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe

DataFitsInMemory DataDoesNotFitInMemory

Latency(us)

slaballocaFon(SSDwrite) cachecheck+load(SSDread) cacheupdate serverresponse clientwait miss-penalty

HPCAC-Switzerland(Mar‘16) 51NetworkBasedCompu4ngLaboratory

–  RDMA-AcceleratedCommunicaFonforMemcachedGet/Set

–  Hybrid‘RAM+SSD’slabmanagementforhigherdataretenFon

–  Non-blockingAPIextensions•  memcached_(iset/iget/bset/bget/test/wait)•  Achievenearin-memoryspeedswhilehiding

bo<lenecksofnetworkandSSDI/O•  AbilitytoexploitcommunicaFon/computaFon

overlap•  OpFonalbufferre-useguarantees

–  AdapFveslabmanagerwithdifferentI/Oschemesforhigherthroughput.

Accelera4ngHybridMemcachedwithRDMA,Non-blockingExtensionsandSSDs

D.Shankar,X.Lu,N.S.Islam,M.W.Rahman,andD.K.Panda,High-PerformanceHybridKey-ValueStoreonModernClusterswithRDMAInterconnectsandSSDs:Non-blockingExtensions,Designs,andBenefits,IPDPS,May2016

BLOCKINGAPI

NON-BLOCKINGAPIREQ.

NON-BLOCKINGAPIREPLY

CLIENT

SERV

ER

HYBRIDSLABMANAGER(RAM+SSD)

RDMA-ENHANCEDCOMMUNICATIONLIBRARY

RDMA-ENHANCEDCOMMUNICATIONLIBRARY

LIBMEMCACHEDLIBRARY

BlockingAPIFlow

Non-BlockingAPIFlow

HPCAC-Switzerland(Mar‘16) 52NetworkBasedCompu4ngLaboratory

–  Datadoesnotfitinmemory:Non-blockingMemcachedSet/GetAPIExtensionscanachieve•  >16xlatencyimprovementvs.blockingAPIoverRDMA-Hybrid/RDMA-Memw/penalty•  >2.5xthroughputimprovementvs.blockingAPIoverdefault/opFmizedRDMA-Hybrid

–  Datafitsinmemory:Non-blockingExtensionsperformsimilartoRDMA-Mem/RDMA-Hybridand>3.6ximprovementoverIPoIB-Mem

PerformanceEvalua4onwithNon-BlockingMemcachedAPI

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set Get

IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-BlockH-RDMA-Opt-NonB-i

H-RDMA-Opt-NonB-b

AverageLatency(us)

MissPenalty(BackendDBAccessOverhead)ClientWaitServerResponseCacheUpdateCachecheck+Load(Memoryand/orSSDread)SlabAllocaFon(w/SSDwriteonOut-of-Mem)

H=HybridMemcachedoverSATASSDOpt=AdapFveslabmanagerBlock=DefaultBlockingAPINonB-i=Non-blockingiset/igetAPINonB-b=Non-blockingbset/bgetAPIw/bufferre-useguarantee

HPCAC-Switzerland(Mar‘16) 53NetworkBasedCompu4ngLaboratory

•  RDMA-basedDesignsandPerformanceEvaluaFon–  HDFS–  MapReduce–  RPC–  HBase–  Spark–  Memcached(Basic,HybridandNon-blockingAPIs)–  HDFS+Memcached-basedBurstBuffer–  OSUHiBDBenchmarks(OHB)

Accelera4onCaseStudiesandPerformanceEvalua4on

HPCAC-Switzerland(Mar‘16) 54NetworkBasedCompu4ngLaboratory

•  DesignFeatures–  Memcached-basedburst-buffer

system

•  Hideslatencyofparallelfilesystemaccess

•  ReadfromlocalstorageandMemcached

–  DatalocalityachievedbywriFngdatatolocalstorage

–  DifferentapproachesofintegraFonwithparallelfilesystemtoguaranteefault-tolerance

Accelera4ngI/OPerformanceofBigDataAnaly4csthroughRDMA-basedKey-ValueStore

Applica4on

I/OForwardingModule

Map/ReduceTask DataNode

LocalDisk

DataLocalityFault-tolerance

Lustre

Memcached-basedBurstBufferSystem

HPCAC-Switzerland(Mar‘16) 55NetworkBasedCompu4ngLaboratory

Evalua4onwithPUMAWorkloads

GainsonOSURIwithourapproach(Mem-bb)on24nodes

•  SequenceCount:34.5%overLustre,40%overHDFS

•  RankedInvertedIndex:27.3%overLustre,48.3%overHDFS

•  HistogramRaFng:17%overLustre,7%overHDFS

050010001500200025003000350040004500

SeqCount RankedInvIndex HistoRaFng

Execu4

onTim

e(s)

Workloads

HDFS(32Gbps)Lustre(32Gbps)Mem-bb(32Gbps)

48.3%

40%

17%

N.S.Islam,D.Shankar,X.Lu,M.W.Rahman,andD.K.Panda,Accelera4ngI/OPerformanceofBigDataAnaly4cswithRDMA-basedKey-ValueStore,ICPP’15,September2015

HPCAC-Switzerland(Mar‘16) 56NetworkBasedCompu4ngLaboratory

•  RDMA-basedDesignsandPerformanceEvaluaFon–  HDFS–  MapReduce–  RPC–  HBase–  Spark–  Memcached(Basic,HybridandNon-blockingAPIs)–  HDFS+Memcached-basedBurstBuffer–  OSUHiBDBenchmarks(OHB)

Accelera4onCaseStudiesandPerformanceEvalua4on

HPCAC-Switzerland(Mar‘16) 57NetworkBasedCompu4ngLaboratory

•  Thecurrentbenchmarksprovidesomeperformancebehavior

•  However,donotprovideanyinformaFontothedesigner/developeron:–  Whatishappeningatthelower-layer?

–  Wherethebenefitsarecomingfrom?

–  Whichdesignisleadingtobenefitsorbo<lenecks?

–  Whichcomponentinthedesignneedstobechangedandwhatwillbeitsimpact?

–  Canperformancegain/lossatthelower-layerbecorrelatedtotheperformancegain/lossobservedattheupperlayer?

AretheCurrentBenchmarksSufficientforBigData?

HPCAC-Switzerland(Mar‘16) 58NetworkBasedCompu4ngLaboratory

BigDataMiddleware(HDFS,MapReduce,HBase,SparkandMemcached)

NetworkingTechnologies(InfiniBand,1/10/40/100GigE

andIntelligentNICs)

StorageTechnologies

(HDD,SSD,andNVMe-SSD)

ProgrammingModels(Sockets)

Applica4ons

CommodityCompu4ngSystemArchitectures

(Mul4-andMany-corearchitecturesandaccelerators)

OtherProtocols?

Communica4onandI/OLibrary

Point-to-PointCommunica4on

QoS

ThreadedModelsandSynchroniza4on

Fault-ToleranceI/OandFileSystems

Virtualiza4on

Benchmarks

RDMAProtocols

ChallengesinBenchmarkingofRDMA-basedDesigns

CurrentBenchmarks

NoBenchmarks

Correla4on?

HPCAC-Switzerland(Mar‘16) 59NetworkBasedCompu4ngLaboratory

•  Acomprehensivesuiteofbenchmarksto–  CompareperformanceofdifferentMPIlibrariesonvariousnetworksandsystems–  Validatelow-levelfuncFonaliFes–  ProvideinsightstotheunderlyingMPI-leveldesigns

•  Startedwithbasicsend-recv(MPI-1)micro-benchmarksforlatency,bandwidthandbi-direcFonalbandwidth•  Extendedlaterto

–  MPI-2one-sided–  CollecFves–  GPU-awaredatamovement–  OpenSHMEM(point-to-pointandcollecFves)–  UPC

•  Hasbecomeanindustrystandard•  Extensivelyusedfordesign/developmentofMPIlibraries,performancecomparisonofMPIlibrariesandeven

inprocurementoflarge-scalesystems•  Availablefromh<p://mvapich.cse.ohio-state.edu/benchmarks

•  AvailableinanintegratedmannerwithMVAPICH2stack

OSUMPIMicro-Benchmarks(OMB)Suite

HPCAC-Switzerland(Mar‘16) 60NetworkBasedCompu4ngLaboratory

BigDataMiddleware(HDFS,MapReduce,HBase,SparkandMemcached)

NetworkingTechnologies(InfiniBand,1/10/40/100GigE

andIntelligentNICs)

StorageTechnologies

(HDD,SSD,andNVMe-SSD)

ProgrammingModels(Sockets)

Applica4ons

CommodityCompu4ngSystemArchitectures

(Mul4-andMany-corearchitecturesandaccelerators)

OtherProtocols?

Communica4onandI/OLibrary

Point-to-PointCommunica4on

QoS

ThreadedModelsandSynchroniza4on

Fault-ToleranceI/OandFileSystems

Virtualiza4on

Benchmarks

RDMAProtocols

Itera4veProcess–RequiresDeeperInves4ga4onandDesignforBenchmarkingNextGenera4onBigDataSystemsandApplica4ons

Applica4ons-LevelBenchmarks

Micro-Benchmarks

HPCAC-Switzerland(Mar‘16) 61NetworkBasedCompu4ngLaboratory

•  HDFSBenchmarks

–  SequenFalWriteLatency(SWL)Benchmark

–  SequenFalReadLatency(SRL)Benchmark

–  RandomReadLatency(RRL)Benchmark

–  SequenFalWriteThroughput(SWT)Benchmark

–  SequenFalReadThroughput(SRT)Benchmark

•  MemcachedBenchmarks

–  GetBenchmark

–  SetBenchmark

–  MixedGet/SetBenchmark

•  AvailableasapartofOHB0.8

OSUHiBDBenchmarks(OHB)N.S.Islam,X.Lu,M.W.Rahman,J.Jose,andD.K.Panda,AMicro-benchmarkSuiteforEvalua4ngHDFSOpera4onsonModernClusters,Int'lWorkshoponBigDataBenchmarking(WBDB'12),December2012

D.Shankar,X.Lu,M.W.Rahman,N.Islam,andD.K.Panda,AMicro-BenchmarkSuiteforEvalua4ngHadoopMapReduceonHigh-PerformanceNetworks,BPOE-5(2014)

X.Lu,M.W.Rahman,N.Islam,andD.K.Panda,AMicro-BenchmarkSuiteforEvalua4ngHadoopRPConHigh-PerformanceNetworks,Int'lWorkshoponBigDataBenchmarking(WBDB'13),July2013

TobeReleased

HPCAC-Switzerland(Mar‘16) 62NetworkBasedCompu4ngLaboratory

•  UpcomingReleasesofRDMA-enhancedPackageswillsupport

–  HBase

–  Impala

•  UpcomingReleasesofOSUHiBDMicro-Benchmarks(OHB)willsupport

–  MapReduce

–  RPC•  Advanceddesignswithupper-levelchangesandopFmizaFons

–  MemcachedwithNon-blockingAPI

–  HDFS+Memcached-basedBurstBuffer

On-goingandFuturePlansofOSUHighPerformanceBigData(HiBD)Project

HPCAC-Switzerland(Mar‘16) 63NetworkBasedCompu4ngLaboratory

•  DiscussedchallengesinacceleraFngHadoop,SparkandMemcached

•  PresentediniFaldesignstotakeadvantageofInfiniBand/RDMAforHDFS,MapReduce,RPC,Spark,andMemcached

•  Resultsarepromising

•  Manyotheropenissuesneedtobesolved

•  WillenableBigDatacommunitytotakeadvantageofmodernHPCtechnologiestocarryouttheiranalyFcsinafastandscalablemanner

ConcludingRemarks

HPCAC-Switzerland(Mar‘16) 64NetworkBasedCompu4ngLaboratory

FundingAcknowledgmentsFundingSupportby

EquipmentSupportby

HPCAC-Switzerland(Mar‘16) 65NetworkBasedCompu4ngLaboratory

PersonnelAcknowledgmentsCurrentStudents

–  A.AugusFne(M.S.)

–  A.Awan(Ph.D.)–  S.Chakraborthy(Ph.D.)

–  C.-H.Chu(Ph.D.)–  N.Islam(Ph.D.)

–  M.Li(Ph.D.)

PastStudents–  P.Balaji(Ph.D.)

–  S.Bhagvat(M.S.)

–  A.Bhat(M.S.)

–  D.BunFnas(Ph.D.)

–  L.Chai(Ph.D.)

–  B.Chandrasekharan(M.S.)

–  N.Dandapanthula(M.S.)

–  V.Dhanraj(M.S.)

–  T.Gangadharappa(M.S.)–  K.Gopalakrishnan(M.S.)

–  G.Santhanaraman(Ph.D.)–  A.Singh(Ph.D.)

–  J.Sridhar(M.S.)

–  S.Sur(Ph.D.)

–  H.Subramoni(Ph.D.)

–  K.Vaidyanathan(Ph.D.)

–  A.Vishnu(Ph.D.)

–  J.Wu(Ph.D.)

–  W.Yu(Ph.D.)

PastResearchScien:st–  S.Sur

CurrentPost-Doc–  J.Lin

–  D.Banerjee

CurrentProgrammer–  J.Perkins

PastPost-Docs–  H.Wang

–  X.Besseron–  H.-W.Jin

–  M.Luo

–  W.Huang(Ph.D.)–  W.Jiang(M.S.)

–  J.Jose(Ph.D.)

–  S.Kini(M.S.)

–  M.Koop(Ph.D.)

–  R.Kumar(M.S.)

–  S.Krishnamoorthy(M.S.)

–  K.Kandalla(Ph.D.)

–  P.Lai(M.S.)

–  J.Liu(Ph.D.)

–  M.Luo(Ph.D.)–  A.Mamidala(Ph.D.)

–  G.Marsh(M.S.)

–  V.Meshram(M.S.)

–  A.Moody(M.S.)

–  S.Naravula(Ph.D.)

–  R.Noronha(Ph.D.)

–  X.Ouyang(Ph.D.)

–  S.Pai(M.S.)

–  S.Potluri(Ph.D.)

–  R.Rajachandrasekar(Ph.D.)

–  K.Kulkarni(M.S.)–  M.Rahman(Ph.D.)

–  D.Shankar(Ph.D.)–  A.Venkatesh(Ph.D.)

–  J.Zhang(Ph.D.)

–  E.Mancini–  S.Marcarelli

–  J.Vienne

CurrentResearchScien:stsCurrentSeniorResearchAssociate–  H.Subramoni

–  X.Lu

PastProgrammers–  D.Bureddy

-K.Hamidouche

CurrentResearchSpecialist–  M.Arnold

HPCAC-Switzerland(Mar‘16) 66NetworkBasedCompu4ngLaboratory

SecondInterna4onalWorkshoponHigh-PerformanceBigDataCompu4ng(HPBDC)

HPBDC2016willbeheldwithIEEEInterna4onalParallelandDistributedProcessingSymposium(IPDPS2016),Chicago,IllinoisUSA,May27th,2016

KeynoteTalk:Dr.ChaitanyaBaru,SeniorAdvisorforDataScience,Na4onalScienceFounda4on(NSF);Dis4nguishedScien4st,SanDiegoSupercomputerCenter(SDSC)

PanelModerator:JianfengZhan(ICT/CAS)

PanelTopic:MergeorSplit:MutualInfluencebetweenBigDataandHPCTechniques

SixRegularResearchPapersandTwoShortResearchPapers

hWp://web.cse.ohio-state.edu/~luxi/hpbdc2016

HPBDC2015washeldinconjunc4onwithICDCS’15

hWp://web.cse.ohio-state.edu/~luxi/hpbdc2015

HPCAC-Switzerland(Mar‘16) 67NetworkBasedCompu4ngLaboratory

panda@cse.ohio-state.edu

ThankYou!

TheHigh-PerformanceBigDataProjecth<p://hibd.cse.ohio-state.edu/

Network-BasedCompuFngLaboratoryh<p://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecth<p://mvapich.cse.ohio-state.edu/

top related