twister2: a high-performance big data programming...

`,WorkwithShantenuJha,KannanGovindarajan,PulasthiWickramasinghe,GurhanGunduz,AhmetUyar

8/14/18 1

HPBDC2018:The4thIEEEInternationalWorkshoponHigh-PerformanceBigData,DeepLearning,andCloudComputing

GeoffreyFox,May21,2018

JudyQiu,SupunKamburugamuveDepartmentofIntelligentSystemsEngineering

[email protected],http://www.dsc.soic.indiana.edu/,http://spidal.org/

Twister2:AHigh-PerformanceBigDataProgrammingEnvironment

Abstract•  WeanalysethecomponentsthatareneededinprogrammingenvironmentsforBigDataAnalysisSystemswithscalableHPCperformanceandthefunctionalityofABDS–theApacheBigDataSoftwareStack.

• OnehighlightisHarp-DAALwhichisamachinelibraryexploitingtheIntelnodelibraryDAALandHPCcommunicationcollectiveswithintheHadoopecosystem.

• AnotherhighlightisTwister2whichconsistsofasetofmiddlewarecomponentstosupportbatchorstreamingdatacapabilitiesfamiliarfromApacheHadoop,Spark,HeronandFlinkbutwithhighperformance

•  Twister2coversbulksynchronousanddataflowcommunication;taskmanagementasinMesos,YarnandKubernetes;dataflowgraphexecutionmodels;launchingoftheHarp-DAALlibrary;streamingandrepositorydataaccessinterfaces,in-memorydatabasesandfaulttoleranceatdataflownodes.

•  SimilarcapabilitiesareavailableincurrentApachesystemsbutasintegratedpackageswhichdonotallowneededcustomizationfordifferentapplicationscenarios.

8/14/18 2

•  Ongeneralprinciplesparallelanddistributedcomputinghavedifferentrequirementsevenifsometimessimilarfunctionalities

•  ApachestackABDStypicallyusesdistributedcomputingconcepts•  Forexample,ReduceoperationisdifferentinMPI(Harp)andSpark

•  Largescalesimulationrequirementsarewellunderstood•  BigDatarequirementsarenotagreedbutthereareafewkeyusetypes

1)  Pleasinglyparallelprocessing(includinglocalmachinelearningLML)asofdifferenttweetsfromdifferentuserswithperhapsMapReducestyleofstatisticsandvisualizations;possiblyStreaming

2)  DatabasemodelwithqueriesagainsupportedbyMapReduceforhorizontalscaling3)  GlobalMachineLearningGMLwithsinglejobusingmultiplenodesasclassicparallel

computing4)  DeepLearningcertainlyneedsHPC–possiblyonlymultiplesmallsystems

•  Currentworkloadsstress1)and2)andaresuitedtocurrentcloudsandtoApacheBigDataSoftware(withnoHPC)

•  ThisexplainswhySparkwithpoorGMLperformancecanbesosuccessful

Requirements

8/14/18 3

DifficultyinParallelismSizeofSynchronizationconstraints

SpectrumofApplicationsandAlgorithmsThereisalsodistributionseeningrid/edgecomputing

8/14/18 4

PleasinglyParallelOftenindependentevents

MapReduceasinscalabledatabases

StructuredAdaptiveSparsityHugeJobs

LooselyCoupled

Largescalesimulations

CurrentmajorBigDatacategory

CommodityClouds HPCCloudsHighPerformanceInterconnect

ExascaleSupercomputers

GlobalMachineLearninge.g.parallelclustering

DeepLearning

HPCClouds/SupercomputersMemoryaccessalsocritical

UnstructuredAdaptiveSparsityMediumsizeJobs

GraphAnalyticse.g.subgraphmining

LDA

LinearAlgebraatcore(typicallynotsparse)

SizeofDiskI/O

NeedatoolkitcoveringallapplicationswithsameAPIbutdifferentimplementations

TightlyCoupled

Parametersweepsimulations

These3arefocusofTwister2butweneedtopreservecapabilityonfirst2paradigms

ClassicCloudWorkload

GlobalMachineLearning

NoteProblemandSystemArchitectureasefficientexecutionsaystheymustmatch

8/14/18 5

Needatoolkitcovering5mainparadigmswithsameAPIbutdifferentimplementations

ComparingSpark,FlinkandMPI•  OnGlobalMachineLearningGML.

8/14/18 6

MachineLearningwithMPI,SparkandFlink

•  Threealgorithmsimplementedinthreeruntimes•  MultidimensionalScaling(MDS)•  Terasort•  K-Means(dropasnotimeandlookedatlater)

•  ImplementationinJava•  MDSisthemostcomplexalgorithm-threenestedparallelloops•  K-Means-oneparallelloop•  Terasort-noiterations

• Withcare,Javaperformance~Cperformance• Withoutcare,Javaperformance<<Cperformance(detailsomitted)

8/14/18 7

MultidimensionalScaling:3NestedParallelSections

MDSexecutiontimeon16nodeswith20processesineachnodewith

varyingnumberofpoints

MDSexecutiontimewith32000pointsonvaryingnumberofnodes.Eachnoderuns20paralleltasks

Spark,FlinkNoSpeedup

8/14/18 8

Flink

Spark

MPI

MPIFactorof20-200FasterthanSpark/Flink

Kmeansalsobad–seelater

Terasort

9

Sorting1TBofdatarecords

Terasortexecutiontimein64and32nodes.OnlyMPIshowsthesortingtimeandcommunicationtimeasothertwoframeworksdoesn'tprovideaclearmethodtoaccuratelymeasurethem.Sorting

timeincludesdatasavetime.MPI-IB-MPIwithInfiniband

Partitionthedatausingasampleandregroup

SoftwareHPC-ABDSHPC-FaaS

8/14/18 10

NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

Ogres Application Analysis

HPC-ABDS and HPC-FaaS Software Harp and Twister2 Building Blocks

SPIDAL Data Analytics Library

8/14/18 11

Software:MIDASHPC-ABDS

HPC-ABDSIntegratedwiderangeofHPCandBigDatatechnologies.IgaveupupdatinglistinJanuary2016!

8/14/18 12

Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies Cross-Cutting

Functions 1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination: Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca

17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds Networking: Google Cloud DNS, Amazon Route 53

21layers Over350SoftwarePackagesJanuary292016

DifferentchoicesinsoftwaresystemsinCloudsandHPC.HPC-ABDStakescloudsoftwareaugmentedbyHPCwhenneededtoimproveperformance16of21layerspluslanguages

8/14/18 13

HarpPluginforHadoop:ImportantpartofTwister2

14

WorkofJudyQiu

Map Collective Run time merges MapReduce and HPC

allreduce reduce

rotate push & pull

allgather

regroup

broadcast

RuntimesoftwareforHarp

15

DynamicRotationControlforLatentDirichletAllocationandMatrixFactorizationSGD(stochasticgradientdescent)

OtherModelParametersFromCaching

ModelParametersFromRotation

ModelRelatedData Computesuntilthetimearrives,thenstartsmodelrotationtoaddressloadimbalance

Multi-ThreadExecution

•  Datasets:5millionpoints,10thousandcentroids,10featuredimensions

•  10to20nodesofIntelKNL7250processors

•  Harp-DAALhas15xspeedupsoverSparkMLlib

•  Datasets:500Kor1milliondatapointsoffeaturedimension300

•  RunningonsingleKNL7250(Harp-DAAL)vs.singleK80GPU(PyTorch)

•  Harp-DAALachieves3xto6xspeedups

•  Datasets:Twitterwith44millionvertices,2billionedges,subgraphtemplatesof10to12vertices

•  25nodesofIntelXeonE52670•  Harp-DAALhas2xto5xspeedups

overstate-of-the-artMPI-Fasciasolution

Harpv.SparkHarpv.TorchHarpv.MPI

17

•  MahoutwasHadoopmachinelearninglibrarybutlargelyabandonedasSparkoutperformedHadoop

•  SPIDALoutperformsSparkMLlibandFlinkduetobettercommunicationandbetterdatafloworBSPcommunication.

•  HasHarp-(DAAL)optimizedmachinelearninginterface

•  SPIDALalsohascommunityalgorithms•  BiomolecularSimulation•  GraphsforNetworkScience•  Imageprocessingforpathologyandpolarscience

MahoutandSPIDAL

18

QiuCoreSPIDALParallelHPCLibrarywithCollectiveUsed

19

•  DA-MDSRotate,AllReduce,Broadcast•  DirectedForceDimensionReductionAllGather,Allreduce

•  IrregularDAVSClusteringPartialRotate,AllReduce,Broadcast

•  DASemimetricClustering(DeterministicAnnealing)Rotate,AllReduce,Broadcast

•  K-meansAllReduce,Broadcast,AllGatherDAAL

•  SVMAllReduce,AllGather•  SubGraphMiningAllGather,AllReduce

•  LatentDirichletAllocationRotate,AllReduce•  MatrixFactorization(SGD)RotateDAAL

•  RecommenderSystem(ALS)RotateDAAL•  SingularValueDecomposition(SVD)AllGatherDAAL

•  QRDecomposition(QR)Reduce,BroadcastDAAL•  NeuralNetworkAllReduceDAAL•  CovarianceAllReduceDAAL•  LowOrderMomentsReduceDAAL•  NaiveBayesReduceDAAL•  LinearRegressionReduceDAAL•  RidgeRegressionReduceDAAL•  Multi-classLogisticRegressionRegroup,Rotate,AllGather

•  RandomForestAllReduce•  PrincipalComponentAnalysis(PCA)AllReduceDAAL

DAALimpliesintegratedonnodewithIntelDAALOptimizedDataAnalyticsLibrary

ImplementingTwister2indetailI

Thisbreaksrulefrom2012-2017ofnot“competing”withbutrather“enhancing”Apache

8/14/18 20

http://www.iterativemapreduce.org/

•  Analyzetheruntimeofexistingsystems•  Hadoop,Spark,Flink,PregelBigDataProcessing•  OpenWhiskandcommercialFaaS•  Storm,Heron,ApexStreamingDataflow•  Kepler,Pegasus,NiFiworkflowsystems•  HarpMap-Collective,MPIandHPCAMTruntimelikeDARMA•  AndapproachessuchasGridFTPandCORBA/HLA(!)forwideareadatalinks

•  Alotofconfusioncomingfromdifferentcommunities(database,distributed,parallelcomputing,machinelearning,computational/datascience)investigatingsimilarideaswithlittleknowledgeexchangeandmixedup(unclear)requirements

Twister2:“NextGenerationGrid-Edge–HPCCloud”ProgrammingEnvironment

21


•  Harp-DAALwithakernelMachineLearninglibraryexploitingtheIntelnodelibraryDAALandHPCcommunicationcollectiveswithintheHadoopecosystem.ThebroadapplicabilityofHarp-DAALissupportingall5classesofdata-intensivecomputation,frompleasinglyparalleltomachinelearningandsimulations.

•  Twister2isatoolkitofcomponentsthatcanbepackagedindifferentways•  IntegratedbatchorstreamingdatacapabilitiesfamiliarfromApacheHadoop,Spark,HeronandFlinkbutwithhighperformance.

•  Separatebulksynchronousanddataflowcommunication;•  TaskmanagementasinMesos,YarnandKubernetes•  Dataflowgraphexecutionmodels•  LaunchingoftheHarp-DAALlibrary•  Streamingandrepositorydataaccessinterfaces,•  In-memorydatabasesandfaulttoleranceatdataflownodes.(useRDDtodoclassiccheckpoint-restart)

IntegratingHPCandApacheProgrammingEnvironments

22

Approach• Clearlydefineanddevelopfunctionallayers(usingexistingtechnologywhenpossible)

•  Developlayersasindependentcomponents• Useinteroperablecommonabstractionsbutmultiplepolymorphicimplementations.

• Allowuserstopickandchooseaccordingtorequirementssuchas•  Communication+DataManagement•  Communication+Staticgraph

• UseHPCfeatureswhenpossible

23

Twister2ComponentsI

9/25/2017 24

Area Component Implementation Comments: User API

Architecture Specification

Coordination Points State and Configuration Management; Program, Data and Message Level

Change execution mode; save and reset state

Execution Semantics

Mapping of Resources to Bolts/Maps in Containers, Processes, Threads

Different systems make different choices - why?

Parallel Computing Spark Flink Hadoop Pregel MPI modes Owner Computes Rule

Job Submission (Dynamic/Static) Resource Allocation

Plugins for Slurm, Yarn, Mesos, Marathon, Aurora

Client API (e.g. Python) for Job Management

Task System

Task migration Monitoring of tasks and migrating tasks for better resource utilization

Task-based programming with Dynamic or Static Graph API; FaaS API; Support accelerators (CUDA,KNL)

Elasticity OpenWhisk

Streaming and FaaS Events

Heron, OpenWhisk, Kafka/RabbitMQ

Task Execution Process, Threads, Queues

Task Scheduling Dynamic Scheduling, Static Scheduling, Pluggable Scheduling Algorithms

Task Graph Static Graph, Dynamic Graph Generation

Twister2ComponentsII

9/25/2017 25

Area Component Implementation Comments

Communication API

Messages Heron This is user level and could map to multiple communication systems

Dataflow Communication

Fine-Grain Twister2 Dataflow communications: MPI,TCP and RMA Coarse grain Dataflow from NiFi, Kepler?

Streaming, ETL data pipelines;

Define new Dataflow communication API and library

BSP Communication Map-Collective

Conventional MPI, Harp MPI Point to Point and Collective API

Data Access Static (Batch) Data File Systems, NoSQL, SQL

Data API Streaming Data Message Brokers, Spouts

Data Management Distributed Data Set

Relaxed Distributed Shared Memory(immutable data), Mutable Distributed Data

Data Transformation API; Spark RDD, Heron Streamlet

Fault Tolerance Check Pointing Upstream (streaming) backup; Lightweight; Coordination Points; Spark/Flink, MPI and Heron models

Streaming and batch cases distinct; Crosses all components

Security Storage, Messaging, execution

Research needed Crosses all Components

Differentapplicationsatdifferentlayers

26

Spark,Flink

Hadoop,Heron,Storm

None

ImplementingTwister2indetailII

LookatCommunicationindetail

8/14/18 27


CommunicationModels•  MPICharacteristics:Tightlysynchronizedapplications

•  Efficientcommunications(µslatency)withuseofadvancedhardware•  Inplacecommunicationsandcomputations(Processscopeforstate)

•  Basicdataflow:Modelacomputationasagraph•  NodesdocomputationswithTaskascomputationsandedgesareasynchronouscommunications

•  Acomputationisactivatedwhenitsinputdatadependenciesaresatisfied

•  Streamingdataflow:Pub-Subwithdatapartitionedintostreams•  Streamsareunbounded,ordereddatatuples•  Orderofeventsimportantandgroupdataintotimewindows

• MachineLearningdataflow:Iterativecomputationsandkeeptrackofstate•  ThereisbothModelandData,buttypicallyonlycommunicatethemodel•  CollectivecommunicationoperationssuchasAllReduceAllGather(nodifferentialoperatorsinBigDataproblems)

•  Canusein-placeMPIstylecommunication

S

W G

S

W

WDataflow

8/14/18 28

Twister2DataflowCommunications•  Twister:Netofferstwocommunicationmodels

•  BSP(BulkSynchronousProcessing)communicationusingTCorMPIseparatedfromitstaskmanagementplusextraHarpcollectives

• plusanewDataflowlibraryDFWbuiltusingMPIsoftwarebutatdatamovementnotmessagelevel

•  Non-blocking•  Dynamicdatasizes•  Streamingmodel

•  Batchcaseismodeledasafinitestream•  Thecommunicationsarebetweenasetoftasksinanarbitrarytaskgraph

•  Keybasedcommunications•  Communicationsspillingtodisks•  Targettaskscanbedifferentfromsourcetasks

29

Twister:Net

30

•  Communicationoperatorsarestateful•  Bufferdata•  handleimbalanceddynamicallysizedcommunications,•  actasacombiner

•  Threadsafe•  Initialization

•  MPI•  TCP/ZooKeeper

•  Buffermanagement•  Themessagesareserializedbythelibrary

•  Back-pressure•  Usesflowcontrolbytheunderlyingchannel

Architecture

OptimizedoperationvsBasic(Flink,Heron)

Reduce Gather Partition Broadcast

AllReduce AllGather Keyed-Partition

Keyed-Reduce KeyedGather

BatchandStreamingversionsofabovecurrentlyavailable

LatencyofMPIandTwister:Netwithdifferentmessagesizesona

two-nodesetupBandwidthutilizationofFlink,Twister2andOpenMPIover1Gbps,10GbpsandIBwithFlinkonIPoIB

Bandwidth&LatencyKernel

Latencyandbandwidthbetweentwotasksrunningintwonodes

LatencyforReduceandGatheroperationsin32nodeswith256-wayparallelism.Thetimeisfor1millionmessagesineachparallelunit,withthegivenmessagesize.ForBSP-ObjectcasewedotwoMPIcallswithMPIAllReduce/MPIAllGatherfirsttogetthelengthsofthemessagesandtheactualcall.InfiniBandnetworkisused.

TotaltimeforFlinkandTwister:NetforReduceandPartitionoperationsin32nodeswith640-wayparallelism.Thetimeisfor1millionmessagesineachparallelunit,withthegivenmessagesize

Flink,BSPandDFWPerformance

Left:K-meansjobexecutiontimeon16nodeswithvaryingcenters,2millionpointswith320-wayparallelism.Right:K-Meanswth4,8and16nodeswhereeachnodehaving20tasks.2millionpointswith16000centersused.

K-MeansalgorithmperformanceAllReduceCommunication

Left:Terasorttimeona16nodeclusterwith384parallelism.BSPandDFWshowsthecommunicationtime.Right:Terasorton32nodeswith.5TBand1TBdatasets.Parallelismof320.Right16nodecluster(Victor),Left32nodecluster(Juliet)withInfiniBand.

Partitionthedatausingasampleandregroup

SortingRecordsForDFWcase,asinglenodecangetcongestedifmany

processessendmessagesimultaneously.

BSPalgorithmwaitsforotherstosendmessagesinaringtopologyandcanbein-efficientcomparedtoDFWcasewhereprocessesdonotwait.

LatencyofApacheHeronandTwister:NetDFW(Dataflow)forReduce,BroadcastandPartitionoperationsin16nodeswith256-wayparallelism

Twister:NetandApacheHeronforStreaming

RobotAlgorithms

RobotwithaLaserRange

Finder

MapBuiltfromRobotdata Robotsneedtoavoidcollisionswhentheymove

N-BodyCollisionAvoidanceSimultaneousLocalizationandMapping

SLAMSimultaneousLocalizationandMapping

MessageBrokersRabbitMQ,Kafka

Gateway

Sendingtopub-sub

SendingtoPersistingtostorage

Streamingworkflow

Astreamapplicationwithsometasksrunninginparallel

Multiplestreamingworkflows

StreamingSLAMAlgorithmApacheStorm

HostedinFutureSystemsOpenStackcloudwhichisaccessiblethroughIUnetwork

Endtoenddelayswithoutanyprocessingislessthan10ms

RaoblackwellizedparticlefilterbasedSLAM

PerformanceofSLAMStormv.Twister2

38

180Laserreadings

StormImplementationSpeedup

Twister2Implementationspeedup.

640Laserreadings

180Laserreadings

640Laserreadings

ImplementingTwister2indetailIII

State

8/14/18 39


ResourceAllocation

40

•  JobSubmission&Management•  twister2submit

•  ResourceManagers•  Slurm

•  Nomad•  Kubernetes•  Mesos

•  Ittakesaround5secondstoinitializeaworkerinKubernetes.•  Ittakesaround3secondstoinitializeaworkerinMesos.•  When3workersaredeployedinoneexecutororpod,initializationtimesarefasterin

bothsystems.

Kubernetes and Mesos Worker Initialization Times

Kubernetes Mesos

0.02.04.06.08.0

10.012.014.016.018.020.0

3 9 18 36 54

Wor

ker S

tart

Tim

es (s

ec)

TotalNumberofWorkers

3workersperpod 1workerperpod

0.02.04.06.08.010.012.014.016.018.020.0

3 9 18 36 54

WorkerS

tartTim

es(sec)

TotalNumberofWorkers

3workersperexecutor 1workerperexecutor

TaskSystem

• Generatecomputationgraphdynamically•  Dynamicschedulingoftasks•  Allowfinegrainedcontrolofthegraph

• Generatecomputationgraphstatically•  Dynamicorstaticscheduling•  Suitableforstreaminganddataqueryapplications•  Hardtoexpresscomplexcomputations,especiallywithloops

• Hybridapproach•  Combinebothstaticanddynamicgraphs

42

Userdefinedoperator

Communication

TaskGraphExecution

43

UserGraph SchedulerPlan

WorkerPlan

Scheduler

Network

ExecutionPlanner ExecutorExecutionPlan

•  TaskSchedulerispluggable•  Executorispluggable•  Schedulerrunningonalltheworkers

•  Streaming•  Roundrobin•  Firstfit

•  Batch•  Datalocalityaware

SchedulingAlgorithms

DataflowatDifferentGrainsizes

8/14/18 44

Reduce

Maps

Iterate

InternalExecutionDataflowNodes

HPCCommunication

CoarseGrainDataflowslinksjobsinsuchapipeline

Datapreparation ClusteringDimensionReduction

Visualization

Butinternallytoeachjobyoucanalsoelegantlyexpressalgorithmasdataflowbutwithmorestringentperformanceconstraints

•  P=loadPoints()•  C=loadInitCenters()•  for(inti=0;i<10;i++){•  T=P.map().withBroadcast(C)•  C=T.reduce()}Iterate

CorrespondingtoclassicSparkK-meansDataflow

WorkflowvsDataflow:Differentgrainsizesanddifferentperformancetrade-offs

45

WorkflowControlledbyWorkflowEngineoraScript Dataflowapplicationrunningasasinglejob

ThedataflowcanexpandfromEdgetoCloud

NiFiWorkflow

8/14/18 46

FlinkMDSDataflowGraph

8/30/2017

SystemsStateSparkKmeansDataflow

• P=loadPoints()• C=loadInitCenters()•  for(inti=0;i<10;i++){•  T=P.map().withBroadcast(C)•  C=T.reduce()}SaveStateatCoordinationPointStoreCinRDD

8/14/18 48

•  Stateishandleddifferentlyinsystems•  CORBA,AMT,MPIandStorm/

Heronhavelongrunningtasksthatpreservestate

•  SparkandFlinkpreservedatasetsacrossdataflownodeusingin-memorydatabases

•  Allsystemsagreeoncoarsegraindataflow;onlykeepstatebyexchangingdata

Iterate

FaultToleranceandState • Similarformofcheck-pointingmechanismisusedalreadyinHPCandBigData

• althoughHPCinformalasdoesn’ttypicallyspecifyasadataflowgraph• FlinkandSparkdobetterthanMPIduetouseofdatabasetechnologies;MPIisabitharderduetoricherstatebutthereisanobviousintegratedmodelusingRDDtypesnapshotsofMPIstylejobs

• Checkpointaftereachstageofthedataflowgraph(atlocationofintelligentdataflownodes)

• Naturalsynchronizationpoint• Let’sallowsusertochoosewhentocheckpoint(noteverystage)• Savestateasuserspecifies;SparkjustsavesModelstatewhichisinsufficientforcomplexalgorithms

8/14/18 49

ImplementingTwister2Futures

8/14/18 50


Twister2Timeline:EndofAugust2018•  Twister:NetDataflowCommunicationAPI

•  DataflowcommunicationswithMPIorTCP• HarpforMachineLearning(CustomBSPCommunications)

•  Richcollectives•  Around30MLalgorithms

• HDFSIntegration•  TaskGraph

•  Streaming-Stormmodel•  Batchanalytics-Hadoop

• DeploymentsonDocker,Kubernetes,Mesos(Aurora),Nomad,Slurm

8/14/18 51

Twister2Timeline:EndofDecember2018•  NativeMPIintegrationtoMesos,Yarn•  NaiadmodelbasedTasksystemforMachineLearning•  LinktoPilotJobs•  Faulttolerance

•  Streaming•  Batch

•  HierarchicaldataflowswithStreaming,MachineLearningandBatchintegratedseamlessly

•  Dataabstractionsforstreamingandbatch(Streamlets,RDD)• Workflowgraphs(Kepler,Spark)withlinkagedefinedbyDataAbstractions(RDD)

•  Endtoendapplications

8/14/18 52

Twister2Timeline:AfterDecember2018•  Dynamictaskmigrations•  RDMAandothercommunicationenhancements•  IntegratepartsofTwister2componentsasbigdatasystemsenhancements(i.e.runcurrentBigDatasoftwareinvokingTwister2components)

•  Heron(easiest),Spark,Flink,Hadoop(likeHarptoday)•  SupportdifferentAPIs(i.e.runTwister2lookinglikecurrentBigDataSoftware)

•  Hadoop•  Spark(Flink)•  Storm

•  RefinementslikeMarathonwithMesosetc.•  FunctionasaServiceandServerless•  Supporthigherlevelabstractions

•  Twister:SQL

8/14/18 53

Summary of Twister2: Next Generation HPC Cloud + Edge + Grid • WehavebuiltahighperformancedataanalysislibrarySPIDAL• WehaveintegratedHPCintomanyApachesystemswithHPC-ABDSwithrichsetofcollectives

• WehavedoneapreliminaryanalysisofthedifferentruntimesofHadoop,Spark,Flink,Storm,Heron,Naiad,DARMA(HPCAsynchronousManyTask)andidentifiedkeycomponents

•  Therearedifferenttechnologiesfordifferentcircumstancesbutcanbeunifiedbyhighlevelabstractionssuchascommunication/data/taskAPI’s

•  Apachesystemsusedataflowcommunicationwhichisnaturalfordistributedsystemsbutslowerforclassicparallelcomputing

•  Nostandarddataflowlibrary(why?).AddDataflowprimitivesinMPI-4?•  HPCcouldadoptsomeoftoolsofBigDataasinCoordinationPoints(dataflownodes),Statemanagement(faulttolerance)withRDD(datasets)

•  Couldintegratedataflowandworkflowinacleanerfashion•  Notclearsomanybigdataandresourcemanagementapproachesneeded

8/14/18 54

twister2: a high-performance big data programming...

Documents