twister2: a high-performance big data programming...

54
`, Work with Shantenu Jha, Kannan Govindarajan, Pulasthi Wickramasinghe, Gurhan Gunduz, Ahmet Uyar 8/14/18 1 HPBDC 2018: The 4th IEEE International Workshop on High-Performance Big Data, Deep Learning, and Cloud Computing Geoffrey Fox, May 21, 2018 Judy Qiu, Supun Kamburugamuve Department of Intelligent Systems Engineering [email protected] , http://www.dsc.soic.indiana.edu/ , http://spidal.org / Twister2: A High-Performance Big Data Programming Environment

Upload: others

Post on 06-Sep-2019

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

`,WorkwithShantenuJha,KannanGovindarajan,PulasthiWickramasinghe,GurhanGunduz,AhmetUyar

8/14/18 1

HPBDC2018:The4thIEEEInternationalWorkshoponHigh-PerformanceBigData,DeepLearning,andCloudComputing

GeoffreyFox,May21,2018

JudyQiu,SupunKamburugamuveDepartmentofIntelligentSystemsEngineering

[email protected],http://www.dsc.soic.indiana.edu/,http://spidal.org/

Twister2:AHigh-PerformanceBigDataProgrammingEnvironment

Page 2: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Abstract•  WeanalysethecomponentsthatareneededinprogrammingenvironmentsforBigDataAnalysisSystemswithscalableHPCperformanceandthefunctionalityofABDS–theApacheBigDataSoftwareStack.

• OnehighlightisHarp-DAALwhichisamachinelibraryexploitingtheIntelnodelibraryDAALandHPCcommunicationcollectiveswithintheHadoopecosystem.

• AnotherhighlightisTwister2whichconsistsofasetofmiddlewarecomponentstosupportbatchorstreamingdatacapabilitiesfamiliarfromApacheHadoop,Spark,HeronandFlinkbutwithhighperformance

•  Twister2coversbulksynchronousanddataflowcommunication;taskmanagementasinMesos,YarnandKubernetes;dataflowgraphexecutionmodels;launchingoftheHarp-DAALlibrary;streamingandrepositorydataaccessinterfaces,in-memorydatabasesandfaulttoleranceatdataflownodes.

•  SimilarcapabilitiesareavailableincurrentApachesystemsbutasintegratedpackageswhichdonotallowneededcustomizationfordifferentapplicationscenarios.

8/14/18 2

Page 3: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

•  Ongeneralprinciplesparallelanddistributedcomputinghavedifferentrequirementsevenifsometimessimilarfunctionalities

•  ApachestackABDStypicallyusesdistributedcomputingconcepts•  Forexample,ReduceoperationisdifferentinMPI(Harp)andSpark

•  Largescalesimulationrequirementsarewellunderstood•  BigDatarequirementsarenotagreedbutthereareafewkeyusetypes

1)  Pleasinglyparallelprocessing(includinglocalmachinelearningLML)asofdifferenttweetsfromdifferentuserswithperhapsMapReducestyleofstatisticsandvisualizations;possiblyStreaming

2)  DatabasemodelwithqueriesagainsupportedbyMapReduceforhorizontalscaling3)  GlobalMachineLearningGMLwithsinglejobusingmultiplenodesasclassicparallel

computing4)  DeepLearningcertainlyneedsHPC–possiblyonlymultiplesmallsystems

•  Currentworkloadsstress1)and2)andaresuitedtocurrentcloudsandtoApacheBigDataSoftware(withnoHPC)

•  ThisexplainswhySparkwithpoorGMLperformancecanbesosuccessful

Requirements

8/14/18 3

Page 4: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

DifficultyinParallelismSizeofSynchronizationconstraints

SpectrumofApplicationsandAlgorithmsThereisalsodistributionseeningrid/edgecomputing

8/14/18 4

PleasinglyParallelOftenindependentevents

MapReduceasinscalabledatabases

StructuredAdaptiveSparsityHugeJobs

LooselyCoupled

Largescalesimulations

CurrentmajorBigDatacategory

CommodityClouds HPCCloudsHighPerformanceInterconnect

ExascaleSupercomputers

GlobalMachineLearninge.g.parallelclustering

DeepLearning

HPCClouds/SupercomputersMemoryaccessalsocritical

UnstructuredAdaptiveSparsityMediumsizeJobs

GraphAnalyticse.g.subgraphmining

LDA

LinearAlgebraatcore(typicallynotsparse)

SizeofDiskI/O

NeedatoolkitcoveringallapplicationswithsameAPIbutdifferentimplementations

TightlyCoupled

Parametersweepsimulations

Page 5: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

These3arefocusofTwister2butweneedtopreservecapabilityonfirst2paradigms

ClassicCloudWorkload

GlobalMachineLearning

NoteProblemandSystemArchitectureasefficientexecutionsaystheymustmatch

8/14/18 5

Needatoolkitcovering5mainparadigmswithsameAPIbutdifferentimplementations

Page 6: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

ComparingSpark,FlinkandMPI•  OnGlobalMachineLearningGML.

8/14/18 6

Page 7: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

MachineLearningwithMPI,SparkandFlink

•  Threealgorithmsimplementedinthreeruntimes•  MultidimensionalScaling(MDS)•  Terasort•  K-Means(dropasnotimeandlookedatlater)

•  ImplementationinJava•  MDSisthemostcomplexalgorithm-threenestedparallelloops•  K-Means-oneparallelloop•  Terasort-noiterations

• Withcare,Javaperformance~Cperformance• Withoutcare,Javaperformance<<Cperformance(detailsomitted)

8/14/18 7

Page 8: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

MultidimensionalScaling:3NestedParallelSections

MDSexecutiontimeon16nodeswith20processesineachnodewith

varyingnumberofpoints

MDSexecutiontimewith32000pointsonvaryingnumberofnodes.Eachnoderuns20paralleltasks

Spark,FlinkNoSpeedup

8/14/18 8

Flink

Spark

MPI

MPIFactorof20-200FasterthanSpark/Flink

Kmeansalsobad–seelater

Page 9: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Terasort

9

Sorting1TBofdatarecords

Terasortexecutiontimein64and32nodes.OnlyMPIshowsthesortingtimeandcommunicationtimeasothertwoframeworksdoesn'tprovideaclearmethodtoaccuratelymeasurethem.Sorting

timeincludesdatasavetime.MPI-IB-MPIwithInfiniband

Partitionthedatausingasampleandregroup

Page 10: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

SoftwareHPC-ABDSHPC-FaaS

8/14/18 10

Page 11: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

Ogres Application Analysis

HPC-ABDS and HPC-FaaS Software Harp and Twister2 Building Blocks

SPIDAL Data Analytics Library

8/14/18 11

Software:MIDASHPC-ABDS

Page 12: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

HPC-ABDSIntegratedwiderangeofHPCandBigDatatechnologies.IgaveupupdatinglistinJanuary2016!

8/14/18 12

Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies Cross-Cutting

Functions 1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination: Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca

17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds Networking: Google Cloud DNS, Amazon Route 53

21layers Over350SoftwarePackagesJanuary292016

Page 13: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

DifferentchoicesinsoftwaresystemsinCloudsandHPC.HPC-ABDStakescloudsoftwareaugmentedbyHPCwhenneededtoimproveperformance16of21layerspluslanguages

8/14/18 13

Page 14: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

HarpPluginforHadoop:ImportantpartofTwister2

14

WorkofJudyQiu

Page 15: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Map Collective Run time merges MapReduce and HPC

allreduce reduce

rotate push & pull

allgather

regroup

broadcast

RuntimesoftwareforHarp

15

Page 16: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

DynamicRotationControlforLatentDirichletAllocationandMatrixFactorizationSGD(stochasticgradientdescent)

OtherModelParametersFromCaching

ModelParametersFromRotation

ModelRelatedData Computesuntilthetimearrives,thenstartsmodelrotationtoaddressloadimbalance

Multi-ThreadExecution

Page 17: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

•  Datasets:5millionpoints,10thousandcentroids,10featuredimensions

•  10to20nodesofIntelKNL7250processors

•  Harp-DAALhas15xspeedupsoverSparkMLlib

•  Datasets:500Kor1milliondatapointsoffeaturedimension300

•  RunningonsingleKNL7250(Harp-DAAL)vs.singleK80GPU(PyTorch)

•  Harp-DAALachieves3xto6xspeedups

•  Datasets:Twitterwith44millionvertices,2billionedges,subgraphtemplatesof10to12vertices

•  25nodesofIntelXeonE52670•  Harp-DAALhas2xto5xspeedups

overstate-of-the-artMPI-Fasciasolution

Harpv.SparkHarpv.TorchHarpv.MPI

17

Page 18: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

•  MahoutwasHadoopmachinelearninglibrarybutlargelyabandonedasSparkoutperformedHadoop

•  SPIDALoutperformsSparkMLlibandFlinkduetobettercommunicationandbetterdatafloworBSPcommunication.

•  HasHarp-(DAAL)optimizedmachinelearninginterface

•  SPIDALalsohascommunityalgorithms•  BiomolecularSimulation•  GraphsforNetworkScience•  Imageprocessingforpathologyandpolarscience

MahoutandSPIDAL

18

Page 19: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

QiuCoreSPIDALParallelHPCLibrarywithCollectiveUsed

19

•  DA-MDSRotate,AllReduce,Broadcast•  DirectedForceDimensionReductionAllGather,Allreduce

•  IrregularDAVSClusteringPartialRotate,AllReduce,Broadcast

•  DASemimetricClustering(DeterministicAnnealing)Rotate,AllReduce,Broadcast

•  K-meansAllReduce,Broadcast,AllGatherDAAL

•  SVMAllReduce,AllGather•  SubGraphMiningAllGather,AllReduce

•  LatentDirichletAllocationRotate,AllReduce•  MatrixFactorization(SGD)RotateDAAL

•  RecommenderSystem(ALS)RotateDAAL•  SingularValueDecomposition(SVD)AllGatherDAAL

•  QRDecomposition(QR)Reduce,BroadcastDAAL•  NeuralNetworkAllReduceDAAL•  CovarianceAllReduceDAAL•  LowOrderMomentsReduceDAAL•  NaiveBayesReduceDAAL•  LinearRegressionReduceDAAL•  RidgeRegressionReduceDAAL•  Multi-classLogisticRegressionRegroup,Rotate,AllGather

•  RandomForestAllReduce•  PrincipalComponentAnalysis(PCA)AllReduceDAAL

DAALimpliesintegratedonnodewithIntelDAALOptimizedDataAnalyticsLibrary

Page 20: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

ImplementingTwister2indetailI

Thisbreaksrulefrom2012-2017ofnot“competing”withbutrather“enhancing”Apache

8/14/18 20

http://www.iterativemapreduce.org/

Page 21: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

•  Analyzetheruntimeofexistingsystems•  Hadoop,Spark,Flink,PregelBigDataProcessing•  OpenWhiskandcommercialFaaS•  Storm,Heron,ApexStreamingDataflow•  Kepler,Pegasus,NiFiworkflowsystems•  HarpMap-Collective,MPIandHPCAMTruntimelikeDARMA•  AndapproachessuchasGridFTPandCORBA/HLA(!)forwideareadatalinks

•  Alotofconfusioncomingfromdifferentcommunities(database,distributed,parallelcomputing,machinelearning,computational/datascience)investigatingsimilarideaswithlittleknowledgeexchangeandmixedup(unclear)requirements

Twister2:“NextGenerationGrid-Edge–HPCCloud”ProgrammingEnvironment

21

http://www.iterativemapreduce.org/

Page 22: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

•  Harp-DAALwithakernelMachineLearninglibraryexploitingtheIntelnodelibraryDAALandHPCcommunicationcollectiveswithintheHadoopecosystem.ThebroadapplicabilityofHarp-DAALissupportingall5classesofdata-intensivecomputation,frompleasinglyparalleltomachinelearningandsimulations.

•  Twister2isatoolkitofcomponentsthatcanbepackagedindifferentways•  IntegratedbatchorstreamingdatacapabilitiesfamiliarfromApacheHadoop,Spark,HeronandFlinkbutwithhighperformance.

•  Separatebulksynchronousanddataflowcommunication;•  TaskmanagementasinMesos,YarnandKubernetes•  Dataflowgraphexecutionmodels•  LaunchingoftheHarp-DAALlibrary•  Streamingandrepositorydataaccessinterfaces,•  In-memorydatabasesandfaulttoleranceatdataflownodes.(useRDDtodoclassiccheckpoint-restart)

IntegratingHPCandApacheProgrammingEnvironments

22

Page 23: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Approach• Clearlydefineanddevelopfunctionallayers(usingexistingtechnologywhenpossible)

•  Developlayersasindependentcomponents• Useinteroperablecommonabstractionsbutmultiplepolymorphicimplementations.

• Allowuserstopickandchooseaccordingtorequirementssuchas•  Communication+DataManagement•  Communication+Staticgraph

• UseHPCfeatureswhenpossible

23

Page 24: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Twister2ComponentsI

9/25/2017 24

Area Component Implementation Comments: User API

Architecture Specification

Coordination Points State and Configuration Management; Program, Data and Message Level

Change execution mode; save and reset state

Execution Semantics

Mapping of Resources to Bolts/Maps in Containers, Processes, Threads

Different systems make different choices - why?

Parallel Computing Spark Flink Hadoop Pregel MPI modes Owner Computes Rule

Job Submission (Dynamic/Static) Resource Allocation

Plugins for Slurm, Yarn, Mesos, Marathon, Aurora

Client API (e.g. Python) for Job Management

Task System

Task migration Monitoring of tasks and migrating tasks for better resource utilization

Task-based programming with Dynamic or Static Graph API; FaaS API; Support accelerators (CUDA,KNL)

Elasticity OpenWhisk

Streaming and FaaS Events

Heron, OpenWhisk, Kafka/RabbitMQ

Task Execution Process, Threads, Queues

Task Scheduling Dynamic Scheduling, Static Scheduling, Pluggable Scheduling Algorithms

Task Graph Static Graph, Dynamic Graph Generation

Page 25: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Twister2ComponentsII

9/25/2017 25

Area Component Implementation Comments

Communication API

Messages Heron This is user level and could map to multiple communication systems

Dataflow Communication

Fine-Grain Twister2 Dataflow communications: MPI,TCP and RMA Coarse grain Dataflow from NiFi, Kepler?

Streaming, ETL data pipelines;

Define new Dataflow communication API and library

BSP Communication Map-Collective

Conventional MPI, Harp MPI Point to Point and Collective API

Data Access Static (Batch) Data File Systems, NoSQL, SQL

Data API Streaming Data Message Brokers, Spouts

Data Management Distributed Data Set

Relaxed Distributed Shared Memory(immutable data), Mutable Distributed Data

Data Transformation API; Spark RDD, Heron Streamlet

Fault Tolerance Check Pointing Upstream (streaming) backup; Lightweight; Coordination Points; Spark/Flink, MPI and Heron models

Streaming and batch cases distinct; Crosses all components

Security Storage, Messaging, execution

Research needed Crosses all Components

Page 26: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Differentapplicationsatdifferentlayers

26

Spark,Flink

Hadoop,Heron,Storm

None

Page 27: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

ImplementingTwister2indetailII

LookatCommunicationindetail

8/14/18 27

http://www.iterativemapreduce.org/

Page 28: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

CommunicationModels•  MPICharacteristics:Tightlysynchronizedapplications

•  Efficientcommunications(µslatency)withuseofadvancedhardware•  Inplacecommunicationsandcomputations(Processscopeforstate)

•  Basicdataflow:Modelacomputationasagraph•  NodesdocomputationswithTaskascomputationsandedgesareasynchronouscommunications

•  Acomputationisactivatedwhenitsinputdatadependenciesaresatisfied

•  Streamingdataflow:Pub-Subwithdatapartitionedintostreams•  Streamsareunbounded,ordereddatatuples•  Orderofeventsimportantandgroupdataintotimewindows

• MachineLearningdataflow:Iterativecomputationsandkeeptrackofstate•  ThereisbothModelandData,buttypicallyonlycommunicatethemodel•  CollectivecommunicationoperationssuchasAllReduceAllGather(nodifferentialoperatorsinBigDataproblems)

•  Canusein-placeMPIstylecommunication

S

W G

S

W

WDataflow

8/14/18 28

Page 29: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Twister2DataflowCommunications•  Twister:Netofferstwocommunicationmodels

•  BSP(BulkSynchronousProcessing)communicationusingTCorMPIseparatedfromitstaskmanagementplusextraHarpcollectives

• plusanewDataflowlibraryDFWbuiltusingMPIsoftwarebutatdatamovementnotmessagelevel

•  Non-blocking•  Dynamicdatasizes•  Streamingmodel

•  Batchcaseismodeledasafinitestream•  Thecommunicationsarebetweenasetoftasksinanarbitrarytaskgraph

•  Keybasedcommunications•  Communicationsspillingtodisks•  Targettaskscanbedifferentfromsourcetasks

29

Page 30: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Twister:Net

30

•  Communicationoperatorsarestateful•  Bufferdata•  handleimbalanceddynamicallysizedcommunications,•  actasacombiner

•  Threadsafe•  Initialization

•  MPI•  TCP/ZooKeeper

•  Buffermanagement•  Themessagesareserializedbythelibrary

•  Back-pressure•  Usesflowcontrolbytheunderlyingchannel

Architecture

OptimizedoperationvsBasic(Flink,Heron)

Reduce Gather Partition Broadcast

AllReduce AllGather Keyed-Partition

Keyed-Reduce KeyedGather

BatchandStreamingversionsofabovecurrentlyavailable

Page 31: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

LatencyofMPIandTwister:Netwithdifferentmessagesizesona

two-nodesetupBandwidthutilizationofFlink,Twister2andOpenMPIover1Gbps,10GbpsandIBwithFlinkonIPoIB

Bandwidth&LatencyKernel

Latencyandbandwidthbetweentwotasksrunningintwonodes

Page 32: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

LatencyforReduceandGatheroperationsin32nodeswith256-wayparallelism.Thetimeisfor1millionmessagesineachparallelunit,withthegivenmessagesize.ForBSP-ObjectcasewedotwoMPIcallswithMPIAllReduce/MPIAllGatherfirsttogetthelengthsofthemessagesandtheactualcall.InfiniBandnetworkisused.

TotaltimeforFlinkandTwister:NetforReduceandPartitionoperationsin32nodeswith640-wayparallelism.Thetimeisfor1millionmessagesineachparallelunit,withthegivenmessagesize

Flink,BSPandDFWPerformance

Page 33: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Left:K-meansjobexecutiontimeon16nodeswithvaryingcenters,2millionpointswith320-wayparallelism.Right:K-Meanswth4,8and16nodeswhereeachnodehaving20tasks.2millionpointswith16000centersused.

K-MeansalgorithmperformanceAllReduceCommunication

Page 34: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Left:Terasorttimeona16nodeclusterwith384parallelism.BSPandDFWshowsthecommunicationtime.Right:Terasorton32nodeswith.5TBand1TBdatasets.Parallelismof320.Right16nodecluster(Victor),Left32nodecluster(Juliet)withInfiniBand.

Partitionthedatausingasampleandregroup

SortingRecordsForDFWcase,asinglenodecangetcongestedifmany

processessendmessagesimultaneously.

BSPalgorithmwaitsforotherstosendmessagesinaringtopologyandcanbein-efficientcomparedtoDFWcasewhereprocessesdonotwait.

Page 35: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

LatencyofApacheHeronandTwister:NetDFW(Dataflow)forReduce,BroadcastandPartitionoperationsin16nodeswith256-wayparallelism

Twister:NetandApacheHeronforStreaming

Page 36: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

RobotAlgorithms

RobotwithaLaserRange

Finder

MapBuiltfromRobotdata Robotsneedtoavoidcollisionswhentheymove

N-BodyCollisionAvoidanceSimultaneousLocalizationandMapping

Page 37: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

SLAMSimultaneousLocalizationandMapping

MessageBrokersRabbitMQ,Kafka

Gateway

Sendingtopub-sub

SendingtoPersistingtostorage

Streamingworkflow

Astreamapplicationwithsometasksrunninginparallel

Multiplestreamingworkflows

StreamingSLAMAlgorithmApacheStorm

HostedinFutureSystemsOpenStackcloudwhichisaccessiblethroughIUnetwork

Endtoenddelayswithoutanyprocessingislessthan10ms

RaoblackwellizedparticlefilterbasedSLAM

Page 38: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

PerformanceofSLAMStormv.Twister2

38

180Laserreadings

StormImplementationSpeedup

Twister2Implementationspeedup.

640Laserreadings

180Laserreadings

640Laserreadings

Page 39: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

ImplementingTwister2indetailIII

State

8/14/18 39

http://www.iterativemapreduce.org/

Page 40: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

ResourceAllocation

40

•  JobSubmission&Management•  twister2submit

•  ResourceManagers•  Slurm

•  Nomad•  Kubernetes•  Mesos

Page 41: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

•  Ittakesaround5secondstoinitializeaworkerinKubernetes.•  Ittakesaround3secondstoinitializeaworkerinMesos.•  When3workersaredeployedinoneexecutororpod,initializationtimesarefasterin

bothsystems.

Kubernetes and Mesos Worker Initialization Times

Kubernetes Mesos

0.02.04.06.08.0

10.012.014.016.018.020.0

3 9 18 36 54

Wor

ker S

tart

Tim

es (s

ec)

TotalNumberofWorkers

3workersperpod 1workerperpod

0.02.04.06.08.010.012.014.016.018.020.0

3 9 18 36 54

WorkerS

tartTim

es(sec)

TotalNumberofWorkers

3workersperexecutor 1workerperexecutor

Page 42: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

TaskSystem

• Generatecomputationgraphdynamically•  Dynamicschedulingoftasks•  Allowfinegrainedcontrolofthegraph

• Generatecomputationgraphstatically•  Dynamicorstaticscheduling•  Suitableforstreaminganddataqueryapplications•  Hardtoexpresscomplexcomputations,especiallywithloops

• Hybridapproach•  Combinebothstaticanddynamicgraphs

42

Userdefinedoperator

Communication

Page 43: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

TaskGraphExecution

43

UserGraph SchedulerPlan

WorkerPlan

Scheduler

Network

ExecutionPlanner ExecutorExecutionPlan

•  TaskSchedulerispluggable•  Executorispluggable•  Schedulerrunningonalltheworkers

•  Streaming•  Roundrobin•  Firstfit

•  Batch•  Datalocalityaware

SchedulingAlgorithms

Page 44: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

DataflowatDifferentGrainsizes

8/14/18 44

Reduce

Maps

Iterate

InternalExecutionDataflowNodes

HPCCommunication

CoarseGrainDataflowslinksjobsinsuchapipeline

Datapreparation ClusteringDimensionReduction

Visualization

Butinternallytoeachjobyoucanalsoelegantlyexpressalgorithmasdataflowbutwithmorestringentperformanceconstraints

•  P=loadPoints()•  C=loadInitCenters()•  for(inti=0;i<10;i++){•  T=P.map().withBroadcast(C)•  C=T.reduce()}Iterate

CorrespondingtoclassicSparkK-meansDataflow

Page 45: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

WorkflowvsDataflow:Differentgrainsizesanddifferentperformancetrade-offs

45

WorkflowControlledbyWorkflowEngineoraScript Dataflowapplicationrunningasasinglejob

ThedataflowcanexpandfromEdgetoCloud

Page 46: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

NiFiWorkflow

8/14/18 46

Page 47: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

FlinkMDSDataflowGraph

8/30/2017

Page 48: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

SystemsStateSparkKmeansDataflow

• P=loadPoints()• C=loadInitCenters()•  for(inti=0;i<10;i++){•  T=P.map().withBroadcast(C)•  C=T.reduce()}SaveStateatCoordinationPointStoreCinRDD

8/14/18 48

•  Stateishandleddifferentlyinsystems•  CORBA,AMT,MPIandStorm/

Heronhavelongrunningtasksthatpreservestate

•  SparkandFlinkpreservedatasetsacrossdataflownodeusingin-memorydatabases

•  Allsystemsagreeoncoarsegraindataflow;onlykeepstatebyexchangingdata

Iterate

Page 49: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

FaultToleranceandState • Similarformofcheck-pointingmechanismisusedalreadyinHPCandBigData

• althoughHPCinformalasdoesn’ttypicallyspecifyasadataflowgraph• FlinkandSparkdobetterthanMPIduetouseofdatabasetechnologies;MPIisabitharderduetoricherstatebutthereisanobviousintegratedmodelusingRDDtypesnapshotsofMPIstylejobs

• Checkpointaftereachstageofthedataflowgraph(atlocationofintelligentdataflownodes)

• Naturalsynchronizationpoint• Let’sallowsusertochoosewhentocheckpoint(noteverystage)• Savestateasuserspecifies;SparkjustsavesModelstatewhichisinsufficientforcomplexalgorithms

8/14/18 49

Page 50: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

ImplementingTwister2Futures

8/14/18 50

http://www.iterativemapreduce.org/

Page 51: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Twister2Timeline:EndofAugust2018•  Twister:NetDataflowCommunicationAPI

•  DataflowcommunicationswithMPIorTCP• HarpforMachineLearning(CustomBSPCommunications)

•  Richcollectives•  Around30MLalgorithms

• HDFSIntegration•  TaskGraph

•  Streaming-Stormmodel•  Batchanalytics-Hadoop

• DeploymentsonDocker,Kubernetes,Mesos(Aurora),Nomad,Slurm

8/14/18 51

Page 52: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Twister2Timeline:EndofDecember2018•  NativeMPIintegrationtoMesos,Yarn•  NaiadmodelbasedTasksystemforMachineLearning•  LinktoPilotJobs•  Faulttolerance

•  Streaming•  Batch

•  HierarchicaldataflowswithStreaming,MachineLearningandBatchintegratedseamlessly

•  Dataabstractionsforstreamingandbatch(Streamlets,RDD)• Workflowgraphs(Kepler,Spark)withlinkagedefinedbyDataAbstractions(RDD)

•  Endtoendapplications

8/14/18 52

Page 53: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Twister2Timeline:AfterDecember2018•  Dynamictaskmigrations•  RDMAandothercommunicationenhancements•  IntegratepartsofTwister2componentsasbigdatasystemsenhancements(i.e.runcurrentBigDatasoftwareinvokingTwister2components)

•  Heron(easiest),Spark,Flink,Hadoop(likeHarptoday)•  SupportdifferentAPIs(i.e.runTwister2lookinglikecurrentBigDataSoftware)

•  Hadoop•  Spark(Flink)•  Storm

•  RefinementslikeMarathonwithMesosetc.•  FunctionasaServiceandServerless•  Supporthigherlevelabstractions

•  Twister:SQL

8/14/18 53

Page 54: Twister2: A High-Performance Big Data Programming Environmentweb.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf · Abstract • We analyse the components that are needed

Summary of Twister2: Next Generation HPC Cloud + Edge + Grid • WehavebuiltahighperformancedataanalysislibrarySPIDAL• WehaveintegratedHPCintomanyApachesystemswithHPC-ABDSwithrichsetofcollectives

• WehavedoneapreliminaryanalysisofthedifferentruntimesofHadoop,Spark,Flink,Storm,Heron,Naiad,DARMA(HPCAsynchronousManyTask)andidentifiedkeycomponents

•  Therearedifferenttechnologiesfordifferentcircumstancesbutcanbeunifiedbyhighlevelabstractionssuchascommunication/data/taskAPI’s

•  Apachesystemsusedataflowcommunicationwhichisnaturalfordistributedsystemsbutslowerforclassicparallelcomputing

•  Nostandarddataflowlibrary(why?).AddDataflowprimitivesinMPI-4?•  HPCcouldadoptsomeoftoolsofBigDataasinCoordinationPoints(dataflownodes),Statemanagement(faulttolerance)withRDD(datasets)

•  Couldintegratedataflowandworkflowinacleanerfashion•  Notclearsomanybigdataandresourcemanagementapproachesneeded

8/14/18 54