big data benchmarking: applications and systems · digital science center big data benchmarking:...
TRANSCRIPT
DigitalScienceCenter
BigDataBenchmarking:ApplicationsandSystems
GeoffreyFox,December10,2018
2018InternationalSymposiumonBenchmarking,MeasuringandOptimizing(Bench’18)atIEEEBigData2018
Dec10- Dec11,2018@Seattle,WA,USA
DigitalScienceCenterIndiana University
[email protected], http://www.dsc.soic.indiana.edu/, http://spidal.org/
112/29/18
DigitalScienceCenter
BigDataandExtreme-scaleComputinghttp://www.exascale.org/bdec/
• BDECPathwaystoConvergenceReport
• NewseriesBDEC2“CommonDigitalContinuumPlatformforBigDataandExtremeScaleComputing”withfirstmeetingNovember28-30,2018BloomingtonIndianaUSA(focusonapplications).
• Workinggroupsonplatform(technology),applications,communitybuilding• BigDataBench presentedawhitepaper
• Nextmeetings:February19-21Kobe,Japan(focusonplatform)followedbytwoinEurope,oneinUSAandoneinChina.
http://www.exascale.org/bdec/sites/www.exascale.org.bdec/files/whitepapers/bdec2017pathways.pdf
2
DigitalScienceCenter
BenchmarksshouldmimicUseCases?Needtocollectusecases?
Canclassifyusecasesandbenchmarksalongseveraldifferentdimensions
312/29/18
DigitalScienceCenter
Software:MIDASHPC-ABDS
NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science
Ogres Application Analysis
HPC-ABDS and HPC-FaaS SoftwareHarp and Twister2 Building Blocks
SPIDAL Data Analytics Library
4
DigitalScienceCenter
MyviewofSystemGAIMSC
5
DigitalScienceCenter
SystemsChallengesforGAIMSC• MicrosoftnotedwearecollectivelybuildingtheGlobalAISupercomputer.• Generalizebyaddingmodeling• ArchitectureoftheGlobalAIandModelingSupercomputerGAIMSCmustreflect
• Global capturestheneedtomashupservicesfrommanydifferentsources;• AI capturestheincredibleprogressinmachinelearning(ML);• Modeling capturesbothtraditionallarge-scalesimulationsandthemodelsanddigitaltwinsneededfordatainterpretation;
• Supercomputercapturesthateverythingishugeandneedstobedonequicklyandofteninrealtimeforstreamingapplications.
• TheGAIMSCincludesanintelligentHPCcloudlinkedviaanintelligentHPCFogtoanintelligentHPCedge.Weconsiderthisdistributedenvironmentasasetofcomputationalanddata-intensivenuggetsswimminginanintelligentaether.
• Wewilluseadataflowgraphtodefineameshintheaether12/29/18 6
DigitalScienceCenter
GlobalAIandModelingSupercomputerGAIMSC• Thereisonlyacloudatthelogicalcenterbutit’sphysicallydistributedandownedbyafewmajorplayers
• ThereisaverydistributedsetofdevicessurroundedbylocalFogcomputing;thisformsthelogicallyandphysicallydistributeedge
• Theedgeisstructuredandlargelydata• ThesearetwodifferencesfromtheGridofthepast• e.g.selfdrivingcarwillhaveitsownfogandwillnotsharefogwithtruckthatitisabouttocollidewith
• Thecloudandedgewillbothbeveryheterogeneouswithvaryingaccelerators,memorysizeanddiskstructure.
• GAIMSCrequiresparallelcomputingtoachievehighperformanceonlargeMLandsimulationnuggetsanddistributedsystemtechnologytobuildtheaether andsupportthedistributedbutconnectednuggets.
• Inthelatterrespect,theintelligentaether mimicsagridbutitisadatagridwheretherearecomputationsbuttypicallythoseassociatedwithdata(oftenfromedgedevices).
• Sounlikethedistributedsimulationsupercomputerthatwasoftenstudiedinpreviousgrids,GAIMSCisasupercomputeraimedatverydifferentdataintensiveAI-enrichedproblems.
7
DigitalScienceCenter
GAIMSCGlobalAI&ModelingSupercomputerQuestions• Whatdogainfromtheconcept?e.g.AbilitytoworkwithBigDatacommunity• Whatdowelosefromtheconcept?e.g.everythingrunsasslowasSpark• IsGAIMSCusefulforBDEC2initiative?ForNSF?ForDoE?
ForUniversities?ForIndustry?Forusers?• Doesaddingmodelingtoconceptaddvalue?• WhataretheresearchissuesforGAIMSC?e.g.howtoprogram?• WhatcanwedowithGAIMSCthatwecouldn’tdowithclassicBigDatatechnologies?
• WhatcanwedowithGAIMSCthatwecouldn’tdowithclassicHPCtechnologies?
• Aretheredeeporimportantissuesassociatedwiththe“Global”inGAIMSC?• Istheconceptofanauto-tunedGlobalAIandModelingSupercomputerscary?
8
DigitalScienceCenter
IntegrationofDataandModelfunctionswithMLwrappersinGAIMSC
• ThereisarapidincreaseintheintegrationofMLandsimulations.• MLcananalyzeresults,guidetheexecutionandsetupinitialconfigurations(auto-tuning).ThisisequallytrueforAIitself-- theGAIMSCwilluseitselftooptimizeitsexecutionforbothanalyticsandsimulations.
• Inprincipleeverytransferofcontrol(joborfunctioninvocation,alinkfromdevicetothefog/cloud)shouldpassthroughanAIwrapperthatlearnsfromeachcallandcandecidebothifcallneedstobeexecuted(maybewehavelearnedtheansweralreadyandneednotcomputeit)andhowtooptimizethecallifitreallyneedstobeexecuted.
• ThedigitalcontinuumproposedbyBDEC2isanintelligentaether learningfromandinformingtheinterconnectedcomputationalactionsthatareembeddedintheaether.
• Implementingtheintelligentaether embracingandextendingtheedge,fog,andcloudisamajorresearchchallengewhereboldnewideasareneeded!
• WeneedtounderstandhowtomakeiteasytoautomaticallywrapeverynuggetwithML.
12/29/18 9
DigitalScienceCenter
ImplementingtheGAIMSC• ThenewMIDASmiddlewaredesignedinSPIDALhasbeenengineeredtosupporthigh-performancetechnologiesandyetpreservethekeyfeaturesoftheApacheBigDataSoftware.
• MIDASseemswellsuitedtobuildtheprototypeintelligenthigh-performanceaether.
• NotethiswillmixmanyrelativelysmallnuggetswithAIwrappersgeneratingparallelismfromthenumberofnuggetsandnotinternallytothenuggetanditswrapper.
• However,therewillbealsolargeglobaljobsrequiringinternalparallelismforindividuallarge-scalemachinelearningorsimulationtasks.
• Thusparallelcomputinganddistributedsystems(grids)mustbelinkedinadeepfashionalthoughthekeyparallelcomputingideasneededforMLarecloselyrelatedtothosealreadydevelopedforsimulations.
12/29/18 10
DigitalScienceCenter
UnderlyingHPCBigDataConvergenceIssues
11
DigitalScienceCenter
• NeedtodiscussDataandModel asproblemshavebothintermingled,butwecangetinsightbyseparatingwhichallowsbetterunderstandingofBigData- BigSimulation“convergence”(ordifferences!)
• TheModel isauserconstructionandithasa“concept”, parametersandgivesresultsdeterminedbythecomputation.Weuseterm“model”inageneralfashiontocoverallofthese.
• BigDataproblems canbebrokenupintoDataandModel• Forclustering,themodelparametersareclustercenterswhilethedataissetofpointstobeclustered
• Forqueries,themodelisstructureofdatabaseandresultsofthisquerywhilethedataiswholedatabasequeriedandSQLquery
• FordeeplearningwithImageNet,themodelischosennetworkwithmodelparametersasthenetworklinkweights.Thedataissetofimagesusedfortrainingorclassification
DataandModelinBigDataandSimulationsI
12
DigitalScienceCenter
• Simulations canalsobeconsideredasData plusModel• Model canbeformulationwithparticledynamicsorpartialdifferentialequationsdefinedbyparameterssuchasparticlepositionsanddiscretizedvelocity,pressure,densityvalues
• Data couldbesmallwhenjustboundaryconditions• Data largewithdataassimilation(weatherforecasting)orwhendatavisualizationsareproducedbysimulation
• BigDataimpliesDataislargebutModelvariesinsize• e.g.LDA (LatentDirichletAllocation)withmanytopicsordeeplearninghasalargemodel• Clustering orDimensionreductioncanbequitesmallinmodelsize
• Data oftenstaticbetweeniterations(unlessstreaming);Modelparameters varybetweeniterations
• Data andModelParametersareoftenconfusedinpapersastermdatausedtodescribetheparametersofmodels.
• Modelsin BigDataandSimulationshavemanysimilaritiesandallowconvergence
DataandModelinBigDataandSimulationsII
13
DigitalScienceCenter
• ManyapplicationsuseLMLorLocalmachineLearningwheremachinelearning(oftenfromRorPythonorMatlab)isrunseparatelyoneverydataitemsuchasoneveryimage
• Butothersare GMLGlobalMachineLearningwheremachinelearningisabasicalgorithmrunoveralldataitems(overallnodesincomputer)
• maximumlikelihoodorc2 withasumovertheNdataitems– documents,sequences,itemstobesold,imagesetc.andoftenlinks(point-pairs).
• GMLincludesGraphanalytics,clustering/communitydetection,mixturemodels,topicdetermination,Multidimensionalscaling,(Deep)LearningNetworks
• NoteFacebookmayneedlotsofsmallgraphs(oneperpersonand~LML)ratherthanonegiantgraphofconnectedpeople(GML)
LocalandGlobalMachineLearning
14
DigitalScienceCenter
• Applications– Divideusecasesinto Data andModelandcomparecharacteristicsseparatelyinthesetwocomponentswith64ConvergenceDiamonds(features).
• Identifyimportanceofstreamingdata,pleasinglyparallel,global/localmachine-learning• Software– Singlemodelof HighPerformanceComputing(HPC)EnhancedBigDataStackHPC-ABDS.21LayersaddinghighperformanceruntimetoApachesystemsHPC-FaaSProgrammingModel
• Serverless InfrastructureasaServiceIaaS• Hardwaresystemdesignedforfunctionalityandperformanceofapplicationtypee.g.disks,interconnect,memory,CPUaccelerationdifferentformachinelearning,pleasinglyparallel,datamanagement,streaming,simulations
• UseDevOpstoautomatedeploymentofevent-drivensoftwaredefinedsystemsonhardware:HPCCloud 2.0
• TotalSystemSolutions(wisdom)asaService:HPCCloud 3.0
Convergence/DivergencePointsforHPC-Cloud-Edge-BigData-Simulation
15
DigitalScienceCenter
ApplicationStructure
http://www.iterativemapreduce.org/
1612/29/18
DigitalScienceCenter
• Real-time (streaming)dataisincreasinglycommoninscientificandengineeringresearch,anditisubiquitousincommercialBigData(e.g.,socialnetworkanalysis,recommendersystemsandconsumerbehaviorclassification)
• SofarlittleuseofcommercialandApachetechnologyinanalysisofscientificstreamingdata
• Pleasinglyparallelapplicationsimportantinscience(longtail)anddatacommunities• Commercial-Scienceapplicationdifferences:Searchandrecommenderengineshavedifferentstructuretodeeplearning,clustering,topicmodels,graphanalysessuchassubgraphmining
• Latterverysensitivetocommunicationandcanbehardtoparallelize• SearchtypicallynotasimportantinScienceasincommercialuseassearchvolumescalesbynumberofusers
• Shoulddiscuss dataandmodel separately• Termdataoftenusedrathersloppilyandoftenreferstomodel
StructureofApplications
17
DigitalScienceCenter
DistinctiveFeaturesofApplications
• Ratioofdatatomodelsizes:verticalaxisonnextslide• ImportanceofSynchronization– ratioofinter-nodecommunicationtonodecomputing:horizontalaxisonnextslide
• SparsityofDataorModel;impactsvalueofGPU’sorvectorcomputing
• IrregularityofDataorModel• GeographicdistributionofDataasinedgecomputing;useofstreaming(dynamicdata)versusbatchparadigms
• Dynamicmodelstructureasinsomeiterativealgorithms
18
DigitalScienceCenter
BigDataandSimulationDifficultyinParallelismSizeofSynchronizationconstraints
PleasinglyParallelOftenindependentevents
MapReduceasinscalabledatabases
StructuredAdaptiveSparse
LooselyCoupled
Largestscalesimulations
CurrentmajorBigDatacategory
CommodityCloudsHPCClouds:Accelerators
HighPerformanceInterconnect
ExascaleSupercomputers
GlobalMachineLearninge.g.parallelclustering
DeepLearning
HPCClouds/SupercomputersMemoryaccessalsocritical
UnstructuredAdaptiveSparse
GraphAnalyticse.g.subgraphmining
LDA
LinearAlgebraatcore(oftennotsparse)
SizeofDiskI/O
TightlyCoupled
Parametersweepsimulations
JusttwoproblemcharacteristicsThereisalsodata/computedistributionseeningrid/edgecomputing
19
DigitalScienceCenter
ApplicationNexusofHPC,BigData,Simulation
ConvergenceUse-caseDataandModelNISTCollectionBigDataOgresConvergenceDiamondshttps://bigdatawg.nist.gov/
20
DigitalScienceCenter
• 26fieldscompletedfor51areas• GovernmentOperation:4• Commercial:8• Defense:3• HealthcareandLifeSciences:10• DeepLearningandSocialMedia:6• TheEcosystemforResearch:4• AstronomyandPhysics:5• Earth,EnvironmentalandPolarScience:10• Energy:1
• Security&PrivacyEnhancedversion2• BDECHPCenhancedversion
OriginalUseCaseTemplate
21
DigitalScienceCenter
• GovernmentOperation(4):NationalArchivesandRecordsAdministration,CensusBureau
• Commercial(8):FinanceinCloud,CloudBackup,Mendeley (Citations),Netflix,WebSearch,DigitalMaterials,Cargoshipping(asinUPS)
• Defense(3):Sensors,Imagesurveillance,SituationAssessment
• HealthcareandLifeSciences(10):Medicalrecords,GraphandProbabilisticanalysis,Pathology,Bioimaging,Genomics,Epidemiology,PeopleActivitymodels,Biodiversity
• DeepLearningandSocialMedia(6):DrivingCar,Geolocate images/cameras,Twitter,CrowdSourcing,NetworkScience,NISTbenchmarkdatasets
• TheEcosystemforResearch(4):Metadata,Collaboration,LanguageTranslation,Lightsourceexperiments
• AstronomyandPhysics(5):SkySurveysincludingcomparisontosimulation,LargeHadronCollideratCERN,BelleAcceleratorIIinJapan
• Earth,EnvironmentalandPolarScience(10):RadarScatteringinAtmosphere,Earthquake,Ocean,EarthObservation,IcesheetRadarscattering,Earthradarmapping,Climatesimulationdatasets,Atmosphericturbulenceidentification,SubsurfaceBiogeochemistry(microbestowatersheds),AmeriFluxandFLUXNETgassensors
• Energy(1):Smartgrid
• PublishedbyNISTasversion2https://bigdatawg.nist.gov/_uploadfiles/NIST.SP.1500-3r1.pdf withcommonsetof26featuresrecordedforeachuse-case
51DetailedUseCases:ContributedJuly-September2013Coversgoals,datafeaturessuchas3V’s,software,hardware
26FeaturesforeachusecaseBiasedtoscience 22
DigitalScienceCenter
PartofPropertySummaryTable
23
DigitalScienceCenter
• People:eithertheusers(butseebelow)orsubjectsofapplicationandoftenboth
• Decisionmakerslikeresearchersordoctors(usersofapplication)
• Items suchasImages,EMR,Sequencesbelow;observationsorcontentsofonlinestore• Imagesor“ElectronicInformationnuggets”• EMR:ElectronicMedicalRecords(oftensimilartopeopleparallelism)• ProteinorGeneSequences;• Material properties,ManufacturedObjectspecifications,etc.,incustomdataset• Modelled entities likevehiclesandpeople
• Sensors – InternetofThings
• Events suchasdetectedanomaliesintelescopeorcreditcarddataoratmosphere
• (Complex)Nodes inRDFGraph• Simplenodesasinalearningnetwork
• Tweets,Blogs,Documents,WebPages,etc.• Andcharacters/wordsinthem
• Files ordatatobebackedup,movedorassignedmetadata
• Particles/cells/mesh points asinparallelsimulations
51UseCases:WhatisParallelismOver?
24
DigitalScienceCenter
• PP(26) “All”PleasinglyParallelorMapOnly• MR(18)ClassicMapReduceMR(addMRStat belowforfullcount)• MRStat (7)SimpleversionofMRwherekeycomputationsaresimplereductionasfoundinstatisticalaveragessuchashistogramsandaverages
• MRIter (23) IterativeMapReduceorMPI(Flink,Spark,Twister)• Graph(9) Complexgraphdatastructureneededinanalysis• Fusion(11) Integratediversedatatoaiddiscovery/decisionmaking;couldinvolvesophisticatedalgorithmsorcouldjustbeaportal
• Streaming(41)Somedatacomesinincrementallyandisprocessedthisway• Classify (30)Classification:dividedataintocategories• S/Q(12)Index,SearchandQuery
SampleFeaturesof51UseCasesI
25
DigitalScienceCenter
• CF(4)CollaborativeFilteringforrecommenderengines• LML(36)LocalMachineLearning(Independentforeachparallelentity)– applicationcouldhaveGMLaswell
• GML(23)GlobalMachineLearning:DeepLearning,Clustering,LDA,PLSI,MDS,• LargeScaleOptimizationsasinVariational Bayes,MCMC,LiftedBeliefPropagation,StochasticGradientDescent,L-BFGS,Levenberg-Marquardt.CancallEGOorExascaleGlobalOptimizationwithscalableparallelalgorithm
• Workflow(51)Universal• GIS(16)GeotaggeddataandoftendisplayedinESRI,MicrosoftVirtualEarth,GoogleEarth,GeoServer etc.
• HPC(5)Classiclarge-scalesimulationofcosmos,materials,etc.generating(visualization)data
• Agent(2)Simulationsofmodelsofdata-definedmacroscopicentitiesrepresentedasagents
SampleFeaturesof51UseCasesII
26
DigitalScienceCenter
BDEC2andNISTUseCases
27
DigitalScienceCenter
53NISTUseCasesforResearchSpaceI• GOVERNMENTOPERATION
• 1:Census2010and2000—Title13BigData• 2:NARAAccession,Search,Retrieve,Preservation• 3:StatisticalSurveyResponseImprovement• 4:Non-TraditionalDatainStatisticalSurveyResponseImprovement(AdaptiveDesign)
1-4 are related to social science survey problems and are “Classic Data+ML” with interesting algorithms (recommender engines) plus databases and important privacy issues which are present in research cases
• COMMERCIAL• 5:CloudEco-SystemforFinancialIndustriesNO• 6:Mendeley—AnInternationalNetworkofResearch• 7:NetflixMovieServiceNO• 8:WebSearchNO• 9:BigDataBusinessContinuityandDisasterRecoveryWithinaCloudEco-System NO• 10:CargoShippingEdgeComputingNO• 11:MaterialsDataforManufacturing• 12:Simulation-DrivenMaterialsGenomics
6 is “Classic Data+ML” with Text Analysis (citation identification, topic models etc.)10 is DHL/Fedex/UPS and has no direct scientific analog. However, it is a good example of Edge computing system of a similar nature to the scientific research case.11 and 12 are material science covered in BDEC2 meeting
12/29/18 28
DigitalScienceCenter
53NISTUseCasesforResearchSpaceII• DEFENSE
• 13:CloudLarge-ScaleGeospatialAnalysisandVisualization• 14:ObjectIdentificationandTrackingfromWide-AreaLargeFormatImageryorFullMotionVideo—PersistentSurveillance
• 15:IntelligenceDataProcessingandAnalysis13-15 are very similar to disaster response problems. They involve extensive “Classic Data+ML” for sensor collections and/or image processing. GIS and spatial analysis are important as in BDEC2 Pathology and Spatial Imagery talk. The geospatial aspect of applications means they are similar to earth science examples.
• Colorcodingofusecases• NOmeansnotsimilartoresearchapplication• Red meansnotrelevanttoBDEC2• Orange meansrelatedtoBDEC2Bloomingtonpresentations• BlackareuniqueusecasesofrelevancetoBDEC2butnotpresentedatBloomington• Purple are comments
12/29/18 29
DigitalScienceCenter
UseCasesIII:HEALTHCAREANDLIFESCIENCES• 16:ElectronicMedicalRecord• 17:PathologyImaging/DigitalPathology• 18:ComputationalBioimaging• 19:GenomicMeasurements• 20:ComparativeAnalysisforMetagenomesandGenome• 21:IndividualizedDiabetesManagement• 22:StatisticalRelationalArtificialIntelligenceforHealthCare• 23:WorldPopulation-ScaleEpidemiologicalStudy• 24:SocialContagionModelingforPlanning,PublicHealth,andDisasterManagement
• 25:BiodiversityandLifeWatch
12/29/18 30
DigitalScienceCenter
CommentsonUseCasesIII• 16 and 22 are “classic data + ML + database” based use cases using an
important technique, understood by the community but not presented at BDEC2
• 17 came originally from Saltz’s group and was updated in his BDEC2 talk• 18 describes biology image processing from many instruments microscopes,
MRI and light sources. The latter was directly discussed at BDEC2 and the other instruments were implicit.
• 19 and 20 are well recognized as a distributed Big Data problems with significant computing. They were represented by Chandrasekaran’s presentation at BDEC2 which inevitably only covered part (gene assembly) of problem.
• 21 relies on “classic data + graph analytics” which was not discussed in BDEC2 meeting but is certainly actively pursued.
• 23 and 24 originally came from Marathe and were updated in his BDEC2 presentation on massive bio-social systems
• 25 generalizes BDEC2 talks by Taufer and Rahnemoonfar on ocean and land monitoring and sensor array analysis.
12/29/18 31
DigitalScienceCenter
UseCasesIV:DEEPLEARNINGANDSOCIALMEDIA• 26:Large-ScaleDeepLearning• 27:OrganizingLarge-Scale,UnstructuredCollectionsofConsumerPhotosNO• 28:Truthy—InformationDiffusionResearchfromTwitterData• 29:CrowdSourcingintheHumanitiesasSourceforBigandDynamicData• 30:CINET—CyberinfrastructureforNetwork(Graph)ScienceandAnalytics• 31:NISTInformationAccessDivision—AnalyticTechnologyPerformanceMeasurements,Evaluations,andStandards
• 26 on deep learning was covered in great depth at the BDEC2 meeting• 27 describes an interesting image processing challenge of geolocating multiple
photographs which is not so far directly related to scientific data analysis although related image processing algorithms are certainly important
• 28-30 are “classic data + ML” use cases with a focus on graph and text mining algorithms not covered in BDEC2 but certainly relevant to the process
• 31 on benchmarking and standard datasets is related to BigDataBench talk at end of BDEC2 meeting and Fosters talk on a model database
12/29/18 32
DigitalScienceCenter
53NISTUseCasesforResearchSpaceVI• THEECOSYSTEMFORRESEARCH• 32:DataNet FederationConsortium• 33:TheDiscinnet ProcessNO• 34:SemanticGraphSearchonScientificChemicalandText-BasedData• 35:LightSourceBeamlines
32 covers data management with iRODS which is well regarded by the community but not discussed in BDEC2.33 is a Teamwork approach that doesn’t seem relevant to BDEC234 is a “classic data+ML” use case with a similar comments to 28-3035 was covered with more advanced deep learning algorithms in Yager and Foster’s BDEC2 talks
ASTRONOMYANDPHYSICS• 36:CatalinaReal-TimeTransientSurvey:ADigital,Panoramic,SynopticSkySurvey• 37:DOEExtremeDatafromCosmologicalSkySurveyandSimulations• 38:LargeSurveyDataforCosmology• 39:ParticlePhysics—AnalysisofLargeHadronColliderData:DiscoveryofHiggsParticle• 40:BelleIIHighEnergyPhysicsExperiment
36 to 38 are “Classic Data+ML” astronomy use cases related to BDEC2 SKA presentation and covering both archival and event detection cases. Use case 37 covers the integration of simulation data and observational data FOR ASTRONOMY; A TOPIC COVERED IN OTHER CASES AT BDEC2.39 and 40 are “Classic Data+ML” use cases for accelerator data analysis. This was not covered in BDEC2 but is currently the largest volume scientific data analysis problem whose importance and relevance is well understood.
12/29/18 33
DigitalScienceCenter
UseCasesVII:EARTH,ENVIRONMENTAL,ANDPOLARSCIENCE• 41:EuropeanIncoherentScatterScientificAssociation3DIncoherentScatterRadarSystemBigRadarinstrumentmonitoringatmosphere.
• 42:CommonOperationsofEnvironmentalResearchInfrastructure• 43:RadarDataAnalysisfortheCenterforRemoteSensingofIceSheets• 44:UnmannedAirVehicleSyntheticApertureRadar(UAVSAR)DataProcessing,DataProductDelivery,andDataServices
• 45:NASALangleyResearchCenter/GoddardSpaceFlightCenteriRODS FederationTestBed• 46:MERRAAnalyticServices(MERRA/AS)Instrument• 47:AtmosphericTurbulence– EventDiscoveryandPredictiveAnalyticsImaging• 48:ClimateStudiesUsingtheCommunityEarthSystemModelattheU.S.DepartmentofEnergy(DOE)NERSCCenter
• 49:DOEBiologicalandEnvironmentalResearch(BER)SubsurfaceBiogeochemistryScientificFocusArea
• 50:DOEBERAmeriFlux andFLUXNETNetworksSensorNetworks• 2-1:NASAEarthObservingSystemDataandInformationSystem(EOSDIS)Instrument• 2-2:Web-EnabledLandsatData(WELD)ProcessingInstrument
12/29/18 34
DigitalScienceCenter
CommentsonUseCasesVII• 41 43 44 are “Classic Data+ML” use cases involving radar data from
different instruments-- specialized ground, vehicle/plane, satellite - not directly covered in BDEC2
• 2-1 and 2-2 are use cases similar to 41 43 and 44 but applied to EOSDIS and LANDSAT earth observations from satellites in multiple modalities.
• 42 49 and 50 are “Classic Data+ML” environmental sensor arrays that extend the scope of talks of Taufer and Rahnemoonfar at BDEC2. See also use case 25 above
• 45 to 47 describe datasets from instruments and computations relevant to climate and weather. It relates to BDEC2 talk by Denvil and Miyoshi. 47 discusses the correlation of aircraft turbulent reports with simulation datasets
• 48 is data analytics and management associated with climate studies as covered in BDEC2 talk by Denvil
12/29/18 35
DigitalScienceCenter
UseCasesVIII:ENERGY• 51:ConsumptionForecastinginSmartGrids
51 is a different subproblem but in the same area as Pothen and Azad’s talk on the electric power grid at BDEC2. This is a challenging edge computing problem as a large number of distributed but correlated sensors
• SC-18BOFApplication/IndustryPerspectivebyDavidKeyes,KingAbdullahUniversityofScienceandTechnology(KAUST)
• https://www.exascale.org/bdec/sites/www.exascale.org.bdec/files/SC18_BDEC2_BoF-Keyes.pdf
This is a presentation by David Keyes on seismic imaging for oil discovery and exploitation. It is “Classic Data+ML” for an array of sonic sensors
12/29/18 36
DigitalScienceCenter
BDEC2UseCasesI:ClassicObservationalDataplusML• BDEC2-1:M.Deegan,BigDataandExtremeScaleComputing,2ndSeries(BDEC2)- StatementofInterestfromtheSquareKilometre ArrayOrganisation (SKAO)
• EnvironmentalScience• BDEC2-2:M.Rahnemoonfar,SemanticSegmentationofUnderwaterSonarImagerybasedonDeepLearning
• BDEC2-3:M.Taufer,CyberinfrastructureToolsforPrecisionAgricultureinthe21stCentury• HealthcareandLifesciences• BDEC2-4:J.Saltz,MultiscaleSpatialDataandDeepLearning• BDEC2-5:R.Stevens,ExascaleDeepLearningforCancer• BDEC2-6:S.Chandrasekaran,Developmentofaparallelalgorithmforwholegenomealignmentforrapiddeliveryofpersonalizedgenomics
• BDEC2-7:M.Marathe,Pervasive,PersonalizedandPrecision(P3)analyticsformassivebio-socialsystems
Instruments include Satellites, UAV’s, Sensors (see edge examples), Light sources (X-ray MRI Microscope etc.), Telescopes, Accelerators, Tokomaks (Fusion), Computers (as in Control, Simulation, Data, ML Integration)
12/29/18 37
DigitalScienceCenter
BDEC2UseCasesII:Control,Simulation,Data,MLIntegration• BDEC2-8:W.Tang,NewModelsforIntegratedInquiry:FusionEnergyExemplar• BDEC2-9:O.Beckstein,Convergenceofdatagenerationandanalysisinthebiomolecularsimulationcommunity
• BDEC2-10:S.Denvil,Fromtheproductiontotheanalysisphase:newapproachesneededinclimatemodeling
• BDEC2-11:T.Miyoshi,PredictionScience:The5thParadigmFusingtheComputationalScienceandDataScience(weatherforecasting)
See also Marathe and Stevens talksSee also instruments under Classic Observational Data plus ML
• MaterialScience• BDEC2-12:K.Yager,AutonomousExperimentationasaParadigmforMaterialsDiscovery• BDEC2-13:L.Ward,DeepLearning,HPC,andDataforMaterialsDesign• BDEC2-14:J.Ahrens,AvisionforavalidateddistributedknowledgebaseofmaterialbehavioratextremeconditionsusingtheAdvancedCyberinfrastructurePlatform
• BDEC2-15:T.Deutsch,DigitaltransitionofMaterialNano-Characterization.
12/29/18 38
DigitalScienceCenter
CommentsonControl,Simulation,Data,MLIntegration• Simulations often involve outside Data but always inside Data (from
simulation itself). Fields covered include Materials (nano), Climate, Weather, Biomolecular, Virtual tissues (no use case written up)
• We can see ML wrapping simulations to achieve many goals. ML replaces functions and/or ML guides functions
• Initial Conditions• Boundary Conditions• Data assimilation• Configuration -- blocking, use of cache etc. • Steering and Control• Support multi-scale• ML learns from previous simulations and so can predict function calls
• Digital Twins are a commercial link between simulation and systems• There are fundamental simulations covered by laws of physics and growingly
Complex System simulations with Bio (tissue) or social entities.12/29/18 39
DigitalScienceCenter
BDEC2UseCasesIII:EdgeComputing• SmartCityandRelatedEdgeApplications• BDEC2-16:P.Beckman,EdgetoHPCCloud• BDEC2-17:G.Ricart,SmartCommunityCyberInfrastructure attheSpeedofLife• BDEC2-18:T.El-Ghazawi,ConvergenceofAI,BigData,ComputingandIOT(ABCI)-SmartCityasanApplicationDriverandVirtualIntelligenceManagement(VIM)
• BDEC2-19:M.Kondo,TheChallengesandopportunitiesofBDECsystemsforSmartCities
• OtherEdgeApplications• BDEC2-20:APothen,High-EndDataScienceandHPCfortheElectricalPowerGrid• BDEC2-21:J.Qiu,Real-TimeAnomalyDetectionfromEdgetoHPC-Cloud
There are correlated edge devices such as power grid and nearby vehicles (racing, road). Also largely independent edge devices interacting via databases such as surveillance cameras
12/29/18 40
DigitalScienceCenter
BDECUseCasesIV• BDECEcosystem• BDEC2-22:IFoster,LearningSystemsforDeepScience• BDEC2-23:W.Gao,BigDataBench:AScalableandUnifiedBigDataandAIBenchmarkSuite
• Image-based Applications• One cross-cutting theme is understanding Generalized (light, sound, other sensors such as temperature, chemistry, moisture) Images with 2D, 3D spatial and time dependence
• Modalities include Radar, MRI, Microscopes, Surveillance and other cameras, X-ray scattering, UAV hosted, and related non-optical sensor networks as in agriculture, wildfires, disaster monitoring and Oil exploration. GIS and geospatial properties are often relevant
12/29/18 41
DigitalScienceCenter
NISTGenericDataProcessingUseCases
42
DigitalScienceCenter
10GenericDataProcessingUseCases1) Multipleusersperforminginteractivequeriesandupdatesonadatabasewithbasicavailabilityand
eventualconsistency(BASE=(BasicallyAvailable,Softstate,Eventualconsistency)asopposedtoACID=(Atomicity,Consistency,Isolation,Durability))
2) Performrealtimeanalyticsondatasourcestreamsandnotifyuserswhenspecifiedeventsoccur3) Movedatafromexternaldatasourcesintoahighlyhorizontallyscalabledatastore,transformitusing
highlyhorizontallyscalableprocessing(e.g.Map-Reduce),andreturnittothehorizontallyscalabledatastore(ELTExtractLoadTransform)
4) Performbatchanalyticsonthedatainahighlyhorizontallyscalabledatastoreusinghighlyhorizontallyscalableprocessing(e.g MapReduce)withauser-friendlyinterface(e.g.SQLlike)
5) Performinteractiveanalyticsondatainanalytics-optimizeddatabasewith5A)Science6) VisualizedataextractedfromhorizontallyscalableBigDatastore7) MovedatafromahighlyhorizontallyscalabledatastoreintoatraditionalEnterpriseDataWarehouse
(EDW)8) Extract,process,andmovedatafromdatastorestoarchives9) CombinedatafromClouddatabasesandonpremisedatastoresforanalytics,datamining,and/or
machinelearning10) Orchestratemultiplesequentialandparalleldatatransformationsand/oranalyticprocessingusinga
workflowmanager
43
DigitalScienceCenter
AccessPattern5A.Performinteractiveanalyticsonobservationalscientificdata
44
GridorManyTaskSoftware,Hadoop,Spark,Giraph,Pig…
DataStorage:HDFS,Hbase,FileCollection
StreamingTwitterdataforSocialNetworking
ScienceAnalysisCode,Mahout,R
RecordScientificDatain“field”
LocalAccumulateandinitialcomputing
DirectTransfer
ExamplesincludeLHC,RemoteSensing,AstronomyandBioinformatics
GridorManyTaskSoftware,Hadoop,Spark,Giraph,Pig…
DataStorage:HDFS,Hbase,FileCollection
StreamingTwitterdataforSocialNetworking
ScienceAnalysisCode,Mahout,R
RecordScientificDatain“field”
LocalAccumulateandinitialcomputing
GridorManyTaskSoftware,Hadoop,Spark,Giraph,Pig…
DataStorage:HDFS,Hbase,FileCollection
StreamingTwitterdataforSocialNetworking
ScienceAnalysisCode,Mahout,R
Transportbatchofdatatoprimaryanalysisdatasystem
RecordScientificDatain“field”
LocalAccumulateandinitialcomputing
GridorManyTaskSoftware,Hadoop,Spark,Giraph,Pig…
DataStorage:HDFS,Hbase,FileCollection
StreamingTwitterdataforSocialNetworking
ScienceAnalysisCode,Mahout,R
RecordScientificDatain“field”
LocalAccumulateandinitialcomputing
DigitalScienceCenter
PolarGrid
LightweightCyberinfrastructuretosupportmobileDatagatheringexpeditionsplusclassiccentralresources(asacloud)
45BATCHMODE
DigitalScienceCenter
“TheUniverseisnowbeingexploredsystematically,inapanchromaticway,overarangeofspatialandtemporalscalesthatleadtoamorecomplete,andlessbiasedunderstandingofitsconstituents,theirevolution,theirorigins,andthephysicalprocessesgoverningthem.”
TowardsaNationalVirtualObservatory
HubbleTelescope PalomarTelescope
SloanTelescope
TrackingtheHeavens
46DISTRIBUTEDLARGEINSTRUMENTMODE
DigitalScienceCenter
OtherUse-caseCollections
47
DigitalScienceCenter
7ComputationalGiantsofNRCMassiveDataAnalysisReport
1) G1: BasicStatisticse.g.MRStat2) G2: GeneralizedN-BodyProblems3) G3: Graph-TheoreticComputations4) G4: LinearAlgebraicComputations5) G5: Optimizationse.g.LinearProgramming6) G6: Integratione.g.LDAandotherGML7) G7: AlignmentProblemse.g.BLAST
http://www.nap.edu/catalog.php?record_id=18374 Big Data Models?
48
DigitalScienceCenter
• Linpack orHPL:ParallelLUfactorizationforsolutionoflinearequations;HPCG
• NPB version1:MainlyclassicHPCsolverkernels• MG:Multigrid• CG:ConjugateGradient• FT:FastFourierTransform• IS:Integersort• EP:EmbarrassinglyParallel• BT:BlockTridiagonal• SP:ScalarPentadiagonal• LU:Lower-UppersymmetricGaussSeidel
HPC(Simulation)BenchmarkClassics
Simulation Models
49
DigitalScienceCenter
1) DenseLinearAlgebra2) SparseLinearAlgebra3) SpectralMethods4) N-BodyMethods5) StructuredGrids6) UnstructuredGrids7) MapReduce8) CombinationalLogic9) GraphTraversal10) DynamicProgramming11) Backtrackand
Branch-and-Bound12) GraphicalModels13) FiniteStateMachines
13BerkeleyDwarfsFirst6ofthesecorrespondtoColella’s original.(Classicsimulations)MonteCarlodropped.N-bodymethodsareasubsetofParticleinColella.
NotealittleinconsistentinthatMapReduceisaprogrammingmodelandspectralmethodisanumericalmethod.Needmultiplefacetstoclassifyusecases!
LargelyModelsforDataorSimulation
50
DigitalScienceCenter
ClassifyingUsecases
51
DigitalScienceCenter
• TheBigDataOgresbuiltonacollectionof51bigdatausesgatheredbytheNISTPublicWorkingGroupwhere26propertiesweregatheredforeachapplication.
• ThisinformationwascombinedwithotherstudiesincludingtheBerkeleydwarfs,theNASparallelbenchmarksandtheComputationalGiantsoftheNRCMassiveDataAnalysisReport.
• TheOgreanalysisledtoasetof50featuresdividedintofourviewsthatcouldbeusedtocategorizeanddistinguishbetweenapplications.
• ThefourviewsareProblemArchitecture(Macropattern);ExecutionFeatures(Micropatterns);DataSourceandStyle;andfinallytheProcessingVieworruntimefeatures.
• WegeneralizedthisapproachtointegrateBigDataandSimulationapplicationsintoasingleclassificationlookingseparatelyatDataandModel withthetotalfacetsgrowingto64innumber,calledconvergencediamonds,andsplitbetweenthesame4views.
• AmappingoffacetsintoworkoftheSPIDALprojecthasbeengiven.
ClassifyingUseCases
52
DigitalScienceCenter
Pleasingly ParallelClassic MapReduceMap-CollectiveMap Point-to-Point
Shared MemorySingle Program Multiple DataBulk Synchronous ParallelFusionDataflowAgentsWorkflow
Geospatial Information SystemHPC SimulationsInternet of ThingsMetadata/ProvenanceShared / Dedicated / Transient / PermanentArchived/Batched/StreamingHDFS/Lustre/GPFSFiles/ObjectsEnterprise Data ModelSQL/NoSQL/NewSQL
Performance
Metrics
FlopsperB
yte;M
emory
I/OExecution
Environment;
Core
librariesVolum
eVelocityVarietyVeracityC
omm
unicationStructure
Data
Abstraction
Metric
=M
/Non-M
etric=
NON#
=N
N/O(N)
=N
Regular
=R
/Irregular=
ID
ynamic
=D
/Static=
S
Visualization
Graph A
lgorithms
Linear Algebra K
ernelsA
lignment
Streaming
Optim
ization Methodology
LearningC
lassificationSearch / Q
uery / Index
Base Statistics
Global A
nalyticsLocal A
nalyticsM
icro-benchmarks
Recom
mendations
Data Source and Style View
Execution View
Processing View
234
6789
101112
10987654
321
1 2 3 4 5 6 7 8 9 10 12 14
9 8 7 5 4 3 2 114 13 12 11 10 6
13
Map Streaming 5
4 Ogre Views and 50 Facets
Iterative/Sim
ple
11
1
Problem Architecture View
Theoriginal50Ogresin4views
53
• Processing• DataSourceandStyle
• ProblemArchitecture(metapatterns)
• Execution(micropatterns)
DigitalScienceCenter
Local(An
alytics/Inform
atics/Simulations)
2M
DataSourceandStyleView
PleasinglyParallelClassicMapReduce
Map-CollectiveMapPoint-to-Point
SharedMemorySingleProgramMultipleData
BulkSynchronousParallel
FusionDataflowAgents
Workflow
GeospatialInformationSystemHPCSimulationsInternetofThingsMetadata/ProvenanceShared/Dedicated/Transient/Permanent
Archived/Batched/Streaming – S1,S2,S3,S4,S5
HDFS/Lustre/GPFS
Files/ObjectsEnterpriseDataModelSQL/NoSQL/NewSQL
1M
Micro-benchmarks
ExecutionView
ProcessingView1234
6
78
910
11M
12
10D98D7D6D
5D
4D
3D2D1D
MapStreaming 5
ConvergenceDiamondsViewsandFacets
ProblemArchitectureView
15MCo
reLibrarie
sVisualiza
tion
14M
GraphAlgorithm
s
13M
LinearAlgebraKernels/M
anysubclasses
12M
Global(A
nalytics/Inform
atics/Simulations)
3M
RecommenderEngine
5M
4M
BaseDataStatistics
10M
Stream
ingDa
taAlgorith
ms
Optimiza
tionMethodology
9M
Learning
8M
DataClassificatio
n
7M
DataSe
arch/Q
uery/In
dex
6M
11M
DataAlignm
ent
BigDataProcessingDiamonds
MultiscaleMethod
17M
16M
IterativePD
ESolvers
22M
Natureofm
eshifused
EvolutionofDiscreteSystem
s
21M
ParticlesandFields
20M
N-bodyM
ethods
19M
Spectra
lMethods
18M
Simulation(Exascale)ProcessingDiamonds
DataAbstraction
D12
ModelAbstraction
M12
DataMetric
=M
/Non-Metric
=N
D13
DataMetric
=M
/Non-Metric
=N
M13
𝑂𝑁#
=NN
/𝑂(𝑁)=N
M14
Regular=R/Irregular=
IModel
M10
Veracity
7
Iterative/Sim
ple
M11
Communication
Structure
M8
Dynamic=D/Static
=S
D9
Dynamic=D/Static=
SM9
Regular=R/Irregular=
IData
D10
ModelVariety
M6
DataVelocity
D5
Performance
Metrics
1
DataVariety
D6
FlopsperByte/M
emory
IO/Flopsperw
att
2
ExecutionEnvironm
ent;Corelibraries
3
DataVolum
e
D4
ModelSize
M4
Simulations Analytics(ModelforBigData)
Both
(AllModel)
(NearlyallData+Model)
(NearlyallData)
(MixofDataandModel)
54
64Featuresin4viewsforUnifiedClassificationofBig
DataandSimulationApplications
DigitalScienceCenter
ConvergenceDiamondsandtheir4ViewsI• Oneview istheoverall problemarchitectureormacropatterns whichisnaturallyrelatedtothemachinearchitectureneededtosupportapplication.
• UnchangedfromOgresanddescribespropertiesofproblemsuchas“PleasingParallel”or“UsesCollectiveCommunication”
• Theexecution(computational)featuresormicropatterns view,describesissuessuchasI/Oversuscomputerates,iterativenatureandregularityofcomputationandtheclassicV’sofBigData:definingproblemsize,rateofchange,etc.
• SignificantchangesfromogrestoseparateDataandModelandaddcharacteristicsofSimulationmodels.e.g.bothmodelanddatahave“V’s”;DataVolume,ModelSize
• e.g.O(N2)Algorithmrelevanttobigdataorbigsimulationmodel55
DigitalScienceCenter
ConvergenceDiamondsandtheir4ViewsII• Thedatasource&style view includesfacetsspecifyinghowthedataiscollected,storedandaccessed.Hasclassicdatabasecharacteristics
• Simulationscanhavefacetsheretodescribeinputoroutputdata• Examples:Streaming,filesversusobjects,HDFSv.Lustre
• Processing view hasmodel(notdata)facetswhichdescribetypesofprocessingstepsincludingnatureofalgorithmsandkernelsbymodele.g.LinearProgramming,Learning,MaximumLikelihood,Spectralmethods,Meshtype,
• mixofBigDataProcessingViewandBigSimulationProcessingViewandincludessomefacetslike“useslinearalgebra”neededinboth:hasspecificsofkeysimulationkernelsandinparticularincludesfacetsseeninNASParallelBenchmarksandBerkeleyDwarfs
• InstancesofDiamondsareparticularproblemsandasetofDiamondinstancesthatcoverenoughofthefacetscouldformacomprehensivebenchmark/mini-app set
• Diamondsandtheirinstancescanbeatomic orcomposite56
DigitalScienceCenter
ProgrammingEnvironmentforGlobalAIandModelingSupercomputerGAIMSC
http://www.iterativemapreduce.org/
5712/29/18
DigitalScienceCenter
WaysofaddingHighPerformancetoGlobalAI(andModeling)Supercomputer
• FixperformanceissuesinSpark,Heron,Hadoop,Flinketc.• Messyassomefeaturesofthesebigdatasystemsintrinsicallyslowinsome(notall)cases
• Allthesesystemsare“monolithic”anddifficulttodealwithindividualcomponents• ExecuteHPBDCfromclassicbigdatasystemwithcustomcommunicationenvironment– approachofHarpfortherelativelysimpleHadoopenvironment
• ProvideanativeMesos/Yarn/Kubernetes/HDFShighperformanceexecutionenvironmentwithallcapabilitiesofSpark,HadoopandHeron– goalofTwister2
• ExecutewithMPIinclassic(Slurm,Lustre)HPCenvironment• AddmodulestoexistingframeworkslikeScikit-LearnorTensorflow eitherasnewcapabilityorasahigherperformanceversionofexistingmodule.
58
DigitalScienceCenter
GAIMSCProgrammingEnvironmentComponentsIArea Component Implementation Comments: User API
Architecture Specification
Coordination PointsState and Configuration Management; Program, Data and Message Level
Change execution mode; save and reset state
Execution Semantics
Mapping of Resources to Bolts/Maps in Containers, Processes, Threads
Different systems make different choices - why?
Parallel Computing Spark Flink Hadoop Pregel MPI modes Owner Computes Rule
Job Submission (Dynamic/Static) Resource Allocation
Plugins for Slurm, Yarn, Mesos, Marathon, Aurora
Client API (e.g. Python) for Job Management
Task System
Task migration Monitoring of tasks and migrating tasks for better resource utilization
Task-based programming with Dynamic or Static Graph API;
FaaS API;
Support accelerators (CUDA,FPGA, KNL)
Elasticity OpenWhisk
Streaming and FaaS Events
Heron, OpenWhisk, Kafka/RabbitMQ
Task Execution Process, Threads, Queues
Task Scheduling Dynamic Scheduling, Static Scheduling,Pluggable Scheduling Algorithms
Task Graph Static Graph, Dynamic Graph Generation
59
DigitalScienceCenter
GAIMSCProgrammingEnvironmentComponentsIIArea Component Implementation Comments
Communication API
Messages Heron This is user level and could map to multiple communication systems
Dataflow Communication
Fine-Grain Twister2 Dataflow communications: MPI,TCP and RMA
Coarse grain Dataflow from NiFi, Kepler?
Streaming, ETL data pipelines;
Define new Dataflow communicationAPI and library
BSP CommunicationMap-Collective
Conventional MPI, Harp MPI Point to Point and Collective API
Data AccessStatic (Batch) Data File Systems, NoSQL, SQL
Data APIStreaming Data Message Brokers, Spouts
Data Management Distributed Data Set
Relaxed Distributed Shared Memory(immutable data), Mutable Distributed Data
Data Transformation API;
Spark RDD, Heron Streamlet
Fault Tolerance Check PointingUpstream (streaming) backup; Lightweight; Coordination Points; Spark/Flink, MPI and Heron models
Streaming and batch casesdistinct; Crosses all components
Security Storage, Messaging, execution
Research needed Crosses all Components
60
DigitalScienceCenter
• Harp-DAAL withakernelMachineLearninglibraryexploitingtheIntelnodelibraryDAALandHPCcommunicationcollectiveswithintheHadoopecosystem.
• Harp-DAALsupportsall5classesofdata-intensiveAIfirstcomputation,frompleasinglyparalleltomachinelearningandsimulations.
• Twister2 isatoolkitofcomponentsthatcanbepackagedindifferentways• IntegratedbatchorstreamingdatacapabilitiesfamiliarfromApacheHadoop,Spark,HeronandFlinkbutwithhighperformance.
• Separatebulksynchronousanddataflowcommunication;• TaskmanagementasinMesos,YarnandKubernetes• Dataflowgraphexecutionmodels• LaunchingoftheHarp-DAALlibrarywithnativeMesos/Kubernetes/HDFSenvironment• Streamingandrepositorydataaccessinterfaces,• In-memorydatabasesandfaulttoleranceatdataflownodes.(useRDDtodoclassiccheckpoint-restart)
IntegratingHPCandApacheProgrammingEnvironments
61
DigitalScienceCenter
Map Collective Run time merges MapReduce and HPC
allreducereduce
rotatepush & pull
allgather
regroup
broadcast
RuntimesoftwareforHarp
62
DigitalScienceCenter
• Datasets:5millionpoints,10thousandcentroids,10featuredimensions
• 10to20nodesofIntelKNL7250processors
• Harp-DAALhas15xspeedupsoverSparkMLlib
• Datasets:500Kor1milliondatapointsoffeaturedimension300
• RunningonsingleKNL7250(Harp-DAAL)vs.singleK80GPU(PyTorch)
• Harp-DAALachieves3xto6xspeedups
• Datasets:Twitterwith44millionvertices,2billionedges,subgraphtemplatesof10to12vertices
• 25nodesofIntelXeonE52670• Harp-DAALhas2xto5xspeedups
overstate-of-the-artMPI-Fasciasolution
Harpv.SparkHarpv.TorchHarpv.MPI
63
DigitalScienceCenter
Twister2DataflowCommunications• Twister:Net offerstwocommunicationmodels• BSP (BulkSynchronousProcessing)message-levelcommunicationusingTCPorMPIseparatedfromitstaskmanagementplusextraHarpcollectives
• DFWanewDataflowlibrarybuiltusingMPIsoftwarebutatdatamovementnotmessagelevel
• Non-blocking• Dynamicdatasizes• Streamingmodel
• Batchcaseismodeledasafinitestream• Thecommunicationsarebetweenasetoftasksinanarbitrarytaskgraph
• Keybasedcommunications• Data-levelCommunicationsspillingtodisks• Targettaskscanbedifferentfromsourcetasks
6412/29/18
DigitalScienceCenter
LatencyofApacheHeronandTwister:NetDFW(Dataflow)forReduce,BroadcastandPartitionoperationsin16nodeswith256-wayparallelism
Twister:Net andApacheHeronandSparkLeft:K-meansjobexecutiontimeon16nodeswithvaryingcenters,2millionpointswith320-wayparallelism.Right:K-Meanswth 4,8and16nodeswhereeachnodehaving20tasks.2millionpointswith16000centersused.
6512/29/18
DigitalScienceCenter
IntelligentDataflowGraph• Thedataflowgraphspecifiesthedistributionandinterconnectionofjobcomponents
• HierarchicalandIterative• AllowMLwrappingofcomponentateachdataflownode• Checkpointaftereachnodeofthedataflowgraph
• Naturalsynchronizationpoint• Let’sallowsusertochoosewhentocheckpoint(noteverystage)• Savestateasuserspecifies;SparkjustsavesModelstatewhichisinsufficientforcomplexalgorithms
• Intelligentnodessupportcustomizationofcheckpointing,ML,communication
• Nodescanbecoarse(largejobs)orfinegrainrequiringdifferentactions
6612/29/18
DigitalScienceCenter
DataflowatDifferentGrainsizes
Reduce
Maps
Iterate
InternalExecutionDataflowNodes
HPCCommunication
CoarseGrainDataflowslinksjobsinsuchapipeline
Datapreparation ClusteringDimensionReduction
Visualization
Butinternallytoeachjobyoucanalsoelegantlyexpressalgorithmasdataflowbutwithmorestringentperformanceconstraints
• P=loadPoints()• C=loadInitCenters()• for(int i =0;i <10;i++){• T=P.map().withBroadcast(C)• C=T.reduce()}
Iterate
CorrespondingtoclassicSparkK-meansDataflow
6712/29/18
DigitalScienceCenter
NiFi Coarse-grainWorkflow
12/29/18 68
DigitalScienceCenter
FuturesImplementingTwister2
forGlobalAIandModelingSupercomputer
http://www.iterativemapreduce.org/
6912/29/18
DigitalScienceCenter
Twister2Timeline:CurrentRelease(EndofSeptember2018)
• Twister:Net DataflowCommunicationAPI• DataflowcommunicationswithMPIorTCP
• Dataaccess• LocalFileSystems• HDFSIntegration
• TaskGraph• StreamingandBatchanalytics– Iterativejobs• Datapipelines• DeploymentsonDocker,Kubernetes,Mesos(Aurora),Slurm• HarpforMachineLearning(CustomBSPCommunications)
• Richcollectives• Around30MLalgorithms
7012/29/18
DigitalScienceCenter
Twister2Timeline:January2018• DataSet APIsimilartoSparkbatchandHeronstreamingwithTsetrealization
• CanuseTsets forwritingRDD/Streamletstyledatasets
• FaulttoleranceasinHeronandSpark• StormAPIforStreaming• HierarchicalDynamicHeterogeneousTaskGraph
• Coarsegrain andfinegraindataflow
• Cyclictaskgraphexecution• Dynamicscalingofresourcesandheterogeneous resources(atthe resourcelayer)forstreamingandheterogeneousworkflow
• Link toPilotJobs
7112/29/18
DigitalScienceCenter
Twister2Timeline:July1,2018• NaiadmodelbasedTasksystemforMachineLearning• NativeMPIintegrationtoMesos,Yarn• Dynamictaskmigrations• RDMAandothercommunicationenhancements• IntegratepartsofTwister2componentsasbigdatasystemsenhancements(i.e.runcurrentBigDatasoftwareinvokingTwister2components)
• Heron(easiest),Spark,Flink,Hadoop(likeHarptoday)• Tsets becomecompatiblewithRDD(Spark)andStreamlet(Heron)
• SupportdifferentAPIs(i.e.runTwister2lookinglikecurrentBigDataSoftware),Hadoop, Spark(Flink), Storm
• RefinementslikeMarathonwithMesosetc.• FunctionasaServiceandServerless• Supporthigherlevelabstractions
• Twister:SQL (majorSparkusecase)• GraphAPI
7212/29/18
DigitalScienceCenter
Conclusions• Canmakeusecasecollectionstomotivatebenchmarks
• NISTandBDEChavetemplates• Couldhelpfullyfillintemplatesforbenchmarks
• Researchapplicationshavesomesimilaritiesbutmanydifferencesfromcommercialusecases
• IncreasingimportanceofintegrationofsimulationandMachineLearning
• IncreasingimportanceofdistributedEdgeapplications• ShouldbenchmarkdataflowandBSPstylecommunication• Twister2willcombineHeronandSparkwithbuiltinHPCperformance
12/29/18 73