big data benchmarking: applications and systems · digital science center big data benchmarking:...

73
Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium on Benchmarking, Measuring and Optimizing (Bench’ 18) at IEEE Big Data 2018 Dec 10 - Dec 11, 2018 @ Seattle, WA, USA Digital Science Center Indiana University [email protected], http://www.dsc.soic.indiana.edu/, http://spidal.org/ 1 12/29/18

Upload: others

Post on 26-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

BigDataBenchmarking:ApplicationsandSystems

GeoffreyFox,December10,2018

2018InternationalSymposiumonBenchmarking,MeasuringandOptimizing(Bench’18)atIEEEBigData2018

Dec10- Dec11,2018@Seattle,WA,USA

DigitalScienceCenterIndiana University

[email protected], http://www.dsc.soic.indiana.edu/, http://spidal.org/

112/29/18

Page 2: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

BigDataandExtreme-scaleComputinghttp://www.exascale.org/bdec/

• BDECPathwaystoConvergenceReport

• NewseriesBDEC2“CommonDigitalContinuumPlatformforBigDataandExtremeScaleComputing”withfirstmeetingNovember28-30,2018BloomingtonIndianaUSA(focusonapplications).

• Workinggroupsonplatform(technology),applications,communitybuilding• BigDataBench presentedawhitepaper

• Nextmeetings:February19-21Kobe,Japan(focusonplatform)followedbytwoinEurope,oneinUSAandoneinChina.

http://www.exascale.org/bdec/sites/www.exascale.org.bdec/files/whitepapers/bdec2017pathways.pdf

2

Page 3: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

BenchmarksshouldmimicUseCases?Needtocollectusecases?

Canclassifyusecasesandbenchmarksalongseveraldifferentdimensions

312/29/18

Page 4: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

Software:MIDASHPC-ABDS

NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

Ogres Application Analysis

HPC-ABDS and HPC-FaaS SoftwareHarp and Twister2 Building Blocks

SPIDAL Data Analytics Library

4

Page 5: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

MyviewofSystemGAIMSC

5

Page 6: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

SystemsChallengesforGAIMSC• MicrosoftnotedwearecollectivelybuildingtheGlobalAISupercomputer.• Generalizebyaddingmodeling• ArchitectureoftheGlobalAIandModelingSupercomputerGAIMSCmustreflect

• Global capturestheneedtomashupservicesfrommanydifferentsources;• AI capturestheincredibleprogressinmachinelearning(ML);• Modeling capturesbothtraditionallarge-scalesimulationsandthemodelsanddigitaltwinsneededfordatainterpretation;

• Supercomputercapturesthateverythingishugeandneedstobedonequicklyandofteninrealtimeforstreamingapplications.

• TheGAIMSCincludesanintelligentHPCcloudlinkedviaanintelligentHPCFogtoanintelligentHPCedge.Weconsiderthisdistributedenvironmentasasetofcomputationalanddata-intensivenuggetsswimminginanintelligentaether.

• Wewilluseadataflowgraphtodefineameshintheaether12/29/18 6

Page 7: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

GlobalAIandModelingSupercomputerGAIMSC• Thereisonlyacloudatthelogicalcenterbutit’sphysicallydistributedandownedbyafewmajorplayers

• ThereisaverydistributedsetofdevicessurroundedbylocalFogcomputing;thisformsthelogicallyandphysicallydistributeedge

• Theedgeisstructuredandlargelydata• ThesearetwodifferencesfromtheGridofthepast• e.g.selfdrivingcarwillhaveitsownfogandwillnotsharefogwithtruckthatitisabouttocollidewith

• Thecloudandedgewillbothbeveryheterogeneouswithvaryingaccelerators,memorysizeanddiskstructure.

• GAIMSCrequiresparallelcomputingtoachievehighperformanceonlargeMLandsimulationnuggetsanddistributedsystemtechnologytobuildtheaether andsupportthedistributedbutconnectednuggets.

• Inthelatterrespect,theintelligentaether mimicsagridbutitisadatagridwheretherearecomputationsbuttypicallythoseassociatedwithdata(oftenfromedgedevices).

• Sounlikethedistributedsimulationsupercomputerthatwasoftenstudiedinpreviousgrids,GAIMSCisasupercomputeraimedatverydifferentdataintensiveAI-enrichedproblems.

7

Page 8: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

GAIMSCGlobalAI&ModelingSupercomputerQuestions• Whatdogainfromtheconcept?e.g.AbilitytoworkwithBigDatacommunity• Whatdowelosefromtheconcept?e.g.everythingrunsasslowasSpark• IsGAIMSCusefulforBDEC2initiative?ForNSF?ForDoE?

ForUniversities?ForIndustry?Forusers?• Doesaddingmodelingtoconceptaddvalue?• WhataretheresearchissuesforGAIMSC?e.g.howtoprogram?• WhatcanwedowithGAIMSCthatwecouldn’tdowithclassicBigDatatechnologies?

• WhatcanwedowithGAIMSCthatwecouldn’tdowithclassicHPCtechnologies?

• Aretheredeeporimportantissuesassociatedwiththe“Global”inGAIMSC?• Istheconceptofanauto-tunedGlobalAIandModelingSupercomputerscary?

8

Page 9: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

IntegrationofDataandModelfunctionswithMLwrappersinGAIMSC

• ThereisarapidincreaseintheintegrationofMLandsimulations.• MLcananalyzeresults,guidetheexecutionandsetupinitialconfigurations(auto-tuning).ThisisequallytrueforAIitself-- theGAIMSCwilluseitselftooptimizeitsexecutionforbothanalyticsandsimulations.

• Inprincipleeverytransferofcontrol(joborfunctioninvocation,alinkfromdevicetothefog/cloud)shouldpassthroughanAIwrapperthatlearnsfromeachcallandcandecidebothifcallneedstobeexecuted(maybewehavelearnedtheansweralreadyandneednotcomputeit)andhowtooptimizethecallifitreallyneedstobeexecuted.

• ThedigitalcontinuumproposedbyBDEC2isanintelligentaether learningfromandinformingtheinterconnectedcomputationalactionsthatareembeddedintheaether.

• Implementingtheintelligentaether embracingandextendingtheedge,fog,andcloudisamajorresearchchallengewhereboldnewideasareneeded!

• WeneedtounderstandhowtomakeiteasytoautomaticallywrapeverynuggetwithML.

12/29/18 9

Page 10: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

ImplementingtheGAIMSC• ThenewMIDASmiddlewaredesignedinSPIDALhasbeenengineeredtosupporthigh-performancetechnologiesandyetpreservethekeyfeaturesoftheApacheBigDataSoftware.

• MIDASseemswellsuitedtobuildtheprototypeintelligenthigh-performanceaether.

• NotethiswillmixmanyrelativelysmallnuggetswithAIwrappersgeneratingparallelismfromthenumberofnuggetsandnotinternallytothenuggetanditswrapper.

• However,therewillbealsolargeglobaljobsrequiringinternalparallelismforindividuallarge-scalemachinelearningorsimulationtasks.

• Thusparallelcomputinganddistributedsystems(grids)mustbelinkedinadeepfashionalthoughthekeyparallelcomputingideasneededforMLarecloselyrelatedtothosealreadydevelopedforsimulations.

12/29/18 10

Page 11: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

UnderlyingHPCBigDataConvergenceIssues

11

Page 12: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• NeedtodiscussDataandModel asproblemshavebothintermingled,butwecangetinsightbyseparatingwhichallowsbetterunderstandingofBigData- BigSimulation“convergence”(ordifferences!)

• TheModel isauserconstructionandithasa“concept”, parametersandgivesresultsdeterminedbythecomputation.Weuseterm“model”inageneralfashiontocoverallofthese.

• BigDataproblems canbebrokenupintoDataandModel• Forclustering,themodelparametersareclustercenterswhilethedataissetofpointstobeclustered

• Forqueries,themodelisstructureofdatabaseandresultsofthisquerywhilethedataiswholedatabasequeriedandSQLquery

• FordeeplearningwithImageNet,themodelischosennetworkwithmodelparametersasthenetworklinkweights.Thedataissetofimagesusedfortrainingorclassification

DataandModelinBigDataandSimulationsI

12

Page 13: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• Simulations canalsobeconsideredasData plusModel• Model canbeformulationwithparticledynamicsorpartialdifferentialequationsdefinedbyparameterssuchasparticlepositionsanddiscretizedvelocity,pressure,densityvalues

• Data couldbesmallwhenjustboundaryconditions• Data largewithdataassimilation(weatherforecasting)orwhendatavisualizationsareproducedbysimulation

• BigDataimpliesDataislargebutModelvariesinsize• e.g.LDA (LatentDirichletAllocation)withmanytopicsordeeplearninghasalargemodel• Clustering orDimensionreductioncanbequitesmallinmodelsize

• Data oftenstaticbetweeniterations(unlessstreaming);Modelparameters varybetweeniterations

• Data andModelParametersareoftenconfusedinpapersastermdatausedtodescribetheparametersofmodels.

• Modelsin BigDataandSimulationshavemanysimilaritiesandallowconvergence

DataandModelinBigDataandSimulationsII

13

Page 14: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• ManyapplicationsuseLMLorLocalmachineLearningwheremachinelearning(oftenfromRorPythonorMatlab)isrunseparatelyoneverydataitemsuchasoneveryimage

• Butothersare GMLGlobalMachineLearningwheremachinelearningisabasicalgorithmrunoveralldataitems(overallnodesincomputer)

• maximumlikelihoodorc2 withasumovertheNdataitems– documents,sequences,itemstobesold,imagesetc.andoftenlinks(point-pairs).

• GMLincludesGraphanalytics,clustering/communitydetection,mixturemodels,topicdetermination,Multidimensionalscaling,(Deep)LearningNetworks

• NoteFacebookmayneedlotsofsmallgraphs(oneperpersonand~LML)ratherthanonegiantgraphofconnectedpeople(GML)

LocalandGlobalMachineLearning

14

Page 15: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• Applications– Divideusecasesinto Data andModelandcomparecharacteristicsseparatelyinthesetwocomponentswith64ConvergenceDiamonds(features).

• Identifyimportanceofstreamingdata,pleasinglyparallel,global/localmachine-learning• Software– Singlemodelof HighPerformanceComputing(HPC)EnhancedBigDataStackHPC-ABDS.21LayersaddinghighperformanceruntimetoApachesystemsHPC-FaaSProgrammingModel

• Serverless InfrastructureasaServiceIaaS• Hardwaresystemdesignedforfunctionalityandperformanceofapplicationtypee.g.disks,interconnect,memory,CPUaccelerationdifferentformachinelearning,pleasinglyparallel,datamanagement,streaming,simulations

• UseDevOpstoautomatedeploymentofevent-drivensoftwaredefinedsystemsonhardware:HPCCloud 2.0

• TotalSystemSolutions(wisdom)asaService:HPCCloud 3.0

Convergence/DivergencePointsforHPC-Cloud-Edge-BigData-Simulation

15

Page 16: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

ApplicationStructure

http://www.iterativemapreduce.org/

1612/29/18

Page 17: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• Real-time (streaming)dataisincreasinglycommoninscientificandengineeringresearch,anditisubiquitousincommercialBigData(e.g.,socialnetworkanalysis,recommendersystemsandconsumerbehaviorclassification)

• SofarlittleuseofcommercialandApachetechnologyinanalysisofscientificstreamingdata

• Pleasinglyparallelapplicationsimportantinscience(longtail)anddatacommunities• Commercial-Scienceapplicationdifferences:Searchandrecommenderengineshavedifferentstructuretodeeplearning,clustering,topicmodels,graphanalysessuchassubgraphmining

• Latterverysensitivetocommunicationandcanbehardtoparallelize• SearchtypicallynotasimportantinScienceasincommercialuseassearchvolumescalesbynumberofusers

• Shoulddiscuss dataandmodel separately• Termdataoftenusedrathersloppilyandoftenreferstomodel

StructureofApplications

17

Page 18: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

DistinctiveFeaturesofApplications

• Ratioofdatatomodelsizes:verticalaxisonnextslide• ImportanceofSynchronization– ratioofinter-nodecommunicationtonodecomputing:horizontalaxisonnextslide

• SparsityofDataorModel;impactsvalueofGPU’sorvectorcomputing

• IrregularityofDataorModel• GeographicdistributionofDataasinedgecomputing;useofstreaming(dynamicdata)versusbatchparadigms

• Dynamicmodelstructureasinsomeiterativealgorithms

18

Page 19: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

BigDataandSimulationDifficultyinParallelismSizeofSynchronizationconstraints

PleasinglyParallelOftenindependentevents

MapReduceasinscalabledatabases

StructuredAdaptiveSparse

LooselyCoupled

Largestscalesimulations

CurrentmajorBigDatacategory

CommodityCloudsHPCClouds:Accelerators

HighPerformanceInterconnect

ExascaleSupercomputers

GlobalMachineLearninge.g.parallelclustering

DeepLearning

HPCClouds/SupercomputersMemoryaccessalsocritical

UnstructuredAdaptiveSparse

GraphAnalyticse.g.subgraphmining

LDA

LinearAlgebraatcore(oftennotsparse)

SizeofDiskI/O

TightlyCoupled

Parametersweepsimulations

JusttwoproblemcharacteristicsThereisalsodata/computedistributionseeningrid/edgecomputing

19

Page 20: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

ApplicationNexusofHPC,BigData,Simulation

ConvergenceUse-caseDataandModelNISTCollectionBigDataOgresConvergenceDiamondshttps://bigdatawg.nist.gov/

20

Page 21: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• 26fieldscompletedfor51areas• GovernmentOperation:4• Commercial:8• Defense:3• HealthcareandLifeSciences:10• DeepLearningandSocialMedia:6• TheEcosystemforResearch:4• AstronomyandPhysics:5• Earth,EnvironmentalandPolarScience:10• Energy:1

• Security&PrivacyEnhancedversion2• BDECHPCenhancedversion

OriginalUseCaseTemplate

21

Page 22: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• GovernmentOperation(4):NationalArchivesandRecordsAdministration,CensusBureau

• Commercial(8):FinanceinCloud,CloudBackup,Mendeley (Citations),Netflix,WebSearch,DigitalMaterials,Cargoshipping(asinUPS)

• Defense(3):Sensors,Imagesurveillance,SituationAssessment

• HealthcareandLifeSciences(10):Medicalrecords,GraphandProbabilisticanalysis,Pathology,Bioimaging,Genomics,Epidemiology,PeopleActivitymodels,Biodiversity

• DeepLearningandSocialMedia(6):DrivingCar,Geolocate images/cameras,Twitter,CrowdSourcing,NetworkScience,NISTbenchmarkdatasets

• TheEcosystemforResearch(4):Metadata,Collaboration,LanguageTranslation,Lightsourceexperiments

• AstronomyandPhysics(5):SkySurveysincludingcomparisontosimulation,LargeHadronCollideratCERN,BelleAcceleratorIIinJapan

• Earth,EnvironmentalandPolarScience(10):RadarScatteringinAtmosphere,Earthquake,Ocean,EarthObservation,IcesheetRadarscattering,Earthradarmapping,Climatesimulationdatasets,Atmosphericturbulenceidentification,SubsurfaceBiogeochemistry(microbestowatersheds),AmeriFluxandFLUXNETgassensors

• Energy(1):Smartgrid

• PublishedbyNISTasversion2https://bigdatawg.nist.gov/_uploadfiles/NIST.SP.1500-3r1.pdf withcommonsetof26featuresrecordedforeachuse-case

51DetailedUseCases:ContributedJuly-September2013Coversgoals,datafeaturessuchas3V’s,software,hardware

26FeaturesforeachusecaseBiasedtoscience 22

Page 23: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

PartofPropertySummaryTable

23

Page 24: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• People:eithertheusers(butseebelow)orsubjectsofapplicationandoftenboth

• Decisionmakerslikeresearchersordoctors(usersofapplication)

• Items suchasImages,EMR,Sequencesbelow;observationsorcontentsofonlinestore• Imagesor“ElectronicInformationnuggets”• EMR:ElectronicMedicalRecords(oftensimilartopeopleparallelism)• ProteinorGeneSequences;• Material properties,ManufacturedObjectspecifications,etc.,incustomdataset• Modelled entities likevehiclesandpeople

• Sensors – InternetofThings

• Events suchasdetectedanomaliesintelescopeorcreditcarddataoratmosphere

• (Complex)Nodes inRDFGraph• Simplenodesasinalearningnetwork

• Tweets,Blogs,Documents,WebPages,etc.• Andcharacters/wordsinthem

• Files ordatatobebackedup,movedorassignedmetadata

• Particles/cells/mesh points asinparallelsimulations

51UseCases:WhatisParallelismOver?

24

Page 25: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• PP(26) “All”PleasinglyParallelorMapOnly• MR(18)ClassicMapReduceMR(addMRStat belowforfullcount)• MRStat (7)SimpleversionofMRwherekeycomputationsaresimplereductionasfoundinstatisticalaveragessuchashistogramsandaverages

• MRIter (23) IterativeMapReduceorMPI(Flink,Spark,Twister)• Graph(9) Complexgraphdatastructureneededinanalysis• Fusion(11) Integratediversedatatoaiddiscovery/decisionmaking;couldinvolvesophisticatedalgorithmsorcouldjustbeaportal

• Streaming(41)Somedatacomesinincrementallyandisprocessedthisway• Classify (30)Classification:dividedataintocategories• S/Q(12)Index,SearchandQuery

SampleFeaturesof51UseCasesI

25

Page 26: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• CF(4)CollaborativeFilteringforrecommenderengines• LML(36)LocalMachineLearning(Independentforeachparallelentity)– applicationcouldhaveGMLaswell

• GML(23)GlobalMachineLearning:DeepLearning,Clustering,LDA,PLSI,MDS,• LargeScaleOptimizationsasinVariational Bayes,MCMC,LiftedBeliefPropagation,StochasticGradientDescent,L-BFGS,Levenberg-Marquardt.CancallEGOorExascaleGlobalOptimizationwithscalableparallelalgorithm

• Workflow(51)Universal• GIS(16)GeotaggeddataandoftendisplayedinESRI,MicrosoftVirtualEarth,GoogleEarth,GeoServer etc.

• HPC(5)Classiclarge-scalesimulationofcosmos,materials,etc.generating(visualization)data

• Agent(2)Simulationsofmodelsofdata-definedmacroscopicentitiesrepresentedasagents

SampleFeaturesof51UseCasesII

26

Page 27: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

BDEC2andNISTUseCases

27

Page 28: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

53NISTUseCasesforResearchSpaceI• GOVERNMENTOPERATION

• 1:Census2010and2000—Title13BigData• 2:NARAAccession,Search,Retrieve,Preservation• 3:StatisticalSurveyResponseImprovement• 4:Non-TraditionalDatainStatisticalSurveyResponseImprovement(AdaptiveDesign)

1-4 are related to social science survey problems and are “Classic Data+ML” with interesting algorithms (recommender engines) plus databases and important privacy issues which are present in research cases

• COMMERCIAL• 5:CloudEco-SystemforFinancialIndustriesNO• 6:Mendeley—AnInternationalNetworkofResearch• 7:NetflixMovieServiceNO• 8:WebSearchNO• 9:BigDataBusinessContinuityandDisasterRecoveryWithinaCloudEco-System NO• 10:CargoShippingEdgeComputingNO• 11:MaterialsDataforManufacturing• 12:Simulation-DrivenMaterialsGenomics

6 is “Classic Data+ML” with Text Analysis (citation identification, topic models etc.)10 is DHL/Fedex/UPS and has no direct scientific analog. However, it is a good example of Edge computing system of a similar nature to the scientific research case.11 and 12 are material science covered in BDEC2 meeting

12/29/18 28

Page 29: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

53NISTUseCasesforResearchSpaceII• DEFENSE

• 13:CloudLarge-ScaleGeospatialAnalysisandVisualization• 14:ObjectIdentificationandTrackingfromWide-AreaLargeFormatImageryorFullMotionVideo—PersistentSurveillance

• 15:IntelligenceDataProcessingandAnalysis13-15 are very similar to disaster response problems. They involve extensive “Classic Data+ML” for sensor collections and/or image processing. GIS and spatial analysis are important as in BDEC2 Pathology and Spatial Imagery talk. The geospatial aspect of applications means they are similar to earth science examples.

• Colorcodingofusecases• NOmeansnotsimilartoresearchapplication• Red meansnotrelevanttoBDEC2• Orange meansrelatedtoBDEC2Bloomingtonpresentations• BlackareuniqueusecasesofrelevancetoBDEC2butnotpresentedatBloomington• Purple are comments

12/29/18 29

Page 30: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

UseCasesIII:HEALTHCAREANDLIFESCIENCES• 16:ElectronicMedicalRecord• 17:PathologyImaging/DigitalPathology• 18:ComputationalBioimaging• 19:GenomicMeasurements• 20:ComparativeAnalysisforMetagenomesandGenome• 21:IndividualizedDiabetesManagement• 22:StatisticalRelationalArtificialIntelligenceforHealthCare• 23:WorldPopulation-ScaleEpidemiologicalStudy• 24:SocialContagionModelingforPlanning,PublicHealth,andDisasterManagement

• 25:BiodiversityandLifeWatch

12/29/18 30

Page 31: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

CommentsonUseCasesIII• 16 and 22 are “classic data + ML + database” based use cases using an

important technique, understood by the community but not presented at BDEC2

• 17 came originally from Saltz’s group and was updated in his BDEC2 talk• 18 describes biology image processing from many instruments microscopes,

MRI and light sources. The latter was directly discussed at BDEC2 and the other instruments were implicit.

• 19 and 20 are well recognized as a distributed Big Data problems with significant computing. They were represented by Chandrasekaran’s presentation at BDEC2 which inevitably only covered part (gene assembly) of problem.

• 21 relies on “classic data + graph analytics” which was not discussed in BDEC2 meeting but is certainly actively pursued.

• 23 and 24 originally came from Marathe and were updated in his BDEC2 presentation on massive bio-social systems

• 25 generalizes BDEC2 talks by Taufer and Rahnemoonfar on ocean and land monitoring and sensor array analysis.

12/29/18 31

Page 32: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

UseCasesIV:DEEPLEARNINGANDSOCIALMEDIA• 26:Large-ScaleDeepLearning• 27:OrganizingLarge-Scale,UnstructuredCollectionsofConsumerPhotosNO• 28:Truthy—InformationDiffusionResearchfromTwitterData• 29:CrowdSourcingintheHumanitiesasSourceforBigandDynamicData• 30:CINET—CyberinfrastructureforNetwork(Graph)ScienceandAnalytics• 31:NISTInformationAccessDivision—AnalyticTechnologyPerformanceMeasurements,Evaluations,andStandards

• 26 on deep learning was covered in great depth at the BDEC2 meeting• 27 describes an interesting image processing challenge of geolocating multiple

photographs which is not so far directly related to scientific data analysis although related image processing algorithms are certainly important

• 28-30 are “classic data + ML” use cases with a focus on graph and text mining algorithms not covered in BDEC2 but certainly relevant to the process

• 31 on benchmarking and standard datasets is related to BigDataBench talk at end of BDEC2 meeting and Fosters talk on a model database

12/29/18 32

Page 33: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

53NISTUseCasesforResearchSpaceVI• THEECOSYSTEMFORRESEARCH• 32:DataNet FederationConsortium• 33:TheDiscinnet ProcessNO• 34:SemanticGraphSearchonScientificChemicalandText-BasedData• 35:LightSourceBeamlines

32 covers data management with iRODS which is well regarded by the community but not discussed in BDEC2.33 is a Teamwork approach that doesn’t seem relevant to BDEC234 is a “classic data+ML” use case with a similar comments to 28-3035 was covered with more advanced deep learning algorithms in Yager and Foster’s BDEC2 talks

ASTRONOMYANDPHYSICS• 36:CatalinaReal-TimeTransientSurvey:ADigital,Panoramic,SynopticSkySurvey• 37:DOEExtremeDatafromCosmologicalSkySurveyandSimulations• 38:LargeSurveyDataforCosmology• 39:ParticlePhysics—AnalysisofLargeHadronColliderData:DiscoveryofHiggsParticle• 40:BelleIIHighEnergyPhysicsExperiment

36 to 38 are “Classic Data+ML” astronomy use cases related to BDEC2 SKA presentation and covering both archival and event detection cases. Use case 37 covers the integration of simulation data and observational data FOR ASTRONOMY; A TOPIC COVERED IN OTHER CASES AT BDEC2.39 and 40 are “Classic Data+ML” use cases for accelerator data analysis. This was not covered in BDEC2 but is currently the largest volume scientific data analysis problem whose importance and relevance is well understood.

12/29/18 33

Page 34: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

UseCasesVII:EARTH,ENVIRONMENTAL,ANDPOLARSCIENCE• 41:EuropeanIncoherentScatterScientificAssociation3DIncoherentScatterRadarSystemBigRadarinstrumentmonitoringatmosphere.

• 42:CommonOperationsofEnvironmentalResearchInfrastructure• 43:RadarDataAnalysisfortheCenterforRemoteSensingofIceSheets• 44:UnmannedAirVehicleSyntheticApertureRadar(UAVSAR)DataProcessing,DataProductDelivery,andDataServices

• 45:NASALangleyResearchCenter/GoddardSpaceFlightCenteriRODS FederationTestBed• 46:MERRAAnalyticServices(MERRA/AS)Instrument• 47:AtmosphericTurbulence– EventDiscoveryandPredictiveAnalyticsImaging• 48:ClimateStudiesUsingtheCommunityEarthSystemModelattheU.S.DepartmentofEnergy(DOE)NERSCCenter

• 49:DOEBiologicalandEnvironmentalResearch(BER)SubsurfaceBiogeochemistryScientificFocusArea

• 50:DOEBERAmeriFlux andFLUXNETNetworksSensorNetworks• 2-1:NASAEarthObservingSystemDataandInformationSystem(EOSDIS)Instrument• 2-2:Web-EnabledLandsatData(WELD)ProcessingInstrument

12/29/18 34

Page 35: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

CommentsonUseCasesVII• 41 43 44 are “Classic Data+ML” use cases involving radar data from

different instruments-- specialized ground, vehicle/plane, satellite - not directly covered in BDEC2

• 2-1 and 2-2 are use cases similar to 41 43 and 44 but applied to EOSDIS and LANDSAT earth observations from satellites in multiple modalities.

• 42 49 and 50 are “Classic Data+ML” environmental sensor arrays that extend the scope of talks of Taufer and Rahnemoonfar at BDEC2. See also use case 25 above

• 45 to 47 describe datasets from instruments and computations relevant to climate and weather. It relates to BDEC2 talk by Denvil and Miyoshi. 47 discusses the correlation of aircraft turbulent reports with simulation datasets

• 48 is data analytics and management associated with climate studies as covered in BDEC2 talk by Denvil

12/29/18 35

Page 36: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

UseCasesVIII:ENERGY• 51:ConsumptionForecastinginSmartGrids

51 is a different subproblem but in the same area as Pothen and Azad’s talk on the electric power grid at BDEC2. This is a challenging edge computing problem as a large number of distributed but correlated sensors

• SC-18BOFApplication/IndustryPerspectivebyDavidKeyes,KingAbdullahUniversityofScienceandTechnology(KAUST)

• https://www.exascale.org/bdec/sites/www.exascale.org.bdec/files/SC18_BDEC2_BoF-Keyes.pdf

This is a presentation by David Keyes on seismic imaging for oil discovery and exploitation. It is “Classic Data+ML” for an array of sonic sensors

12/29/18 36

Page 37: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

BDEC2UseCasesI:ClassicObservationalDataplusML• BDEC2-1:M.Deegan,BigDataandExtremeScaleComputing,2ndSeries(BDEC2)- StatementofInterestfromtheSquareKilometre ArrayOrganisation (SKAO)

• EnvironmentalScience• BDEC2-2:M.Rahnemoonfar,SemanticSegmentationofUnderwaterSonarImagerybasedonDeepLearning

• BDEC2-3:M.Taufer,CyberinfrastructureToolsforPrecisionAgricultureinthe21stCentury• HealthcareandLifesciences• BDEC2-4:J.Saltz,MultiscaleSpatialDataandDeepLearning• BDEC2-5:R.Stevens,ExascaleDeepLearningforCancer• BDEC2-6:S.Chandrasekaran,Developmentofaparallelalgorithmforwholegenomealignmentforrapiddeliveryofpersonalizedgenomics

• BDEC2-7:M.Marathe,Pervasive,PersonalizedandPrecision(P3)analyticsformassivebio-socialsystems

Instruments include Satellites, UAV’s, Sensors (see edge examples), Light sources (X-ray MRI Microscope etc.), Telescopes, Accelerators, Tokomaks (Fusion), Computers (as in Control, Simulation, Data, ML Integration)

12/29/18 37

Page 38: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

BDEC2UseCasesII:Control,Simulation,Data,MLIntegration• BDEC2-8:W.Tang,NewModelsforIntegratedInquiry:FusionEnergyExemplar• BDEC2-9:O.Beckstein,Convergenceofdatagenerationandanalysisinthebiomolecularsimulationcommunity

• BDEC2-10:S.Denvil,Fromtheproductiontotheanalysisphase:newapproachesneededinclimatemodeling

• BDEC2-11:T.Miyoshi,PredictionScience:The5thParadigmFusingtheComputationalScienceandDataScience(weatherforecasting)

See also Marathe and Stevens talksSee also instruments under Classic Observational Data plus ML

• MaterialScience• BDEC2-12:K.Yager,AutonomousExperimentationasaParadigmforMaterialsDiscovery• BDEC2-13:L.Ward,DeepLearning,HPC,andDataforMaterialsDesign• BDEC2-14:J.Ahrens,AvisionforavalidateddistributedknowledgebaseofmaterialbehavioratextremeconditionsusingtheAdvancedCyberinfrastructurePlatform

• BDEC2-15:T.Deutsch,DigitaltransitionofMaterialNano-Characterization.

12/29/18 38

Page 39: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

CommentsonControl,Simulation,Data,MLIntegration• Simulations often involve outside Data but always inside Data (from

simulation itself). Fields covered include Materials (nano), Climate, Weather, Biomolecular, Virtual tissues (no use case written up)

• We can see ML wrapping simulations to achieve many goals. ML replaces functions and/or ML guides functions

• Initial Conditions• Boundary Conditions• Data assimilation• Configuration -- blocking, use of cache etc. • Steering and Control• Support multi-scale• ML learns from previous simulations and so can predict function calls

• Digital Twins are a commercial link between simulation and systems• There are fundamental simulations covered by laws of physics and growingly

Complex System simulations with Bio (tissue) or social entities.12/29/18 39

Page 40: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

BDEC2UseCasesIII:EdgeComputing• SmartCityandRelatedEdgeApplications• BDEC2-16:P.Beckman,EdgetoHPCCloud• BDEC2-17:G.Ricart,SmartCommunityCyberInfrastructure attheSpeedofLife• BDEC2-18:T.El-Ghazawi,ConvergenceofAI,BigData,ComputingandIOT(ABCI)-SmartCityasanApplicationDriverandVirtualIntelligenceManagement(VIM)

• BDEC2-19:M.Kondo,TheChallengesandopportunitiesofBDECsystemsforSmartCities

• OtherEdgeApplications• BDEC2-20:APothen,High-EndDataScienceandHPCfortheElectricalPowerGrid• BDEC2-21:J.Qiu,Real-TimeAnomalyDetectionfromEdgetoHPC-Cloud

There are correlated edge devices such as power grid and nearby vehicles (racing, road). Also largely independent edge devices interacting via databases such as surveillance cameras

12/29/18 40

Page 41: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

BDECUseCasesIV• BDECEcosystem• BDEC2-22:IFoster,LearningSystemsforDeepScience• BDEC2-23:W.Gao,BigDataBench:AScalableandUnifiedBigDataandAIBenchmarkSuite

• Image-based Applications• One cross-cutting theme is understanding Generalized (light, sound, other sensors such as temperature, chemistry, moisture) Images with 2D, 3D spatial and time dependence

• Modalities include Radar, MRI, Microscopes, Surveillance and other cameras, X-ray scattering, UAV hosted, and related non-optical sensor networks as in agriculture, wildfires, disaster monitoring and Oil exploration. GIS and geospatial properties are often relevant

12/29/18 41

Page 42: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

NISTGenericDataProcessingUseCases

42

Page 43: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

10GenericDataProcessingUseCases1) Multipleusersperforminginteractivequeriesandupdatesonadatabasewithbasicavailabilityand

eventualconsistency(BASE=(BasicallyAvailable,Softstate,Eventualconsistency)asopposedtoACID=(Atomicity,Consistency,Isolation,Durability))

2) Performrealtimeanalyticsondatasourcestreamsandnotifyuserswhenspecifiedeventsoccur3) Movedatafromexternaldatasourcesintoahighlyhorizontallyscalabledatastore,transformitusing

highlyhorizontallyscalableprocessing(e.g.Map-Reduce),andreturnittothehorizontallyscalabledatastore(ELTExtractLoadTransform)

4) Performbatchanalyticsonthedatainahighlyhorizontallyscalabledatastoreusinghighlyhorizontallyscalableprocessing(e.g MapReduce)withauser-friendlyinterface(e.g.SQLlike)

5) Performinteractiveanalyticsondatainanalytics-optimizeddatabasewith5A)Science6) VisualizedataextractedfromhorizontallyscalableBigDatastore7) MovedatafromahighlyhorizontallyscalabledatastoreintoatraditionalEnterpriseDataWarehouse

(EDW)8) Extract,process,andmovedatafromdatastorestoarchives9) CombinedatafromClouddatabasesandonpremisedatastoresforanalytics,datamining,and/or

machinelearning10) Orchestratemultiplesequentialandparalleldatatransformationsand/oranalyticprocessingusinga

workflowmanager

43

Page 44: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

AccessPattern5A.Performinteractiveanalyticsonobservationalscientificdata

44

GridorManyTaskSoftware,Hadoop,Spark,Giraph,Pig…

DataStorage:HDFS,Hbase,FileCollection

StreamingTwitterdataforSocialNetworking

ScienceAnalysisCode,Mahout,R

RecordScientificDatain“field”

LocalAccumulateandinitialcomputing

DirectTransfer

ExamplesincludeLHC,RemoteSensing,AstronomyandBioinformatics

GridorManyTaskSoftware,Hadoop,Spark,Giraph,Pig…

DataStorage:HDFS,Hbase,FileCollection

StreamingTwitterdataforSocialNetworking

ScienceAnalysisCode,Mahout,R

RecordScientificDatain“field”

LocalAccumulateandinitialcomputing

GridorManyTaskSoftware,Hadoop,Spark,Giraph,Pig…

DataStorage:HDFS,Hbase,FileCollection

StreamingTwitterdataforSocialNetworking

ScienceAnalysisCode,Mahout,R

Transportbatchofdatatoprimaryanalysisdatasystem

RecordScientificDatain“field”

LocalAccumulateandinitialcomputing

GridorManyTaskSoftware,Hadoop,Spark,Giraph,Pig…

DataStorage:HDFS,Hbase,FileCollection

StreamingTwitterdataforSocialNetworking

ScienceAnalysisCode,Mahout,R

RecordScientificDatain“field”

LocalAccumulateandinitialcomputing

Page 45: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

PolarGrid

LightweightCyberinfrastructuretosupportmobileDatagatheringexpeditionsplusclassiccentralresources(asacloud)

45BATCHMODE

Page 46: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

“TheUniverseisnowbeingexploredsystematically,inapanchromaticway,overarangeofspatialandtemporalscalesthatleadtoamorecomplete,andlessbiasedunderstandingofitsconstituents,theirevolution,theirorigins,andthephysicalprocessesgoverningthem.”

TowardsaNationalVirtualObservatory

HubbleTelescope PalomarTelescope

SloanTelescope

TrackingtheHeavens

46DISTRIBUTEDLARGEINSTRUMENTMODE

Page 47: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

OtherUse-caseCollections

47

Page 48: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

7ComputationalGiantsofNRCMassiveDataAnalysisReport

1) G1: BasicStatisticse.g.MRStat2) G2: GeneralizedN-BodyProblems3) G3: Graph-TheoreticComputations4) G4: LinearAlgebraicComputations5) G5: Optimizationse.g.LinearProgramming6) G6: Integratione.g.LDAandotherGML7) G7: AlignmentProblemse.g.BLAST

http://www.nap.edu/catalog.php?record_id=18374 Big Data Models?

48

Page 49: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• Linpack orHPL:ParallelLUfactorizationforsolutionoflinearequations;HPCG

• NPB version1:MainlyclassicHPCsolverkernels• MG:Multigrid• CG:ConjugateGradient• FT:FastFourierTransform• IS:Integersort• EP:EmbarrassinglyParallel• BT:BlockTridiagonal• SP:ScalarPentadiagonal• LU:Lower-UppersymmetricGaussSeidel

HPC(Simulation)BenchmarkClassics

Simulation Models

49

Page 50: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

1) DenseLinearAlgebra2) SparseLinearAlgebra3) SpectralMethods4) N-BodyMethods5) StructuredGrids6) UnstructuredGrids7) MapReduce8) CombinationalLogic9) GraphTraversal10) DynamicProgramming11) Backtrackand

Branch-and-Bound12) GraphicalModels13) FiniteStateMachines

13BerkeleyDwarfsFirst6ofthesecorrespondtoColella’s original.(Classicsimulations)MonteCarlodropped.N-bodymethodsareasubsetofParticleinColella.

NotealittleinconsistentinthatMapReduceisaprogrammingmodelandspectralmethodisanumericalmethod.Needmultiplefacetstoclassifyusecases!

LargelyModelsforDataorSimulation

50

Page 51: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

ClassifyingUsecases

51

Page 52: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• TheBigDataOgresbuiltonacollectionof51bigdatausesgatheredbytheNISTPublicWorkingGroupwhere26propertiesweregatheredforeachapplication.

• ThisinformationwascombinedwithotherstudiesincludingtheBerkeleydwarfs,theNASparallelbenchmarksandtheComputationalGiantsoftheNRCMassiveDataAnalysisReport.

• TheOgreanalysisledtoasetof50featuresdividedintofourviewsthatcouldbeusedtocategorizeanddistinguishbetweenapplications.

• ThefourviewsareProblemArchitecture(Macropattern);ExecutionFeatures(Micropatterns);DataSourceandStyle;andfinallytheProcessingVieworruntimefeatures.

• WegeneralizedthisapproachtointegrateBigDataandSimulationapplicationsintoasingleclassificationlookingseparatelyatDataandModel withthetotalfacetsgrowingto64innumber,calledconvergencediamonds,andsplitbetweenthesame4views.

• AmappingoffacetsintoworkoftheSPIDALprojecthasbeengiven.

ClassifyingUseCases

52

Page 53: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

Pleasingly ParallelClassic MapReduceMap-CollectiveMap Point-to-Point

Shared MemorySingle Program Multiple DataBulk Synchronous ParallelFusionDataflowAgentsWorkflow

Geospatial Information SystemHPC SimulationsInternet of ThingsMetadata/ProvenanceShared / Dedicated / Transient / PermanentArchived/Batched/StreamingHDFS/Lustre/GPFSFiles/ObjectsEnterprise Data ModelSQL/NoSQL/NewSQL

Performance

Metrics

FlopsperB

yte;M

emory

I/OExecution

Environment;

Core

librariesVolum

eVelocityVarietyVeracityC

omm

unicationStructure

Data

Abstraction

Metric

=M

/Non-M

etric=

NON#

=N

N/O(N)

=N

Regular

=R

/Irregular=

ID

ynamic

=D

/Static=

S

Visualization

Graph A

lgorithms

Linear Algebra K

ernelsA

lignment

Streaming

Optim

ization Methodology

LearningC

lassificationSearch / Q

uery / Index

Base Statistics

Global A

nalyticsLocal A

nalyticsM

icro-benchmarks

Recom

mendations

Data Source and Style View

Execution View

Processing View

234

6789

101112

10987654

321

1 2 3 4 5 6 7 8 9 10 12 14

9 8 7 5 4 3 2 114 13 12 11 10 6

13

Map Streaming 5

4 Ogre Views and 50 Facets

Iterative/Sim

ple

11

1

Problem Architecture View

Theoriginal50Ogresin4views

53

• Processing• DataSourceandStyle

• ProblemArchitecture(metapatterns)

• Execution(micropatterns)

Page 54: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

Local(An

alytics/Inform

atics/Simulations)

2M

DataSourceandStyleView

PleasinglyParallelClassicMapReduce

Map-CollectiveMapPoint-to-Point

SharedMemorySingleProgramMultipleData

BulkSynchronousParallel

FusionDataflowAgents

Workflow

GeospatialInformationSystemHPCSimulationsInternetofThingsMetadata/ProvenanceShared/Dedicated/Transient/Permanent

Archived/Batched/Streaming – S1,S2,S3,S4,S5

HDFS/Lustre/GPFS

Files/ObjectsEnterpriseDataModelSQL/NoSQL/NewSQL

1M

Micro-benchmarks

ExecutionView

ProcessingView1234

6

78

910

11M

12

10D98D7D6D

5D

4D

3D2D1D

MapStreaming 5

ConvergenceDiamondsViewsandFacets

ProblemArchitectureView

15MCo

reLibrarie

sVisualiza

tion

14M

GraphAlgorithm

s

13M

LinearAlgebraKernels/M

anysubclasses

12M

Global(A

nalytics/Inform

atics/Simulations)

3M

RecommenderEngine

5M

4M

BaseDataStatistics

10M

Stream

ingDa

taAlgorith

ms

Optimiza

tionMethodology

9M

Learning

8M

DataClassificatio

n

7M

DataSe

arch/Q

uery/In

dex

6M

11M

DataAlignm

ent

BigDataProcessingDiamonds

MultiscaleMethod

17M

16M

IterativePD

ESolvers

22M

Natureofm

eshifused

EvolutionofDiscreteSystem

s

21M

ParticlesandFields

20M

N-bodyM

ethods

19M

Spectra

lMethods

18M

Simulation(Exascale)ProcessingDiamonds

DataAbstraction

D12

ModelAbstraction

M12

DataMetric

=M

/Non-Metric

=N

D13

DataMetric

=M

/Non-Metric

=N

M13

𝑂𝑁#

=NN

/𝑂(𝑁)=N

M14

Regular=R/Irregular=

IModel

M10

Veracity

7

Iterative/Sim

ple

M11

Communication

Structure

M8

Dynamic=D/Static

=S

D9

Dynamic=D/Static=

SM9

Regular=R/Irregular=

IData

D10

ModelVariety

M6

DataVelocity

D5

Performance

Metrics

1

DataVariety

D6

FlopsperByte/M

emory

IO/Flopsperw

att

2

ExecutionEnvironm

ent;Corelibraries

3

DataVolum

e

D4

ModelSize

M4

Simulations Analytics(ModelforBigData)

Both

(AllModel)

(NearlyallData+Model)

(NearlyallData)

(MixofDataandModel)

54

64Featuresin4viewsforUnifiedClassificationofBig

DataandSimulationApplications

Page 55: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

ConvergenceDiamondsandtheir4ViewsI• Oneview istheoverall problemarchitectureormacropatterns whichisnaturallyrelatedtothemachinearchitectureneededtosupportapplication.

• UnchangedfromOgresanddescribespropertiesofproblemsuchas“PleasingParallel”or“UsesCollectiveCommunication”

• Theexecution(computational)featuresormicropatterns view,describesissuessuchasI/Oversuscomputerates,iterativenatureandregularityofcomputationandtheclassicV’sofBigData:definingproblemsize,rateofchange,etc.

• SignificantchangesfromogrestoseparateDataandModelandaddcharacteristicsofSimulationmodels.e.g.bothmodelanddatahave“V’s”;DataVolume,ModelSize

• e.g.O(N2)Algorithmrelevanttobigdataorbigsimulationmodel55

Page 56: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

ConvergenceDiamondsandtheir4ViewsII• Thedatasource&style view includesfacetsspecifyinghowthedataiscollected,storedandaccessed.Hasclassicdatabasecharacteristics

• Simulationscanhavefacetsheretodescribeinputoroutputdata• Examples:Streaming,filesversusobjects,HDFSv.Lustre

• Processing view hasmodel(notdata)facetswhichdescribetypesofprocessingstepsincludingnatureofalgorithmsandkernelsbymodele.g.LinearProgramming,Learning,MaximumLikelihood,Spectralmethods,Meshtype,

• mixofBigDataProcessingViewandBigSimulationProcessingViewandincludessomefacetslike“useslinearalgebra”neededinboth:hasspecificsofkeysimulationkernelsandinparticularincludesfacetsseeninNASParallelBenchmarksandBerkeleyDwarfs

• InstancesofDiamondsareparticularproblemsandasetofDiamondinstancesthatcoverenoughofthefacetscouldformacomprehensivebenchmark/mini-app set

• Diamondsandtheirinstancescanbeatomic orcomposite56

Page 57: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

ProgrammingEnvironmentforGlobalAIandModelingSupercomputerGAIMSC

http://www.iterativemapreduce.org/

5712/29/18

Page 58: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

WaysofaddingHighPerformancetoGlobalAI(andModeling)Supercomputer

• FixperformanceissuesinSpark,Heron,Hadoop,Flinketc.• Messyassomefeaturesofthesebigdatasystemsintrinsicallyslowinsome(notall)cases

• Allthesesystemsare“monolithic”anddifficulttodealwithindividualcomponents• ExecuteHPBDCfromclassicbigdatasystemwithcustomcommunicationenvironment– approachofHarpfortherelativelysimpleHadoopenvironment

• ProvideanativeMesos/Yarn/Kubernetes/HDFShighperformanceexecutionenvironmentwithallcapabilitiesofSpark,HadoopandHeron– goalofTwister2

• ExecutewithMPIinclassic(Slurm,Lustre)HPCenvironment• AddmodulestoexistingframeworkslikeScikit-LearnorTensorflow eitherasnewcapabilityorasahigherperformanceversionofexistingmodule.

58

Page 59: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

GAIMSCProgrammingEnvironmentComponentsIArea Component Implementation Comments: User API

Architecture Specification

Coordination PointsState and Configuration Management; Program, Data and Message Level

Change execution mode; save and reset state

Execution Semantics

Mapping of Resources to Bolts/Maps in Containers, Processes, Threads

Different systems make different choices - why?

Parallel Computing Spark Flink Hadoop Pregel MPI modes Owner Computes Rule

Job Submission (Dynamic/Static) Resource Allocation

Plugins for Slurm, Yarn, Mesos, Marathon, Aurora

Client API (e.g. Python) for Job Management

Task System

Task migration Monitoring of tasks and migrating tasks for better resource utilization

Task-based programming with Dynamic or Static Graph API;

FaaS API;

Support accelerators (CUDA,FPGA, KNL)

Elasticity OpenWhisk

Streaming and FaaS Events

Heron, OpenWhisk, Kafka/RabbitMQ

Task Execution Process, Threads, Queues

Task Scheduling Dynamic Scheduling, Static Scheduling,Pluggable Scheduling Algorithms

Task Graph Static Graph, Dynamic Graph Generation

59

Page 60: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

GAIMSCProgrammingEnvironmentComponentsIIArea Component Implementation Comments

Communication API

Messages Heron This is user level and could map to multiple communication systems

Dataflow Communication

Fine-Grain Twister2 Dataflow communications: MPI,TCP and RMA

Coarse grain Dataflow from NiFi, Kepler?

Streaming, ETL data pipelines;

Define new Dataflow communicationAPI and library

BSP CommunicationMap-Collective

Conventional MPI, Harp MPI Point to Point and Collective API

Data AccessStatic (Batch) Data File Systems, NoSQL, SQL

Data APIStreaming Data Message Brokers, Spouts

Data Management Distributed Data Set

Relaxed Distributed Shared Memory(immutable data), Mutable Distributed Data

Data Transformation API;

Spark RDD, Heron Streamlet

Fault Tolerance Check PointingUpstream (streaming) backup; Lightweight; Coordination Points; Spark/Flink, MPI and Heron models

Streaming and batch casesdistinct; Crosses all components

Security Storage, Messaging, execution

Research needed Crosses all Components

60

Page 61: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• Harp-DAAL withakernelMachineLearninglibraryexploitingtheIntelnodelibraryDAALandHPCcommunicationcollectiveswithintheHadoopecosystem.

• Harp-DAALsupportsall5classesofdata-intensiveAIfirstcomputation,frompleasinglyparalleltomachinelearningandsimulations.

• Twister2 isatoolkitofcomponentsthatcanbepackagedindifferentways• IntegratedbatchorstreamingdatacapabilitiesfamiliarfromApacheHadoop,Spark,HeronandFlinkbutwithhighperformance.

• Separatebulksynchronousanddataflowcommunication;• TaskmanagementasinMesos,YarnandKubernetes• Dataflowgraphexecutionmodels• LaunchingoftheHarp-DAALlibrarywithnativeMesos/Kubernetes/HDFSenvironment• Streamingandrepositorydataaccessinterfaces,• In-memorydatabasesandfaulttoleranceatdataflownodes.(useRDDtodoclassiccheckpoint-restart)

IntegratingHPCandApacheProgrammingEnvironments

61

Page 62: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

Map Collective Run time merges MapReduce and HPC

allreducereduce

rotatepush & pull

allgather

regroup

broadcast

RuntimesoftwareforHarp

62

Page 63: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

• Datasets:5millionpoints,10thousandcentroids,10featuredimensions

• 10to20nodesofIntelKNL7250processors

• Harp-DAALhas15xspeedupsoverSparkMLlib

• Datasets:500Kor1milliondatapointsoffeaturedimension300

• RunningonsingleKNL7250(Harp-DAAL)vs.singleK80GPU(PyTorch)

• Harp-DAALachieves3xto6xspeedups

• Datasets:Twitterwith44millionvertices,2billionedges,subgraphtemplatesof10to12vertices

• 25nodesofIntelXeonE52670• Harp-DAALhas2xto5xspeedups

overstate-of-the-artMPI-Fasciasolution

Harpv.SparkHarpv.TorchHarpv.MPI

63

Page 64: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

Twister2DataflowCommunications• Twister:Net offerstwocommunicationmodels• BSP (BulkSynchronousProcessing)message-levelcommunicationusingTCPorMPIseparatedfromitstaskmanagementplusextraHarpcollectives

• DFWanewDataflowlibrarybuiltusingMPIsoftwarebutatdatamovementnotmessagelevel

• Non-blocking• Dynamicdatasizes• Streamingmodel

• Batchcaseismodeledasafinitestream• Thecommunicationsarebetweenasetoftasksinanarbitrarytaskgraph

• Keybasedcommunications• Data-levelCommunicationsspillingtodisks• Targettaskscanbedifferentfromsourcetasks

6412/29/18

Page 65: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

LatencyofApacheHeronandTwister:NetDFW(Dataflow)forReduce,BroadcastandPartitionoperationsin16nodeswith256-wayparallelism

Twister:Net andApacheHeronandSparkLeft:K-meansjobexecutiontimeon16nodeswithvaryingcenters,2millionpointswith320-wayparallelism.Right:K-Meanswth 4,8and16nodeswhereeachnodehaving20tasks.2millionpointswith16000centersused.

6512/29/18

Page 66: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

IntelligentDataflowGraph• Thedataflowgraphspecifiesthedistributionandinterconnectionofjobcomponents

• HierarchicalandIterative• AllowMLwrappingofcomponentateachdataflownode• Checkpointaftereachnodeofthedataflowgraph

• Naturalsynchronizationpoint• Let’sallowsusertochoosewhentocheckpoint(noteverystage)• Savestateasuserspecifies;SparkjustsavesModelstatewhichisinsufficientforcomplexalgorithms

• Intelligentnodessupportcustomizationofcheckpointing,ML,communication

• Nodescanbecoarse(largejobs)orfinegrainrequiringdifferentactions

6612/29/18

Page 67: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

DataflowatDifferentGrainsizes

Reduce

Maps

Iterate

InternalExecutionDataflowNodes

HPCCommunication

CoarseGrainDataflowslinksjobsinsuchapipeline

Datapreparation ClusteringDimensionReduction

Visualization

Butinternallytoeachjobyoucanalsoelegantlyexpressalgorithmasdataflowbutwithmorestringentperformanceconstraints

• P=loadPoints()• C=loadInitCenters()• for(int i =0;i <10;i++){• T=P.map().withBroadcast(C)• C=T.reduce()}

Iterate

CorrespondingtoclassicSparkK-meansDataflow

6712/29/18

Page 68: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

NiFi Coarse-grainWorkflow

12/29/18 68

Page 69: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

FuturesImplementingTwister2

forGlobalAIandModelingSupercomputer

http://www.iterativemapreduce.org/

6912/29/18

Page 70: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

Twister2Timeline:CurrentRelease(EndofSeptember2018)

• Twister:Net DataflowCommunicationAPI• DataflowcommunicationswithMPIorTCP

• Dataaccess• LocalFileSystems• HDFSIntegration

• TaskGraph• StreamingandBatchanalytics– Iterativejobs• Datapipelines• DeploymentsonDocker,Kubernetes,Mesos(Aurora),Slurm• HarpforMachineLearning(CustomBSPCommunications)

• Richcollectives• Around30MLalgorithms

7012/29/18

Page 71: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

Twister2Timeline:January2018• DataSet APIsimilartoSparkbatchandHeronstreamingwithTsetrealization

• CanuseTsets forwritingRDD/Streamletstyledatasets

• FaulttoleranceasinHeronandSpark• StormAPIforStreaming• HierarchicalDynamicHeterogeneousTaskGraph

• Coarsegrain andfinegraindataflow

• Cyclictaskgraphexecution• Dynamicscalingofresourcesandheterogeneous resources(atthe resourcelayer)forstreamingandheterogeneousworkflow

• Link toPilotJobs

7112/29/18

Page 72: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

Twister2Timeline:July1,2018• NaiadmodelbasedTasksystemforMachineLearning• NativeMPIintegrationtoMesos,Yarn• Dynamictaskmigrations• RDMAandothercommunicationenhancements• IntegratepartsofTwister2componentsasbigdatasystemsenhancements(i.e.runcurrentBigDatasoftwareinvokingTwister2components)

• Heron(easiest),Spark,Flink,Hadoop(likeHarptoday)• Tsets becomecompatiblewithRDD(Spark)andStreamlet(Heron)

• SupportdifferentAPIs(i.e.runTwister2lookinglikecurrentBigDataSoftware),Hadoop, Spark(Flink), Storm

• RefinementslikeMarathonwithMesosetc.• FunctionasaServiceandServerless• Supporthigherlevelabstractions

• Twister:SQL (majorSparkusecase)• GraphAPI

7212/29/18

Page 73: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium

DigitalScienceCenter

Conclusions• Canmakeusecasecollectionstomotivatebenchmarks

• NISTandBDEChavetemplates• Couldhelpfullyfillintemplatesforbenchmarks

• Researchapplicationshavesomesimilaritiesbutmanydifferencesfromcommercialusecases

• IncreasingimportanceofintegrationofsimulationandMachineLearning

• IncreasingimportanceofdistributedEdgeapplications• ShouldbenchmarkdataflowandBSPstylecommunication• Twister2willcombineHeronandSparkwithbuiltinHPCperformance

12/29/18 73