integrating data-parallel analytics into stream -processing ......• stateful stream-processing...

of 42/42
Integrating Data-Parallel Analytics into Stream-Processing Using an In-Memory Data Grid DR. WILLIAM L. BAIN SCALEOUT SOFTWARE, INC.

Post on 03-Aug-2020

4 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • IntegratingData-ParallelAnalyticsintoStream-ProcessingUsinganIn-MemoryDataGrid

    DR.WILLIAML.BAIN

    SCALEOUT SOFTWARE,INC.

  • • Dr.WilliamBain,Founder&CEOofScaleOutSoftware:• Email:[email protected]• Ph.D.inElectricalEngineering(RiceUniversity,1978)• Careerfocusedonparallelcomputing– BellLabs,Intel,Microsoft• 3priorstart-ups,lastacquiredbyMicrosoftandproductnowshipsasNetworkLoadBalancinginWindowsServer

    • ScaleOutSoftwaredevelopsandmarketsIn-MemoryDataGrids,softwarefor:• Scalingapplicationperformancewithin-memorydatastorage

    • Analyzinglivedatainrealtimewithin-memorycomputing

    • Thirteen+yearsinthemarket;450+customers,12,000+servers

    AbouttheSpeaker

    2In-MemoryComputingSummitEurope2018

  • Agenda

    HowIn-MemoryComputingCreatestheNextGenerationinStream-Processing• Goalsandchallengesforstream-processing• Addingcontext:statefulstream-processing• Overviewofin-memorydatagrids(IMDGs)• Digitaltwinmodelforstream-processing• WhyuseanIMDG:integratedeventprocessinganddata-parallelanalysis• Exampleusecases• Detailedcodesample:runnerswithsmartwatches• Performancebenefits

    ©ScaleOutSoftware,Inc.CompanyConfidential

    3In-MemoryComputingSummit2017

  • GoalsforStream-Processing

    • Goals:• Processincomingdatastreamsfrommany(1000s)ofsources.• Analyzeeventsforpatternsofinterest.• Providetimely(real-time)feedbackandalerts.• Providedata-parallelanalyticsforaggregatestatisticsandfeedback.

    • Manyapplications:• InternetofThings(IoT)• Medicalmonitoring• Logistics• Financialtradingsystems• Ecommercerecommendations

    4

    EventSources

    In-MemoryComputingSummitEurope2018

  • Example:EcommerceRecommendations

    1000sofonlineshoppers:• Eachshoppergeneratesaclickstreamofproductssearched.• Stream-processingsystemmust:

    • Correlateclicksforeachshopper.• Maintainahistoryofclicksduringashoppingsession.

    • Analyzeclickstocreatenewrecommendationswithin100msec.

    • Analysismust:• Takeintoaccounttheshopper’spreferencesanddemographics.

    • Useaggregatefeedbackoncollaborativeshoppingbehavior.

    5In-MemoryComputingSummitEurope2018

  • ProvidingRecommendationsinRealTime

    • Requiresscalablestream-processingtoanalyzeeachclickandrespondin

  • ProvidingAggregateMetrics

    • Mustaggregatestatisticsforallshoppers:• Trackreal-timeshoppingbehavior.• Chartkeypurchasingtrends.• Enablemerchandizertocreatepromotionsdynamically.

    • Aggregatestatisticscanbesharedwithshoppers:• Allowsshopperstoobtaincollaborativefeedback.

    • Examplesincludemostviewedandbestsellingproducts.

    7In-MemoryComputingSummitEurope2018

  • ChallengesforStream-ProcessingArchitectures• Basicstream-processingarchitecture:

    • Challenges:• Howefficientlycorrelateeventsfromeachdatasource?• Howcombineeventswithrelevantstateinformationtocreatethenecessarycontextforanalysis?• Howembedapplication-specificanalysisalgorithmsinthepipeline?• Howgeneratefeedback/alertswithlowlatency?• Howperformdata-parallelanalyticstodetermineaggregatetrends?

    8In-MemoryComputingSummitEurope2018

  • AddingContexttoStream-Processing

    • Statefulstream-processingplatformsadd“unmanaged”datastoragetothepipeline:• Pipelinestagesperformtransformationsinasequenceofstagesfromdatasourcestosinks.• Datastorage(distributedcache,database)isaccessedfromthepipelinebyapplicationcodeinanunspecifiedmanner.

    • Examples:Apama (CEP),ApacheFlink,Storm

    • Problems:• Thereisnosoftwarearchitectureformanagingstateinformation.

    • Thisaddscomplexitytotheapplication.

    • Createsanetworkbottleneck.• Doesnotaddressneedfordata-parallelanalytics.

    9In-MemoryComputingSummitEurope2018

  • LambdaArchitecture:BatchParallelAnalytics

    • Lambdaarchitectureseparatesstream-processing(“speedlayer”)fromdata-parallelanalytics(“batchlayer”).• Createsqueryable state,but:

    • Doesnotenhancecontextforstatefulstreamprocessing.

    • Doesnotperformdata-parallelanalyticsonlineforimmediatefeedback.

    • Doesnotleadtoa“HybridTransactionalandAnalyticsProcessing”(HTAP)architecture.

    10

    https://commons.wikimedia.org/w/index.php?curid=34963987

    Howcombinestream-processingwithstatetosimplifydesign,maximizeperformance,andenablefastdata-parallelanalytics?

    In-MemoryComputingSummitEurope2018

  • In-MemoryDataGrid(IMDG)

    IMDGprovidesapowerfulplatformforstatefulstream-processing.WhatisanIMDG?• IMDGstoreslive,object-orienteddata:

    • Usesakey/valuestoragemodelforlargeobjectcollections.

    • Mapsobjectstoaclusterofcommodityserverswithlocationtransparency.

    • Haspredictablyfast(

  • HowanIMDGCanIntegrateComputation• Eachgridhostrunsaworkerprocesswhichexecutesapplication-definedmethodsonstoredobjects.• Thesetofworkerprocessesiscalledaninvocationgrid(IG).

    • IGusuallyrunslanguage-specificruntimes(JVM,.NET).

    • IMDGcanshipcodetotheIGworkers.

    • KeyadvantagesforIGs:• Followsobject-orientedmodel.• Avoidsnetworkbottlenecksbymovingcomputingtothedata.

    • LeveragesIMDG’scores&servers.

    InvocationGrid

    12

    In-MemoryDataGrid

    In-MemoryComputingSummitEurope2018

  • IMDGRunsEventHandlersforStream-Processing

    Eventhandlersrunindependentlyforeachincomingevent:• IMDGdirectseventtoaspecificobjectusingReactiveX forlowlatency.• IMDGexecutesmultipleeventhandlersinparallelforhighthroughput.

    13

    Object

    In-MemoryComputingSummitEurope2018

  • IMDGExecutesData-ParallelComputationsMethodexecutionimplementsabatchjobonanobjectcollection:• Clientrunsasinglemethodonallobjectsinacollection.• Executionrunsinparallelacrossthegrid.• Resultsaremergedandreturnedtotheclient.

    14

    Object

    In-MemoryComputingSummitEurope2018

  • ABasicData-ParallelExecutionModel

    Afundamentalmodelfromparallelsupercomputing:• Runonemethod(“eval”)inparallelacrossmanydataobjects.• Optionallymerge theresults.• Binarycombiningisaspecialcase,but…

    • ItrunsinlogN timetoenablescalablespeedup

    15In-MemoryComputingSummitEurope2018

  • MapReduceBuildsonThisModel

    • Implements“group-by”computations.• Example:“DetermineaverageRPMforallwindmillsbyregion(NE,NW,SE,SW).”• Runsintwodata-parallelphases(map,reduce):• Map phaserepartitionsandoptionallycombinessourcedata.

    • Reduce phaseanalyzeseachdatapartitioninparallel.

    • Returnsresultsforeachpartition.

    partitions

    16In-MemoryComputingSummitEurope2018

  • DistributedForEach:AnotherData-ParallelModel

    17In-MemoryComputingSummitEurope2018

  • ReducedGCTimewithDistributedForEach

    PMI DistributedForEach

    18In-MemoryComputingSummitEurope2018

  • Stream-ProcessingwiththeDigitalTwinModel

    • CreatedbyMichaelGrieves;popularizedbyGartner• RepresentseachdatasourcewithanIMDGobjectthatholds:• Aneventcollection• Stateinformationaboutthedatasource• Logicforanalyzingevents,updatingstate,andgeneratingalerts

    • Benefits:• Offersastructuredapproachtostatefulstream-processing.• Automaticallycorrelatesincomingeventsbydatasource.• Integratesallrelevantcontext(events&state).• Enableseasydeploymentofapplication-specificlogic(e.g.,ML,rulesengine,etc.)foranalysisandalerting.

    • Providesdomainforaggregateanalysisandfeedback.

    19In-MemoryComputingSummitEurope2018

    Data-parallelanalysis

  • SomeApplicationsforDigitalTwins

    Adigitaltwincorrelatesincomingeventswithcontextusingdomain-specificalgorithmstogeneratealerts:

    20

    Application Context Events Logic Alerts

    IoT devices Devicestatus&history Devicetelemetry Analyzetopredictmaintenance.

    Maintenancerequests

    Medicalmonitoring

    Patienthistory&medications

    Heart-rate,blood-pressure,etc.

    Evaluatemeasurementsovertimewindowswithrulesengine.

    Alertstopatient&physician

    CableTV Viewerpreferences&history,set-topboxstatus

    Channelchangeevents,telemetry

    Cleanse&mapchanneleventsforreco.engine;predictboxfailure.

    Viewerrecom-mendations,repairalerts

    Ecommerce Shopperpreferences&buyinghistory

    Clickstreameventsfromwebsite

    UseMLtomakeproductrecommendations.

    Productlistforwebsite

    Frauddetection

    Customerstatus&history

    Transactions Analyzepatternstoidentifyprobablefraud.

    Alertstocustomer&bank

    In-MemoryComputingSummitEurope2018

  • WhyUseanIMDGtoHostDigitalTwins?

    IMDGprovidesanexcellentDTplaftorm:• Scalable,object-orienteddatastorage:

    • Offersanaturalmodelforhostingdigitaltwins.• Cleanlyseparatesdomainlogicfromdata-parallelorchestration.

    • Integrated,In-memorycomputing:• Automaticallycorrelatesincomingeventsforanalysis.

    • Enablesbothstreamanddata-parallelprocessing.• Highperformance:

    • Avoidsdatamotionandassociatednetworkbottlenecks.

    • Fastandscalestohandlelargeworkloads.• Integratedhighavailability:

    • Usesdatareplicationdesignedforlivesystems.• Canensurethatcomputationishighav.

    21In-MemoryComputingSummitEurope2018

  • ScalingEventIngestionwithKafka

    22

    • IMDGpartitionsdigitaltwinobjectsacrossservers.• Kafkaofferspartitionstoscaleouthandlingofeventmessages.• Partitionsaredistributedacrossbrokers.• Brokersprocessmessagesinparallel.

    • IMDGcanmapKafkapartitionstogridpartitions:• IMDGspecifiesevent-mappingalgorithmtoKafka.

    • IMDGlistenstoappropriateKafkapartitions.

    • Thisminimizeseventhandlinglatency.• Avoidsstore-and-forwardwithinIMDG.

    In-MemoryComputingSummitSiliconValley2017

  • IntegratingEventandData-ParallelProcessingTheIMDG:• Postsincomingeventstoitsrespectivedigitaltwinobject.• Runsthetwin’seventhandlermethodwithlowlatency.• Eventhandlermanagestheeventcollectionandcanusetimewindowsforanalysis.

    • Eventhandlerusesandupdatesin-memorystate.• Eventhandlercanuse/updateoff-linestate.• Eventhandleroptionallygeneratesalertsandfeedbacktoitsdigitaltwin.

    • Runsdata-parallelmethodstoanalyzealldigitaltwinsinreal-time.• Resultscanbeusedforbothalertingandfeedback.

    23

    OfflineState Data-ParallelAnalysis

    In-MemoryComputingSummitEurope2018

  • Example:EcommerceShoppingSite

    Trackswebshoppersandprovidesreal-timerecommendations:• EachDTobjectholdsclickstreamofbrowsedproducts,preferences,anddemographics.• Eventhandleranalyzesthisdataandupdatesrecommendations.• Periodicdata-parallel,batchanalyticsacrossallshoppersdetermineaggregatetrends:• Examplesincludebestsellingproducts,averagebasketsize,etc.

    • Usedforanalysisandreal-timefeedback

    24In-MemoryComputingSummitEurope2018

  • Example:TrackingaFleetofVehicles

    • Goal:Tracktelemetryfromafleetofcarsortrucks.• Eventsindicatespeed,position,andotherparameters.

    • Digitaltwinobjectstoresinformationaboutvehicle,driver,anddestination.

    • Eventhandleralertsonexceptionalconditions(speeding,lostvehicle).

    • Periodicdata-parallelanalyticsdeterminesaggregatefleetperformance:• Computesoverallfuelefficiency,driverperformance,vehicleavailability,etc.

    • Canprovidefeedbacktodriverstooptimizeoperations.

    25In-MemoryComputingSummitEurope2018

  • UsingDigitalTwinsinaHierarchy

    Trackscomplexsystemsashierarchyofdigitaltwinobjects:• Leafnodesreceivetelemetryfromphysicalendpoints.• Higherlevelnodesrepresentsubsystems:• Receivetelemetryfromlower-levelnodes.

    • Supplytelemetrytohigher-levelnodesasalerts.

    • Allowsuccessiverefinementofreal-timetelemetryintohigher-levelabstractions.

    26

    Example:HierarchyofDigitalTwinsforaWindmill

    In-MemoryComputingSummitEurope2018

  • DetailedExample:Heart-RateWatchMonitoring

    Goal:Trackheart-rateforalargepopulationofrunners.• Heart-rateeventsflowfromsmartwatchestotheirrespectivedigitaltwinobjectsforanalysis.• Theanalysisuseswearer’shistory,activity,andaggregatestatisticstodeterminefeedbackandalerts.

    27In-MemoryComputingSummitEurope2018

  • DigitalTwinObject(Java)• Holdseventcollectionanduser’scontext(age,medicalhistory,currentstatus,etc.):

    28

    public class User implements Serializable {private int _id;private double _height;private double _bodyWeight;private Gender _gender;private int _age;private int _averageHr;private WorkoutProgress _status;private int _sessionAverageMax;private List _medications;private List _heartIncidents;private List _runningHeartRateTelemetry;private long _alertTime;private boolean _alerted;...}

    In-MemoryComputingSummitEurope2018

    Eventcollection

    User’scontext

  • Events&Alerts

    • EventholdsperiodictelemetrysentfromwatchtoIMDG:

    • Alertholdsdatatobesentbacktowearerand/ortomedicalpersonnel:

    29

    public class HeartRateEvent {private int _userId;private int _heartRate;private long _timestamp;private WorkoutType _workoutType;private WorkoutProgress _workoutProgress;private Event _event;...}

    public class HeartRateAlert {private int _userId; private String _alertType;private String _params;...}

    In-MemoryComputingSummitEurope2018

  • SettingUpaReactiveX PipelineontheIMDG

    • DefineaReactiveX observerthatrunsoneveryserverintheIMDG:

    • CreateaninvocationgridthatInitializestheReactiveX observeratstartup:

    30

    public class HeartRateObserver implements Observer, Serializable {@Override public void onNext(Event event) {

    HeartRateEvent hre = HeartRateEvent.fromBytes(event.getPayload());hre.setEvent(event);User.processRunningEvent(hre);} ...}

    In-MemoryComputingSummitEurope2018

    Pipeline pipeline = new Pipeline(“userCache”, “userGrid”);GridAction action = pipeline.createRemoteObserverAction("userObserver",

    new HeartRateObserver());InvocationGrid grid = new InvocationGridBuilder(“userGrid”)

    .addJar("./bin/appcode.jar")

    .addStartupAction(action)

    .load();

    Callapplication

    Initializeobserver

  • EventHandlerandEventPosting

    • PostinganeventtotheReactiveX observer:• Thekeydetermineswhichserverreceivestheeventforposting.

    • HandlinganeventpostedtotheReactiveX observeronDTtwin’sserver:

    31

    private static void processRunningEvent(HeartRateEvent hre) {CachedObjectId id = hre.getId();User u = (User)cache.retrieve(id, false);...executeRunningWorkoutAnalytics(hre, u);...cache.update(id, u);}

    In-MemoryComputingSummitEurope2018

    RetrieveDTobject

    Analyzeevent

    UpdateDTobject

    pipeline.postEvent(makeKey(UserId),"heartRateEvent", HeartRateEvent.toBytes(new HeartRateEvent(last, System.nanoTime(),WorkoutType.Running, WorkoutProgress)));

  • EventAnalysis• Handlesaneventforanactiveuserdoingarunningworkout:

    32

    private static void executeRunningWorkoutAnalytics(HeartRateEvent hre, User u) {long start = twoWeeksAgo();long sessionTimeout = threeHours();SessionWindowCollection swc = new

    SessionWindowCollection(u.getRunningHeartRateTelemetry(),heartRate -> heartRate.getTimestamp(), start, sessionTimeout);

    swc.add(new HeartRate(hre.getHeartRate(), hre.getTimestamp()));

    int total = 0; int windowCount = 0;for(TimeWindow window : swc) {

    int avg = 0;for(HeartRate hr : window) {avg += hr.getHeartRate();}total += (avg/window.size());windowCount++;}

    u.setAverageHr(total/windowCount);u.analyzeAndCheckForAlert(hre);}

    In-MemoryComputingSummitEurope2018

    Analyzeeventhistory

    Analyzeuser’scontext

    Createtimewindows

    Addevent

  • AnalysisTechniquesEnabledbyDigitalTwin

    Enabledetailedheart-ratemonitoringforahighintensityexerciseprogram:• Exampleofdatatobetracked:

    • Exercisespecifics:typeofexercise,exercise-specificparameters(distance,strides,altitudechange,etc.)

    • Participantbackground/history:age,height,weighthistory,heart-relatedmedicalconditionsandmedications,injuries,previousmedicalevents

    • Exercisetracking:sessionhistory,average#sessionsperweek,averageandpeakheartrates,frequencyofexercisetypes

    • Aggregate statistics:average/max/minexercisetrackingstatisticsforallparticipants

    • Exampleoflogictobeperformed:• Notifyparticipantifsessionhistoryacrosstimewindowsindicatesneedtochangemix.• Notifyparticipantifheartratetrendsdeviatesignificantlyfromaggregatestatistics.• Alertparticipant/medicalpersonnelifheartrateanalysisacrosstimewindowsindicatesanimminentthreattohealth.

    • Report aggregatestatisticstoanalystsand/orusers.

    33In-MemoryComputingSummitEurope2018

  • DataParallelAnalysisAcrossallDigitalTwins

    • UsesIMDG’sin-memorycomputeenginetocreateaggregatestatisticsinrealtime.

    • Resultscanbereportedtoanalystsandupdatedeveryfewseconds.

    • Resultscanbeusedasfeedbacktoeventanalysisindigitaltwinobjectsand/orreportedtousers.

    34In-MemoryComputingSummitEurope2018

  • ComputingAggregateData• Performsadata-parallelcomputationusingtheIMDG’sEval andMergemethods:

    35

    public class AggregateStatsInvokable implements Invokable {@Overridepublic AggregateStats eval(User u, Integer numUsers) {

    AggregateStats userStats = new AggregateStats(numUsers);userStats.merge(u);return userStats ;

    }

    @Overridepublic AggregateStats merge(AggregateStats mergedStats,

    AggregateStats u) {mergedStats.merge(u);return mergedStats;

    }}

    In-MemoryComputingSummitEurope2018

    Evalmethod

    Binarymergemethod

  • ComputingAggregateData(2)• Computesrunningaverageofheart-ratebycategories:

    36

    public void merge(AggregateStats user) {numEvents += user.getNumEvents();totalHeartRate18to34 += user.getTotalHeartRate18to34();totalHeartRate35to50 += user.getTotalHeartRate35to50();totalHeartRateOver50 += user.getTotalHeartRateOver50();count18to34 += user.getCount18to34();count35to50 += user.getCount35to50();countOver50 += user.getCountOver50();

    totalHeartRateBmiUnderWeight += user.getTotalHeartRateBmiUnderWeight();totalHeartRateBmiNormalWeight += user.getTotalHeartRateBmiNormalWeight();totalHeartRateBmiOverweight += user.getTotalHeartRateBmiOverweight();countUnderweight += user.getCountUnderweight();countNormalWeight += user.getCountNormalWeight();countOverWeight += user.getCountOverWeight();

    }

    In-MemoryComputingSummitEurope2018

    CreatesGroups

  • RunningtheData-ParallelComputation

    • Usesasinglemethodtorunadata-parallelcomputationandreturnresults.• PublishesmergedresultstoanIMDGobjectforaccessbyuserobjectsand/oranalysts.

    37

    public void run() { NamedCache usersCache = CacheFactory.getCache(“userCache”);NamedCache statsCache = CacheFactory.getCache(“statsCache”);AggregateStats stats;

    InvokeResult result =usersCache.invoke(AggregateStatsInvokable.class, null, _numUsers,

    TimeSpan.fromMilliseconds(10000));

    stats = result.getResult();statsCache.put(“globalStats”, stats);

    }

    In-MemoryComputingSummitEurope2018

    Invokedata-parallelop

    StoreresultinIMDG

  • Data-ParallelExecutionSteps

    • Eval phase:eachserverquerieslocalobjectsandrunseval andmergemethods:• Accessinglocalobjectsavoidsdatamotion.• Completeswithoneresultobjectperserver.

    • Mergephase:allserversperformbinary,distributedmergetocreatefinalresult:• Mergerunsinparalleltominimizecompletiontime.

    • Returnsfinalresultobjecttoclient.

    38In-MemoryComputingSummitEurope2018

  • Predictable,ScalablePerformance

    • DigitaltwinmodelenablestheIMDGtoscalebothevent-handlingandintegrateddata-parallelanalysis.• Correlatingeventstodigitaltwinobjectscreatesanautomaticbasisforperformancescaling:

    • Foreventanalysis• Fordata-parallelanalysis

    • Itenablesaccesstoeachevent’scontextwithoutrequiringanetworkaccess.• Italsoco-locatesandencapsulatesapplication-specificcodeusingo-otechniques.

    39In-MemoryComputingSummitEurope2018

  • AvoidsNetworkBottlenecks

    • DigitaltwinmodelavoidsnetworkbottlenecksassociatedwithusinganIMDGasanetworkedcacheinastream-processingpipeline.• Externaldatastoragerequiresnetworkaccesstoobtainanevent’scontext.• Networkbottleneckpreventsscalablethroughput.

    40In-MemoryComputingSummitEurope2018

  • Wrap-Up

    DigitalTwins:TheNextGenerationinStatefulStream-Processing• Challenge:Currenttechniquesforstatefulstream-processing:

    • Lackacoherentsoftwarearchitectureformanagingcontext.• Cansufferfromperformanceissuesduetonetworkbottlenecks.

    • Thedigitaltwinmodel:• Offersaflexible,powerful,scalablearchitectureforstatefulstream-processing:

    • Associateseventswithcontextabouttheirphysicalsourcesfordeeperintrospection.• Enablesflexible,object-orientedencapsulationofanalysisalgorithms.

    • Providesabasisforaggregateanalysisandfeedback.

    • Scalable,data-parallelcomputingwithanIMDG:• Automaticallycorrelatesincomingeventsandprocessestheminparallel.• Implementsintegrated(real-time),aggregateanalysisforimmediatefeedback.

    41In-MemoryComputingSummitEurope2018

  • www.scaleoutsoftware.com

    In-MemoryComput ing for Operat ional Inte l l igence