apex & geode: in-memory streaming, storage & analytics

Post on 19-Mar-2017

37 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

In-MemoryStreaming,Storage&Analy4csApacheApex+ApacheGeode

ThomasWeise AshishTadose

•  In-memoryStreamProcessing•  Par22oningandScalingout•  WindowingSupport(temporal)•  StatefulFault-tolerance,Operability•  ProcessingGuarantees•  ComputeLocality•  Dynamicupdates

ApexFeatures…

ApexPlaGormOverview

Applica2onProgrammingModelApplica2onProgrammingModel

§  Stream is a sequence of data tuples§  Operator takes one or more input streams, performs computations & emits one or more output streams

–  Each Operator is YOUR custom business logic in java, or built-in operator from our open source library–  Operator has many instances that run in parallel and each instance is single-threaded

§  Directed Acyclic Graph (DAG) is made up of operators and streams–  Iterative processing supported

Directed Acyclic Graph (DAG)

Output Stream Tuple Tuple er

Operator

er

Operator

er

Operator

er

Operator

ApacheApex-Malhar

ApexNa2veHadoopIntegra2on

YARNistheresourcemanagerHDFSusedforstoringanypersistentstate

•  Operatorstateischeckpointedtoapersistentstore–  Automa2callyperformedbyengine,noaddi2onalworkneededbyoperator–  Incaseoffailureoperatorsarerestartedfromcheckpointstate–  Frequencyconfigurableperoperator–  Asynchronousanddistributedbydefault–  DefaultstoreisHDFS

•  Automa2cdetec2onandrecoveryoffailedoperators–  Heartbeatmechanism

•  Bufferingmechanismtoensurereplayofdatafromrecoveredpointsothatthereisnolossofdata

•  Applica2onmasterstatecheckpointed

ApexFaultTolerance

At-least-once• Onrecoverydatawillbereplayedfromapreviouscheckpoint

–  Nomessageslost–  Default,suitableformostapplica2ons

• CanbeusedtoensuredataiswriUenoncetostore–  Transac2onswithmetainforma2on,Rewindingoutput,Feedbackfromexternalen2ty,

Idempotentopera2onsAt-most-once• Onrecoverythelatestdataismadeavailabletooperator

–  UsefulwheredatalossisacceptableandlatestdataissufficientExactly-once

–  At-least-once+idempotency+transac2onalmechanisms(operatorlogic)toachieveend-to-endexactlyoncebehavior

ApexProcessingSeman2cs

•  Dataflowin-memory,nodisk

•  Incrementalrecovery–bufferserver

•  In-memorydataforqueryingdata

IMCBenefitsforApex

StreamingmeetsInMemoryDataGrid

Apex+GeodeIntegra2on

Completed

•  Operatorcheck-poin2nginGeode.•  OutputoperatortostoretuplesinGeoderegion.

Proposed

•  GeodeoutputoperatorwithTransac2onalsupport.•  IngestdatafromGeodetoApexDAG.•  DistributedCacheOperator.•  Scanoperatorforparallelqueryexecu2on&resultretrieval.

OperatorCheckpoin2nginGeode

ApexOperatorcheck-poin4nginanIMDG(Geodestore)• Checkpoin2ngisanessen2almechanismtoensureFaultTolerance• ApexcheckpointsoperatorstatetoHDFS• SlowerHDFScheckpoin2nghurtsapplica2onperformance• Checkpoin2nginGeodeensuresthatapplica2onperformanceisnotimpacted• GeodehasbeUerlatencyforwriteopera2onsthanHDFS.

Implementa4on: GeodeStorageAgent

hUps://issues.apache.org/jira/browse/APEXCORE-283

DataStreamstoGeodeStore

ApexOutputOperatortowritetoGeodestore•  ApexOutputoperator–EgressdatafromApexDAGtoexternalstore•  Storeincomingtuplesinbinary/POJOformatinGeoderegion•  GeodeEfficientQueryintegra2on–OQL•  Geoderegionsupportsdatareplica2on,overflowtodisk,persistence&manyevic2onstrategies

Implementa4on: GeodeStoreGeodePOJOPutOperatorAbstractGeodePutOperator

hUps://malhar.atlassian.net/projects/MLHR/issues/MLHR-1942

GeodeTransac2onswrites

ApexOutputOperatortowritetoGeodestorewithTransac4ons• ApexDAGusesTransac2onableStoretoprovideguaranteethatrecordsarewriUenareexactlyonce.E.g.JdbcTransac2onalStore

• GeodeprovidesTransac2onsupportforefficientandsafecoordinatedopera2ons• Geodestoreusingtransac2onsguaranteethatrecordsarewriUenexactlyonce• PutoperatorbackedbyGeodeTransac2onalstorecanhelptoachieveExactlyonceseman2cs

Implementa4on: GeodeWindowStoreasTransac2onableStore

StreamingGeodedatainApex

ApexInputOperatortoreadfromGeodestore

• ApexInputoperators–IngestdatafromexternalsourcesintoApexDAG

• Geodeprovidesversa2leandreliableeventdistribu2ontoprovideRealTimeupdatestodata•  Usecase–ApexoperatortostreamasynceventsfromGeodeinDAG•  Callbackeventsreducepollingcyclesovernetwork

Implementa4on: GeodeRegionStreamOperator receivesanewlyaddedtuplesandemitsinDAG

GeodeCacheOperator

ApexGeodeCacheOperator

• GeodeprovidesefficientEvents&No2fica2ons•  Registerinterest–updatelocalcopies•  Con2nuousQuery

•  Receiveno2fica2onwhenQuerycondi2onmetonserver•  Eg.gSELECT*FROM/tradeOrdertWHEREt.price>100.00

• UseGeodeeventsno2fica2onframeworktomaintain&invalidatecache.

Implementa4on: GeodeCacheOperator maintainsconsistentcachebasedonsubscribedkeyset/query

GeodeScanOperator

ApexGeodeScanOperator

• Func2onExecu2onprovidesParallelQueryExecu2on• MapReducelikeexecu2on-concurrentexecu2ononmembers&resultsarecollectedfrommembers&senttocaller.• Usecase:Streamingapplica2ondependingonlargescanresultfromexternalstore

Implementa4on: GeodeQueryOperator executedatadependentqueriesondistributedregion emitresultsinDAG

Join the Apache Geode Community!

•  Check out: http://geode.incubator.apache.org

•  Subscribe: user-subscribe@geode.incubator.apache.org

•  Download: http://geode.incubator.apache.org/releases/

Ques4ons???

ThankYou…

top related