apex & geode: in-memory streaming, storage & analytics

19
In-Memory Streaming, Storage & Analy4cs Apache Apex + Apache Geode Thomas Weise Ashish Tadose

Upload: ashish-tadose

Post on 19-Mar-2017

37 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Apex & Geode: In-memory streaming, storage & analytics

In-MemoryStreaming,Storage&Analy4csApacheApex+ApacheGeode

ThomasWeise AshishTadose

Page 2: Apex & Geode: In-memory streaming, storage & analytics

•  In-memoryStreamProcessing•  Par22oningandScalingout•  WindowingSupport(temporal)•  StatefulFault-tolerance,Operability•  ProcessingGuarantees•  ComputeLocality•  Dynamicupdates

ApexFeatures…

Page 3: Apex & Geode: In-memory streaming, storage & analytics

ApexPlaGormOverview

Page 4: Apex & Geode: In-memory streaming, storage & analytics

Applica2onProgrammingModelApplica2onProgrammingModel

§  Stream is a sequence of data tuples§  Operator takes one or more input streams, performs computations & emits one or more output streams

–  Each Operator is YOUR custom business logic in java, or built-in operator from our open source library–  Operator has many instances that run in parallel and each instance is single-threaded

§  Directed Acyclic Graph (DAG) is made up of operators and streams–  Iterative processing supported

Directed Acyclic Graph (DAG)

Output Stream Tuple Tuple er

Operator

er

Operator

er

Operator

er

Operator

Page 5: Apex & Geode: In-memory streaming, storage & analytics

ApacheApex-Malhar

Page 6: Apex & Geode: In-memory streaming, storage & analytics

ApexNa2veHadoopIntegra2on

YARNistheresourcemanagerHDFSusedforstoringanypersistentstate

Page 7: Apex & Geode: In-memory streaming, storage & analytics

•  Operatorstateischeckpointedtoapersistentstore–  Automa2callyperformedbyengine,noaddi2onalworkneededbyoperator–  Incaseoffailureoperatorsarerestartedfromcheckpointstate–  Frequencyconfigurableperoperator–  Asynchronousanddistributedbydefault–  DefaultstoreisHDFS

•  Automa2cdetec2onandrecoveryoffailedoperators–  Heartbeatmechanism

•  Bufferingmechanismtoensurereplayofdatafromrecoveredpointsothatthereisnolossofdata

•  Applica2onmasterstatecheckpointed

ApexFaultTolerance

Page 8: Apex & Geode: In-memory streaming, storage & analytics

At-least-once• Onrecoverydatawillbereplayedfromapreviouscheckpoint

–  Nomessageslost–  Default,suitableformostapplica2ons

• CanbeusedtoensuredataiswriUenoncetostore–  Transac2onswithmetainforma2on,Rewindingoutput,Feedbackfromexternalen2ty,

Idempotentopera2onsAt-most-once• Onrecoverythelatestdataismadeavailabletooperator

–  UsefulwheredatalossisacceptableandlatestdataissufficientExactly-once

–  At-least-once+idempotency+transac2onalmechanisms(operatorlogic)toachieveend-to-endexactlyoncebehavior

ApexProcessingSeman2cs

Page 9: Apex & Geode: In-memory streaming, storage & analytics

•  Dataflowin-memory,nodisk

•  Incrementalrecovery–bufferserver

•  In-memorydataforqueryingdata

IMCBenefitsforApex

Page 10: Apex & Geode: In-memory streaming, storage & analytics

StreamingmeetsInMemoryDataGrid

Page 11: Apex & Geode: In-memory streaming, storage & analytics

Apex+GeodeIntegra2on

Completed

•  Operatorcheck-poin2nginGeode.•  OutputoperatortostoretuplesinGeoderegion.

Proposed

•  GeodeoutputoperatorwithTransac2onalsupport.•  IngestdatafromGeodetoApexDAG.•  DistributedCacheOperator.•  Scanoperatorforparallelqueryexecu2on&resultretrieval.

Page 12: Apex & Geode: In-memory streaming, storage & analytics

OperatorCheckpoin2nginGeode

ApexOperatorcheck-poin4nginanIMDG(Geodestore)• Checkpoin2ngisanessen2almechanismtoensureFaultTolerance• ApexcheckpointsoperatorstatetoHDFS• SlowerHDFScheckpoin2nghurtsapplica2onperformance• Checkpoin2nginGeodeensuresthatapplica2onperformanceisnotimpacted• GeodehasbeUerlatencyforwriteopera2onsthanHDFS.

Implementa4on: GeodeStorageAgent

hUps://issues.apache.org/jira/browse/APEXCORE-283

Page 13: Apex & Geode: In-memory streaming, storage & analytics

DataStreamstoGeodeStore

ApexOutputOperatortowritetoGeodestore•  ApexOutputoperator–EgressdatafromApexDAGtoexternalstore•  Storeincomingtuplesinbinary/POJOformatinGeoderegion•  GeodeEfficientQueryintegra2on–OQL•  Geoderegionsupportsdatareplica2on,overflowtodisk,persistence&manyevic2onstrategies

Implementa4on: GeodeStoreGeodePOJOPutOperatorAbstractGeodePutOperator

hUps://malhar.atlassian.net/projects/MLHR/issues/MLHR-1942

Page 14: Apex & Geode: In-memory streaming, storage & analytics

GeodeTransac2onswrites

ApexOutputOperatortowritetoGeodestorewithTransac4ons• ApexDAGusesTransac2onableStoretoprovideguaranteethatrecordsarewriUenareexactlyonce.E.g.JdbcTransac2onalStore

• GeodeprovidesTransac2onsupportforefficientandsafecoordinatedopera2ons• Geodestoreusingtransac2onsguaranteethatrecordsarewriUenexactlyonce• PutoperatorbackedbyGeodeTransac2onalstorecanhelptoachieveExactlyonceseman2cs

Implementa4on: GeodeWindowStoreasTransac2onableStore

Page 15: Apex & Geode: In-memory streaming, storage & analytics

StreamingGeodedatainApex

ApexInputOperatortoreadfromGeodestore

• ApexInputoperators–IngestdatafromexternalsourcesintoApexDAG

• Geodeprovidesversa2leandreliableeventdistribu2ontoprovideRealTimeupdatestodata•  Usecase–ApexoperatortostreamasynceventsfromGeodeinDAG•  Callbackeventsreducepollingcyclesovernetwork

Implementa4on: GeodeRegionStreamOperator receivesanewlyaddedtuplesandemitsinDAG

Page 16: Apex & Geode: In-memory streaming, storage & analytics

GeodeCacheOperator

ApexGeodeCacheOperator

• GeodeprovidesefficientEvents&No2fica2ons•  Registerinterest–updatelocalcopies•  Con2nuousQuery

•  Receiveno2fica2onwhenQuerycondi2onmetonserver•  Eg.gSELECT*FROM/tradeOrdertWHEREt.price>100.00

• UseGeodeeventsno2fica2onframeworktomaintain&invalidatecache.

Implementa4on: GeodeCacheOperator maintainsconsistentcachebasedonsubscribedkeyset/query

Page 17: Apex & Geode: In-memory streaming, storage & analytics

GeodeScanOperator

ApexGeodeScanOperator

• Func2onExecu2onprovidesParallelQueryExecu2on• MapReducelikeexecu2on-concurrentexecu2ononmembers&resultsarecollectedfrommembers&senttocaller.• Usecase:Streamingapplica2ondependingonlargescanresultfromexternalstore

Implementa4on: GeodeQueryOperator executedatadependentqueriesondistributedregion emitresultsinDAG

Page 18: Apex & Geode: In-memory streaming, storage & analytics

Join the Apache Geode Community!

•  Check out: http://geode.incubator.apache.org

•  Subscribe: [email protected]

•  Download: http://geode.incubator.apache.org/releases/

Page 19: Apex & Geode: In-memory streaming, storage & analytics

Ques4ons???

ThankYou…