in-memory computing, storage & analysis: apache apex + apache geode
TRANSCRIPT
In-MemoryComputing,Storage&AnalysisApacheApex+ApacheGeode
SandeepDeshmukh AshishTadose
ProjectStatus
Mentor ListTed Dunning: Apache Member, MapR
Alan Gates: Apache Member, HortonworksTaylor Goetz: Apache Member, Hortonworks
Justin Mclean: Apache Member, Class SoftwareChris Nauroth: Apache Member, HortonworksHitesh Shah: Apache Member, Hortonworks
ApexInApacheIncubationStage
ApacheApex(Incubating)CommitterList
Open-sourced inJuly2015
Over50 committersalready…Andgrowing….
ApexPlatformOverview EnterpriseEdition
Directed AcyclicGraph (DAG)
ApplicationProgrammingModel
• A Stream is a sequence of data tuples• An Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library• Operator has many instances that run in parallel and each instance in single-threaded
• Directed Acyclic Graph (DAG) is made up of operators and streams
Output StreamTuple Tuple er
Operator
er
Operator
er
Operator
er
Operator
ApplicationProgrammingModel
Hadoop EdgeNode
DTRTSManagement
Server
HadoopNode
YARNContainerApexAppMaster
HadoopNode
YARNContainerYARNContainer
YARNContainer
Thread1
Op2
Op1
Thread-N
Op3
StreamingContainer
HadoopNode
YARNContainerYARNContainer
YARNContainer
Thread1
Op2
Op1
Thread-N
Op3
StreamingContainer
CLI
RESTAPI
DTRTSManagement
Server
RESTAPI
PartofCommunityEdition
ApexComponentOverview
• NativeHadoopIntegration• PartitioningandScalingout• AdvancedWindowingSupport• StatefulFault-tolerance• ProcessingSemantics• ComputeLocality• Dynamicupdates
ApexFeatures…
ApacheApex-Malhar
• Processingdatain-motion
• Preventingdata-loss– bufferserver
• Inmemorydatastoresforqueryingdata
IMCComponentsinApex
Typicallatencies
WhyIn-MemoryComputing?
WhyIn-MemoryComputing?
In-memorycomputingwillhavelongterm,disruptiveimpactbyradicallychangingusersexpectations,applicationdesignprinciples,product'sarchitecturesandvendor'sstrategiesRAMisthenewdisk,
diskthenewtapeRAMisthenewdisk,diskthenewtape
In-memorycomputingisthefutureofcomputing..itoffersmassivenotonlyinTCOreductionbutacrossallfourvaluedimensions:performance,process,processinnovation,simplificationand
flexibility.
WhatareIMDG?• IMDGshostdatainmemoryanddistribute itacrossa clusterofcommodityservers• Themainaccesspatterniskey/valueaccess,MapReduce,variousformsofHPC-likeprocessing,
andalimiteddistributedqueryingandindexingcapabilities.
Whytheyareimportant?
• Performance– usingRAMisfasterthanusingdisk.• Extremely Highavailabilityofdata- bykeepingitinmemoryandinhighlydistributedcluster.• DataStructure– usingakey/valuestoreallowsgreater flexibility fortheapplicationdeveloper.
objectstoresimilar ininterfacetoatypicalconcurrenthashmap.• ScalableDataPartitioning• TransactionalACIDsupport
InMemoryDataGrid- IMDG
HighLevelArchitecture- Geode
GeodeFeatures
CoreFeatures• Linearscalability&latencyminiming datadistribution • Performanceoptimizedpersistence- Highavailability&durability • Configurableconsistency- regiontypes{partitioned, replicated&local}• Distributed transactions• Clusterresilience&failover
AdvancedFeatures• ServerFunctionExecution- Sendcomputationtodata• Asynchronous Events- Delivereventstoareceiverwithoutimpacting the
writepath• ContinuesQueries&Clientsubscriptions - Usefulforrefreshing client
cache
GeodeFeatures
CoreFeatures• Linearscalability&latencyminiming datadistribution • Performanceoptimizedpersistence- Highavailability&durability • Configurableconsistency- regiontypes{partitioned, replicated&local}• Distributed transactions• Clusterresilience&failover
AdvancedFeatures• ServerFunctionExecution- Sendcomputationtodata• Asynchronous Events- Delivereventstoareceiverwithoutimpacting the
writepath• ContinuesQueries&Clientsubscriptions - Usefulforrefreshing client
cache
� Caching for speed and scale– Read-through, Write-through, Write-behind
� Geode as the OLTP system of record– Data in-memory for low latency, on disk for durability
� Parallel compute engine
� Real-time analytics
ApplicationPatterns
GeodereadsWithConsistentLatencyandCPU
• Scaledfrom256clientsand2serversto1280clientsand10servers• Partitionedregionwithredundancyand1Kdatasize
0
2
4
6
8
10
12
14
16
18
0
1
2
3
4
5
6
2 4 6 8 10
Spee
dup
ServerHosts
speedup
latency(ms)
CPU%
GeodeFeatures
Geode3.5-4.5XFasterThanCassandraforYCSB
Roadmap
� HDFS persistence
� Off-heap storage
� Lucene indexes
� Spark integration
� Cloud Foundry service
…and other ideas from the Geode community!
Roadmap
StreamingmeetsInMemoryDataGrid
Apex+GeodeApexOperatorcheck-pointinginGeodestore• BetterlatencyforcheckpointoperationsthanHDFScheck-pointing • MakesApexDAGacompletein-memorypipeline• https://issues.apache.org/jira/browse/APEXCORE-283
WriteApexdatastreamstoGeodestore• Apexoutput operatorimplementationwhichwritesdatatoGeoderegion• Usecases
• IngeststreamingdatainGeodeforfurtherprocessing• StoreDataprocessedbyApexpipeline inGeodestoretoserveuserqueries
• https://malhar.atlassian.net/projects/MLHR/issues/MLHR-1942
Questions???
ThankYou…