neutrino: revisi&ng memory caching for iterave data analy&cs...neutrino: revisi&ng...
TRANSCRIPT
![Page 1: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/1.jpg)
Neutrino:Revisi&ngMemoryCachingforItera&veDataAnaly&cs
ErciXu*,MohitSaxena,LawrenceChiuOhioStateUniversity*IBMResearchAlmaden
![Page 2: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/2.jpg)
Background • Itera&veanaly&csisrapidlygainingpopularity– DataClustering,LogMining,GraphProcessing,MachineLearning
– Datasetisrepeatedlyaccessedacrossdifferentitera&ons
• In-MemoryCachingbestfitsItera&veAnaly&cs– In-MemorycachingframeworksavoidfrequentI/Owithunderlyingstoragesystems
– Itera&veDataAnaly&cscouldhave10x–100xspeedup
2
![Page 3: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/3.jpg)
SparkforIn-MemoryItera&veAnaly&cs
Ref:wikibon.org
3
![Page 4: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/4.jpg)
RDD
SparkRDD
RDD:ResilientDistributedDatasets
block1
par&&on3
par&&on1
InMemory
InHDFS
4
par&&on2
par&&on4
block3
block2
block4
Worker1 Worker2
![Page 5: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/5.jpg)
RDDCacheOp3ons• DeserializedorSerialized• OnHeaporOffHeap• InMemoryorDisk
MemoryCachinginSpark
stripe
Spark
Mem
Disk
stripeMem
Disk
stripeMem
Disk
…
5
![Page 6: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/6.jpg)
Problems:In-memoryCachingforItera&veDataAnaly&cs
1. DiscreteCacheLevels
2. ManualProgrammerManagement
3. NotAdap&vetoRun&meChanges
6
![Page 7: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/7.jpg)
Problem1:DiscreteCacheLevels
SerializedCachesaves56%to63%ofthespacebutrela&velyslower7
27.7
53.1
79.8
107.3
126.1
0
20
40
60
80
100
120
140
10 20 30 40 50
SizeinM
emory(GB)
DataSize(GB) Serialized Deserialized
56%
63%
![Page 8: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/8.jpg)
Problem1:DiscreteCacheLevels
• DeserializedCacheisanorderofmagnitudefasterbutbecomeveryslowonce
spilledtodisk
8
0
10
20
30
40
50
60
70
10 20 30 40 50 60 70 80 90 100
Time/Sec
DataSize/GB
Deserialized
Serialized
97% -71%
ClusterMemory:100GB
![Page 9: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/9.jpg)
Problem1:DiscreteCacheLevels
9
0
10
20
30
40
50
60
70
10 20 30 40 50 60 70 80 90 100
Time/Sec
DataSize/GB
Deserialized
Serialized
Op&mal
GoalforNeutrino
![Page 10: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/10.jpg)
Problems:In-memoryCachingforItera&veDataAnaly&cs
1. DiscreteCacheLevels
2. ManualProgrammerManagement
3. NotAdap&vetoRun&meChanges
10
![Page 11: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/11.jpg)
Problem2:ManualManagement rdd_1 = sc.textfile(HDFS://file1)rdd_2 = sc.textfile(HDFS://file2)
rdd_1.persist(Cache_Level) rdd_2.persist(Cache_Level)
rdd_1.tranformations().action() rdd_2.tranformations().action()
datasetsize?
de/serialized?
accessorder?11
![Page 12: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/12.jpg)
Problems:In-memoryCachingforItera&veDataAnaly&cs
1. DiscreteCacheLevels
2. ManualProgrammerManagement
3. NotAdap&vetoRun&meChanges
12
![Page 13: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/13.jpg)
NotAdap&vetoRun&meChanges
Cachelevelsaresta&callyassignedtoRDDandsuchprogrammerdecisionsmaynotadaptto:1. Changingmemoryu&liza&ononeachworkernode
2. DifferentmemoryrequirementforaRDDpar&&onindeserialized/serializedcachelevels
13
![Page 14: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/14.jpg)
OurSolu&on:NeutrinoLessManualManagement
Adap&vetoRun&meChanges
Fine-grainedCacheLevels
14
![Page 15: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/15.jpg)
1.DataFlowGenera&on
• Goal:TounderstandRDDaccessorderbetweenjobs
• Solu&on:PreliminaryrunonsmallworkloadstoextractRDDaccessorder
• Example:KNearestNeighborsClassifica&on– ClassicalMLclassifica&onalgorithm– 1traindataset,3testdataset
15
![Page 16: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/16.jpg)
KNNExample:JobExecu&on
SparkApplica&on
Job1
Split Split
KNNjoin
Train Test_1
Master
Executor
SchedulingTasks
Execu&ngTask
Map Map
Job2
Split Split
KNNjoin
Train Test_2
SchedulingTasks
Execu&ngTask
Map Map
Job3
Split Split
KNNjoin
Train Test_3
SchedulingTasks
Execu&ngTask
Map Map
16
![Page 17: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/17.jpg)
KNNDataFlowGraph
Goal:UnderstandtheRDDaccessorderbetweenjobs.
TrainSplitRDD
KNNjoin(ac&on)
TrainRDD
Result
TrainMappedRDD
TestSplitRDD Test_1RDD Test
MappedRDD
Job1
RDD_seq[1]={TrainRDD,Test_1RDD}
17
RDD_seq[2]={TrainRDD,Test_2RDD}RDD_seq[3]={TrainRDD,Test_3RDD}
Job1 Job2 Job3
![Page 18: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/18.jpg)
2.Adap&veCaching
• Goal:Fine-grainedcachemanagementatRDDpar&&onlevel
• Solu&on:Newcachelevel:Adap+ve.ItcanmoveRDDpar&&onsbetweencachelevelsatrun&me
• Par&&on-levelOpera&ons:cache,discardandconvert
18
![Page 19: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/19.jpg)
CacheOpera&onsinSpark
Deserialized
Serialized
Uncached
Cache(Deserialize)
Cache(Serialize)
Discard
Discard
CachingGranularity:RDD 19
![Page 20: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/20.jpg)
Deserialized
Serialized
Uncached Convert
Cache(Deserialize)
Cache(Serialize)
Discard
Discard
Adap&ve
RDD CachingGranularity:Par&&on
Adap&veCacheOpera&onsinNeutrino
20
![Page 21: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/21.jpg)
3.DynamicCacheScheduling • Goal:Adapttorun&mechangesforachievingop&malperformance
• Solu&ons:Explorecachedecisionsonallpar&&onsbydynamicprogrammingeach&mebeforescheduling
• DynamicProgrammingModel– Inputs:RDDaccessorder,par&&onstatus– Output:Cachedecisionforeachpar&&oninthenextjob– CostModel:Overallexecu&on&me
21
![Page 22: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/22.jpg)
Execu&onofDynamicCacheScheduling
KNNjoin
Master
Executor
SchedulingTasks
Execu&ngTask
DPScheduling
RDDAccessOrder
Par&&onStatus
Execu&ngCachingDecisions
Split Split
Map Map
Update
22
![Page 23: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/23.jpg)
DynamicCacheScheduling:CachingDecisions
RDD#
DPJob1
Part# Node Status 0 1 worker1 uncached
Job# Part# 1 RDD0,RDD1
Path1Decision_1:RDD_0_Part_1àDeser_CacheRDD_1_Part_1àDeser_Cache
Par33onStatusTable RDDAccessOrder
2 RDD0,RDD2 3 RDD0,RDD3
1 1 worker1 uncached 2 1 worker1 uncached 3 1 worker1 uncached
DPJob2
Decision_2:RDD_0_Part_1àDo_NothingRDD_1_Part_1àDiscardRDD_2_Part_1àDeser_Cache
DPJob3(Final) ReturnPath2:bestcachingdecisions
DPJob2
DPJob3(Final)
Path2Decision_1:RDD_0_Part_1àDeser_CacheRDD_1_Part_1àDo_Nothing
Decision_2:RDD_0_Part_1àDo_NothingRDD_2_Part_1àDo_Nothing
23
![Page 24: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/24.jpg)
Evalua&on
• NeutrinoImplementa&on- ExtensiontoApacheSpark
• Methodology– 6nodesof4cores,8GBmemoryeach– Itera&vemachinelearningworkloads:
• Classifica&on:KNN,Logis&cRegression• Clustering:K-Means• Inference:LDA
– SystemsCompared:• NeutrinowithAdap&veCaching• SparkwithSerializedandDeserializedCaching
24
![Page 25: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/25.jpg)
Scenario1:AbundantMemoryDeserializeddatasize<ClusterMemory
Neutrinodeserializeallpar&&onsandmakeefficientuseofunusedmemory 25
0
0.5
1
1.5
2
2.5
3
3.5
LDA K-Means KNN LogReg
Rela3veJobExecu3onTime
Neutrino Deserialized Serialized
60%
45%
Neutrinohasextracomputa&onoverheadfordynamicschedulingandaddi&onalopera&ons
-7%
![Page 26: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/26.jpg)
Scenario2:SufficientMemoryDeserializeddatasize>ClusterMemory
26
0
0.5
1
1.5
2
2.5
3
3.5
LDA K-Means KNN LogReg
Rela3veJobExecu3onTime
Neutrino Deserialized Serialized
66%
Serializedlevelhasextraoverheadondeserializa&on.Neutrinocachepar&allyindeserializedlevelandpar&allyinserializedlevel
35%
Deserializedlevelstartstohitdiskandhencerequirere-computa&onfromHDFS
16% 5%
![Page 27: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/27.jpg)
WithmorefrequentcachemissesoccurredforDeserializedlevel
Scenario3:JustEnoughMemorySerializeddatasize=ClusterMemory
27
0
0.5
1
1.5
2
2.5
3
3.5
4
LDA K-Means KNN LogReg
Rela3veJobExecu3onTime
Neutrino Deserialized Serialized
46%
70%
![Page 28: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/28.jpg)
Conclusions• DiscreteCacheLevelsforIn-MemoryCaching
– Inefficientmemoryusageànotop&malperformance
• Neutrino:– Par&&onleveladap&vecaching– Dataflowgraphgenera&on– Dynamiccachescheduling
• Neutrinoimprovesaveragejobexecu&on&mebyupto70%overna&veSparkcaching
28
MajorProblems
• CachedRDDisdeserializedinSparkMemory-CheckpointedRDDisserializedandpersistent• CachedRDDiscomputedfromHDFS-CheckpointedRDDisre-computedfromcachedRDD
• CachedRDDreadingislocalityaware-Sta&cDelayschedulinglimitsCheckpoin&nglocality
![Page 29: Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs...Neutrino: Revisi&ng Memory Caching for Iterave Data Analy&cs Erci Xu*, Mohit Saxena, Lawrence Chiu Ohio State University*](https://reader030.vdocument.in/reader030/viewer/2022040308/5edb54fead6a402d66657f6d/html5/thumbnails/29.jpg)
ThanksQ&A
Neutrino:Revisi&ngMemoryCaching
forItera&veDataAnaly&cs
ErciXu*,MohitSaxena,LawrenceChiuOhioStateUniversity*IBMResearchAlmaden