lecture 11 spark - github pagesintro to spark • spark is really a different implementation of the...
Post on 21-Jun-2020
8 Views
Preview:
TRANSCRIPT
CS639:DataManagementfor
DataScienceLecture11:Spark
TheodorosRekatsinas
1
Logistics/Announcements
2
• QuestionsonPA3?
Today’sLecture
1. MapReduceImplementation
2. Spark
3
1. MapReduceImplementation
4
Recall:TheMapReduceAbstractionforDistributedAlgorithms
DistributedDataStorage
Map
Reduce
(Shuffle)
map map map map map map
reduce reduce reduce reduce
MapReduce:whathappensinbetween?
MapReduce:thecompletepicture
Step1:Splitinputfilesintochunks(shards)
Step2:Forkprocesses
Step3:RunMapTasks
Step4:Createintermediatefiles
Step4a:Partitioning
Step5:ReduceTask- sorting
Step6:ReduceTask- reduce
Step7:Returntouser
MapReduce:thecompletepicture
Weneedadistributedfilesystem!
2.Spark
17
IntrotoSpark
• SparkisreallyadifferentimplementationoftheMapReduceprogrammingmodel
• WhatmakesSparkdifferentisthatitoperatesonMainMemory• Spark:wewriteprogramsintermsofoperationsonresilient
distributeddatasets(RDDs).• RDD(simpleview):acollectionofelementspartitionedacrossthe
nudesofaclusterthatcanbeoperatedoninparallel.• RDD(complexview):RDDisaninterfacefordatatransformation,
RDDreferstothedatastoredeitherinpersistedstore(HDFS)orincache(memory,memory+disk,diskonly)orinanotherRDD
RDDsinSpark
MapReducevsSpark
RDDs
• Partitionsarerecomputedonfailureorcacheeviction• Metadatastoredforinterface:• Partitions– setofdatasplitsassociatedwiththisRDD• Dependencies– listofparentRDDsinvolvedincomputation• Compute– functiontocomputepartitionoftheRDDgiventheparent
partitionsfromtheDependencies• PreferredLocations– whereisthebestplacetoputcomputationsonthis
partition(datalocality)• Partitioner – howthedataissplitintopartitions
RDDs
DAG
• DirectedAcyclicGraph– sequenceofcomputationsperformedondata
• Node– RDDpartition• Edge– transformationontopofthedata• Acyclic– graphcannotreturntotheolderpartition• Directed– transformationisanactionthattransitionsdata
partitionsstate(fromAtoB)
Example:WordCount
SparkArchitecture
SparkComponents
SparkDriver
• EntrypointoftheSparkShell(Scala,Python,R)• TheplacewhereSparkContext iscreated• TranslatesRDDintotheexecutiongraph• Splitsgraphintostages• Schedulestasksandcontrolstheirexecution• StoresmetadataaboutalltheRDDsandtheirpartitions• BringsupSparkWebUI withjobinformation
SparkExecutor
• StoresthedataincacheinJVMheaporonHDDs• Readsdatafromexternalsources• Writesdatatoexternalsources• Performsallthedataprocessing
DagScheduler
MoreRDDOperations
Spark’ssecretisreallytheRDDabstraction
top related