lecture 11 spark - github pagesintro to spark • spark is really a different implementation of the...

31
CS639: Data Management for Data Science Lecture 11: Spark Theodoros Rekatsinas 1

Upload: others

Post on 21-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

CS639:DataManagementfor

DataScienceLecture11:Spark

TheodorosRekatsinas

1

Page 2: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

Logistics/Announcements

2

• QuestionsonPA3?

Page 3: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

Today’sLecture

1. MapReduceImplementation

2. Spark

3

Page 4: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

1. MapReduceImplementation

4

Page 5: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

Recall:TheMapReduceAbstractionforDistributedAlgorithms

DistributedDataStorage

Map

Reduce

(Shuffle)

map map map map map map

reduce reduce reduce reduce

Page 6: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

MapReduce:whathappensinbetween?

Page 7: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

MapReduce:thecompletepicture

Page 8: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

Step1:Splitinputfilesintochunks(shards)

Page 9: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

Step2:Forkprocesses

Page 10: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

Step3:RunMapTasks

Page 11: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

Step4:Createintermediatefiles

Page 12: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

Step4a:Partitioning

Page 13: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

Step5:ReduceTask- sorting

Page 14: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

Step6:ReduceTask- reduce

Page 15: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

Step7:Returntouser

Page 16: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

MapReduce:thecompletepicture

Weneedadistributedfilesystem!

Page 17: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

2.Spark

17

Page 18: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

IntrotoSpark

• SparkisreallyadifferentimplementationoftheMapReduceprogrammingmodel

• WhatmakesSparkdifferentisthatitoperatesonMainMemory• Spark:wewriteprogramsintermsofoperationsonresilient

distributeddatasets(RDDs).• RDD(simpleview):acollectionofelementspartitionedacrossthe

nudesofaclusterthatcanbeoperatedoninparallel.• RDD(complexview):RDDisaninterfacefordatatransformation,

RDDreferstothedatastoredeitherinpersistedstore(HDFS)orincache(memory,memory+disk,diskonly)orinanotherRDD

Page 19: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

RDDsinSpark

Page 20: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

MapReducevsSpark

Page 21: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

RDDs

• Partitionsarerecomputedonfailureorcacheeviction• Metadatastoredforinterface:• Partitions– setofdatasplitsassociatedwiththisRDD• Dependencies– listofparentRDDsinvolvedincomputation• Compute– functiontocomputepartitionoftheRDDgiventheparent

partitionsfromtheDependencies• PreferredLocations– whereisthebestplacetoputcomputationsonthis

partition(datalocality)• Partitioner – howthedataissplitintopartitions

Page 22: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

RDDs

Page 23: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

DAG

• DirectedAcyclicGraph– sequenceofcomputationsperformedondata

• Node– RDDpartition• Edge– transformationontopofthedata• Acyclic– graphcannotreturntotheolderpartition• Directed– transformationisanactionthattransitionsdata

partitionsstate(fromAtoB)

Page 24: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

Example:WordCount

Page 25: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

SparkArchitecture

Page 26: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

SparkComponents

Page 27: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

SparkDriver

• EntrypointoftheSparkShell(Scala,Python,R)• TheplacewhereSparkContext iscreated• TranslatesRDDintotheexecutiongraph• Splitsgraphintostages• Schedulestasksandcontrolstheirexecution• StoresmetadataaboutalltheRDDsandtheirpartitions• BringsupSparkWebUI withjobinformation

Page 28: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

SparkExecutor

• StoresthedataincacheinJVMheaporonHDDs• Readsdatafromexternalsources• Writesdatatoexternalsources• Performsallthedataprocessing

Page 29: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

DagScheduler

Page 30: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

MoreRDDOperations

Page 31: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on

Spark’ssecretisreallytheRDDabstraction