lecture 11 spark - github pagesintro to spark • spark is really a different implementation of the...

Post on 21-Jun-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS639:DataManagementfor

DataScienceLecture11:Spark

TheodorosRekatsinas

1

Logistics/Announcements

2

• QuestionsonPA3?

Today’sLecture

1. MapReduceImplementation

2. Spark

3

1. MapReduceImplementation

4

Recall:TheMapReduceAbstractionforDistributedAlgorithms

DistributedDataStorage

Map

Reduce

(Shuffle)

map map map map map map

reduce reduce reduce reduce

MapReduce:whathappensinbetween?

MapReduce:thecompletepicture

Step1:Splitinputfilesintochunks(shards)

Step2:Forkprocesses

Step3:RunMapTasks

Step4:Createintermediatefiles

Step4a:Partitioning

Step5:ReduceTask- sorting

Step6:ReduceTask- reduce

Step7:Returntouser

MapReduce:thecompletepicture

Weneedadistributedfilesystem!

2.Spark

17

IntrotoSpark

• SparkisreallyadifferentimplementationoftheMapReduceprogrammingmodel

• WhatmakesSparkdifferentisthatitoperatesonMainMemory• Spark:wewriteprogramsintermsofoperationsonresilient

distributeddatasets(RDDs).• RDD(simpleview):acollectionofelementspartitionedacrossthe

nudesofaclusterthatcanbeoperatedoninparallel.• RDD(complexview):RDDisaninterfacefordatatransformation,

RDDreferstothedatastoredeitherinpersistedstore(HDFS)orincache(memory,memory+disk,diskonly)orinanotherRDD

RDDsinSpark

MapReducevsSpark

RDDs

• Partitionsarerecomputedonfailureorcacheeviction• Metadatastoredforinterface:• Partitions– setofdatasplitsassociatedwiththisRDD• Dependencies– listofparentRDDsinvolvedincomputation• Compute– functiontocomputepartitionoftheRDDgiventheparent

partitionsfromtheDependencies• PreferredLocations– whereisthebestplacetoputcomputationsonthis

partition(datalocality)• Partitioner – howthedataissplitintopartitions

RDDs

DAG

• DirectedAcyclicGraph– sequenceofcomputationsperformedondata

• Node– RDDpartition• Edge– transformationontopofthedata• Acyclic– graphcannotreturntotheolderpartition• Directed– transformationisanactionthattransitionsdata

partitionsstate(fromAtoB)

Example:WordCount

SparkArchitecture

SparkComponents

SparkDriver

• EntrypointoftheSparkShell(Scala,Python,R)• TheplacewhereSparkContext iscreated• TranslatesRDDintotheexecutiongraph• Splitsgraphintostages• Schedulestasksandcontrolstheirexecution• StoresmetadataaboutalltheRDDsandtheirpartitions• BringsupSparkWebUI withjobinformation

SparkExecutor

• StoresthedataincacheinJVMheaporonHDDs• Readsdatafromexternalsources• Writesdatatoexternalsources• Performsallthedataprocessing

DagScheduler

MoreRDDOperations

Spark’ssecretisreallytheRDDabstraction

top related