spark basics 2 - github pages. spark basics 2_class.pdf · spark basics 2 processing math: 100%....

22
Spark Basics 2 Processing math: 100%

Upload: others

Post on 04-Aug-2020

13 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

SparkBasics2

Processing math: 100%

Page 2: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

Lazyevaluation

Wecancombinemapandreduceoperationstoperformmorecomplexoperations.

Supposewewanttocomputethesumofthesquares

wheretheelementsarestoredinanRDD.

[MathProcessingError]

[MathProcessingError]

Page 3: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

CreateanRDD

In [2]: B=sc.parallelize(range(4))B.collect()

Out[2]: [0, 1, 2, 3]

Page 4: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

Sequentialsyntax

Performassignmentaftereachcomputation

In [3]: Squares=B.map(lambda x:x*x)Squares.reduce(lambda x,y:x+y)

Out[3]: 14

Page 5: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

Cascadedsyntax

Combinecomputationsintoasinglecascadedcommand

In [4]: B.map(lambda x:x*x)\ .reduce(lambda x,y:x+y)

Out[4]: 14

Page 6: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

Bothsyntaxesmeanthesamething

Theonlydifference:

InthesequentialsyntaxtheintermediateRDDhasanameSquaresInthecascadedsyntaxtheintermediateRDDisanonymous

Theexecutionisidentical!

Page 7: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

Sequentialexecution

Thestandardwaythatthemapandreduceareexecutedis

performthemapstoretheresultingRDDinmemoryperformthereduce

DisadvantagesofSequentialexecution

1. Intermediateresult(Squares)requiresmemoryspace.

2. Twoscansofmemory(ofB,thenofSquares)-doublethe

cache-misses.

Page 8: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

Pipelinedexecution

Performthewholecomputationinasinglepass.ForeachelementofB

1. Computethesquare2. Enterthesquareasinputtothereduceoperation.

AdvantagesofPipelinedexecution

1. Lessmemoryrequired-intermediateresultisnotstored.2. Faster-onlyonepassthroughtheInputRDD.

Page 9: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

LazyEvaluation

ThistypeofpipelinedevaluationisrelatedtoLazyEvaluation.ThewordLazyisusedbecausethefirstcommand(computingthesquare)isnotexecutedimmediately.Instead,theexecutionisdelayedaslongaspossiblesothatseveralcommandsareexecutedinasinglepass.

ThedelayedcommandsareorganizedinanExecutionplan

Page 10: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

Aninstructivemistake

Hereisanotherwaytocomputethesumofthesquaresusingasinglereducecommand.Whatiswrongwithit?

In [5]:

111

C=sc.parallelize([1,1,1])C.reduce(lambda x,y: x*x+y*y)

Out[5]: 5

Page 11: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

gettinginformationaboutanRDD

RDD'stypicallyhavehundredsofthousandsofelements.ItusuallymakesnosensetoprintoutthecontentofawholeRDD.HerearesomewaystogetmanageableamountsofinformationaboutanRDD

In [6]: n=1000000B=sc.parallelize([0,0,1,0]*(n/4))

Page 12: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

In [7]: #find the number of elements in the RDDB.count()

Out[7]: 1000000

Page 13: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

In [8]: # get the first few elements of an RDDprint 'first element=',B.first()print 'first 5 elements = ',B.take(5)

first element= 0first 5 elements = [0, 0, 1, 0, 0]

Page 14: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

SamplinganRDD

RDDsareoftenverylarge.Aggregates,suchasaverages,canbeapproximatedefficientlybyusingasample.Samplingisdoneinparallelanditkeepsthedatalocal.

In [9]: # get a sample whose expected size is mm=5.B.sample(False,m/n).collect()

Out[9]: [1, 0, 1, 0, 0, 0]

Page 15: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

filteringanRDD

ThemethodRDD.filter(func)Returnanewdatasetformedbyselecting

thoseelementsofthesourceonwhichfuncreturnstrue.

In [10]: # How many positive numbers?B.filter(lambda n: n > 0).count()

Out[10]: 250000

Page 16: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

RemovingduplicateelementsfromanRDD

ThemethodRDD.distinct(numPartitions=None)Returnsanewdataset

thatcontainsthedistinctelementsofthesourcedataset

ThenumberofpartitionsisspecifiedthroughthenumPartitionsargument.Eachofthispartitionsispotentiallyondifferentmachine.

In [11]: # Remove duplicate element in DuplicateRDD, we get distinct RDDDuplicateRDD = sc.parallelize([1,1,2,2,3,3])DistinctRDD = DuplicateRDD.distinct()DistinctRDD.collect()

Out[11]: [1, 2, 3]

Page 17: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

flatmapanRDD

ThemethodRDD.flatMap(func)issimilartomap,buteachinputitemcan

bemappedto0ormoreoutputitems(sofuncshouldreturnaSeqratherthanasingleitem).

In [12]: text=["you are my sunshine","my only sunshine"]text_file = sc.parallelize(text)# map each line in text to a list of wordsprint 'map:',text_file.map(lambda line: line.split(" ")).collect()# create a single list of words by combining the words from all of the linesprint 'flatmap:',text_file.flatMap(lambda line: line.split(" ")).collect()

map: [['you', 'are', 'my', 'sunshine'], ['my', 'only', 'sunshine']]flatmap: ['you', 'are', 'my', 'sunshine', 'my', 'only', 'sunshine']

Page 18: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

Setoperations

Inthispart,weexploresetoperationsincludingunion,intersection,subtract,cartesianinpyspark

In [13]: rdd1 = sc.parallelize([1, 1, 2, 3])rdd2 = sc.parallelize([1, 3, 4, 5])

Page 19: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

1. union(other)ReturntheunionofthisRDDandanotherone.

In [14]: rdd1.union(rdd2).collect()

Out[14]: [1, 1, 2, 3, 1, 3, 4, 5]

Page 20: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

1. intersection(other)ReturntheintersectionofthisRDDandanotherone.Theoutputwillnotcontainanyduplicateelements,eveniftheinputRDDsdid.Notethatthismethodperformsashuffleinternally.

In [15]: rdd1.intersection(rdd2).collect()

Out[15]: [1, 3]

Page 21: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

1. subtract(other,numPartitions=None)Returneachvalueinselfthatisnotcontainedinother.

In [16]: rdd1.subtract(rdd2).collect()

Out[16]: [2]

Page 22: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset

1. cartesian(other)ReturntheCartesianproductofthisRDDandanotherone,thatis,theRDDofallpairsofelements(a,b)whereaisinselfandbisinother.

In [17]: print rdd1.cartesian(rdd2).collect()

[(1, 1), (1, 3), (1, 4), (1, 5), (1, 1), (1, 3), (1, 4), (1, 5), (2, 1), (2, 3), (2, 4), (2, 5), (3, 1), (3, 3), (3, 4), (3, 5)]