Download - Overview of Spark project
![Page 1: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/1.jpg)
Overview of Spark projectPresented by Yin Zhu ([email protected]) Materials from • http://spark-project.org/documentation/• Hadoop in Practice by A. Holmes• My demo code: https://
bitbucket.org/blackswift/spark-example25 March 2013
![Page 2: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/2.jpg)
OutlineReview of MapReduce (15’)
Going Through Spark (NSDI’12) Slides (30’)
Demo (15’)
![Page 3: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/3.jpg)
Review of MapReducePagerank implemented in Scala and HadoopWhy Pagerank? >More complicated than the “hello world” WordCount >Widely used in search engines, very large scale input (the whole web graph!) >Iterative algorithm (typical style for most numerical algorithms for data mining)
![Page 4: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/4.jpg)
![Page 5: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/5.jpg)
New score: 0.15 + 0.5*0.85 = 0.575
![Page 6: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/6.jpg)
![Page 7: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/7.jpg)
![Page 8: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/8.jpg)
![Page 9: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/9.jpg)
![Page 10: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/10.jpg)
A functional implementationdef pagerank(links: Array[(UrlId, Array[UrlId])], numIters: Int):Map[UrlId, Double] = { val n = links.size var ranks = (Array.fromFunction(i => (i, 1.0)) (n)).toMap // init: each node has rank 1.0 for (iter <- 1 to numIters) { // Take some interactions val contrib = links .flatMap(node => {
val out_url = node._1 val in_urls = node._2 val score = ranks(out_url) / in_urls.size // the score each outer link
recieves in_urls.map(in_url => (in_url, score) )
} ) .groupBy(url_score => url_score._1) // group the (url, score) pairs by url .map(url_scoregroup => // sum the score for each unique url (url_scoregroup._1, url_scoregroup._2.foldLeft (0.0) ((sum,url_score) => sum+url_score._2))) ranks = ranks.map(url_score => (url_score._1, if (contrib.contains(url_score._1)) 0.85 * contrib(url_score._1) + 0.15 else url_score._2)) } ranks }
Map
Reduce
![Page 11: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/11.jpg)
Hadoop implementationhttps://github.com/alexholmes/hadoop-book/tree/master/src/main/java/com/manning/hip/ch7/pagerank
![Page 12: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/12.jpg)
Hadoop/MapReduce implementationFault tolerance the result after each iteration is saved to Disk
Speed/Disk IO disk IO is proportional to the # of iterations ideally the link graph should be loaded for only once
![Page 13: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/13.jpg)
Resilient Distributed DatasetsA Fault-Tolerant Abstraction forIn-Memory Cluster Computing
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley,Michael Franklin, Scott Shenker, Ion Stoica
UC Berkeley UC BERKELEY
![Page 14: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/14.jpg)
MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clustersBut as soon as it got popular, users wanted more:
»More complex, multi-stage applications(e.g. iterative machine learning & graph processing)
»More interactive ad-hoc queries
Response: specialized frameworks for some of these apps (e.g. Pregel for graph
processing)
![Page 15: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/15.jpg)
MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:
Efficient primitives for data sharingIn MapReduce, the only way to share data across jobs is stable storage
slow!
![Page 16: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/16.jpg)
Examplesiter. 1 iter. 2 . .
.Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFSread
Slow due to replication and disk I/O,but necessary for fault tolerance
![Page 17: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/17.jpg)
iter. 1 iter. 2 . . .
Input
Goal: In-Memory Data Sharing
Input
query 1
query 2
query 3. . .
one-timeprocessing
10-100× faster than network/disk, but how to get FT?
![Page 18: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/18.jpg)
Challenge
How to design a distributed memory abstraction that is both fault-tolerant
and efficient?
![Page 19: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/19.jpg)
Solution: Resilient Distributed Datasets (RDDs)Restricted form of distributed shared memory
»Immutable, partitioned collections of records
»Can only be built through coarse-grained deterministic transformations (map, filter, join, …)
Efficient fault recovery using lineage»Log one operation to apply to many
elements»Recompute lost partitions on failure»No cost if nothing fails
![Page 20: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/20.jpg)
Three core concepts of RDDTransformations define a new RDD from an existing RDD
Cache and Partitioner Put the dataset into memory/other persist media, and specify the locations of the sub datasets
Actions carry out the actual computation
![Page 21: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/21.jpg)
RDD and its lazy transformationsRDD[T]: A sequence of objects of type TTransformations are lazy:
https://github.com/mesos/spark/blob/master/core/src/main/scala/spark/PairRDDFunctions.scala
![Page 22: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/22.jpg)
/*** Return an RDD of grouped items.*/
def groupBy[K: ClassManifest](f: T => K, p: Partitioner): RDD[(K, Seq[T])] = { val cleanF = sc.clean(f) this.map(t => (cleanF(t), t)).groupByKey(p) }
/*** Group the values for each key in the RDD into a single sequence. Allows controlling the* partitioning of the resulting key-value pair RDD by passing a Partitioner.*/
def groupByKey(partitioner: Partitioner): RDD[(K, Seq[V])] = { def createCombiner(v: V) = ArrayBuffer(v) def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v
def mergeCombiners(b1: ArrayBuffer[V], b2: ArrayBuffer[V]) = b1 ++= b2
val bufs = combineByKey[ArrayBuffer[V]]( createCombiner _, mergeValue _, mergeCombiners _, partitioner) bufs.asInstanceOf[RDD[(K, Seq[V])]] }
![Page 23: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/23.jpg)
Actions: carry out the actual computation
https://github.com/mesos/spark/blob/master/core/src/main/scala/spark/SparkContext.scala
runJob
![Page 24: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/24.jpg)
Examplelines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))
messages.persist() or .cache()
messages.filter(_.contains(“foo”)).countmessages.filter(_.contains(“bar”)).count
![Page 25: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/25.jpg)
Task Scheduler for actionsDryad-like DAGsPipelines functionswithin a stageLocality & data reuse awarePartitioning-awareto avoid shuffles
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
![Page 26: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/26.jpg)
Input
query 1
query 2
query 3
. . .
RDD Recovery
one-timeprocessing
iter. 1 iter. 2 . . .
Input
![Page 27: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/27.jpg)
RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.:
messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))
HadoopRDDpath = hdfs://…
FilteredRDDfunc =
_.contains(...)
MappedRDDfunc = _.split(…)
Fault Recovery
HadoopRDD FilteredRDD MappedRDD
![Page 28: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/28.jpg)
Fault Recovery Results
1 2 3 4 5 6 7 8 9 10020406080
100120140 119
57 56 58 5881
57 59 57 59
Iteration
Iter
atri
on t
ime
(s) Failure happens
![Page 29: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/29.jpg)
Generality of RDDsDespite their restrictions, RDDs can express surprisingly many parallel algorithms
»These naturally apply the same operation to many items
Unify many current programming models»Data flow models: MapReduce, Dryad, SQL, …»Specialized models for iterative apps: BSP (Pregel),
iterative MapReduce (Haloop), bulk incremental, …
Support new apps that these models don’t
![Page 30: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/30.jpg)
Memorybandwidth
Networkbandwidth
Tradeoff Space
Granularityof Updates
Write Throughput
Fine
CoarseLow High
K-V stores,databases,RAMCloud
Best for batchworkloads
Best fortransactional
workloads
HDFS RDDs
![Page 31: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/31.jpg)
OutlineSpark programming interfaceImplementationDemoHow people are using Spark
![Page 32: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/32.jpg)
Spark Programming InterfaceDryadLINQ-like API in the Scala languageUsable interactively from Scala interpreterProvides:
»Resilient distributed datasets (RDDs)»Operations on RDDs: transformations (build
new RDDs), actions (compute and output results)
»Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)
![Page 33: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/33.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).countmessages.filter(_.contains(“bar”)).count
tasksresults
Msgs. 1
Msgs. 2
Msgs. 3
Base RDDTransformed
RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20
sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
![Page 34: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/34.jpg)
Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to
Σi∈neighbors ranki / |neighborsi|links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairsfor (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _)}
![Page 35: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/35.jpg)
Optimizing Placement
links & ranks repeatedly joinedCan co-partition them (e.g. hash both on URL) to avoid shufflesCan also use app knowledge, e.g., hash on DNS name
links = links.partitionBy( new URLPartitioner())
reduceContribs0
join
joinContribs2
Ranks0(url, rank)
Links(url,
neighbors)
. . .
Ranks2
reduce
Ranks1
![Page 36: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/36.jpg)
PageRank Performance
01020304050607080 72
23
HadoopBasic SparkSpark + Con-trolled Partition-ingTi
me
per
iter
-at
ion
(s)
![Page 37: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/37.jpg)
ImplementationRuns on Mesos [NSDI 11]to share clusters w/ HadoopCan read from any Hadoop input source (HDFS, S3, …)
Spark Hadoop MPI
Mesos
Node Node Node Node
…
No changes to Scala language or compiler»Reflection + bytecode analysis to correctly ship
code
www.spark-project.org
![Page 38: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/38.jpg)
Behavior with Insufficient RAM
0% 25% 50% 75% 100%0
20
40
60
80
10068
.8
58.1
40.7
29.7
11.5
Percent of working set in memory
Iter
atio
n ti
me
(s)
![Page 39: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/39.jpg)
Scalability
25 50 1000
50
100
150
200
250
184
111
76
116
80
62
15 6 3
HadoopHadoopBinMemSpark
Number of machines
Iter
atio
n ti
me
(s)
25 50 1000
50100150200250300 27
4
157
106
197
121
87
143
61
33
Hadoop HadoopBinMemSpark
Number of machines
Iter
atio
n ti
me
(s)
Logistic Regression K-Means
![Page 40: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/40.jpg)
Breaking Down the Speedup
In-mem HDFS
In-mem local file
Spark RDD0
5
10
15
2015
.4
13.1
2.9
8.4
6.9
2.9
Text InputBinary Input
Iter
atio
n ti
me
(s)
![Page 41: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/41.jpg)
Spark Operations
Transformations
(define a new RDD)
mapfilter
samplegroupByKeyreduceByKey
sortByKey
flatMapunionjoin
cogroupcross
mapValues
Actions(return a result
to driver program)
collectreducecountsave
lookupKey
![Page 42: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/42.jpg)
Demo
![Page 43: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/43.jpg)
Open Source Community15 contributors, 5+ companies using Spark,3+ applications projects at BerkeleyUser applications:
»Data mining 40x faster than Hadoop (Conviva)
»Exploratory log analysis (Foursquare)»Traffic prediction via EM (Mobile Millennium)»Twitter spam classification (Monarch)»DNA sequence analysis (SNAP)». . .
![Page 44: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/44.jpg)
Related WorkRAMCloud, Piccolo, GraphLab, parallel DBs
»Fine-grained writes requiring replication for resiliencePregel, iterative MapReduce
»Specialized models; can’t run arbitrary / ad-hoc queriesDryadLINQ, FlumeJava
»Language-integrated “distributed dataset” API, but cannot share datasets efficiently across queries
Nectar [OSDI 10]»Automatic expression caching, but over distributed FS
PacMan [NSDI 12]»Memory cache for HDFS, but writes still go to network/disk
![Page 45: Overview of Spark project](https://reader036.vdocument.in/reader036/viewer/2022081515/56816913550346895de02f3f/html5/thumbnails/45.jpg)
ConclusionRDDs offer a simple and efficient programming model for a broad range of applicationsLeverage the coarse-grained nature of many parallel algorithms for low-overhead recoveryTry it out at www.spark-project.org