introduction to cloud computing - home - department …dell/teaching/cc/dell_cc_spark.pdfreturn...
TRANSCRIPT
Spark
• One popular answer to “What’s beyond MapReduce?”
• Open-source engine for large-scale batch processing
– Supports generalized data flows
– Written in Scala, with bindings in Java and Python
Spark
• Brief history:
– Developed at UC Berkeley AMPLab in 2009
– Open-sourced in 2010
– Became top-level Apache project in February 2014
– Commercial support provided by DataBricks
Motivation
• MapReduce greatly simplified “big data” analysis on large, unreliable clusters
• But as soon as it got popular, users wanted more:– More complex, multi-stage applications
(e.g. iterative machine learning & graph processing)
– More interactive ad-hoc queries
Response: specialized frameworks for some of these apps (e.g. Pregel for graph processing)
Motivation
• Complex apps and interactive queries both need one thing that MapReduce lacks:
Efficient primitives for data sharing
In MapReduce, the only way to share data across jobs is stable storage slow!
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFSread
Slow due to replication and disk I/O, but necessary for fault tolerance
Example Scenarios
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
iter. 1 . . .iter. 2
iter. 1 iter. 2 . . .
Input
Input
query 1
query 2
query 3
. . .
one-timeprocessing
Goal: In-Memory Data Sharing
10-100× faster than network/disk, but how to get fault tolerance?
Challenge
• Existing storage abstractions have interfaces based on fine-grained updates to mutable state (reads and writes to cells in a table)– E.g., databases, key-value stores,
distributed memory, RAMCloud, Piccolo, …
• Requires replicating data or logs across nodes for fault tolerance– Costly for data-intensive apps
– 10-100x slower than memory write
Resilient Distributed Datasets (RDDs)
• Distributed shared memory in a restricted form– Immutable, partitioned collections of records
– Can only be built through coarse-graineddeterministic transformations (map, filter, join, …)
• Resilient: efficient fault recovery using lineage– Log one operation to apply to many elements
– Re-compute lost partitions on failure
– No cost if nothing fails
Fault Recovery of RDDs
• An RDD can only be created through deterministic operations (transformations) on
– (1) data in stable storage (2) other RDDs.
• An RDD has enough information about how it was derived from other datasets in a series of transformations (its lineage), so it can alwaysbe recomputed from stable storage (disk).
Fault Recovery of RDDs
Input
query 1
query 2
query 3
. . .
one-timeprocessing
iter. 1 iter. 2 . . .
Input
Fault Recovery of RDDs
119
57 56 58 58
81
57 59 57 59
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10
Ite
ratr
ion
tim
e (
s)
Iteration
No FailureFailure in the 6th Iteration
Generality of RDDs
• Despite their restrictions, RDDs can express surprisingly many parallel algorithms
– These naturally apply the same operation to many items
• Unify many current programming models
– Data flow models: MapReduce, Dryad, SQL, …
– Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, …
• Support new apps that these models don’t
Trade-off Space
Memorybandwidth
Networkbandwidth
Granularityof Updates
Write Throughput
Fine
Coarse
Low High
K-V stores,databases,RAMCloud
Best for batchworkloads
Best fortransactional
workloads
HDFS RDDs
Programming Interface
• Spark offers over 80 high-level operators that make it easy to build parallel apps.
• Usable interactively from the Scala, Python and R shells.
Programing Interface
• Operations on RDDs
– Transformations define new RDDs
– Actions use RDDs to return results to the driver program, or export results to the storage system
Spark computes RDDs lazily the first time they are used in an action so that it can pipeline transformations. i.e., an RDDs is defined by transformations but its computation is actually triggered by actions.
Transformations
mapfilter
samplegroupByKeyreduceByKey
sortByKey
flatMapunionjoin
cogroupcross
mapValues
Actions
collectreducecountsave
lookupKey
Programing Interface
Programming Interface
• map != flatMap
– Spark map means a one-to-one mapping
– Spark flatMap maps each input value to one or more outputs (similar to the map in MapReduce)
Programing Interface
• Control of RDDs
– persistence (storage in RAM, on disk, etc)
• Users can indicate which RDDs they will reuse and choose a storage strategy for them.
– partitioning (layout across nodes)
• Users can indicate how an RDD’s elements be partitioned across machines based on a key in each record. This is useful for placement optimizations, such as ensuring that two datasets that will be joined together are hash-partitioned in the same way.
Example: Log Mining
Load error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
tasks
results
Msgs. 1
Msgs. 2
Msgs. 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)
Example: Log Mining
• Fault Recovery
– RDDs track the graph of transformations that built them (their lineage) to rebuild lost data
HadoopRDD
path = hdfs://…
FilteredRDD
func = _.contains(...)
MappedRDD
func = _.split(…)
messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2))
HadoopRDD FilteredRDD MappedRDD
Example: Simplified PageRank
1. Start each page with a rank of 1
2. On each iteration, update each page’s rank to
Σi∈neighbors ranki / |neighborsi|
links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {ranks = links.join(ranks).flatMap {
(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))
}.reduceByKey(_ + _)}
Example: Simplified PageRank
• Optimizing Placement– links & ranks repeatedly joined
– Can co-partition them (e.g. hash both on URL) to avoid shuffles
– Can also use app knowledge, e.g., hash on DNS name
links = links.partitionBy(new URLPartitioner())
reduce
Contribs0
join
join
Contribs2
Ranks0(url, rank)
Links(url, neighbors)
. . .
Ranks2
reduce
Ranks1
Example: Simplified PageRank
• Performance1
71
80
72
28
23
14
020406080
100120140160180200
30 60
Ite
rati
on
tim
e (
s)
Number of machines
Hadoop
Basic Spark
Spark + ControlledPartitioning
Programming Models
• RDDs can express many existing parallel models– MapReduce, DryadLINQ
– Iterative MapReduce
– Pregel
– SQL
• Enables apps to efficiently intermix these models
All are based oncoarse-grained operations
Programming Models
• For example, MapReduce
– can be expressed using the flatMap and groupByKey operations in Spark, or reduceByKey if there is a combiner.
Implementation
• Runs on Mesos to share clusters with Hadoop
• Can read from any Hadoop input source
– HDFS, HBase, S3, …
• No changes to Scala language or compiler
– Reflection + bytecode analysis to correctly ship code
Spark Hadoop MPI
Mesos
Node Node Node Node
…
Dependencies
• Narrow
– Each partition of the parent RDD is used by at most one partition of the child RDD
• Wide
– Each partition of the parent RDD is used by multiple partitions of the child RDD
Dependencies
• Narrow vs Wide– Narrow dependencies allow for pipelined
execution on one cluster node, while wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation.
– Recovery after a node failure is more efficient with a narrow dependency as only the lost parent partitions need to be recomputed, and they can be recomputed in parallel on different nodes.
Task Scheduler
• Dryad-like task DAGs
• Pipelines functionswithin a stage
• Cache-aware for data reuse and & locality
• Partitioning-awareto avoid shuffles join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
Scalability
184
111
76
116
80
62
15 6 3
0
50
100
150
200
250
25 50 100
Ite
rati
on
tim
e (
s)
Number of machines
Hadoop
HadoopBinMem
Spark
274
157
106
197
121
87
143
61
33
0
50
100
150
200
250
300
25 50 100
Ite
rati
on
tim
e (
s)
Number of machines
Hadoop
HadoopBinMem
Spark
Logistic Regression K-Means
Breaking Down the Speedup
15.4
13.1
2.9
8.4
6.9
2.9
0
5
10
15
20
In-mem HDFS In-mem local file Spark RDD
Ite
rati
on
tim
e (
s)
Text Input
Binary Input
Behavior with Insufficient RAM
68
.8
58.1
40
.7
29
.7
11.5
0
20
40
60
80
100
0% 25% 50% 75% 100%
Ite
rati
on
tim
e (
s)
Percent of working set in memory
Aspect RDD DSM
Reads Corse- or Fine-grained Fine-grained
Writes Corse- grained Fine-grained
Consistency Trivial (immutable) Up to app / runtime
Fault recovery Fine-grained and low-overhead using lineage
Requires checkpointsand program rollback
Stragglermitigation
Possible using backup tasks
Difficult
Work placement Automatic based on data locality
Up to app (but runtime aims for transparency)
Behaviour if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Motivation
• Many important applications need to process large data streams arriving in real time
– User activity statistics (e.g. Facebook’s Puma)
– Spam detection
– Traffic estimation
– Network intrusion detection
• The target: large-scale apps that must run on tens-hundreds of nodes with O(1 sec) latency
Challenge
• To run at large scale, streaming systems have to be both:
– Fault-tolerant: recover quickly from failures and stragglers
– Cost-efficient: do not require significant hardware beyond that needed for basic processing
• Traditional streaming systems based on the Continuous Operator Model don’t have both properties
Continuous Operator Model
• “Record-at-a-time” processing– Each node has mutable state
– For each record, update state & send new records
mutable state
node 1
node 3
input records push
node 2input records
Continuous Operator Model
Fault tolerance via replication or upstream backup:
node 1
node 3
node 2
standby
input
input
node 1
node 3
node 2
node 1’
node 3’
node 2’
synchronization
input
input
primaries
replicas
Continuous Operator Model
Fault tolerance via replication or upstream backup:
node 1
node 3
node 2
standby
input
input
node 1
node 3
node 2
node 1’
node 3’
node 2’
synchronization
input
input
primaries
replicas
Fast recovery, but 2x hardware cost Only need 1 standby, but slow to recover
Continuous Operator Model
Fault tolerance via replication or upstream backup:
node 1
node 3
node 2
standby
input
input
node 1
node 3
node 2
node 1’
node 3’
node 2’
synchronization
input
input
primaries
replicas
Neither approach tolerates stragglers
Observation
• Batch processing models for clusters (e.g. MapReduce) provide fault tolerance efficiently
– Divide job into deterministic tasks
– Rerun failed/slow tasks in parallel on other nodes
Idea
• Discretized Stream (D-Stream) Model
– Run a streaming computation as a series of small, stateless, and deterministic batchcomputations
• Same recovery schemes at much smaller timescale
• Work to make batch size as small as possible
t = 1:
t = 2:
D-Stream 1 D-Stream 2
batch operation
pullinput
… …
input
immutable dataset(stored reliably)
immutable dataset(output or state);stored in memory
without replication
…Discretized Stream Model
Parallel Recovery
• Checkpoint state datasets periodically
• If a node fails/straggles, recompute its dataset partitions in parallel on other nodes
Faster recovery than upstream backup,without the cost of replication
map
input dataset output dataset
Programming Interface
• A D-Stream is just a sequence of immutable, partitioned datasets
– Specifically, resilient distributed datasets (RDDs), the storage abstraction in Spark
• Deterministic transformation operators produce new streams
Example: Page Views
t = 1:
t = 2:
pageViews ones counts
map reduce
. . .
= RDD = partition
pageViews = readStream("...", "1s")
ones = pageViews.map(ev => (ev.url, 1))
counts = ones.runningReduce(_ + _)
Scala function literal
sliding = ones.reduceByWindow(
“5s”, _ + _, _ - _)
Incremental version with “add” and “subtract” functions
Timing Considerations
• D-streams group input into intervals based on when records arrive at the system
• For apps that need to group by an “external” time and tolerate network delays, support:
– Slack time: delay starting a batch for a short fixed time to give records a chance to arrive
– Application-level correction: e.g. give a result for time t at time t+1, then use later records to update incrementally at time t+5
How Fast Can It Go?
• Process 2 GB/s (20M records/s) of data on 50 nodes at sub-second latency
0
0.5
1
1.5
2
2.5
3
0 20 40 60
Clu
ste
r T
hro
ug
hp
ut
(GB
/s)
# of Nodes in Cluster
Grep
1 sec
2 sec 0
0.5
1
1.5
2
2.5
3
0 20 40 60
Clu
ste
r T
hro
ug
hp
ut
(GB
/s)
# of Nodes in Cluster
WordCount
1 sec
2 sec
0
0.5
1
1.5
2
2.5
3
0 20 40 60
Clu
ste
r T
hro
ug
hp
ut
(GB
/s)
# of Nodes in Cluster
Grep
1 sec
2 sec 0
0.5
1
1.5
2
2.5
3
0 20 40 60
Clu
ste
r T
hro
ug
hp
ut
(GB
/s)
# of Nodes in Cluster
WordCount
1 sec
2 sec
Max throughput within a given latency bound (1 or 2s)
How Fast Can It Go?
• Recovers from failures within 1 second
Failure Happens
0.0
0.5
1.0
1.5
2.0
0 15 30 45 60 75
Inte
rval
Pro
ces
sin
g
Tim
e (
s)
Time (s)
Sliding WordCount on 10 nodes with 30s checkpoint interval
Other Benefits
• Consistency
– Each record is processed atomically
• Unification with batch processing
– Combining streams with historical data
pageViews.join(historicCounts).map(...)
– Interactive ad-hoc queries on stream state
pageViews.slice(“21:00”, “21:05”).topK(10)
Aspect Discretized Streams Continuous Operator
Latency 0.5–2 s 1–100 ms unless recordsare batched for consistency
Consistency Records processedatomically with intervalthey arrive in
Some systems wait a shorttime to sync operators before proceeding
Laterecords
Slack time, or app-levelcorrection
Slack time, out of orderprocessing
Faultrecovery
Fast parallel recovery Replication or serial recovery on one node
Stragglerrecovery
Possible via speculativeexecution
Typically not handled
Mixing with batch
Simple unificationthrough RDD APIs
In some DBs; not inmessage queueing systems
GraphX = Spark for Graphs
• Integration of record-oriented and graph-oriented processing
• Extends RDDs to Resilient Distributed Property Graphs
– Present different views of the graph (vertices, edges, triplets)
– Support map-like operations
– Support distributed Pregel-like aggregations