introduction to cloud computing - home - department …dell/teaching/cc/dell_cc_spark.pdfreturn...

Apache Spark

Dell Zhang

Birkbeck, University of London

2017/18

Cloud Computing

Spark

• One popular answer to “What’s beyond MapReduce?”

• Open-source engine for large-scale batch processing

– Supports generalized data flows

– Written in Scala, with bindings in Java and Python

Spark

• Brief history:

– Developed at UC Berkeley AMPLab in 2009

– Open-sourced in 2010

– Became top-level Apache project in February 2014

– Commercial support provided by DataBricks

Source: Zaharia et al. (NSDI 2012)

Spark

Motivation

• MapReduce greatly simplified “big data” analysis on large, unreliable clusters

• But as soon as it got popular, users wanted more:– More complex, multi-stage applications

(e.g. iterative machine learning & graph processing)

– More interactive ad-hoc queries

Response: specialized frameworks for some of these apps (e.g. Pregel for graph processing)

Motivation

• Complex apps and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharing

In MapReduce, the only way to share data across jobs is stable storage slow!

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to replication and disk I/O, but necessary for fault tolerance

Example Scenarios

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

iter. 1 . . .iter. 2

iter. 1 iter. 2 . . .

Input

Input

query 1

query 2

query 3

. . .

one-timeprocessing

Goal: In-Memory Data Sharing

10-100× faster than network/disk, but how to get fault tolerance?

Challenge

How to design a distributed memory abstraction that is both fault-tolerant and efficient?

Challenge

• Existing storage abstractions have interfaces based on fine-grained updates to mutable state (reads and writes to cells in a table)– E.g., databases, key-value stores,

distributed memory, RAMCloud, Piccolo, …

• Requires replicating data or logs across nodes for fault tolerance– Costly for data-intensive apps

– 10-100x slower than memory write

Resilient Distributed Datasets (RDDs)

• Distributed shared memory in a restricted form– Immutable, partitioned collections of records

– Can only be built through coarse-graineddeterministic transformations (map, filter, join, …)

• Resilient: efficient fault recovery using lineage– Log one operation to apply to many elements

– Re-compute lost partitions on failure

– No cost if nothing fails

Fault Recovery of RDDs

• An RDD can only be created through deterministic operations (transformations) on

– (1) data in stable storage (2) other RDDs.

• An RDD has enough information about how it was derived from other datasets in a series of transformations (its lineage), so it can alwaysbe recomputed from stable storage (disk).


Input

query 1

query 2

query 3

. . .

one-timeprocessing

iter. 1 iter. 2 . . .

Input


119

57 56 58 58

81

57 59 57 59

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10

Ite

ratr

ion

tim

e (

s)

Iteration

No FailureFailure in the 6th Iteration

Generality of RDDs

• Despite their restrictions, RDDs can express surprisingly many parallel algorithms

– These naturally apply the same operation to many items

• Unify many current programming models

– Data flow models: MapReduce, Dryad, SQL, …

– Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, …

• Support new apps that these models don’t

Trade-off Space

Memorybandwidth

Networkbandwidth

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

Best for batchworkloads

Best fortransactional

workloads

HDFS RDDs

System Architecture

Programming Interface

• Spark offers over 80 high-level operators that make it easy to build parallel apps.

• Usable interactively from the Scala, Python and R shells.

Programing Interface

• Operations on RDDs

– Transformations define new RDDs

– Actions use RDDs to return results to the driver program, or export results to the storage system

Spark computes RDDs lazily the first time they are used in an action so that it can pipeline transformations. i.e., an RDDs is defined by transformations but its computation is actually triggered by actions.

Transformations

mapfilter

samplegroupByKeyreduceByKey

sortByKey

flatMapunionjoin

cogroupcross

mapValues

Actions

collectreducecountsave

lookupKey



• map != flatMap

– Spark map means a one-to-one mapping

– Spark flatMap maps each input value to one or more outputs (similar to the map in MapReduce)


• Control of RDDs

– persistence (storage in RAM, on disk, etc)

• Users can indicate which RDDs they will reuse and choose a storage strategy for them.

– partitioning (layout across nodes)

• Users can indicate how an RDD’s elements be partitioned across machines based on a key in each record. This is useful for placement optimizations, such as ensuring that two datasets that will be joined together are hash-partitioned in the same way.

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

tasks

results

Msgs. 1

Msgs. 2

Msgs. 3

Base RDDTransformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)

Example: Log Mining

• Fault Recovery

– RDDs track the graph of transformations that built them (their lineage) to rebuild lost data

HadoopRDD

path = hdfs://…

FilteredRDD

func = _.contains(...)

MappedRDD

func = _.split(…)

messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2))

HadoopRDD FilteredRDD MappedRDD

Example: Simplified PageRank

1. Start each page with a rank of 1

2. On each iteration, update each page’s rank to

Σi∈neighbors ranki / |neighborsi|

links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) {ranks = links.join(ranks).flatMap {

(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}.reduceByKey(_ + _)}


• Optimizing Placement– links & ranks repeatedly joined

– Can co-partition them (e.g. hash both on URL) to avoid shuffles

– Can also use app knowledge, e.g., hash on DNS name

links = links.partitionBy(new URLPartitioner())

reduce

Contribs0

join

join

Contribs2

Ranks0(url, rank)

Links(url, neighbors)

. . .

Ranks2

reduce

Ranks1


• Performance1

71

80

72

28

23

14

020406080

100120140160180200

30 60

Ite

rati

on

tim

e (

s)

Number of machines

Hadoop

Basic Spark

Spark + ControlledPartitioning

Programming Models

• RDDs can express many existing parallel models– MapReduce, DryadLINQ

– Iterative MapReduce

– Pregel

– SQL

• Enables apps to efficiently intermix these models

All are based oncoarse-grained operations

Programming Models

• For example, MapReduce

– can be expressed using the flatMap and groupByKey operations in Spark, or reduceByKey if there is a combiner.

Implementation

• Runs on Mesos to share clusters with Hadoop

• Can read from any Hadoop input source

– HDFS, HBase, S3, …

• No changes to Scala language or compiler

– Reflection + bytecode analysis to correctly ship code

Spark Hadoop MPI

Mesos

Node Node Node Node

…

Dependencies

• Narrow

– Each partition of the parent RDD is used by at most one partition of the child RDD

• Wide

– Each partition of the parent RDD is used by multiple partitions of the child RDD

Dependencies

Dependencies

• Narrow vs Wide– Narrow dependencies allow for pipelined

execution on one cluster node, while wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation.

– Recovery after a node failure is more efficient with a narrow dependency as only the lost parent partitions need to be recomputed, and they can be recomputed in parallel on different nodes.

Task Scheduler

• Dryad-like task DAGs

• Pipelines functionswithin a stage

• Cache-aware for data reuse and & locality

• Partitioning-awareto avoid shuffles join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= cached data partition

Scalability

184

111

76

116

80

62

15 6 3

0

50

100

150

200

250

25 50 100

Ite

rati

on

tim

e (

s)

Number of machines

Hadoop

HadoopBinMem

Spark

274

157

106

197

121

87

143

61

33

0

50

100

150

200

250

300

25 50 100

Ite

rati

on

tim

e (

s)

Number of machines

Hadoop

HadoopBinMem

Spark

Logistic Regression K-Means

Breaking Down the Speedup

15.4

13.1

2.9

8.4

6.9

2.9

0

5

10

15

20

In-mem HDFS In-mem local file Spark RDD

Ite

rati

on

tim

e (

s)

Text Input

Binary Input

Behavior with Insufficient RAM

68

.8

58.1

40

.7

29

.7

11.5

0

20

40

60

80

100

0% 25% 50% 75% 100%

Ite

rati

on

tim

e (

s)

Percent of working set in memory

Aspect RDD DSM

Reads Corse- or Fine-grained Fine-grained

Writes Corse- grained Fine-grained

Consistency Trivial (immutable) Up to app / runtime

Fault recovery Fine-grained and low-overhead using lineage

Requires checkpointsand program rollback

Stragglermitigation

Possible using backup tasks

Difficult

Work placement Automatic based on data locality

Up to app (but runtime aims for transparency)

Behaviour if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Spark Streaming

Source: Zaharia et al. (SOSP 2013)

Motivation

• Many important applications need to process large data streams arriving in real time

– User activity statistics (e.g. Facebook’s Puma)

– Spam detection

– Traffic estimation

– Network intrusion detection

• The target: large-scale apps that must run on tens-hundreds of nodes with O(1 sec) latency

Challenge

• To run at large scale, streaming systems have to be both:

– Fault-tolerant: recover quickly from failures and stragglers

– Cost-efficient: do not require significant hardware beyond that needed for basic processing

• Traditional streaming systems based on the Continuous Operator Model don’t have both properties

Continuous Operator Model

• “Record-at-a-time” processing– Each node has mutable state

– For each record, update state & send new records

mutable state

node 1

node 3

input records push

node 2input records


Fault tolerance via replication or upstream backup:

node 1

node 3

node 2

standby

input

input

node 1

node 3

node 2

node 1’

node 3’

node 2’

synchronization

input

input

primaries

replicas



node 1

node 3

node 2

standby

input

input

node 1

node 3

node 2

node 1’

node 3’

node 2’

synchronization

input

input

primaries

replicas

Fast recovery, but 2x hardware cost Only need 1 standby, but slow to recover



node 1

node 3

node 2

standby

input

input

node 1

node 3

node 2

node 1’

node 3’

node 2’

synchronization

input

input

primaries

replicas

Neither approach tolerates stragglers

Observation

• Batch processing models for clusters (e.g. MapReduce) provide fault tolerance efficiently

– Divide job into deterministic tasks

– Rerun failed/slow tasks in parallel on other nodes

Idea

• Discretized Stream (D-Stream) Model

– Run a streaming computation as a series of small, stateless, and deterministic batchcomputations

• Same recovery schemes at much smaller timescale

• Work to make batch size as small as possible

t = 1:

t = 2:

D-Stream 1 D-Stream 2

batch operation

pullinput

… …

input

immutable dataset(stored reliably)

immutable dataset(output or state);stored in memory

without replication

…Discretized Stream Model

Parallel Recovery

• Checkpoint state datasets periodically

• If a node fails/straggles, recompute its dataset partitions in parallel on other nodes

Faster recovery than upstream backup,without the cost of replication

map

input dataset output dataset


• A D-Stream is just a sequence of immutable, partitioned datasets

– Specifically, resilient distributed datasets (RDDs), the storage abstraction in Spark

• Deterministic transformation operators produce new streams

Example: Page Views

t = 1:

t = 2:

pageViews ones counts

map reduce

. . .

= RDD = partition

pageViews = readStream("...", "1s")

ones = pageViews.map(ev => (ev.url, 1))

counts = ones.runningReduce(_ + _)

Scala function literal

sliding = ones.reduceByWindow(

“5s”, _ + _, _ - _)

Incremental version with “add” and “subtract” functions

Timing Considerations

• D-streams group input into intervals based on when records arrive at the system

• For apps that need to group by an “external” time and tolerate network delays, support:

– Slack time: delay starting a batch for a short fixed time to give records a chance to arrive

– Application-level correction: e.g. give a result for time t at time t+1, then use later records to update incrementally at time t+5

How Fast Can It Go?

• Process 2 GB/s (20M records/s) of data on 50 nodes at sub-second latency

0

0.5

1

1.5

2

2.5

3

0 20 40 60

Clu

ste

r T

hro

ug

hp

ut

(GB

/s)

# of Nodes in Cluster

Grep

1 sec

2 sec 0

0.5

1

1.5

2

2.5

3

0 20 40 60

Clu

ste

r T

hro

ug

hp

ut

(GB

/s)


WordCount

1 sec

2 sec

0

0.5

1

1.5

2

2.5

3

0 20 40 60

Clu

ste

r T

hro

ug

hp

ut

(GB

/s)


Grep

1 sec

2 sec 0

0.5

1

1.5

2

2.5

3

0 20 40 60

Clu

ste

r T

hro

ug

hp

ut

(GB

/s)


WordCount

1 sec

2 sec

Max throughput within a given latency bound (1 or 2s)

How Fast Can It Go?

• Recovers from failures within 1 second

Failure Happens

0.0

0.5

1.0

1.5

2.0

0 15 30 45 60 75

Inte

rval

Pro

ces

sin

g

Tim

e (

s)

Time (s)

Sliding WordCount on 10 nodes with 30s checkpoint interval

Other Benefits

• Consistency

– Each record is processed atomically

• Unification with batch processing

– Combining streams with historical data

pageViews.join(historicCounts).map(...)

– Interactive ad-hoc queries on stream state

pageViews.slice(“21:00”, “21:05”).topK(10)

Aspect Discretized Streams Continuous Operator

Latency 0.5–2 s 1–100 ms unless recordsare batched for consistency

Consistency Records processedatomically with intervalthey arrive in

Some systems wait a shorttime to sync operators before proceeding

Laterecords

Slack time, or app-levelcorrection

Slack time, out of orderprocessing

Faultrecovery

Fast parallel recovery Replication or serial recovery on one node

Stragglerrecovery

Possible via speculativeexecution

Typically not handled

Mixing with batch

Simple unificationthrough RDD APIs

In some DBs; not inmessage queueing systems

GraphX: Motivation

Source: Gonzalez et al. (OSDI 2014)

GraphX = Spark for Graphs

• Integration of record-oriented and graph-oriented processing

• Extends RDDs to Resilient Distributed Property Graphs

– Present different views of the graph (vertices, edges, triplets)

– Support map-like operations

– Support distributed Pregel-like aggregations

Property Graph: Example

Underneath the Covers

Take Home Messages

• Spark

– Resilient Distributed Datasets

– Discretized Streams

– Property Graphs

introduction to cloud computing - home - department …dell/teaching/cc/dell_cc_spark.pdfreturn...

Documents