spark streaming

Spark Streaming

Large-scale near-real-time stream processing

UC BERKELEY

Tathagata Das (TD)

2

MotivationMany important applications must

process large data streams at second-scale latencies– Check-ins, status updates, site statistics,

spam filtering, …

Require large clusters to handle workloads

Require latencies of few seconds

3

Case study: Conviva, Inc.

Real-time monitoring of online video metadata

Custom-built distributed streaming system– 1000s complex metrics on millions of

videos sessions– Requires many dozens of nodes for

processing

Hadoop backend for offline analysis– Generating daily and monthly reports– Similar computation as the streaming

system

Painful to maintain two stacks

4

Goals Framework for large-scale stream

processing Scalable to large clusters (~ 100

nodes) with near-real-time latency (~ 1 second)

Efficiently recovers from faults and stragglers

Simple programming model that integrates well with batch & interactive queries

Existing system do not achieve all of them

5

Existing Streaming Systems

Record-at-a-time processing model– Each node has mutable state– For each record, update state & send

new recordsmutable state

node 1 node

3

input records push

node 2

input records

6

Existing Streaming Systems

Storm– Replays records if not processed due to failure– Processes each record at least once– May update mutable state twice!– Mutable state can be lost due to failure!

Trident – Uses transactions to update state– Processes each record exactly once– Per state transaction updates slow

No integration with batch processing&

Cannot handle stragglers

7

Spark Streaming

8

Discretized Stream Processing

Run a streaming computation as a series of very small, deterministic batch jobs

Batch processing models, like MapReduce, recover from faults and stragglers efficiently– Divide job into deterministic tasks– Rerun failed/slow tasks in parallel on

other nodes

Same recovery techniques at lower time scales

9

Spark StreamingState between batches kept in

memory as immutable, fault-tolerant dataset– Specifically as Spark’s Resilient

Distributed Dataset

Batch sizes can be reduced to as low as 1/2 second to achieve ~ 1 second latency

Potentially combine streaming and batch workloads to build a single unified stack

10

Discretized Stream Processing

time = 0 - 1:

time = 1 - 2:

batch operationsinput

input

immutable distributed dataset

(replicated in memory)

immutable distributed dataset, stored in memory

as RDD

input stream state stream

… ……

state / output

11

Fault RecoveryState stored as Resilient Distributed

Dataset (RDD)– Deterministically re-computable parallel

collection– Remembers lineage of operations used to

create themFault / straggler recovery is done in

parallel on other nodes

operation

input dataset(replicated and fault-tolerant)

state RDD(not replicated)

Fast recovery from faults without full data replication

12

Programming ModelA Discretized Stream or DStream is a

series of RDDs representing a stream of data– API very similar to RDDs

DStreams can be created… – Either from live streaming data– Or by transforming other DStreams

13

DStream Data SourcesMany sources out of the box

– HDFS– Kafka– Flume– Twitter– TCP sockets– Akka actor– ZeroMQ

Easy to add your own

Contributed by external developers

TransformationsBuild new streams from existing streams

– RDD-like operations• map, flatMap, filter, count, reduce,• groupByKey, reduceByKey, sortByKey, join• etc.

– New window and stateful operations• window, countByWindow, reduceByWindow• countByValueAndWindow,

reduceByKeyAndWindow• updateStateByKey• etc.

Output Operations Send data to outside world

– saveAsHadoopFiles– print – prints on the driver’s screen– foreach - arbitrary operation on every

RDD

16

ExampleProcess a stream of Tweets to find the 20 most popular hashtags in the last 10 mins

1. Get the stream of Tweets and isolate the hashtags

2. Count the hashtags over 10 minute window3. Sort the hashtags by their counts4. Get the top 20 hashtags

17

1. Get the stream of Hashtags

val tweets = ssc.twitterStream(<username>, <password>)

val hashtags = tweets.flatMap (status => getTags(status))

transformation

DStream

= RDD

t-1 t t+1 t+2 t+4t+3

flatMap flatMap flatMap flatMap flatMap

tweets

hashTags

tagCounts

2. Count the hashtags over 10 min

val tweets = ssc.twitterStream(<username>, <password>)val hashtags = tweets.flatMap (status => getTags(status))val tagCounts = hashTags.window(Minutes(10), Seconds(1))

.map(tag => (tag, 1)).reduceByKey(_ + _)

sliding window operation

hashTags

t-1 t t+1 t+2 t+4t+3

2. Count the hashtags over 10 min

val tweets = ssc.twitterStream(<username>, <password>)val hashtags = tweets.flatMap (status => getTags(status))val tagCounts = hashtags

.countByValueAndWindow(Minutes(10), Seconds(1))

hashTags

t-1 t t+1 t+2 t+4t+3

+

+

–tagCounts

20

Smart window-based reduce

Technique with count generalizes to reduce– Need a function to “subtract” – Applies to invertible reduce functions

Could have implemented counting as:

hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), …)

3. Sort the hashtags by their counts

val tweets = ssc.twitterStream(<username>, <password>)val hashTags = tweets.flatMap (status => getTags(status))val tagCounts = hashtags

.countByValueAndWindow(Minutes(1), Seconds(1))

val sortedTags = tagCounts.map { case (tag, cnt) => (cnt, tag) } .transform(_.sortByKey(false))

allows arbitrary RDD operations to create

a new DStream

4. Get the top 20 hashtagsval tweets = ssc.twitterStream(<username>, <password>)val hashTags = tweets.flatMap (status => getTags(status))val tagCounts = hashtags

.countByValueAndWindow(Minutes(1), Seconds(1))val sortedTags = tagCounts.map { case (tag, cnt) => (cnt, tag) } .transform(_.sortByKey(false))

sortedTags.foreach(showTopTags(20) _)

output operation

23

10 popular hashtags in last 10 min

// Create the stream of tweets val tweets = ssc.twitterStream(<username>, <password>) // Count the tags over a 1 minute window val tagCounts = tweets.flatMap (statuts => getTags(status)) .countByValueAndWindow (Minutes(10), Second(1))

// Sort the tags by counts val sortedTags = tagCounts.map { case (tag, count) => (count, tag) } .transform(_.sortByKey(false))

// Show the top 10 tags sortedTags.foreach(showTopTags(10) _)

24

Demo

25

Other OperationsMaintaining arbitrary state, tracking sessions

tweets.updateStateByKey(tweet => updateMood(tweet))

Selecting data directly from a DStreamtagCounts.slice(<from Time>, <to Time>).sortByKey()

tweets

t-1 t t+1 t+2 t+4t+3

user mood

26

PerformanceCan process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency

0 50 1000

0.5

1

1.5

2

2.5

3

3.5WordCount

1 sec2 sec

# Nodes in Cluster

Clus

ter T

hrou

ghpu

t (G

B/s)

0 20 40 60 80 100

0

1

2

3

4

5

6

7Grep

1 sec2 sec

# Nodes in Cluster

Clus

ter T

hhro

ughp

ut (G

B/s

)

27

Comparison with othersHigher throughput than Storm

– Spark Streaming: 670k records/second/node

– Storm: 115k records/second/node– Apache S4: 7.5k records/second/node

100 100005

1015202530

WordCount

Spark StormRecord Size (bytes)

Thro

ughp

ut p

er n

ode

(MB/

s)

100 10000

20406080

Grep

Spark StormRecord Size (bytes)

Thro

ughp

ut p

er n

ode

(MB/

s)

28

Fast Fault RecoveryRecovers from faults/stragglers within 1 sec

29

Real Applications: Conviva

Real-time monitoring of video metadata• Implemented Shadoop – a

wrapper for Hadoop jobs to run over Spark / Spark Streaming

• Ported parts of Conviva’s Hadoop stack to run on Spark Streaming

Shadoop

HadoopJob

SparkStreaming

val shJob = new SparkHadoopJob[…]( <Hadoop job> )val shJob.run( <Spark context> )

30

Real Applications: Conviva

Real-time monitoring of video metadata

0 10 20 30 40 50 60 700

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

# Nodes in Cluster

Activ

e se

ssio

ns (m

illio

ns)• Achieved 1-2 second

latency• Millions of video

sessions processed scales linearly with cluster size

31

Real Applications: Mobile Millennium

ProjectTraffic estimation using online machine learning

0 20 40 60 800

400

800

1200

1600

2000

# Nodes in Cluster

GPS

obse

rvati

ons p

er se

cond• Markov chain Monte

Carlo simulations on GPS observations

• Very CPU intensive, requires 10s of machines for useful computation

• Scales linearly with cluster size

32

Failure SemanticsInput data replicated by the system

Lineage of deterministic ops used to recompute RDD from input data if worker nodes fails

Transformations – exactly once

Output operations – at least once

33

Java API for StreamingDeveloped by Patrick WendellSimilar to Spark Java APIDon’t need to know scala to try

streaming!

34

Contributors5 contributors from UCB, 3 external

contributors– Matei Zaharia, Haoyuan Li– Patrick Wendell– Denny Britz– Sean McNamara*– Prashant Sharma*– Nick Pentreath*– Tathagata Das

Vision - one stack to rule them all

Ad-hoc Queries

Batch Processing

Stream Processing Spark

+Spark

Streaming

37

ConclusionAlpha to be release with Spark 0.7 by

weekend

Look at the new Streaming Programming Guide

More about Spark Streaming system in our paper

http://tinyurl.com/dstreams

Join us in Strata on Feb 26 in Santa Clara

http://tinyurl.com/dstreams

spark streaming

Documents

update state

mutable state faulttolerant

streaming computation

time processing modeleach

batch operations

batch workloads

mutable statefor

realtime latency