Download - Spark Streaming

Transcript
Page 1: Spark Streaming

Spark Streaming

Large-scale near-real-time stream processing

UC BERKELEY

Tathagata Das (TD)

Page 2: Spark Streaming

2

MotivationMany important applications must

process large data streams at second-scale latencies– Check-ins, status updates, site statistics,

spam filtering, …

Require large clusters to handle workloads

Require latencies of few seconds

Page 3: Spark Streaming

3

Case study: Conviva, Inc.

Real-time monitoring of online video metadata

Custom-built distributed streaming system– 1000s complex metrics on millions of

videos sessions– Requires many dozens of nodes for

processing

Hadoop backend for offline analysis– Generating daily and monthly reports– Similar computation as the streaming

system

Painful to maintain two stacks

Page 4: Spark Streaming

4

Goals Framework for large-scale stream

processing Scalable to large clusters (~ 100

nodes) with near-real-time latency (~ 1 second)

Efficiently recovers from faults and stragglers

Simple programming model that integrates well with batch & interactive queries

Existing system do not achieve all of them

Page 5: Spark Streaming

5

Existing Streaming Systems

Record-at-a-time processing model– Each node has mutable state– For each record, update state & send

new recordsmutable state

node 1 node

3

input records push

node 2

input records

Page 6: Spark Streaming

6

Existing Streaming Systems

Storm– Replays records if not processed due to failure– Processes each record at least once– May update mutable state twice!– Mutable state can be lost due to failure!

Trident – Uses transactions to update state– Processes each record exactly once– Per state transaction updates slow

No integration with batch processing&

Cannot handle stragglers

Page 7: Spark Streaming

7

Spark Streaming

Page 8: Spark Streaming

8

Discretized Stream Processing

Run a streaming computation as a series of very small, deterministic batch jobs

Batch processing models, like MapReduce, recover from faults and stragglers efficiently– Divide job into deterministic tasks– Rerun failed/slow tasks in parallel on

other nodes

Same recovery techniques at lower time scales

Page 9: Spark Streaming

9

Spark StreamingState between batches kept in

memory as immutable, fault-tolerant dataset– Specifically as Spark’s Resilient

Distributed Dataset

Batch sizes can be reduced to as low as 1/2 second to achieve ~ 1 second latency

Potentially combine streaming and batch workloads to build a single unified stack

Page 10: Spark Streaming

10

Discretized Stream Processing

time = 0 - 1:

time = 1 - 2:

batch operationsinput

input

immutable distributed dataset

(replicated in memory)

immutable distributed dataset, stored in memory

as RDD

input stream state stream

… ……

state / output

Page 11: Spark Streaming

11

Fault RecoveryState stored as Resilient Distributed

Dataset (RDD)– Deterministically re-computable parallel

collection– Remembers lineage of operations used to

create themFault / straggler recovery is done in

parallel on other nodes

operation

input dataset(replicated and fault-tolerant)

state RDD(not replicated)

Fast recovery from faults without full data replication

Page 12: Spark Streaming

12

Programming ModelA Discretized Stream or DStream is a

series of RDDs representing a stream of data– API very similar to RDDs

DStreams can be created… – Either from live streaming data– Or by transforming other DStreams

Page 13: Spark Streaming

13

DStream Data SourcesMany sources out of the box

– HDFS– Kafka– Flume– Twitter– TCP sockets– Akka actor– ZeroMQ

Easy to add your own

Contributed by external developers

Page 14: Spark Streaming

TransformationsBuild new streams from existing streams

– RDD-like operations• map, flatMap, filter, count, reduce,• groupByKey, reduceByKey, sortByKey, join• etc.

– New window and stateful operations• window, countByWindow, reduceByWindow• countByValueAndWindow,

reduceByKeyAndWindow• updateStateByKey• etc.

Page 15: Spark Streaming

Output Operations Send data to outside world

– saveAsHadoopFiles– print – prints on the driver’s screen– foreach - arbitrary operation on every

RDD

Page 16: Spark Streaming

16

ExampleProcess a stream of Tweets to find the 20 most popular hashtags in the last 10 mins

1. Get the stream of Tweets and isolate the hashtags

2. Count the hashtags over 10 minute window3. Sort the hashtags by their counts4. Get the top 20 hashtags

Page 17: Spark Streaming

17

1. Get the stream of Hashtags

val tweets = ssc.twitterStream(<username>, <password>)

val hashtags = tweets.flatMap (status => getTags(status))

transformation

DStream

= RDD

t-1 t t+1 t+2 t+4t+3

flatMap flatMap flatMap flatMap flatMap

tweets

hashTags

Page 18: Spark Streaming

tagCounts

2. Count the hashtags over 10 min

val tweets = ssc.twitterStream(<username>, <password>)val hashtags = tweets.flatMap (status => getTags(status))val tagCounts = hashTags.window(Minutes(10), Seconds(1))

.map(tag => (tag, 1)).reduceByKey(_ + _)

sliding window operation

hashTags

t-1 t t+1 t+2 t+4t+3

Page 19: Spark Streaming

2. Count the hashtags over 10 min

val tweets = ssc.twitterStream(<username>, <password>)val hashtags = tweets.flatMap (status => getTags(status))val tagCounts = hashtags

.countByValueAndWindow(Minutes(10), Seconds(1))

hashTags

t-1 t t+1 t+2 t+4t+3

+

+

–tagCounts

Page 20: Spark Streaming

20

Smart window-based reduce

Technique with count generalizes to reduce– Need a function to “subtract” – Applies to invertible reduce functions

Could have implemented counting as:

hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), …)

Page 21: Spark Streaming

3. Sort the hashtags by their counts

val tweets = ssc.twitterStream(<username>, <password>)val hashTags = tweets.flatMap (status => getTags(status))val tagCounts = hashtags

.countByValueAndWindow(Minutes(1), Seconds(1))

val sortedTags = tagCounts.map { case (tag, cnt) => (cnt, tag) } .transform(_.sortByKey(false))

allows arbitrary RDD operations to create

a new DStream

Page 22: Spark Streaming

4. Get the top 20 hashtagsval tweets = ssc.twitterStream(<username>, <password>)val hashTags = tweets.flatMap (status => getTags(status))val tagCounts = hashtags

.countByValueAndWindow(Minutes(1), Seconds(1))val sortedTags = tagCounts.map { case (tag, cnt) => (cnt, tag) } .transform(_.sortByKey(false))

sortedTags.foreach(showTopTags(20) _)

output operation

Page 23: Spark Streaming

23

10 popular hashtags in last 10 min

// Create the stream of tweets val tweets = ssc.twitterStream(<username>, <password>) // Count the tags over a 1 minute window val tagCounts = tweets.flatMap (statuts => getTags(status)) .countByValueAndWindow (Minutes(10), Second(1))

// Sort the tags by counts val sortedTags = tagCounts.map { case (tag, count) => (count, tag) } .transform(_.sortByKey(false))

// Show the top 10 tags sortedTags.foreach(showTopTags(10) _)

Page 24: Spark Streaming

24

Demo

Page 25: Spark Streaming

25

Other OperationsMaintaining arbitrary state, tracking sessions

tweets.updateStateByKey(tweet => updateMood(tweet))

Selecting data directly from a DStreamtagCounts.slice(<from Time>, <to Time>).sortByKey()

tweets

t-1 t t+1 t+2 t+4t+3

user mood

Page 26: Spark Streaming

26

PerformanceCan process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency

0 50 1000

0.5

1

1.5

2

2.5

3

3.5WordCount

1 sec2 sec

# Nodes in Cluster

Clus

ter T

hrou

ghpu

t (G

B/s)

0 20 40 60 80 100

0

1

2

3

4

5

6

7Grep

1 sec2 sec

# Nodes in Cluster

Clus

ter T

hhro

ughp

ut (G

B/s

)

Page 27: Spark Streaming

27

Comparison with othersHigher throughput than Storm

– Spark Streaming: 670k records/second/node

– Storm: 115k records/second/node– Apache S4: 7.5k records/second/node

100 100005

1015202530

WordCount

Spark StormRecord Size (bytes)

Thro

ughp

ut p

er n

ode

(MB/

s)

100 10000

20406080

Grep

Spark StormRecord Size (bytes)

Thro

ughp

ut p

er n

ode

(MB/

s)

Page 28: Spark Streaming

28

Fast Fault RecoveryRecovers from faults/stragglers within 1 sec

Page 29: Spark Streaming

29

Real Applications: Conviva

Real-time monitoring of video metadata• Implemented Shadoop – a

wrapper for Hadoop jobs to run over Spark / Spark Streaming

• Ported parts of Conviva’s Hadoop stack to run on Spark Streaming

Shadoop

HadoopJob

SparkStreaming

val shJob = new SparkHadoopJob[…]( <Hadoop job> )val shJob.run( <Spark context> )

Page 30: Spark Streaming

30

Real Applications: Conviva

Real-time monitoring of video metadata

0 10 20 30 40 50 60 700

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

# Nodes in Cluster

Activ

e se

ssio

ns (m

illio

ns)• Achieved 1-2 second

latency• Millions of video

sessions processed scales linearly with cluster size

Page 31: Spark Streaming

31

Real Applications: Mobile Millennium

ProjectTraffic estimation using online machine learning

0 20 40 60 800

400

800

1200

1600

2000

# Nodes in Cluster

GPS

obse

rvati

ons p

er se

cond• Markov chain Monte

Carlo simulations on GPS observations

• Very CPU intensive, requires 10s of machines for useful computation

• Scales linearly with cluster size

Page 32: Spark Streaming

32

Failure SemanticsInput data replicated by the system

Lineage of deterministic ops used to recompute RDD from input data if worker nodes fails

Transformations – exactly once

Output operations – at least once

Page 33: Spark Streaming

33

Java API for StreamingDeveloped by Patrick WendellSimilar to Spark Java APIDon’t need to know scala to try

streaming!

Page 34: Spark Streaming

34

Contributors5 contributors from UCB, 3 external

contributors– Matei Zaharia, Haoyuan Li– Patrick Wendell– Denny Britz– Sean McNamara*– Prashant Sharma*– Nick Pentreath*– Tathagata Das

Page 35: Spark Streaming

Vision - one stack to rule them all

Ad-hoc Queries

Batch Processing

Stream Processing Spark

+Spark

Streaming

Page 36: Spark Streaming

36

Page 37: Spark Streaming

37

ConclusionAlpha to be release with Spark 0.7 by

weekend

Look at the new Streaming Programming Guide

More about Spark Streaming system in our paper

http://tinyurl.com/dstreams

Join us in Strata on Feb 26 in Santa Clara


Top Related