spark streaming

of 37/37
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)

Post on 23-Feb-2016




1 download

Embed Size (px)


Spark Streaming. Large-scale near- real-time stream processing. Tathagata Das (TD). UC BERKELEY. Motivation. Many important applications must process large data streams at second-scale latencies Check-ins, status updates, site statistics, spam filtering , … - PowerPoint PPT Presentation


Spark Streaming

Spark StreamingLarge-scale near-real-time stream processing


Tathagata Das (TD)1MotivationMany important applications must process large data streams at second-scale latenciesCheck-ins, status updates, site statistics, spam filtering,

Require large clusters to handle workloads

Require latencies of few seconds

22Case study: Conviva, Inc.Real-time monitoring of online video metadata

Custom-built distributed streaming system1000s complex metrics on millions of videos sessionsRequires many dozens of nodes for processing

Hadoop backend for offline analysisGenerating daily and monthly reportsSimilar computation as the streaming system

3Painful to maintain two stacksGoalsFramework for large-scale stream processingScalable to large clusters (~ 100 nodes) with near-real-time latency (~ 1 second)Efficiently recovers from faults and stragglersSimple programming model that integrates well with batch & interactive queries

4Existing system do not achieve all of themExisting Streaming SystemsRecord-at-a-time processing modelEach node has mutable stateFor each record, update state & send new records

mutable statenode 1node 3input recordspush

node 2input records5Traditional streaming systems have what we call a record-at-a-time processing model. Each node in the cluster processing a stream has a mutable state. As records arrive one at a time, the mutable state is updated, and a new generated record is pushed to downstream nodes. Now making this mutable state fault-tolerant is hard. 5Existing Streaming SystemsStormReplays records if not processed due to failureProcesses each record at least onceMay update mutable state twice!Mutable state can be lost due to failure!

Trident Uses transactions to update stateProcesses each record exactly oncePer state transaction updates slow6No integration with batch processing&Cannot handle stragglers Spark Streaming7Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs

Batch processing models, like MapReduce, recover from faults and stragglers efficientlyDivide job into deterministic tasksRerun failed/slow tasks in parallel on other nodes

Same recovery techniques at lower time scales

8Spark StreamingState between batches kept in memory as immutable, fault-tolerant datasetSpecifically as Sparks Resilient Distributed Dataset

Batch sizes can be reduced to as low as 1/2 second to achieve ~ 1 second latency

Potentially combine streaming and batch workloads to build a single unified stack

9Discretized Stream Processingtime = 0 - 1:time = 1 - 2:batch operations


inputimmutable distributed dataset(replicated in memory)immutable distributed dataset, stored in memory as RDDinput streamstate streamstate / output10Fault RecoveryState stored as Resilient Distributed Dataset (RDD)Deterministically re-computable parallel collectionRemembers lineage of operations used to create themFault / straggler recovery is done in parallel on other nodes

operationinput dataset(replicated and fault-tolerant)

state RDD(not replicated)Fast recovery from faults without full data replication11Programming ModelA Discretized Stream or DStream is a series of RDDs representing a stream of dataAPI very similar to RDDs

DStreams can be created Either from live streaming dataOr by transforming other DStreams

12DStream Data SourcesMany sources out of the boxHDFSKafkaFlumeTwitterTCP socketsAkka actorZeroMQ

Easy to add your own13Contributed by external developersTransformationsBuild new streams from existing streamsRDD-like operationsmap, flatMap, filter, count, reduce,groupByKey, reduceByKey, sortByKey, joinetc.New window and stateful operationswindow, countByWindow, reduceByWindowcountByValueAndWindow, reduceByKeyAndWindowupdateStateByKeyetc.Output Operations Send data to outside world saveAsHadoopFilesprint prints on the drivers screenforeach - arbitrary operation on every RDDExampleProcess a stream of Tweets to find the 20 most popular hashtags in the last 10 mins

Get the stream of Tweets and isolate the hashtagsCount the hashtags over 10 minute windowSort the hashtags by their countsGet the top 20 hashtags161. Get the stream of Hashtagsval tweets = ssc.twitterStream(, )

val hashtags = tweets.flatMap (status => getTags(status)) 17transformationDStream = RDDt-1tt+1t+2t+4t+3flatMapflatMapflatMapflatMapflatMaptweetshashTagstagCounts2. Count the hashtags over 10 minval tweets = ssc.twitterStream(, )val hashtags = tweets.flatMap (status => getTags(status))val tagCounts = hashTags.window(Minutes(10), Seconds(1)) .map(tag => (tag, 1)).reduceByKey(_ + _)

sliding window operationhashTagst-1tt+1t+2t+4t+32. Count the hashtags over 10 minval tweets = ssc.twitterStream(, )val hashtags = tweets.flatMap (status => getTags(status))val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1))

hashTagst-1tt+1t+2t+4t+3++tagCountsSmart window-based reduceTechnique with count generalizes to reduceNeed a function to subtract Applies to invertible reduce functions

Could have implemented counting as:

hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), )

203. Sort the hashtags by their countsval tweets = ssc.twitterStream(, )val hashTags = tweets.flatMap (status => getTags(status))val tagCounts = hashtags.countByValueAndWindow(Minutes(1), Seconds(1))

val sortedTags = { case (tag, cnt) => (cnt, tag) } .transform(_.sortByKey(false))

allows arbitrary RDD operations to create a new DStream4. Get the top 20 hashtagsval tweets = ssc.twitterStream(, )val hashTags = tweets.flatMap (status => getTags(status))val tagCounts = hashtags.countByValueAndWindow(Minutes(1), Seconds(1))val sortedTags = { case (tag, cnt) => (cnt, tag) } .transform(_.sortByKey(false))

sortedTags.foreach(showTopTags(20) _)

output operation10 popular hashtags in last 10 min // Create the stream of tweets val tweets = ssc.twitterStream(, ) // Count the tags over a 1 minute window val tagCounts = tweets.flatMap (statuts => getTags(status)) .countByValueAndWindow (Minutes(10), Second(1))

// Sort the tags by counts val sortedTags = { case (tag, count) => (count, tag) } .transform(_.sortByKey(false))

// Show the top 10 tags sortedTags.foreach(showTopTags(10) _)

23Demo24Other OperationsMaintaining arbitrary state, tracking sessions

tweets.updateStateByKey(tweet => updateMood(tweet))

Selecting data directly from a DStreamtagCounts.slice(, ).sortByKey()

25tweetst-1tt+1t+2t+4t+3user moodPerformanceCan process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency

26Comparison with othersHigher throughput than StormSpark Streaming: 670k records/second/nodeStorm: 115k records/second/nodeApache S4: 7.5k records/second/node27Streaming Spark offers similar speed while providing FT and consistency guarantees that these systems lack27Fast Fault RecoveryRecovers from faults/stragglers within 1 sec28

Real Applications: ConvivaReal-time monitoring of video metadata

29Implemented Shadoop a wrapper for Hadoop jobs to run over Spark / Spark Streaming

Ported parts of Convivas Hadoop stack to run on Spark Streaming

ShadoopHadoopJobSparkStreamingval shJob = new SparkHadoopJob[]( )val ) Real Applications: ConvivaReal-time monitoring of video metadata

30Achieved 1-2 second latencyMillions of video sessions processed scales linearly with cluster size

Real Applications: Mobile Millennium ProjectTraffic estimation using online machine learning31Markov chain Monte Carlo simulations on GPS observationsVery CPU intensive, requires 10s of machines for useful computationScales linearly with cluster size

Failure SemanticsInput data replicated by the system

Lineage of deterministic ops used to recompute RDD from input data if worker nodes fails

Transformations exactly once

Output operations at least once

32Java API for StreamingDeveloped by Patrick WendellSimilar to Spark Java APIDont need to know scala to try streaming!33Contributors5 contributors from UCB, 3 external contributorsMatei Zaharia, Haoyuan LiPatrick WendellDenny BritzSean McNamara*Prashant Sharma*Nick Pentreath*Tathagata Das34Vision - one stack to rule them allSpark+Spark Streaming36

ConclusionAlpha to be release with Spark 0.7 by weekend

Look at the new Streaming Programming Guide

More about Spark Streaming system in our paper

Join us in Strata on Feb 26 in Santa Clara

37Trident unificationperformance3. stragglers37