spark streaming preview

of 40/40
Spark Streaming Preview Fault-Tolerant Stream Processing at Scale Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY

Post on 24-Feb-2016




0 download

Embed Size (px)


Spark Streaming Preview. Fault-Tolerant Stream Processing at Scale. Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker , Ion Stoica. UC BERKELEY. Motivation. Many important applications need to process large data streams arriving in real time - PowerPoint PPT Presentation


PowerPoint Presentation

Spark Streaming PreviewFault-Tolerant Stream Processing at Scale

Matei Zaharia, Tathagata Das,Haoyuan Li, Scott Shenker, Ion Stoica

UC BERKELEYMotivationMany important applications need to process large data streams arriving in real timeUser activity statistics (e.g. Facebooks Puma)Spam detectionTraffic estimationNetwork intrusion detection

Our target: large-scale apps that need to run on tens-hundreds of nodes with O(1 sec) latencySystem GoalsSimple programming interfaceAutomatic fault recovery (including state)Automatic straggler recoveryIntegration with batch & ad-hoc queries(want one API for all your data analysis)Traditional Streaming SystemsRecord-at-a-time processing modelEach node has mutable stateEvent-driven API: for each record, update state and send out new records

mutable statenode 1node 3input recordspush

node 2input recordsExamples: S4, Storm, Amazon SQS4Challenges with Traditional SystemsFault toleranceEither replicate the whole system (costly) or use upstream backup (slow to recover)

Stragglers (typically not handled)

Consistency (few guarantees across nodes)

Hard to unify with batch processingOur Model: Discretized StreamsRun each streaming computation as a series of very small, deterministic batch jobsE.g. a MapReduce every second to count tweets

Keep state in memory across jobsNew Spark operators allow stateful processing

Recover from faults/stragglers in same way as MapReduce (by rerunning tasks in parallel)

This is already how some users were using Spark!6Discretized Streams in Actiont = 1:t = 2:stream 1stream 2batch operation

inputinputimmutable dataset(stored reliably)immutable dataset(output or state);stored in memoryas Spark RDDExample: View CountKeep a running count of views to each webpage

views = readStream("http:...", "1s") ones = => (ev.url, 1))counts = ones.runningReduce(_ + _)t = 1:t = 2:viewsonescountsmapreduce. . .= dataset= partitionFault RecoveryCheckpoint state datasets periodicallyIf a node fails/straggles, build its data in parallel on other nodes using dependency graph

mapinput datasetFast recovery without the cost of full replicationoutput datasetHow Fast Can It Go?Currently handles 4 GB/s of data (42 million records/s) on 100 nodes at sub-second latency

Recovers from failures/stragglers within 1 sec


Programming interface


Early results

Future developmentD-StreamsA discretized stream is a sequence of immutable, partitioned datasetsSpecifically, each dataset is an RDD (resilient distributed dataset), the storage abstraction in SparkEach RDD remembers how it was created, and can recover if any part of the data is lost

D-StreamsD-Streams can be created either from live streaming data or by transforming other D-streams

Programming with D-Streams is very similar to programming with RDDs in Spark

D-Stream Operators TransformationsBuild new streams from existing streamsInclude existing Spark operators, which act on each interval in isolation, plus new stateful operators

Output operatorsSend data to outside world (save results to external storage, print to screen, etc)Example 1Count the words received every second

words = readStream("http://...", Seconds(1)) counts = words.count()D-Streamstransformationtime = 0 - 1:time = 1 - 2:time = 2 - 3:wordscountscountcountcount = RDDDemoSetup10 EC2 m1.xlarge instancesEach instance receiving a stream of sentences at rate of 1 MB/s, total 10 MB/sSpark Streaming receives the sentences and processes them

Example 2Count frequency of words received every second

words = readStream("http://...", Seconds(1)) ones = => (w, 1))freqs = ones.reduceByKey(_ + _)Scala function literaltime = 0 - 1:time = 1 - 2:time = 2 - 3:wordsonesfreqsmapreduceDemoExample 3Count frequency of words received in last minuteones = => (w, 1))freqs = ones.reduceByKey(_ + _)freqs_60s = freqs.window(Seconds(60), Second(1)) .reduceByKey(_ + _)

window lengthwindow movementsliding window operatorfreqstime = 0 - 1:time = 1 - 2:time = 2 - 3:wordsonesmapreducefreqs_60swindowreducefreqs = ones.reduceByKey(_ + _)freqs_60s = freqs.window(Seconds(60), Second(1)) .reduceByKey(_ + _)

freqs = ones.reduceByKeyAndWindow(_ + _, Seconds(60), Seconds(1))Simpler running reduceDemoIncremental window operatorswordsfreqsfreqs_60st-1tt+1t+2t+3t+4+ Aggregation functionwordsfreqsfreqs_60st-1tt+1t+2t+3t+4++Invertible aggregation functionfreqs = ones.reduceByKey(_ + _)freqs_60s = freqs.window(Seconds(60), Second(1)) .reduceByKey(_ + _)

freqs = ones.reduceByKeyAndWindow(_ + _, Seconds(60), Seconds(1))

freqs = ones.reduceByKeyAndWindow(_ + _, _ - _, Seconds(60), Seconds(1))

Smarter running reduceOutput Operatorssave: write results to any Hadoop-compatible storage system (e.g. HDFS, HBase)

foreachRDD: run a Spark function on each => {// any Spark/scala processing, maybe save to database})Live + Batch + InteractiveCombining D-streams with historical datasets pageViews.join(historicCounts).map(...)

Interactive queries on stream state from the Spark interpreter pageViews.slice(21:00, 21:05).topK(10)OutlineIntroduction

Programming interface


Early results

Future developmentSystem ArchitectureBuilt on an optimized version of Spark

WorkerMasterD-streamlineageTask schedulerBlock trackerTask executionBlock managerInput receiverWorkerTask executionBlock managerInput receiverReplication of input & checkpoint RDDsClientClientClientThe whole system is built on top of an optimized version of Spark, while the optimization will be explained later.

This graph presents the architecture of the system, which consists of three components: a master that keeps tracking the d-stream lineage and schedules tasks to compute RDD partitions on each interval; worker daemon running on every machine that receives data, store the partitions of input and computed RDDs, and execute tasks; an a client library that can be used by applications to send data into the system.27ImplementationOptimizations on current Spark:New block storeAPIs: Put(key, value, storage level), Get(key)Optimized scheduling for