learning spark ch10 - spark streaming

of 28/28
CHAPTER 10: SPARK STREAMING Learning Spark by Holden Karau et. al.

Post on 26-Jan-2017




12 download

Embed Size (px)


Learning Spark by Holden Karau et. al.

Chapter 10: spark streamingLearning Sparkby Holden Karau et. al.

Overview: Spark StreamingA Simple Example Architecture and Abstraction Transformations Stateless Stateful Output OperationsInput Sources Core SourcesAdditional SourcesMultiple Sources and Cluster Sizing 24/7 Operation CheckpointingDriver Fault ToleranceWorker Fault Tolerance Receiver Fault Tolerance Processing Guarantees Streaming UIPerformance Considerations Batch and Window SizesLevel of ParallelismGarbage Collection and Memory Usage Conclusion

10.1 A Simple ExampleBefore we dive into the details of Spark Streaming, lets consider a simple example. We will receive a stream of newline-delimited lines of text from a server running at port 7777, filter only the lines that contain the word error, and print them. Spark Streaming programs are best run as standalone applications built using Maven or sbt. Spark Streaming, while part of Spark, ships as a separate Maven artifact and has some additional imports you will want to add to your project.

10.2 Architecture and Abstraction

Spark Streaming uses a micro-batch architecture, where the streaming computa tion is treated as a continuous series of batch computations on small batches of data. Spark Streaming receives data from various input sources and groups it into small batches. New batches are created at regular time intervals. At the beginning of each time interval a new batch is created, and any data that arrives during that interval gets added to that batch. At the end of the time interval the batch is done growing. The size of the time intervals is determined by a parameter called the batch interval. The batch interval is typically between 500 milliseconds and several seconds, as config ured by the application developer. Each input batch forms an RDD, and is processed using Spark jobs to create other RDDs. The processed results can then be pushed out to external systems in batches.


Edx and Coursera CoursesIntroduction to Big Data with Apache SparkSpark Fundamentals IFunctional Programming Principles in Scala

10.2 Architecture and Abstraction (cont.)

10.3 TransformationsStatelessthe processing of each batch does not depend on the data of its previous batches include the common RDD transformations like map(), filter(), and reduceByKey() Statefuluse data or intermediate results from previous batches to compute the results of the current batch include transformations based on: sliding windows tracking state across time

10.3.1 Stateless Transformations

10.3.2 Stateless TransformationsWindowed Transformationcompute results across a longer time period than the StreamingContexts batch interval, by combining results from multiple batches

A windowed stream with a window duration of 3 batches and a slide duration of 2 batches; every two time steps, we compute a result over the previous 3 time steps

10.3.2 Stateless Transformations (cont.)UpdateStateByKey transformationupdateStateByKey() maintains state across the batches in a DStream by providing access to a state variable for DStreams of key/value pairs update(events, oldState) returns a newStateevents is a list of events that arrived in the current batch (may be empty)oldState is an optional state object, stored within an Option; it might be missing if there was no previous state for the keynewState is also an Option; we can return an empty Option to specify that we want to delete the state

10.4 Output OperationsSpecify what needs to be done with the final transformed data in a stream print()save()Saving DStream to text files in Scala ipAddressRequestCount.saveAsTextFiles("outputDir", "txt") Saving SequenceFiles from a DStream in Scala val writableIpAddressRequestCount = ipAddressRequestCount.map { (ip, count) => (new Text(ip), new LongWritable(count)) } writableIpAddressRequestCount.saveAsHadoopFiles[ SequenceFileOutputFormat[Text, LongWritable]]("outputDir", "txt")

10.5 Input SourcesSpark Streaming has built-in support for a number of different data sources. core sources are built into the Spark Streaming Maven artifactothers are available through additional artifactsEg: spark-streaming-kafka.

10.5.1 Core SourcesStream of filesallows a stream to be created from files written in a directory of a Hadoop-compatible filesystemneeds to have a consistent date format for the directory names and the files have to be created atomically Eg: Streaming text files written to a directory in Scala val logData = ssc.textFileStream(logDirectory) Akka actor streamallows using Akka actors as a source for streamingTo construct an actor stream:create an Akka actorimplement the org.apache.spark.streaming.receiver.ActorHelper interface

10.5.2 Additional SourcesApache KafkaApache PlumePush-based receiver Pull-based receiverCustom input sources

10.5.3 Multiple Sources and Cluster SizingWe can combine multiple DStreams using operations like union() combine data from multiple input DStreams The receivers are executed in the Spark cluster to use multiple ones Each receiver runs as a long-running task within Sparks executors, and hence occupies CPU cores allocated to the application Note: Do not run Spark Streaming programs locally with master config ured as "local" or "local[1]

10.6 24/7 OperationsSpark provides strong fault tolerance guarantees.As long as the input data is stored reliably, Spark Streaming will always compute the correct result from it, offering exactly once semantics, even if workers or the driver fail. To run Spark Streaming applications 24/7setting up checkpointing to a reliable storage system, such as HDFS or Amazon S3 worry about the fault tolerance of the driver program and of unreliable input sources

10.6.1 CheckpointingMain mechanism needs to be set up for fault toleranceAllows periodically saving data about the application to a reliable storage system, such as HDFS or Amazon S3 for use in recovering Two purposes:Limiting the state that must be recomputed on failure Providing fault tolerance for the driver

Limiting the state that must be recomputed on failure. As discussed in Architec ture and Abstraction on page 186, Spark Streaming can recompute state using the lineage graph of transformations, but checkpointing controls how far back it must go.

Providing fault tolerance for the driver. If the driver program in a streaming application crashes, you can launch it again and tell it to recover from a check point, in which case Spark Streaming will read how far the previous run of the program got in processing the data and take over from there.


10.6.2 Driver Fault ToleranceRequires creating our StreamingContext, which takes in the checkpoint directory use the StreamingContext.getOrCreate() function Write initialization code using getOrCreate(), need to actually restart your driver program when it crashes


10.6.3 Worker Fault ToleranceSpark Streaming uses the same techniques as Spark for its fault tolerance. All the data received from external sources is replicated among the Spark workersAll RDDs created through transformations of this replicated input data are tolerant to failure of a worker node, as the RDD lineage allows the system to recompute the lost data all the way from the surviving replica of the input data.

10.6.4 Receiver Fault ToleranceSpark Streaming restarts the failed receivers on other nodes in the cluster Receivers provide the guarantees:All data read from a reliable filesystem (e.g., with StreamingContext.hadoop Files) is reliable, because the underlying filesystem is replicated. For unreliable sources such as Kafka, push-based Flume, or Twitter, Spark repli cates the input data to other nodes, but it can briefly lose data if a receiver task is down.

10.6.5 Processing GuaranteesSpark Streaming provide exactly- once semantics for all transformationsEven if a worker fails and some data gets reprocessed, the final transformed result (that is, the transformed RDDs) will be the same as if the data were processed exactly once. When the transformed result is to be pushed to external systems using out put operations, the task pushing the result may get executed multiple times due to failures, and some data can get pushed multiple times.

10.7 Streaming UIUI page that lets us look at what applications are doing. (typically http:// :4040)

10.8 Performance ConsiderationsBatch in window sizesLevel of parallelismGarbage Collection and Memory Usage

10.8.1 Batch and Window SizesMinimum batch size Spark Streaming can use: 500 millisecondsThe best approach:start with a larger batch size (around 10 seconds)work your way down to a smaller batch size. If the processing times reported in the Streaming UI remain consistent, then you can continue to decrease the batch sizeNote: if they are increasing you may have reached the limit for your application.

10.8.2 Level of ParallelismIncreasing the parallelism - a common way to reduce the processing time of batches3 ways:Increasing the number of receivers Explicitly repartitioning received data Increasing parallelism in aggregation

10.8.3 Garbage Collection and Memory UsageJavas garbage collection - an aspect that can cause problems To minimize large pauses due to GC enabling Javas Concurrent Mark- Sweep garbage collector. The Concurrent Mark-Sweep garbage collector does consume more resources overall, but introduces fewer pauses. To reduce GC pressureCache RDDs in serialized form Use Kryo serialization Use an LRU cache

Edx and Coursera CoursesIntroduction to Big Data with Apache SparkSpark Fundamentals IFunctional Programming Principles in Scala

10.9 ConclusionIn this chapter, we have seen how to work with streaming data using DStreams. Since DStreams are composed of RDDs, the techniques and knowledge you have gained from the earlier chapters remains applicable for streaming and real-time applications. In the next chapter, we will look at machine learning with Spark.