spark & spark streaming internals - nov 15 (1)

Spark & Spark Streaming Internals

Akhil Dasakhil@sigmoidanalytics.com

Apache Spark

Spark Stack

Spark Internals

Resilient Distributed Dataset (RDD)

Restricted form of distributed shared memory ➔ Immutable➔ Can only be built through deterministic transformations (textFile, map,

filter, join, …)

Efficient fault recovery using lineage ➔ Recompute lost partitions on failure ➔ No cost if nothing fails

RDD Operations

Transformations

➔ map/flatmap➔ filter➔ union/join/groupBy➔ cache

…….

Actions➔ collect/count➔ save➔ take

…….

Log Mining Example

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))messages.persist()

messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count

Base RDD

Transformed RDD

Action

What is Spark Streaming?

Framework for large scale stream processing

➔ Scales to 100s of nodes

➔ Can achieve second scale latencies

➔ Provides a simple batch-like API for implementing complex algorithm

➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.

Overview

Run a streaming computation as a series of very small, deterministic batch jobs

SparkStreaming

- Chop up the live stream into batches of X seconds

- Spark treats each batch of data as RDDs and processes them using RDD operations

- Finally, the processed results of the RDD operations are returned in batches

Key Concepts➔ DStream – sequence of RDDs representing a stream of data

- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets

➔ Transformations – modify data from one DStream to another- Standard RDD operations – map, countByValue, reduce, join, …

- Stateful operations – window, countByValueAndWindow, …

➔ Output Operations – send data to external entity- saveAsHadoopFiles – saves to HDFS- foreach – do anything with each batch of results

Eg: Get hashtags from Twitter

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)val hashTags = tweets.flatMap (status => status.getText.split(" ").filter(_.startsWith("#"))))hashTags.saveAsHadoopFiles("hdfs://...") Transformation

#Ebola, #India, #Mars ...

Thank You

spark & spark streaming internals - nov 15 (1)

batch of data

stream of data

output operations

standard rdd operations

live data streams

live stream

streaming computation

val hashtags

Documents

spark streaming, machine learning and meetup.com streaming...

introduction to spark internals

apache spark internals - part 2

spark streaming with kafka

spark streaming and mllib - hyderabad spark group

building robust, adaptive streaming apps with spark...

spark streaming @ berlin apache spark meetup, march 2015

manchester hadoop meetup: cassandra spark internals

structured streaming in spark

spark meetup - zoomdata streaming

apache spark streaming - bigdata.com

a deeper-understanding-of-spark-internals

massive streaming analytics with spark streaming

streaming office hours today after the lecture until 7pm....

spark streaming with cassandra

spark streaming preview

streaming items through a cluster with spark streaming

spark streaming: best practices

introduction to spark streaming

dtcc '14 spark runtime internals