what's new with apache spark's structured streaming?
TRANSCRIPT
What’s new with Apache Spark’s Structured Streaming?
Miklos Christine3/21/2017
$ whoami
Solutions Architect @ Databricks• Apache Spark Advocate• Build and architect big data platforms for streaming and batch processing
Previously:• Sales Engineer @ Cloudera• Software Engineer @ Cisco
building robust stream processing
apps is hard
Complexities in stream processing
Complex Data
Diverse data formats (json, avro, binary, …)
Data can be dirty, late, out-of-order
Complex Systems
Diverse storage systems and formats (SQL, NoSQL, parquet, ... )
System failures
Complex Workloads
Event time processing
Combining streaming with interactive queries,
machine learning
Spark Streaming 1.x APIs (DStreams)
Difficulties: • Separate SparkStreamingContext()• Additional library packages and dependencies
• Kafka / Kinesis libraries• State management / window functions • Serialization issues & code changes (upgrade issues)• 1 SparkStreamingContext() per Application
Spark Streaming 1.x (aka DStreams)// Function to create a new StreamingContext and set it up
def creatingFunc(): StreamingContext = {
// Create a StreamingContext
val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds))
// Get the input stream from the source
val topic1_Stream = createKafkaStream(ssc, kafkaTopic1, kafkaBrokers)
// … … … logic
// To make sure data is not deleted by the time we query it interactively
ssc.remember(Minutes(1))
println("Creating function called to create new StreamingContext")
ssc
}
Spark Streaming 1.x (aka DStreams)// Get or create a streaming context.
val ssc = StreamingContext.getActiveOrCreate(creatingFunc)
// This starts the streaming context in the background.
ssc.start()
// This is to ensure that we wait for some time before the background streaming job
starts. This will put this cell on hold for 5 times the batchIntervalSeconds.
ssc.awaitTerminationOrTimeout(batchIntervalSeconds * 5 * 1000)
Structured Streaming
stream processing on Spark SQL enginefast, scalable, fault-tolerant
rich, unified, high level APIs deal with complex data and complex workloads
rich ecosystem of data sourcesintegrate with many storage systems
you should not have to
reason about streaming
you should write simple batch queries
&
Spark should automatically streamify
them
Treat Streams as Unbounded Tables
11
data stream unbounded input table
new data in the data stream
= new rows appended
to a unbounded table
New ModelTrigger: every 1 sec
Time
Input data upto t = 3
Que
ry
Input: data from source as an append-only table
Trigger: how frequently to check input for new data
Query: operations on input usual map/filter/reduce new window, session ops
t=1 t=2 t=3
data upto t = 1
data upto t = 2
New Model
result up to t = 1
Result
Que
ry
Time
data upto t = 1
Input data upto t = 2
result up to t = 2
data upto t = 3
result up to t = 3
Result: final operated table updated after every trigger
Output: what part of result to write to storage after every trigger
Output[complete mode]
write all rows in result table to storage
t=1 t=2 t=3
Complete output: write full result table every time
New Model
Que
ry
Time
data upto t = 1
Input data upto t = 2
data upto t = 3
Result: final operated table updated after every trigger
Output: what part of result to write to storage after every trigger
Complete output: write full result table every time
Append output: write only new rows that got added to result table since previous batch
t=1 t=2 t=3
Result result up to t =
3
Output[append mode] write new rows since last trigger to storage
result up to t =
1
result up to t =
2
static data =bounded table
streaming data =unbounded table
API - Dataset/DataFrame
Single API !
Batch Queries with DataFrames
input = spark.read.format("json").load("source-path")
result = input.select("device", "signal").where("signal > 15")
result.write.format("parquet").save("dest-path")
Read from Json file
Select some devices
Write to parquet file
Streaming Queries with DataFrames
input = spark.readStream.format("json").load("source-path")
result = input.select("device", "signal").where("signal > 15")
result.writeStream.format("parquet").start("dest-path")
Read from Json file streamReplace read with readStream
Select some devicesCode does not change
Write to Parquet file streamReplace save() with start()
DataFrames,Datasets, SQL
Logical Plan
Streaming Source
Projectdevice, signal
Filtersignal > 15
Streaming Sink
Spark automatically streamifies!
Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of
data
Series of Incremental
Execution Plans
proc
ess
new
file
s
t = 1 t = 2 t = 3
proc
ess
new
file
s
proc
ess
new
file
s
input = spark.readStream.format("json").load("source-path")
result = input.select("device", "signal").where("signal > 15")
result.writeStream.format("parquet").start("dest-path")
Fault-tolerance with Checkpointing
Checkpointing - metadata (e.g. offsets) of current batch stored in a write ahead log in HDFS/S3
Query can be restarted from the log
Streaming sources can replay the exact data range in case of failure
Streaming sinks can dedup reprocessed data when writing, idempotent by design
end-to-end exactly-once guarantees
proc
ess
new
file
s
t = 1 t = 2 t = 3
proc
ess
new
file
s
proc
ess
new
file
s
write ahead
log
ComplexStreaming ETL
Traditional ETL
Raw, dirty, un/semi-structured is data dumped as files
Periodic jobs run every few hours to convert raw data to structured data ready for further analytics
filedump
seconds hourstable
10101010
Traditional ETL
Hours of delay before taking decisions on latest data
Unacceptable when time is of essence[intrusion detection, anomaly detection, etc.]
filedump
seconds hourstable
10101010
Streaming ETL w/ Structured Streaming
Structured Streaming enables raw data to be available as structured data as soon as possible
tableseconds10101010
Streaming ETL w/ Structured Streaming
24
Example
- Json data being received in Kafka
- Parse nested json and flatten it
- Store in structured Parquet table
- Get end-to-end failure guarantees
val rawData = spark.readStream .format("kafka") .option("subscribe", "topic") .option("kafka.boostrap.servers",...) .load()
val parsedData = rawData .selectExpr("cast (value as string) as json")) .select(from_json("json").as("data")) .select("data.*")
val query = parsedData.writeStream .option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet")
Reading from Kafka [Spark 2.1]Support Kafka 0.10.0.1Specify options to configure
How? kafka.boostrap.servers => broker1
What?subscribe => topic1,topic2,topic3 // fixed list of topicssubscribePattern => topic* // dynamic list of topicsassign => {"topicA":[0,1] } // specific partitions
Where? startingOffsets => latest
(default) / earliest / {"topicA":{"0":23,"1":345} }
val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load()
Reading from Kafka
rawData dataframe has the following columns
key value topic partition offset timestamp[binary] [binary] "topicA" 0 345 1486087873
[binary] [binary] "topicB" 3 2890 1486086721
val rawData = spark.readStream .format("kafka") .option("subscribe", "topic") .option("kafka.boostrap.servers",...) .load()
Transforming Data
Cast binary value to stringName it column json
val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*")
Transforming Data
val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*")
Cast binary value to stringName it column json
Parse json string and expand into nested columns, name it data
json{ "timestamp": 1486087873, "device": "devA", …}{ "timestamp": 1486082418, "device": "devX", …}
data (nested)
timestamp device …1486087873 devA …1486086721 devX …
from_json("json")as "data"
Transforming Data
val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*")
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
timestamp device …
1486087873 devA …
1486086721 devX …
select("data.*")
(not nested)
Cast binary value to stringName it column json
Parse json string and expand into nested columns, name it data
Flatten the nested columns
Transforming Data
Cast binary value to stringName it column json
Parse json string and expand into nested columns, name it data
Flatten the nested columns
powerful built-in APIs to perform complex data
transformationsfrom_json, to_json, explode, ...
100s of functions
(see our blog post)
val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*")
Save parsed data as Parquet table in the given path
Partition files by date so that future queries on time slices of data is fast
e.g. query on last 48 hours of data
Writing to Parquet table
val query = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("parquet") .start("/parquetTable")
Checkpointing
Enable checkpointing by setting the checkpoint location to save offset logs
start actually starts a continuous running StreamingQuery in the Spark cluster
val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/")
Streaming Query
query is a handle to the continuously running StreamingQuery
Used to monitor and manage the execution
StreamingQueryval query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/")
proc
ess n
ew
data
t = 1 t = 2 t = 3
proc
ess n
ew
data
proc
ess n
ew
data
Data Consistency on Ad-hoc Queries
Data available for complex, ad-hoc analytics within seconds
Parquet table is updated atomically, ensures prefix integrity
Even if distributed, ad-hoc queries will see either all updates from streaming query or none, read more in our bloghttps://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
seconds!complex, ad-hoc
queries on latest
data
Advanced Streaming Analytics
Event time Aggregations
Many use cases require aggregate statistics by event timeE.g. what's the #errors in each system in the 1 hour windows?
Many challengesExtracting event time from data, handling late, out-of-order data
DStream APIs were insufficient for event-time stuff
Windowing is just another type of grouping in Struct. Streaming
number of records every hour
Support UDAFs!
parsedData .groupBy(window("timestamp","1 hour")) .count()
avg signal strength of each device every 10 mins
Event time Aggregations
parsedData .groupBy( "device", window("timestamp","10 mins")) .avg("signal")
Stateful Processing for Aggregations
Aggregates has to be saved as distributed state between triggers
Each trigger reads previous state and writes updated state
State stored in memory, backed by write ahead log in HDFS/S3
Fault-tolerant, exactly-once guarantee!
proc
ess
new
dat
a
t = 1
sink
src
t = 2
proc
ess
new
dat
a
sink
src
t = 3
proc
ess
new
dat
a
sink
src
state state
write ahead
log
state updates are written to log for checkpointing
state
Watermarking and Late Data
Watermark [Spark 2.1] - threshold on how late an event is expected to be in event time
Trails behind max seen event time
Trailing gap is configurable
event time
max event time
watermark data older than
watermark not expected
12:30 PM
12:20 PM
trailing gapof 10 mins
Watermarking and Late Data
Data newer than watermark may be late, but allowed to aggregate
Data older than watermark is "too late" and dropped
Windows older than watermark automatically deleted to limit the amount of intermediate state
event time
max event time
watermark data too late,
dropped
12:30 PMlate dataallowed to aggregate
Watermarking and Late Data event time
max event time
watermark data older than
watermark not expected
parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count()
late dataallowed to aggregate
allowed latenessof 10 mins
Watermarking to Limit State [Spark 2.1]
data too late, ignored in counts, state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
Even
t Tim
e12:15
12:18
12:04
watermark updated to 12:14 - 10m = 12:04 for next trigger, state < 12:04 deleted
data is late, but considered in counts
parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count()
system tracks max observed event time
12:08
wm = 12:04
10 m
in
12:14
more details in online programming guide
mapGroupsWithStateallows any user-definedstateful ops to a user-defined state--
supports timeouts--
fault-tolerant, exactly-once--
supports Scala and Java
Arbitrary Stateful Operations [Spark 2.2]
dataset.groupByKey(groupingFunc).mapGroupsWithState(mappingFunc)
def mappingFunc( key: K, values: Iterator[V], state: KeyedState[S]): U = { // update or remove state // set timeouts // return mapped value }
Many more updates!StreamingQueryListener [Spark 2.1]Receive of regular progress heartbeats for health and perf monitoringAutomatic in Databricks, come to the Databricks booth for a demo!!
Kafka Batch Queries [Spark 2.2]Run batch queries on Kafka just like a file system
Kafka Sink [Spark 2.2]Write to Kafka, can only give at-least-once guarantee as Kafka doesn't support transactional updates
Kinesis SourceRead from Amazon Kinesis
44
More InfoStructured Streaming Programming Guide
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Databricks blog posts for more focused discussionshttps://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html
and more to come, stay tuned!!
45
GET TICKETS NOW!!
EARLY BIRD PRICING!!
Need time to convince your manager?
Spark Summit Code:ChicagoMU
Good for 15% off starting 4/8.
Comparison with Other Engines
48
Read our blog to understand this table
Thank you!!