apache spark streaming

35
[email protected], Scalapolis 2016 Make yourself a scalable pipeline with Apache Spark

Upload: bartosz-jankiewicz

Post on 18-Feb-2017

138 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Apache Spark Streaming

[email protected], Scalapolis 2016

Make yourself a scalable pipeline with Apache Spark

Page 2: Apache Spark Streaming

Google Data Flow, 2014

The future of data processing is unbounded data. Though bounded data will always have an

important and useful place, it is semantically subsumed by its unbounded counter- part.

Page 3: Apache Spark Streaming

Jaikumar Vijayan, eWeek 2015

Analyst firms like Forrester expect demand for streaming analytics services and technologies to grow in the next

few years as more organisations try to extract value from the huge volumes of data being generated these days

from transactions, Web clickstreams, mobile applications and cloud services.

Page 4: Apache Spark Streaming

❖ Integrate user activity information

❖ Enable nearly real-time analytics

❖ Scale to millions visits per day

❖ Respond to rapidly emerging requirements

❖ Enable data-science techniques on top of collected data

❖ Do the above with reasonable cost

Page 5: Apache Spark Streaming

IngestionSources

Canonical architecture

web

sensor

audit-event

micro-service

Page 6: Apache Spark Streaming

Apache Spark

❖ Started in 2009

❖ Developed in Scala with Akka

❖ Polyglot: Currently supports Scala, Java, Python and R

❖ The largest BigData community as of 2015

Page 7: Apache Spark Streaming

Spark use-cases

❖ Data integration and ETL

❖ Interactive analytics

❖ Machine learning and advanced analytics

Page 8: Apache Spark Streaming

Apache Spark

❖ Scalable

❖ Fast

❖ Elegant programming model

❖ Fault tolerant

Scalable• Scalable by design• Scales to hundreds of nodes• Proven in production by

many companies

Page 9: Apache Spark Streaming

Apache Spark

❖ Scalable

❖ Fast

❖ Elegant programming model

❖ Fault tolerant

Fast• You can optimise both for

latency and throughput• Reduced hardware appetite

due various optimisations• Further improvements

added with Structured Streaming in Spark 2.0

Page 10: Apache Spark Streaming

Apache Spark

❖ Scalable

❖ Fast

❖ Elegant programming model

❖ Fault tolerant

Programming model• Functional paradigm• Easy to run, easy to test• Polyglot (R, Scala, Python,

Java)• Batch and streaming APIs

are very similar• REPL - a.k.a. Spark shell

Page 11: Apache Spark Streaming

Apache Spark

❖ Scalable

❖ Fast

❖ Elegant programming model

❖ Fault tolerant

Fault tollerancy• Data is distributed and

replicated• Seamlessly recovers from

node failure• Zero data loss guarantees

due to write ahead log

Page 12: Apache Spark Streaming

Runtime model

Driver Program

Executor #1

Your code Spark Context

Executor #2

Executor #3

Executor #4

p1

p4

p2

p5

p3

p6

Page 13: Apache Spark Streaming

RDD - Resilient Distributed DatasetDriver Program

Executor #1 Executor #2 Executor #3 Executor #4

val textFile = sc.textFile(“hdfs://…")

Data node #1 Data node #2 Data node #3 Data node #4

Page 14: Apache Spark Streaming

val rdd: RDD[String] = sc.textFile(…)

val wordsRDD = rdd .flatMap(line => line.split(" "))

val lengthHistogram = wordsRDD .groupBy(word => word.length) .collect

val aWords = wordsRDD .filter(word => word.startsWith(“a”)) .saveAsHadoopFile(“hdfs://…”)

Meet DAG

B

C

E

D

F

A

B

C E

D F

A

Page 15: Apache Spark Streaming

DStream

❖ Series of small and deterministic batch jobs

❖ Spark chops live stream into batches

❖ Each micro-batch processing produces a result

time [s]1 2 3 4 5 6

RDD1 RDD2 RDD3 RDD4 RDD5 RDD6

Page 16: Apache Spark Streaming

val dstream: DStream[String] = …

val wordsStream = dstream .flatMap(line => line.split(" ")) .transform(_.map(_.toUpper)) .countByValue() .print()

Streaming program

Page 17: Apache Spark Streaming

It’s not a free lunch

❖ The abstractions are leaking

❖ You need to control level of parallelism

❖ You need to understand impact of transformations

❖ Don’t materialise partitions in forEachPartition operation

Page 18: Apache Spark Streaming

Performance factors

• Network operations

• Data locality

• Total number of cores

• How much you can chunk your work

• Memory usage and GC

• Serialization

Page 19: Apache Spark Streaming

Level of parallelism

❖ Number of receivers aligned with number of executors

❖ Number of threads aligned with number of cores and nature of operations - blocking or non-blocking

❖ Your data needs to be chunked to make use of your hardware

Page 20: Apache Spark Streaming

Stateful transformations❖ Stateful transformation example

❖ Stateful DStream operators can have infinite lineages

❖ That leads to high failure-recovery time

❖ Spark solves that problem with checkpointing

val actions[(String, UserAction)] = … val hotCategories = actions.mapWithState(StateSpec.function(stateFunction))

Page 21: Apache Spark Streaming

Monitoring❖ Spark Web UI

❖ Metrics:

❖ Console

❖ Ganglia Sink

❖ Graphite Sink (works great with Grafana)

❖ JMX

❖ REST API

Page 22: Apache Spark Streaming
Page 23: Apache Spark Streaming
Page 24: Apache Spark Streaming
Page 25: Apache Spark Streaming
Page 26: Apache Spark Streaming

Types of sources

❖ Basic sources:

❖ Sockets, HDFS, Akka actors

❖ Advanced sources:

❖ Kafka, Kinesis, Flume, MQTT

❖ Custom sources:

❖ Receiver interface

Page 27: Apache Spark Streaming

Apache Kafka Greasing the wheels for big data

❖ Incredibly fast message bus❖ Distributed and fault tolerant❖ Highly scalable❖ Strong order guarantees❖ Easy to replicate across multiple regions

Broker 1

Producer

Broker 2

Consumer

Page 28: Apache Spark Streaming

Spark 💕 Kafka

❖ Native integration through direct-stream API

❖ Offsets information are stored in write ahead logs

❖ Restart of Spark driver reloads offsets which weren't processed

❖ Needs to explicitly enabled

Page 29: Apache Spark Streaming

Storage consideration❖ HDFS works well for large, batch workloads

❖ HBase works well for random reads and writes

❖ HDFS is well suited for analytical queries

❖ HBase is well suited for interaction with web pages and certain types of range queries

❖ It’s pays off to persist all data in raw format

Page 30: Apache Spark Streaming

Lessons learnt

Page 31: Apache Spark Streaming

Architecture

web

Page 32: Apache Spark Streaming

Final thoughts

❖ Start with reasonably large batch duration ~10 seconds

❖ Adopt your level of parallelism

❖ Use Kryo for faster serialisation

❖ Don’t even start without good monitoring

❖ Find bottlenecks using Spark UI and monitoring

❖ The issues usually in surrounding Spark environment

Page 33: Apache Spark Streaming

?

Page 34: Apache Spark Streaming

The End

Bartosz Jankiewicz

@oborygen

[email protected]

Page 35: Apache Spark Streaming

References❖ http://spark.apache.org/docs/latest/streaming-

programming-guide.html

❖ https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details

❖ http://milinda.pathirage.org/kappa-architecture.com/

❖ http://lambda-architecture.net

❖ http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/