stream all the things - github pages€¦ · worker node #1 diskdiskdiskdiskdisk node manager data...

84
Dean Wampler, Ph.D. [email protected] @deanwampler Stream All the Things! Architectures for Data that Never Ends

Upload: others

Post on 17-Apr-2020

41 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Dean Wampler, Ph.D. [email protected] @deanwampler

Stream All the Things!Architectures for Data that Never Ends

Page 2: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

lightbend.com/fast-data-platform (2nd Edition coming soon!)

Free as in🍺

Page 3: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Streaming in Context…

Page 4: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Hadoop: Classic Batch Architecture

Page 5: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

submit to…

YARN

HDFS

MapReducejobs

Sparkjobs

WorkerNode#1

DiskDiskDiskDiskDisk

NodeManager

DataNode

MasterNode

ResourceManager

NameNode

…#2

Page 6: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

submit to…

YARN

HDFS

MapReducejobs

Sparkjobs

WorkerNode#1

DiskDiskDiskDiskDisk

NodeManager

DataNode

MasterNode

ResourceManager

NameNode

…#2

Storage

Page 7: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

submit to…

YARN

HDFS

MapReducejobs

Sparkjobs

WorkerNode#1

DiskDiskDiskDiskDisk

NodeManager

DataNode

MasterNode

ResourceManager

NameNode

…#2

Compute

Page 8: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

submit to…

YARN

HDFS

MapReducejobs

Sparkjobs

WorkerNode#1

DiskDiskDiskDiskDisk

NodeManager

DataNode

MasterNode

ResourceManager

NameNode

…#2

Resource Management

Page 9: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

submit to…

YARN

HDFS

MapReducejobs

Sparkjobs

WorkerNode#1

DiskDiskDiskDiskDisk

NodeManager

DataNode

MasterNode

ResourceManager

NameNode

…#2

Database Deconstructed!

Optimized for storing lots of data at rest, with subsequent processing, but not optimized for data in motion.

Page 10: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

submit to…

YARN

HDFS

MapReducejobs

Sparkjobs

WorkerNode#1

DiskDiskDiskDiskDisk

NodeManager

DataNode

MasterNode

ResourceManager

NameNode

…#2

• Characteristics•Batch and interactive queries•Massive storage - HDFS is the data

“backplane”• Integrate jobs

through HDFS•Multiuser jobs

Page 11: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

submit to…

YARN

HDFS

MapReducejobs

Sparkjobs

WorkerNode#1

DiskDiskDiskDiskDisk

NodeManager

DataNode

MasterNode

ResourceManager

NameNode

…#2

•Use Cases•Data warehouse replacement• Interactive exploration•Offline ML model training•…

Page 12: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

New Streaming, “Fast Data” Architecture

Page 13: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka4aStreamsAkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/NoSQLSearch

KaEa Cluster

Broker

Beam

Spark

Events

Streams

Storage

Microservices

ReacAvePlaDorm

Go Node.js …

Page 14: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka4aStreamsAkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/NoSQLSearch

KaEa Cluster

Broker

Beam

Spark

Events

Streams

Storage

Microservices

ReacAvePlaDorm

Go Node.js …

While YARN can be used, it’s not flexible enough

for today’s dynamic

workloads

Kubernetes and Mesos provide the job and

resource management needed for dynamic,

heterogenous work loads

Deploy in the cloud or on

premise

Page 15: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka4aStreamsAkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/NoSQLSearch

KaEa Cluster

Broker

Beam

Spark

Events

Streams

Storage

Microservices

ReacAvePlaDorm

Go Node.js … “Events” - e.g., REST messages, sessions,

alerts, …

Page 16: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka4aStreamsAkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/NoSQLSearch

KaEa Cluster

Broker

Beam

Spark

Events

Streams

Storage

Microservices

ReacAvePlaDorm

Go Node.js … “Events” - e.g., REST messages, sessions,

alerts, …

“Streams” - one-way data flows, e.g., sockets or files, including logs,

metrics, other telemetry, click

streams, etc.

Page 17: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka4aStreamsAkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/NoSQLSearch

KaEa Cluster

Broker

Beam

Spark

Events

Streams

Storage

Microservices

ReacAvePlaDorm

Go Node.js … “Events” - e.g., REST messages, sessions,

alerts, …

Each has different volumes, velocities, latency characteristics, protocols, etc.

“Storage” - JDBC, async reads/writes to storage

“Streams” - one-way data flows, e.g., sockets or files, including logs,

metrics, other telemetry, click

streams, etc.

Page 18: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka4aStreamsAkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/NoSQLSearch

KaEa Cluster

Broker

Beam

Spark

Events

Streams

Storage

Microservices

ReacAvePlaDorm

Go Node.js …

Kafka deployed as a cluster of “Brokers”

for scalability, resiliency.

Page 19: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka4aStreamsAkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/NoSQLSearch

KaEa Cluster

Broker

Beam

Spark

Events

Streams

Storage

Microservices

ReacAvePlaDorm

Go Node.js …

Data backplane - like Enterprise Service

Bus (ESB), but without the flaws…

Page 20: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Why Kafka?Organized into

topics

Ka#a

Partition 1

Partition 2

Topic A

Partition 1Topic B

Topics are partitioned, replicated, and

distributed

Page 21: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Why Kafka?

Unlike queues, consumers don’t delete entries; Kafka

manages their lifecycles

M Producers

N Consumers, who start

reading where they want

Consumer 1

(at offset 14)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Partition 1Topic B

Producer 1 Producer

2

Consumer 2

(at offset 10)

writes

reads

Consumer 3

(at offset 6)

earliest latest

Logs, not queues!

Page 22: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Using KafkaService 1

Log & Other Files

Internet

Services

Service 2

Service 3

Services

Services

N * M links ConsumersProducers

Before:

Service 1

Log & Other Files

Internet

Services

Service 2

Service 3

Services

Services

N + M links ConsumersProducers

After:

Messy and fragile; what if “Service 1”

goes down?

Simpler and more robust! Loss of Service 1 means no data loss.

X X

Page 23: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka4aStreamsAkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/NoSQLSearch

KaEa Cluster

Broker

Beam

Spark

Events

Streams

Storage

Microservices

ReacAvePlaDorm

Go Node.js …

Lots of streaming engine options… too many.

Page 24: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka4aStreamsAkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/NoSQLSearch

KaEa Cluster

Broker

Beam

Spark

Events

Streams

Storage

Microservices

ReacAvePlaDorm

Go Node.js …

The streaming analog of a deconstructed database!

Page 25: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka4aStreamsAkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/NoSQLSearch

KaEa Cluster

Broker

Beam

Spark

Events

Streams

Storage

Microservices

ReacAvePlaDorm

Go Node.js …

fStandard APIs

allow almost any storage you want

Page 26: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka4aStreamsAkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/NoSQLSearch

KaEa Cluster

Broker

Beam

Spark

Events

Streams

Storage

Microservices

ReacAvePlaDorm

Go Node.js …Use your regular

microservice tools…

Page 27: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Streaming Engines

Page 28: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Features to Consider…

Page 29: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

• Low latency? How low?•High Volume: How high?•Which kinds of data processing?• Process data individually or in bulk?• Preferred application architecture and

DevOps processes?• Integration with other services

Page 30: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Low latency? How low?

Page 31: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Low latency? How low?•Picoseconds to a few microseconds?

True “Real Time”

http://www.spacex.com/news

Page 32: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Low latency? How low?•Picoseconds to a few microseconds? •Custom hardware (FPGAs).•“Kernel bypass” network HW/SW.•Custom C++ code.

Page 33: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Low latency? How low?•< 100 microseconds?

http://tradinghub.co/watch-list-for-mar-26th-2015/ http://www.usa.philips.com/

Page 34: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Low latency? How low?•< 100 microseconds? •Fast JVM message handlers.•Akka Actors•LMAX Disruptor

Page 35: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Low latency? How low?•< 10 milliseconds?

http://money.cnn.com/2017/05/12/pf/credit-card-mistakes/index.html

Page 36: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Low latency? How low?•< 10 milliseconds? •Fast data streaming tools like Flink and more recently Spark, Akka (and Akka Streams), and Kafka Streams.

Page 37: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Low latency? How low?•< hundreds of milliseconds?

https://github.com/keen/dashboards

https://www.coursera.org/learn/machine-learning

Page 38: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Low latency? How low?•< hundreds of milliseconds? •“micro-batches”•Processing records in bulk, e.g., Spark’s micro-batch model and “streaming SQL” over windows.

Page 39: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Low latency? How low?•< 1 second to minutes?

ETL

storage

Data

ModelTraining

ModelServing

OtherLogic

Logs

Ka'a

RawLogsTopic

ParsedLogsTopic

Ka'aStreamsJob

Model Training

Page 40: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Low latency? How low?•> 1 minute? •Consider periodic batch jobs!

Page 41: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•High Volume: How high?

Page 42: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•High Volume: How high?•< 1o,000 events/second?•REST•One at a time…

http://www.drdobbs.com/web-development/ soa-web-services-and-restful-systems/199902676

Page 43: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•High Volume: How high?•< 10o,000 per second?•Nonblocking REST!•Parallelism - Akka worker actors•Switch to bulk processing?

Page 44: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•High Volume: How high?•1,00o,000s per second?•Flink or Spark Streaming•Process in bulk

https://store.nest.com/product/thermostat/

Page 45: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Which kinds of data processing?

Page 46: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Which kinds of data processing?•Extract, transform, and load (ETL)?

Logs

Ka'a

RawLogsTopic

ParsedLogsTopic

Ka'aStreamsJob

Page 47: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Which kinds of data processing?•“Dataflow” pipelines

val sc = new SparkContext("local[*]", "Inverted Idx") sc.textFile("data/crawl") .map { line => val Array(path, text) = line.split("\t",2); (path, text) } flatMap { case (path, text) => text.split("""\W+""").map((_, path)) } map { case (w, p) => ((w, p), 1) } reduceByKey { case (n1, n2) => n1 + n2

Page 48: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Which kinds of data processing?•SQL?

val input = spark.read. format(“parquet”). stream(“my-iot-data”)

input.groupBy(“zip-code”). count()

SELECT COUNT(*) FROM my-iot-data GROUP BY zip-code

Page 49: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Which kinds of data processing?•Train and serve ML models?

storage

Data

ModelTraining

ModelServing

OtherLogic

Page 50: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Process data individually or in bulk?

MicroserviceMicroservice

Microservice

Microservice

ServiceActor1

Event

Event

Event

Event

Event

Event RouterActor

ServiceActor2

SA13SA11

SA12

SA23

SA21SA22

SELECT COUNT(*) FROM my-iot-data GROUP BY zip-code

“Record-centric” μ-services

Events Records

Event-driven μ-services

storage

Data

ModelTraining

ModelServing

OtherLogic

Page 51: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSpark•Preferred application

architecture? •Streaming library in an app?•or, distributed services running your job?

Already have a microservices-based, DevOps CI/CD workflow? Stream processing with microservices may fit better into your environment!

Page 52: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/NoSQLSearch

• Integration with other tools.•Akka, Flink, & Spark integrate with Databases, Kafka, file systems, REST, …•Kafka Streams only read & write Kafka topics.

Page 53: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Best of Breed Streaming

Engines

Page 54: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSpark

Page 55: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSparkRun as

distributed services

You submit jobs, they are

partitioned into tasks

The streaming engines form two groups:

Page 56: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSpark

Libraries you embed in your microservices

The streaming engines form two groups:

Page 57: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSpark•Apache Beam

•(Google Dataflow)•Requires a “runner”•Most sophisticated streaming semantics

See these blog posts: https://www.oreilly.com/people/09f01-tyler-akidau

Page 58: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

0Time (minutes)

1 2 3 …

Analysis

Server 1

Server 2

accumulate

1 1

2 2 2 2 2 2

1 1

2 2

1 1 1

Key

Collect data,Then process

accumulate

n

Event at Server npropagated to

Analysis

Page 59: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSpark•Spark Structured

Streaming•“Dataset” - SQL•Millisecond latency• Ideal for Rich SQL, ML.

Page 60: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSpark•Spark Streaming

•Mini-batch model•“RDD” (dataflow) based•~0.5 sec latency•Original model - obsolete

Page 61: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSpark•Spark Batch

•Same Dataset and RDD features as streaming.•Massive scalability•Excellent performance

Page 62: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSpark•Apache Flink

•High volume, low latency•Sophisticated streaming (Beam) semantics•SQL, evolving ML support

Page 63: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSpark•Akka Streams

•Low latency•Complex Event Processing•Efficient, per event•Mid-volume pipelines

Page 64: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSpark•Kafka Streams

•Low overhead Kafka topic processing• Ideal for ETL and aggregations

Page 65: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSpark•Akka and Kafka Streams

•“Exactly once” with transactions

Logs

Ka'a

RawLogsTopic

ParsedLogsTopic

StreamingApp

Page 66: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSpark•Akka and Kafka Streams

•Neither have built-in support for state checkpointing

Page 67: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Process data individually or in bulk?

MicroserviceMicroservice

Microservice

Microservice

ServiceActor1

Event

Event

Event

Event

Event

Event RouterActor

ServiceActor2

SA13SA11

SA12

SA23

SA21SA22

SELECT COUNT(*) FROM my-iot-data GROUP BY zip-code

•“Record-centric” μ-services

Events Records

Event-driven μ-services

storage

Data

ModelTraining

ModelServing

OtherLogic

Each grew out of one end of this

Page 68: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka,aStreamsAkkaStreams

BeamSpark•Akka Streams vs. Kafka

Streams talk• Also at polyglotprogramming.com/talks/

Page 69: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Microservices and Fast Data

Page 70: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

ZooKeeper Cluster

ZK

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka5aStreams

AkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/

NoSQLSearch

1

5

6

3 10

KaFa Cluster

Broker

24

78

9

Beam

Spark

Events

Streams

Storage

Microservices

ReacBvePlaEorm

Go Node.js …

Use your regular microservice

tools…

… but why are microservices in this diagram??

Recall this diagram?

Page 71: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

How is this… Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

ZooKeeper Cluster

ZK

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka5aStreams

AkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/

NoSQLSearch

1

5

6

3 10

KaFa Cluster

Broker

24

78

9

Beam

Spark

Events

Streams

Storage

Microservices

ReacBvePlaEorm

Go Node.js …

Page 72: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

… like this?

MicroserviceMicroservice

Microservice

Microservice

ServiceActor1

Event

Event

Event

Event

Event

Event Router

Actor

ServiceActor2

SA13SA11

SA12

SA23

SA21SA22

Page 73: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•A data app / microservice:•A single responsibility.•…

MicroserviceMicroservice

Microservice

Microservice

ServiceActor1

Event

Event

Event

Event

Event

Event Router

Actor

ServiceActor2

SA13SA11

SA12

SA23

SA21SA22

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

ZooKeeper Cluster

ZK

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka5aStreams

AkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/

NoSQLSearch

1

5

6

3 10

KaFa Cluster

Broker

24

78

9

Beam

Spark

Events

Streams

Storage

Microservices

ReacBvePlaEorm

Go Node.js …

Page 74: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•A data app / microservice:•A single responsibility.•The input never ends.

MicroserviceMicroservice

Microservice

Microservice

ServiceActor1

Event

Event

Event

Event

Event

Event Router

Actor

ServiceActor2

SA13SA11

SA12

SA23

SA21SA22

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

ZooKeeper Cluster

ZK

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka5aStreams

AkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/

NoSQLSearch

1

5

6

3 10

KaFa Cluster

Broker

24

78

9

Beam

Spark

Events

Streams

Storage

Microservices

ReacBvePlaEorm

Go Node.js …

Page 75: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•A data app/microservice:•A single responsibility.•The input never ends.• So, both must be

available, responsive, resilient, & scalable. I.e., reactive

MicroserviceMicroservice

Microservice

Microservice

ServiceActor1

Event

Event

Event

Event

Event

Event Router

Actor

ServiceActor2

SA13SA11

SA12

SA23

SA21SA22

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

ZooKeeper Cluster

ZK

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka5aStreams

AkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/

NoSQLSearch

1

5

6

3 10

KaFa Cluster

Broker

24

78

9

Beam

Spark

Events

Streams

Storage

Microservices

ReacBvePlaEorm

Go Node.js …

http://www.reactivemanifesto.org/

Page 76: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

•Going the other way, “small” microservice architectures become data-centric, as the data grows.

MicroserviceMicroservice

Microservice

Microservice

ServiceActor1

Event

Event

Event

Event

Event

Event Router

Actor

ServiceActor2

SA13SA11

SA12

SA23

SA21SA22

Kubernetes, Mesos, YARN, …Cloud or on-premise

Files

Sockets

REST

ZooKeeper Cluster

ZK

Mini-batch

Spark

Batch

Spark

Low Latency

Flink

Ka5aStreams

AkkaStreams

Beam

Persistence

S3,…

HDFS

DiskDiskDisk

SQL/

NoSQLSearch

1

5

6

3 10

KaFa Cluster

Broker

24

78

9

Beam

Spark

Events

Streams

Storage

Microservices

ReacBvePlaEorm

Go Node.js …

Page 77: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Some Overlap: Concerns, Architecture

Big DataServices

The Recent Past

Page 78: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

The Present

Much More Overlap

Microservices & Fast Data

Page 79: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

The Future?

Much more microservice focused?

Microservices for Fast Data

Why? Since streams process data incrementally, there is less need for large-scale tools like Spark, Flink

… and using microservices for everything simplifies development, deployment, and operations

Unclear if this helps bridge the divide between data science and data engineering

Page 80: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Lightbend Fast Data Platform

lightbend.com/fast-data-platform

Page 81: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

lightbend.com/fast-data-platform

Page 82: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

lightbend.com/fast-data-platform

What we discusse

Page 83: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

lightbend.com/fast-data-platform

Plus management & monitoring tools

Page 84: Stream All The Things - GitHub Pages€¦ · Worker Node #1 DiskDiskDiskDiskDisk Node Manager Data Node Master Node Resource Manager Name Node #2 … Database Deconstructed! Optimized

Questions?Dean Wampler, Ph.D. [email protected] @deanwampler polyglotprogramming.com/talks