staging reactive data pipelines using simon souter...

52

Upload: others

Post on 25-May-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time
Page 2: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Jaakko Pallari (@lepovirta)

Simon Souter (@simonsouter)

Staging Reactive data pipelines using Kafka as the backbone

/cakesolutions /scala-kafka-client

Page 3: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

MANCHESTER LONDON NEW YORK

Reactive Solutions at Cake

Page 4: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Contents

1. Reactive Data Pipelines

2. Kafka as a Reactive Message Queue

3. Architecture & Consumer Patterns

4. Streaming Application Development

Page 5: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Stream Processing

● Big Data● Processing in Real-time● Event Throughput vs Number of Queries● IoT

Source Service Sink

Page 6: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Distributed Streaming Engines

● Server Applications● Stream topologies deployed to cluster● Framework design

Page 7: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Streaming from ground-up

● Custom Streaming Applications● Leverage existing tool stack

Source Application Sink

Page 8: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Staged data pipelines

● Staged Event Driven Architecture● Processes separated by a queue● Processing in stages

Process Queue Process QueueQueue

Page 9: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Reactive data pipelines

● Responsive● Resilient● Elastic● Message Driven

Process Queue ProcessSource Sink

Page 10: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Streaming from ground-up

● Microservices as processing components

Source Microservice 1 Microservice 2

Microservice 1

Microservice 1

Microservice 2

Microservice 2

SinkQueue

Page 11: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

● Deployment via cluster orchestration services

Streaming from ground-up

Source Microservice 1 Queue Microservice 2

Microservice 1

Microservice 1

Microservice 2

Microservice 2

Sink

Orchestration Service

Scale up

Page 12: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Streaming from ground-up

● Messaging middleware for resilient data distribution between microservices

Source Microservice 1 Queue Microservice 2

Microservice 1

Microservice 1

Microservice 2

Microservice 2

Sink

Page 13: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

What is Kafka?

● Distributed Message Broker● Supports Parallel Streaming● Kafka as a Reactive MQ

Source Microservice 1 Kafka Microservice 2

Microservice 1

Microservice 1

Microservice 2

Microservice 2

Sink

Page 14: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka Topic:“Electric_Readings”

Kafka: topic and message anatomy

Key: “meter1”Value: 1.34

Electric BillCalculation

Auditing

Message Driven

Page 15: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka: at-least-once delivery

Kafka Topic:“Electric_Readings”

Electric meterConsumptionAggregator

Deliver

ACK

Deliver

ACK

Resilient

Page 16: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka node 2

Kafka node 1

Kafka: clustering - arrangement

KafkaTopic

Partition 1

Partition 2

Elastic

Page 17: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka: clustering - replication

Resilient

Kafka node 2

Kafka node 1

KafkaTopic

Partition 1

Partition 2

Partition 2Replica

Partition 1Replica

Page 18: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka: clustering - consumer

Partition #1

Partition #2

Partition #3

Consumer #1

Consumer #2

Consumer #3

KafkaTopic

Responsive

Same consumer group

Page 19: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka: clustering - consumer

Partition #1

Partition #2

Partition #3

Consumer #1

Consumer #2

Consumer #3

KafkaTopic

Responsive

Page 20: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka: clustering - consumer

Partition #1

Partition #2

Partition #3

Consumer #1

Consumer #2

Consumer #3

KafkaTopic

Responsive

Page 21: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka: clustering - consumer

Partition #1

Partition #2

Partition #3

Consumer #1

Consumer #2

Consumer #3

KafkaTopic

Responsive

Page 22: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka: clustering - consumer

Partition #1

Partition #2

Partition #3

Consumer #1

Consumer #2

Consumer #3

KafkaTopic

Responsive

Page 23: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka: clustering - consumer

Partition #1

Partition #2

Partition #3

Consumer #1

Consumer #2KafkaTopic

Responsive

Consumer #3

Consumer #4 No Data

Page 24: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka: high throughput

● Single partition consumer: 20-90 Mb/sec

Responsive

Page 25: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka the Reactive MQ

Message Driven● Key-value messages

Responsive● Consumer clustering● High throughput

Resilient● At-least-once delivery● Replication

Elastic● Linear scalability

Page 26: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka consumer patterns

Source Microservice 1 Kafka Microservice 2

Microservice 1

Microservice 1

Microservice 2

Microservice 2

Sink

Page 27: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Simple message queue

PartitionElectricMeter

AuditingElectricReadings

Partition replica

Partition replica

Kafka Terminology:- Partition Count: 1

Page 28: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Simple message queue - fanout

PartitionElectricMeter

AuditingElectricReadings

Partition replica

Partition replicaBilling

Kafka Terminology:- Partition Count: 1- Multiple Consumer Groups

Page 29: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

DB

Simple message queue - consumer

AuditingServiceConsumer

Client

App logic

Kafka Partition

1. Consume a batch of messages from Kafka

2. Process messages and send results to wherever necessary (e.g. another Kafka topic)

3. Confirm delivery to Kafka

Kafka Terminology:- Commit Mode: Manual

Page 30: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Partition

Kafka: message confirmation

● Messages confirmed by offset (not individually)

Commit point

Consumer

Consumed:

Kafka Terminology:- Commit Mode: Manual

Page 31: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Partition

Kafka: message confirmation

● Messages confirmed by offset (not individually)

Commit point

ConsumerCommit

Consumed:

Kafka Terminology:- Commit Mode: Manual

Page 32: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Parallel workers

Partition #1

Partition #2

Partition #N

ElectricMeter

Auditing node #1

Auditing node #2

Auditing node #N

ElectricReadings

ElectricMeterElectric

Meter

Kafka Terminology:- Partition Count: >1- Single Consumer Group

Page 33: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka PartitionKafka Partition

Consumer for parallel processing

DB

AuditingServiceConsumer

Client

App logic

Kafka Partition

● Same arrangement from consumer perspective

Kafka Terminology:- Partition Count: >1- Commit Mode: Manual

Page 34: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Orchestration● Provide Scaling Capability● Restart or replace failed nodes

Partition #1

Partition #2

Partition #N

ElectricMeter

Auditing node #1

Auditing node #2

Auditing node #N

ElectricReadings

ElectricMeterElectric

Meter

Mesos/ Marathon New node

Page 35: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Stateful Processing● Example:

Average electricity consumption per meter for the last hour

ElectricMeter

AggregationElectricReadings

PartitionPartition

Partition

ElectricMeterElectric

Meter

Page 36: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Aggregator for

Stream and state

Partition #1

Partition #2

Aggregator for

ElectricReadings

● Data locality

Page 37: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Aggregator for

Stream and state

Partition #1

Partition #2

Aggregator for

Key: "meter 1"Value: 9.2

Key: "meter 2"Value: 2.7

ElectricReadings

● Data locality

Page 38: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Aggregator for

Fault tolerance

Partition #1

Partition #2

Aggregator for

ElectricReadings

● State persistence and recovery

Page 39: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Aggregator for

Fault tolerance

Partition #1

Partition #2

Aggregator for

ElectricReadings

Persistence

● State persistence and recovery

Page 40: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Persistence

Stateful Processing app

Persistence

Kafka PartitionKafka Partition

Kafka/DB/?

AggregationServiceConsumer

Client

Aggregation logic

Kafka Partition

Page 41: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

AggregationServiceConsumer

Client

Aggregation logic

AggregationServiceConsumer

Client

Aggregation logic

Stateful Processing app

Persistence

Kafka PartitionKafka Partition

Kafka/DB/?

Kafka Partition

Duplicated message processing after recovery.

Page 42: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Stateful Processing app

PersistencePersist state with partition offsets

Don't commit!Just fetch more data

Kafka PartitionKafka Partition

Kafka/DB/?

AggregationServiceConsumer

Client

Aggregation logic

Kafka Partition

Kafka Terminology:- Commit Mode: Self Managed

Offsets

Page 43: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Partition #1

Partition #1

Stateful Processing architecture● Dynamic partition assignment● Shared Persistence for State

Aggregator 2

Aggregator 1

Persistence

Kafka/DB/?Partition #4

Partition #6

Partition #1Partition #2

Orchestration Service

Aggregator 3

Page 44: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Partition #1

Partition #1

Stateful Processing architecture● Dynamic partition assignment● Shared Persistence for State

Aggregator 2

Aggregator 1

Persistence

Kafka/DB/?Partition #4

Partition #6

Partition #1Partition #2

Orchestration Service

Aggregator 3

Page 45: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Streaming Patterns

Stateful Processing● Self-managed processing

state

Single Partition Topic● Strong ordering guarantees● Limited failure recovery● Scalability is limited

Multi Partition Topic● Parallel processing● Limited ordering guarantees● Kafka managed processing

state

Fanout● Independent consumer

groups

Page 46: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Kafka libraries● Kafka client support in many languages● Scala, Java, C● C bindings -> Haskell, OCaml, Python etc.

Source Microservice 1 Kafka Microservice 2

Microservice 1

Microservice 1

Microservice 2

Microservice 2

Sink

Page 47: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Reactive Streaming APIs● Similar paradigm as in real-time streaming platforms● Reactive Kafka

○ Based on Akka Reactive Streams API○ Scala + Java○ Developed by Akka team

● Kafka Streams○ Official streaming API for Kafka○ Java○ Developed by Confluent

Page 48: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

scala-kafka-client

● Kafka client developed for Scala● Async and non-blocking● Built on top off the official Java driver● Easy API with high performance

/cakesolutions /scala-kafka-client

Page 49: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

scala-kafka-client

● Leverage extensive Akka feature set● Processing logic implemented using

Actor Model

KafkaConsumer

Actor

KafkaProducer

Actor

ReceiverActor

Kafka Kafka

/cakesolutions /scala-kafka-client

Page 50: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Summary

● Leverage Microservice based techniques.● Streaming topologies can be varied and complex

○ Many use-cases fall under a small set of consumer patterns.

● Challenges around scalable and reactive data pipelines● Kafka provides first-class support for reactive streaming to

your applications.● Stateful processing remains a challenging area.

Page 51: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

We didn’t discuss...

● Data serialisation● Application rolling updates● Complex streaming topologies

Page 52: Staging Reactive data pipelines using Simon Souter ...datascienceassn.org/sites/default/files/Staging Reactive Data Pipelin… · Reactive Streaming APIs Similar paradigm as in real-time

Questions?

MANCHESTER LONDON NEW YORK

/cakesolutions /scala-kafka-client

@cakesolutions

+44 845 617 1200

[email protected]