staging reactive data pipelines using simon souter...
TRANSCRIPT
Jaakko Pallari (@lepovirta)
Simon Souter (@simonsouter)
Staging Reactive data pipelines using Kafka as the backbone
/cakesolutions /scala-kafka-client
MANCHESTER LONDON NEW YORK
Reactive Solutions at Cake
Contents
1. Reactive Data Pipelines
2. Kafka as a Reactive Message Queue
3. Architecture & Consumer Patterns
4. Streaming Application Development
Stream Processing
● Big Data● Processing in Real-time● Event Throughput vs Number of Queries● IoT
Source Service Sink
Distributed Streaming Engines
● Server Applications● Stream topologies deployed to cluster● Framework design
Streaming from ground-up
● Custom Streaming Applications● Leverage existing tool stack
Source Application Sink
Staged data pipelines
● Staged Event Driven Architecture● Processes separated by a queue● Processing in stages
Process Queue Process QueueQueue
Reactive data pipelines
● Responsive● Resilient● Elastic● Message Driven
Process Queue ProcessSource Sink
Streaming from ground-up
● Microservices as processing components
Source Microservice 1 Microservice 2
Microservice 1
Microservice 1
Microservice 2
Microservice 2
SinkQueue
● Deployment via cluster orchestration services
Streaming from ground-up
Source Microservice 1 Queue Microservice 2
Microservice 1
Microservice 1
Microservice 2
Microservice 2
Sink
Orchestration Service
Scale up
Streaming from ground-up
● Messaging middleware for resilient data distribution between microservices
Source Microservice 1 Queue Microservice 2
Microservice 1
Microservice 1
Microservice 2
Microservice 2
Sink
What is Kafka?
● Distributed Message Broker● Supports Parallel Streaming● Kafka as a Reactive MQ
Source Microservice 1 Kafka Microservice 2
Microservice 1
Microservice 1
Microservice 2
Microservice 2
Sink
Kafka Topic:“Electric_Readings”
Kafka: topic and message anatomy
Key: “meter1”Value: 1.34
Electric BillCalculation
Auditing
Message Driven
Kafka: at-least-once delivery
Kafka Topic:“Electric_Readings”
Electric meterConsumptionAggregator
Deliver
ACK
Deliver
ACK
Resilient
Kafka node 2
Kafka node 1
Kafka: clustering - arrangement
KafkaTopic
Partition 1
Partition 2
Elastic
Kafka: clustering - replication
Resilient
Kafka node 2
Kafka node 1
KafkaTopic
Partition 1
Partition 2
Partition 2Replica
Partition 1Replica
Kafka: clustering - consumer
Partition #1
Partition #2
Partition #3
Consumer #1
Consumer #2
Consumer #3
KafkaTopic
Responsive
Same consumer group
Kafka: clustering - consumer
Partition #1
Partition #2
Partition #3
Consumer #1
Consumer #2
Consumer #3
KafkaTopic
Responsive
Kafka: clustering - consumer
Partition #1
Partition #2
Partition #3
Consumer #1
Consumer #2
Consumer #3
KafkaTopic
Responsive
Kafka: clustering - consumer
Partition #1
Partition #2
Partition #3
Consumer #1
Consumer #2
Consumer #3
KafkaTopic
Responsive
Kafka: clustering - consumer
Partition #1
Partition #2
Partition #3
Consumer #1
Consumer #2
Consumer #3
KafkaTopic
Responsive
Kafka: clustering - consumer
Partition #1
Partition #2
Partition #3
Consumer #1
Consumer #2KafkaTopic
Responsive
Consumer #3
Consumer #4 No Data
Kafka: high throughput
● Single partition consumer: 20-90 Mb/sec
Responsive
Kafka the Reactive MQ
Message Driven● Key-value messages
Responsive● Consumer clustering● High throughput
Resilient● At-least-once delivery● Replication
Elastic● Linear scalability
Kafka consumer patterns
Source Microservice 1 Kafka Microservice 2
Microservice 1
Microservice 1
Microservice 2
Microservice 2
Sink
Simple message queue
PartitionElectricMeter
AuditingElectricReadings
Partition replica
Partition replica
Kafka Terminology:- Partition Count: 1
Simple message queue - fanout
PartitionElectricMeter
AuditingElectricReadings
Partition replica
Partition replicaBilling
Kafka Terminology:- Partition Count: 1- Multiple Consumer Groups
DB
Simple message queue - consumer
AuditingServiceConsumer
Client
App logic
Kafka Partition
1. Consume a batch of messages from Kafka
2. Process messages and send results to wherever necessary (e.g. another Kafka topic)
3. Confirm delivery to Kafka
Kafka Terminology:- Commit Mode: Manual
Partition
Kafka: message confirmation
● Messages confirmed by offset (not individually)
Commit point
Consumer
Consumed:
Kafka Terminology:- Commit Mode: Manual
Partition
Kafka: message confirmation
● Messages confirmed by offset (not individually)
Commit point
ConsumerCommit
Consumed:
Kafka Terminology:- Commit Mode: Manual
Parallel workers
Partition #1
Partition #2
Partition #N
ElectricMeter
Auditing node #1
Auditing node #2
Auditing node #N
ElectricReadings
ElectricMeterElectric
Meter
Kafka Terminology:- Partition Count: >1- Single Consumer Group
Kafka PartitionKafka Partition
Consumer for parallel processing
DB
AuditingServiceConsumer
Client
App logic
Kafka Partition
● Same arrangement from consumer perspective
Kafka Terminology:- Partition Count: >1- Commit Mode: Manual
Orchestration● Provide Scaling Capability● Restart or replace failed nodes
Partition #1
Partition #2
Partition #N
ElectricMeter
Auditing node #1
Auditing node #2
Auditing node #N
ElectricReadings
ElectricMeterElectric
Meter
Mesos/ Marathon New node
Stateful Processing● Example:
Average electricity consumption per meter for the last hour
ElectricMeter
AggregationElectricReadings
PartitionPartition
Partition
ElectricMeterElectric
Meter
Aggregator for
Stream and state
Partition #1
Partition #2
Aggregator for
ElectricReadings
● Data locality
Aggregator for
Stream and state
Partition #1
Partition #2
Aggregator for
Key: "meter 1"Value: 9.2
Key: "meter 2"Value: 2.7
ElectricReadings
● Data locality
Aggregator for
Fault tolerance
Partition #1
Partition #2
Aggregator for
ElectricReadings
● State persistence and recovery
Aggregator for
Fault tolerance
Partition #1
Partition #2
Aggregator for
ElectricReadings
Persistence
● State persistence and recovery
Persistence
Stateful Processing app
Persistence
Kafka PartitionKafka Partition
Kafka/DB/?
AggregationServiceConsumer
Client
Aggregation logic
Kafka Partition
AggregationServiceConsumer
Client
Aggregation logic
AggregationServiceConsumer
Client
Aggregation logic
Stateful Processing app
Persistence
Kafka PartitionKafka Partition
Kafka/DB/?
Kafka Partition
Duplicated message processing after recovery.
Stateful Processing app
PersistencePersist state with partition offsets
Don't commit!Just fetch more data
Kafka PartitionKafka Partition
Kafka/DB/?
AggregationServiceConsumer
Client
Aggregation logic
Kafka Partition
Kafka Terminology:- Commit Mode: Self Managed
Offsets
Partition #1
Partition #1
Stateful Processing architecture● Dynamic partition assignment● Shared Persistence for State
Aggregator 2
Aggregator 1
Persistence
Kafka/DB/?Partition #4
Partition #6
Partition #1Partition #2
Orchestration Service
Aggregator 3
Partition #1
Partition #1
Stateful Processing architecture● Dynamic partition assignment● Shared Persistence for State
Aggregator 2
Aggregator 1
Persistence
Kafka/DB/?Partition #4
Partition #6
Partition #1Partition #2
Orchestration Service
Aggregator 3
Streaming Patterns
Stateful Processing● Self-managed processing
state
Single Partition Topic● Strong ordering guarantees● Limited failure recovery● Scalability is limited
Multi Partition Topic● Parallel processing● Limited ordering guarantees● Kafka managed processing
state
Fanout● Independent consumer
groups
Kafka libraries● Kafka client support in many languages● Scala, Java, C● C bindings -> Haskell, OCaml, Python etc.
Source Microservice 1 Kafka Microservice 2
Microservice 1
Microservice 1
Microservice 2
Microservice 2
Sink
Reactive Streaming APIs● Similar paradigm as in real-time streaming platforms● Reactive Kafka
○ Based on Akka Reactive Streams API○ Scala + Java○ Developed by Akka team
● Kafka Streams○ Official streaming API for Kafka○ Java○ Developed by Confluent
scala-kafka-client
● Kafka client developed for Scala● Async and non-blocking● Built on top off the official Java driver● Easy API with high performance
/cakesolutions /scala-kafka-client
scala-kafka-client
● Leverage extensive Akka feature set● Processing logic implemented using
Actor Model
KafkaConsumer
Actor
KafkaProducer
Actor
ReceiverActor
Kafka Kafka
/cakesolutions /scala-kafka-client
Summary
● Leverage Microservice based techniques.● Streaming topologies can be varied and complex
○ Many use-cases fall under a small set of consumer patterns.
● Challenges around scalable and reactive data pipelines● Kafka provides first-class support for reactive streaming to
your applications.● Stateful processing remains a challenging area.
We didn’t discuss...
● Data serialisation● Application rolling updates● Complex streaming topologies
Questions?
MANCHESTER LONDON NEW YORK
/cakesolutions /scala-kafka-client
@cakesolutions
+44 845 617 1200