data pipelines with apache kafka

50
Data Pipelines with Apache Kafka Ben Stopford @confluentinc

Upload: ben-stopford

Post on 16-Apr-2017

2.784 views

Category:

Technology


1 download

TRANSCRIPT

Data Pipelines with Apache Kafka

Ben Stopford @confluentinc

Today

• What is Kafka? (High level fluffy stuff)

• What makes it tick? (Low level geeky stuff)

• How can you use it? (Architect oriented stuff)

What is Kafka?

Kafka: a Streaming Platform

The Log Connectors Connectors

Producer Consumer

Streaming Engine

The Log Scalable, Fault Tolerant, Concurrent, Strongly Ordered, Stateful

The Log Connectors Connectors

Producer Consumer

Streaming Engine

Clients JVM & C native implementations, Go, Python, many more OS

The Log Connectors Connectors

Producer Consumer

Streaming Engine

Connectors Plug into your database of choice

The Log Connectors Connectors

Producer Consumer

Streaming Engine

Streaming Engine The declarative power of a database, wrapped into a Kafka client

The Log Connectors Connectors

Producer Consumer

Streaming Engine

Kafka: The distributed Log

Today we’ll focus on

The log is a type of messaging system

What is messaging in essence?

•  Take a message, keep it safe, make it available to consumers.

•  Track what messages have been consumed

Kafka attacks these problems separately

What is a message broker in essence?

Sender Receiver

Broker (the log)

The log is a simple idea

Messages are added at the end of the log

Just think of the log as a file

Old New

Consumers have a position

Sally is here

George is here

Fred is here

Old New

Scan Scan

Scan

Only Sequential Access

Old New Read to offset & scan

No Random Access

Index

Disk

Kafka avoids Indexes by keeping the approach simple (indexes impede scalability in this context)

Topics are Broadcast

Consumer

Consumer

Broker broadcast

Can also behave as a queue

Sender Receiver

The problem:

If you built a messaging system for internet scale,

what would it look like?

Shard data to get scalability

Messages are sent to different partitions

Producer (1) Producer (2) Producer (3)

Cluster of machines Partitions live on

different machines

Replicate to get fault tolerance

replicate

msg

mastership moves

machines

(1)

(2)

msg

leader

Machine A

Machine A

Machine B

Machine B

Kafka goes a step further

A single topic can be spread over multiple consumers

(4 consuming machines process a single topic)

Linearly Scalable Architecture

Single topic:

- Many producers machines

- Many consumer machines

- Many Broker machines

No Bottleneck!!

Distributed Commit Log Different to a traditional messaging system

Data is replicated

Strong Consistency

Send Message

3 replicas on different machines

•  Only 1 elected leader •  Only leader can be written to, read from

Replication provides resiliency

Another replica takes over on machine failure

Replication Protocol

Send Message

Optimistic Write (single machine delivery)

Send Message

Get ack (optimistic)

Pessimistic Write (wait for replication to complete)

Send Message

Get ack (pessimistic)

Replication Protocol Writer

Messages can be read only after replication completes

Reader

Replication Protocol

Number of replicas is a soft quorum (set min/max tolerable values)

Writer

Reader

Replication is used for resiliency. No need to flush

to disk synchronously. You can flush if you wish, but no one does.

Advanced Features

Consumers cluster too! Consumer Group 1 Consumer Group 1

Consumers cluster too!

Compacted Topics (Tabular View)

Version 3

Version 2

Version 1

Version 2

Version 1

Version 5

Version 4

Version 3

Version 2

Version 1

Version 2

Version 3

Version 5

All versions Latest Key only

Multi Tenancy

Users isolated using security features

Bandwidth segregated per user

Use Cases

Microservice Backbone

Always on, Event-Driven Services

The Log (streams & tables)

Ingestion Services

Services with Polyglotic

persistence

Simple Services

Streaming Services

Event Buffer

Many producers, small messages

Kafka

Hadoop etc

Stream Processing for enrichment & transformation

Kafka Streams Example

Orders

Customer (Compacted)

Join

Customer Stream

Join, aggregate, intermediary state

stored in Kafka

Kafka Kafka Streams

Orders Stream

Dashboard

Query

Stream Data Platform (Kappa Architecture)

All y

our

data

Stream Data platform Views

Client

Client

Kafka

Stream processor

Connectors

Kafka: a Streaming Platform

The Log Connectors Connectors

Producer Consumer

Streaming Engine

The end

@benstopford http://benstopford.com