kafka connect by datio

Post on 21-Apr-2017

1.390 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Kafka Connect

Contents

Introduction to Kafka123

Data Ingestion

Kafka Connect

1.INTRODUCTION TO KAFKA

Let’s start with the first set of slides

Data Pipeline ProblemFrontend

Server

Metrics Server

Frontend Server

Frontend Server Database

Server

Chat Server

Metrics Server

Metrics UI

Inter-process communication channel

Data Pipeline Problem

Frontend Server

Shopping Cart

Backend Server

Metrics Server

Metrics UI Log Search

Database Server

Frontend Server

Frontend Server

Database Server

Chat Server

Metrics Server

Metrics UI

Metrics Pub/Sub

Metrics Pub/Sub

Logging Pub/Sub

A publish/subscribe System Multiple publish/subscribe Systems

Kafka Goals

Frontend Server

Metrics Server

Metrics UI Log Search

✓ Decouple data pipelines

✓ Provide persistence for message data to allow multiple consumers

✓ Optimize for high throughput of messages

✓ Allow for horizontal scaling of the system to grow as the data stream grow

Database Server Shopping

Cart

Backend Server

ProducerProducer

ProducerProducer

Consumer Consumer Consumer

Broker 1

Topic APartition 0

Broker 2

Topic APartition 1

Broker 3

Topic B

ZOOKEEPER

Kafka Architecture

Kafka Cluster

Topic C

Disk-based retentionScalable

High throughput

USE CASESGold Data

Data Lake

Data

2.DATA INGESTION

Kafka Ingestion

Kafka Producer

Kafka Consumer

Kafka Cluster

Use Case Requirements

DATA LOSS ?

EXCATLY ONCE ?

LATENCY ?

THROUGHPUT ?

Producer Record

KafkaBroker

Topic

Partition

KeyValue

producer.send(record).get

Exception/Metadata

Producer Record

KafkaBroker

Topic

Partition

KeyValue

producer.send(record)

Exception/Metadata

Asynchronous sendSynchronous send

Producer

Producer Record

Serializer

Partitioner

Topic APartition 0

Batch 0

Batch 1

Topic BPartition 1

Batch 0

Batch 1

Fail?

Retry?

Yes

Yes

MetadataException

Send()Topic

Partition

KeyValue

TopicPartition

commit

Metadata

TopicPartitionOffset

BrokerProducer

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12

Consumer 0

Consumer 1

Consumer 2

Consumer Group

Partition 0

Partition 1

Partition 2

Partition 3

Partition 0

Partition 1

Partition 2

Partition 3

Partition 0

Partition 1

Partition 2

Partition 3

Partition 0

Partition 1

Partition 2

Partition 3

Consumer 1

Consumer 2

Consumer 1

Consumer 2

Consumer 3

Consumer 4

Consumer 1

Consumer 2

Consumer 3

Consumer 4

Consumer 5

Consumer 6

Topic T1 Topic T1 Topic T1Consumer Group 1 Consumer Group 1

Consumer Group 1

Consumer

ChannelChannel Processor

Interceptor #1

Interceptor #N

SinkSource

Flume Agent

Reliable

Fault Tolerant

Customizable

Manageable

Centralized Sources

Apache Flume

Consumer

Kafka as reliable flume channel

Flume + Kafka

Source Sink

ChannelProducer

Flume as kafka producer/consumer

3.KAFKA CONNECT

Data Source

Schema Registry

Data Sink

Kaf

ka S

ourc

e C

onne

ct

Kafka Connect

Ingestion integration

Streaming and batch

Scales to the application

Failover control

Accessible connector API

Kaf

ka S

ink

Con

nect

Worker Mode

Connect

Kafka Producer

Kafka Consumer

Worker TaskWorker TaskWorker

TaskThread

Standalone Distributed

Connector

Input N

Input 2

Input 1

Task

Worker

Producer Record

Serializer

Partitioner

Topic APartition 0

Batch 0

Batch 1

Topic BPartition 1

Batch 0

Batch 1

Fail?

Retry?

Yes

Yes

MetadataException

Send()Topic

Partition

KeyValue

TopicPartition

commit

Metadata

TopicPartitionOffset

Broker

Worker settings to ensure no data loss

request.timeout.ms=MAX_VALUE

retries=MAX_VALUE

max.in.flight.request.per.connection=1

acks=all

max.block.ms=MAX_VALUE

Worker 2 Worker 3Worker 1

Worker TaskWorker TaskWorker Task

Workerconfig

Sourceconfig

Sourceconfig

Sourceconfig

Workerconfig

Worker TaskWorker TaskWorker Task

Conn 1, Task 3Partitions: 5,6

Conn 2, Task 1Partitions: 1,2

Conn 2

Conn 1, Task 2Partitions: 3,4

Conn 1

Conn 1, Task 1Partitions: 1,2

Conn 2, Task 2Partitions: 3,4

Standaloneworker

Scalability

Fault tolerance

Share

connectors &

tasks

Distributed Worker

Simple

1 Worker

N conn/tasks

Schema Registry

Consumer

SubjectTopic

SchemaVersion

Worker Task sendRecords()

SourceRecordSourceRecordSourceRecord Producer

RecordProducerRecordProducerRecord

Task

REST API

Serializers

Formaters

Multiple version

of the same

schema

StreamsStreamsStream

Source

Schema

Id Id

Schema

SerializerDeserializer

THANKS!Any questions?

@datiobd rbravo@datiobd.com

datio-big-data

top related