stream processing with big data: knowledgent big data palooza meet-up

15
©2014 Knowledgent Group Inc. All Rights Reserved Stream Processing with Big Data Learn Apache Kafka Kishore Veleti Big Data Engineer

Upload: knowledgent

Post on 27-Jun-2015

303 views

Category:

Data & Analytics


0 download

DESCRIPTION

On September 17, 2014 at the NJ Big Data Palooza MeetUp, Kishore Veleti, Big Data Engineer at Knowledgent, presented on Stream Processing with Big Data using Apache Kafka.This presentation includes the content he covered during the event, including an overview of Kafka terminology and processes.

TRANSCRIPT

Page 1: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved

Stream Processing with Big Data

Learn Apache KafkaKishore VeletiBig Data Engineer

Page 2: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved2

• Big Data Engineer at Knowledgent

• Background in enterprise application development using Hadoop stack, Java, PHP

• Worked in Healthcare, Banking, and Social Media Applications

• Passionate in sharing knowledge

About Me

Page 3: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved3

Tutorial

Page 4: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved4

• What is Apache Kafka?

• Apache Kafka Terminology

• Apache Kafka – about Topic & Partition

• Apache Kafka hands-on

We will discuss:

Page 5: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved5

• Apache Kafka is a publish-subscribe messaging system implemented as a distributed commit log

• It is written in Java/Scala

• Built by LinkedIn to process activity stream data from their website

What is Apache Kafka?

Page 6: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved6

• All the messages in Kafka are real-time

• There are many subscribers to a message

• Kafka persists messages to the disk

• Messages are retained for a specific time period

• Subscribers/clients store the state of their reads

• Easy to replay messages

What is Apache Kafka?

Page 7: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved7

• Message: A datum to send

• Topic: Kafka maintains messages in categories called “topics”

• Partition: A logical division of a topic

• Producer: An API to publish messages to Kafka topic

• Broker: A server

• Cluster: Kafka cluster comprises one or more brokers

• Consumer: API to consume published messages and process further

• Replication: Kafka replicates log for each partition across servers

Apache Kafka Terminology

Page 8: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved8

Message Topic Partition Producer Broker

Consumer

At a high level, producers send messages over the network to the Kafka cluster.

Kafka cluster in turn serves them up to consumers.

Apache Kafka Terminology & Big Picture

Page 9: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved9

Message Topic Partition Producer Broker

Consumer

Let’s do a hands-on exercise of Kafka with knowledge we’ve learned until now

Apache Kafka Terminology & Big Picture

Page 10: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved10

Message Topic Partition Producer Broker

Consumer

In Kafka for each topic a partition log is maintained.

Each partition is an ordered, immutable sequence of messages that is appended to

Each message in the partition is assigned a sequential id number called the offset

Apache Kafka: About Topic and Partition

Partition 1

Writes

Partition 2

Partition 3

Page 11: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved11

Message Topic Partition Producer Broker Consumer

In Kafka, a Producer is an API to publish messages to topic

Apache Kafka: About Topic and Partition

Page 12: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved12

Message Topic Partition Producer Broker Consumer

In Kafka, a Consumer is an API to consume messages from topics

Apache Kafka: About Topic and Partition

Page 13: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved13

Message Topic Partition Producer Broker

Consumer

Let’s do a hands-on exercise of Kafka with knowledge we’ve learned until now

Apache Kafka Terminology & Big Picture

Page 14: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved14

• Trading Systems- Risk Identification in real-time

• Change Data Capture- Capturing the changed data into data lake environment

• Online Gaming- Identifying top scorers of a game

Apache Kafka Use Cases

Page 15: Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up

©2014 Knowledgent Group Inc. All Rights Reserved15

Thank you!

Questions?