apache samza * reliable stream processing atop apache kafka and yarn

100
Apache Samza* Reliable Stream Processing atop Apache Kafka and Yarn Sriram Subramanian Me on Linkedin Me on twitter - @sriramsub1 * Incubating

Upload: drago

Post on 23-Feb-2016

69 views

Category:

Documents


0 download

DESCRIPTION

Apache Samza * Reliable Stream Processing atop Apache Kafka and Yarn. Sriram Subramanian Me on Linkedin Me on twitter - @sriramsub1. * Incubating. Agenda. Why Stream Processing? What is Samza’s Design ? How is Samza’s Design Implemented? How can you use Samza ? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Apache Samza*Reliable Stream Processing atop

Apache Kafka and Yarn

Sriram SubramanianMe on Linkedin

Me on twitter - @sriramsub1

* Incubating

Page 2: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn
Page 3: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Agenda• Why Stream Processing?• What is Samza’s Design ?• How is Samza’s Design

Implemented? • How can you use Samza ?• Example usage at Linkedin

Page 4: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Why Stream Processing?

Page 5: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Response latency0 ms

Page 6: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Response latency

RPC

Synchronous

0 ms

Page 7: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Response latency

RPC

Synchronous Later. Possibly much later.

0 ms

Page 8: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Response latency

Samza

Milliseconds to minutes

RPC

Synchronous Later. Possibly much later.

0 ms

Page 9: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Newsfeed

Ad Relevance

Page 10: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Search Index

Metrics and Monitoring

Page 11: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

What is Samza’s Design ?

Page 12: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stream A

JOB

Stream B

Stream C

Page 13: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stream A

JOB 1

Stream B

Stream C

Stream D

JOB 2

Stream E

Stream F

JOB 3

Stream G

Page 14: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

StreamsPartition 0 Partition 1 Partition 2

Page 15: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

StreamsPartition 0 Partition 1 Partition 2

123456

12345

1234567

Page 16: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

StreamsPartition 0 Partition 1 Partition 2

123456

12345

1234567

Page 17: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

StreamsPartition 0 Partition 1 Partition 2

123456

12345

1234567

Page 18: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

StreamsPartition 0 Partition 1 Partition 2

123456

12345

1234567

Page 19: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

StreamsPartition 0 Partition 1 Partition 2

123456

12345

1234567

Page 20: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

StreamsPartition 0 Partition 1 Partition 2

next append

123456

12345

1234567

Page 21: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

JobsStream A Stream B

Task 1 Task 2 Task 3

Stream C

Page 22: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

JobsAdViews AdClicks

Task 1 Task 2 Task 3

AdClickThroughRate

Page 23: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0 Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 24: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0 Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 25: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0 Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 26: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 27: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 28: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 29: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0 Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 30: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0 Partition 1

Ad Views - Partition 0

1234

Output Count Stream

Page 31: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0 Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 32: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0 Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 33: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0 Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 34: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 35: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 36: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 37: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 38: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 39: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Tasks

AdViewsCounterTask

Partition 0Partition 1

1234

2

Partition 1Checkpoint

Stream

Ad Views - Partition 0

Output Count Stream

Page 40: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

DataflowStream A Stream B Stream C

Stream E

Stream B

Job 1 Job 2

Stream D

Job 3

Page 41: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

DataflowStream A Stream B Stream C

Stream E

Stream B

Job 1 Job 2

Stream D

Job 3

Page 42: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful Processing• Windowed Aggregation

– Counting the number of page views for each user per hour

• Stream Stream Join– Join stream of ad clicks to stream of ad views to identify the

view that lead to the click

• Stream Table Join– Join user region info to stream of page views to create an

augmented stream

Page 43: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

• In memory state with checkpointing

– Periodically save out the task’s in memory data

– As state grows becomes very expensive

– Some implementation checkpoints diffs but adds complexity

How do people do this?

Page 44: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

• Using an external store

– Push state to an external store

– Performance suffers because of remote queries

– Lack of isolation

– Limited query capabilities

How do people do this?

Page 45: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B

Page 46: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B

Page 47: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 48: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 49: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 50: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 51: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 52: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 53: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 54: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 55: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 56: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 57: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 58: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 59: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Key-Value Store• put(table_name, key, value)• get(table_name, key)• delete(table_name, key)• range(table_name, key1, key2)

Page 60: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

How is Samza’s Design Implemented?

Page 61: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Apache Kafka• Persistent,

reliable,distributed message queue

Page 62: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

At LinkedIn

10+ billionwrites per day

172kmessages per second

(average)

60+ billionmessages per day

to real-time consumers

Page 63: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Apache Kafka• Models streams as topics

• Each topic is partitioned and each partition is replicated

• Producer sends messages to a topic

• Messages are stored in brokers

• Consumers consume from a topic (pull from broker)

Page 64: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

YARN- Yet another resource negotiator

• Framework to run your code on a grid of machines

• Distributes our tasks across multiple machines

• Notifies our framework when a task has died

• Isolates our tasks from each other

Page 65: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

JobsStream A

Task 1 Task 2 Task 3

Stream B

Page 66: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Containers

Task 1 Task 2 Task 3

Stream B

Stream A

Page 67: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Containers

Stream B

Stream A

Samza Container 1 Samza Container 2

Page 68: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Containers

Samza Container 1 Samza Container 2

Page 69: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

YARN

Samza Container 1 Samza Container 2

Host 1 Host 2

Page 70: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

YARN

Samza Container 1 Samza Container 2

NodeManager NodeManager

Host 1 Host 2

Page 71: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

YARN

Samza Container 1 Samza Container 2

NodeManager NodeManager

Samza YARN AM

Host 1 Host 2

Page 72: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

YARN

Samza Container 1 Samza Container 2

NodeManager

Kafka Broker

NodeManager

Samza YARN AM

Kafka Broker

Host 1 Host 2

Page 73: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

YARN

MapReduceContainer

MapReduce Container

NodeManager

HDFS

NodeManager

MapReduce YARN AM

HDFS

Host 1 Host 2

Page 74: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

YARN

Samza Container 1

NodeManager

Kafka Broker

Host 1

Stream C

Stream A

Samza Container 1 Samza Container 2

Page 75: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

YARN

Samza Container 1

NodeManager

Kafka Broker

Host 1

Stream C

Stream A

Samza Container 1 Samza Container 2

Page 76: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

YARN

Samza Container 1

NodeManager

Kafka Broker

Host 1

Stream C

Stream A

Samza Container 1 Samza Container 2

Page 77: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

YARN

Samza Container 1

NodeManager

Kafka Broker

Host 1

Stream C

Stream A

Samza Container 1 Samza Container 2

Page 78: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

YARN

Samza Container 1 Samza Container 2

NodeManager

Kafka Broker

NodeManager

Samza YARN AM

Kafka Broker

Host 1 Host 2

Page 79: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

CGroups

Samza Container 1 Samza Container 2

NodeManager

Kafka Broker

NodeManager

Samza YARN AM

Kafka Broker

Host 1 Host 2

Page 80: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

How can you use Samza ?

Page 81: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 82: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 83: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 84: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 85: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 86: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 87: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 88: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 89: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 90: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Page 91: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful Stream Taskpublic class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); }}

Page 92: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful Stream Taskpublic class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); }}

Page 93: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful Stream Taskpublic class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); }}

Page 94: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Stateful Stream Taskpublic class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); }}

Page 95: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Example usage at Linkedin

Page 96: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Call graph assemblyget_unread_msg_count()

get_PYMK()

get_Pulse_news()

get_relevant_ads()

get_news_updates()

Page 97: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Lots of calls == lots of machines, logs

get_unread_msg_count()

get_PYMK()

get_Pulse_news()

get_relevant_ads()

get_news_updates()

unread_msg_service_call

get_PYMK_service_call

pulse_news_service_call

add_relevance_service_call

news_update_service_call

Page 98: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

TreeID: Unique identifier

page_view_event (123456)

unread_msg_service_call (123456)

another_service_call (123456)

silly_service_call (123456)

get_PYMK_service_call (123456) counter_service_call (123456)

unread_msg_service_call (123456)count_invites_service_call (123456)

count_msgs_service_call (123456)

Page 99: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

OK, now lots of streams with TreeIDs…

all_service_calls(partitioned by TreeID)

Samza job:Repartition-By-TreeID

*_service_call

Samza job:Assemble Call Graph

service_call_graphs

• Near real-time holistic view of how we’re actually serving data• Compare day-over-day, cost, changes, outages

Page 100: Apache  Samza * Reliable Stream Processing atop  Apache Kafka and Yarn

Thank you• Quick start: bit.ly/hello-samza• Project homepage: samza.incubator.apache.org

• Newbie issues: bit.ly/samza_newbie_issues• Detailed Samza and YARN talk: bit.ly/samza_and_yarn

• A must-read: http://bit.ly/jay_on_logs• Twitter: @samzastream• Me on Twitter: @sriramsub1