papers we love realtime at facebook

28
1 Papers We Love: Realtime Data Processing at Facebook Gwen Shapira Confluent Inc.

Upload: gwen-chen-shapira

Post on 28-Jan-2018

392 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Papers we love   realtime at facebook

1

Papers We Love:

Realtime Data Processing at Facebook

Gwen ShapiraConfluent Inc.

Page 2: Papers we love   realtime at facebook

2

Papers We Love:

Realtime Data Processing at Facebook

Page 3: Papers we love   realtime at facebook

3

Published in 2016 (!)

Page 4: Papers we love   realtime at facebook

4

What kind of paper is this?

Page 5: Papers we love   realtime at facebook

5

This is NOT

The one true architecture

.

Please don’t cargo-cult this paper

Page 6: Papers we love   realtime at facebook

6

Few real-time systems at Facebook

• Chorus – aggregate trends

• Realtime feedback for mobile app developers

• Page analytics – likes, engagement…

• Offload CPU-intensive dashboard queries

Page 7: Papers we love   realtime at facebook

7

Page 8: Papers we love   realtime at facebook

8

Page 9: Papers we love   realtime at facebook

9

Page 10: Papers we love   realtime at facebook

10

Looking for trending topics in 5 minute windows

Page 11: Papers we love   realtime at facebook

11

The Tofu & Potatoes of the paper:

Design Decisions

Page 12: Papers we love   realtime at facebook

12

/ KafkaStreams

+ exactly

once

Page 13: Papers we love   realtime at facebook

13

Decision #1 – Language Paradigm

• Declarative (SQL) – easy & limited

• Functional

• Procedural (C++, Java, Python) –most flexibility, control, performance. Longer dev cycle.

Page 14: Papers we love   realtime at facebook

14

Decision #1 – Language Paradigm

• Declarative (SQL) – easy & limited

• Functional

• Procedural (C++, Java, Python) –most flexibility, control, performance. Longer dev cycle.

Page 15: Papers we love   realtime at facebook

15

Decision #2: Data Transfer

• RPC (Millwheel, Flink, SparkStreaming)

• All about speed

• Message-forwarding broker (Heron)

• Applies back-pressure, multiplex

• Persistent stream storage (Samza, Kafka’s Stream API)

• Most reliable

• Decouples processors

Page 16: Papers we love   realtime at facebook

16

Decision #2: Data Transfer

Page 17: Papers we love   realtime at facebook

17

Love Song to Scribe

Independent stream processing nodes

And storing inputs / outputs

Made everything great

Page 18: Papers we love   realtime at facebook

18

Decision #3 – Processing Semantics

Page 19: Papers we love   realtime at facebook

19

Decision #3 – Processing Semantics

Facebook Verdict: It depends on requirements

• Ranker writes to idempotent system – at least once

• Scuba can lose data, but not handle duplicates – at most once

• …. Exactly once is REALLY HARD and requires transactions

Page 20: Papers we love   realtime at facebook

20

Don’t miss the side-note on side-effects

• Exactly once means writing output + offsets to a transactional system

• This takes time

• Why just wait when you can deserialize? And maybe do other stateless stuff?

Page 21: Papers we love   realtime at facebook

21

Decision #4 – State Saving

• In-memory state with replication (Old VoltDB)• Requires lots of hardware and network

• Local database (Samza, Kafka Streams API)

• Remote database (Millwheel)

• Upstream (i.e. replay everything on failure)

• Global consistent snapshot (Flink)

Page 22: Papers we love   realtime at facebook

22

Decision #4 – State Saving

Facebook Verdict: It depends

Rhode Island Alaska

Page 23: Papers we love   realtime at facebook

23

Best Part of the Paper – by far

How to efficiently work with state in remote DB?

Page 24: Papers we love   realtime at facebook

24

Decision #5 - Reprocessing

• Stream only – requires long retention in the stream store

• Maintain both batch and stream systems

• Develop systems that can run in streams and batch (Flink, Spark)

Page 25: Papers we love   realtime at facebook

25

Decision #5 - Reprocessing

• Stream only – requires long retention in the stream store

• Maintain both batch and stream systems

• Develop systems that can run in streams and batch (Flink, Spark)

Facebook Verdict:

SQL runs everywhere

And binary generation FTW

Page 26: Papers we love   realtime at facebook

26

Applications – Or a whirlwind tour of good patterns

One example:

Page 27: Papers we love   realtime at facebook

27

Lessons Learned!

The biggest win is pipelines composed of independent processors

• Mixing multiple systems let us move fast

• High level abstractions let us improve implementation

• Ease of debugging – Independent nodes and ability to replay

• Ease of deployment – Puma as-a-service

• Ease of monitoring – Lag is the most important metric. Everything is instrumented out of the box.

• In the future – auto-scale based on lag

Page 28: Papers we love   realtime at facebook

28

Thank You!