papers we love realtime at facebook

1

Papers We Love:

Realtime Data Processing at Facebook

Gwen ShapiraConfluent Inc.

2

Papers We Love:

Realtime Data Processing at Facebook

3

Published in 2016 (!)

4

What kind of paper is this?

5

This is NOT

The one true architecture

.

Please don’t cargo-cult this paper

6

Few real-time systems at Facebook

• Chorus – aggregate trends

• Realtime feedback for mobile app developers

• Page analytics – likes, engagement…

• Offload CPU-intensive dashboard queries

10

Looking for trending topics in 5 minute windows

11

The Tofu & Potatoes of the paper:

Design Decisions

12

/ KafkaStreams

+ exactly

once

13

Decision #1 – Language Paradigm

• Declarative (SQL) – easy & limited

• Functional

• Procedural (C++, Java, Python) –most flexibility, control, performance. Longer dev cycle.

14

Decision #1 – Language Paradigm

• Declarative (SQL) – easy & limited

• Functional

• Procedural (C++, Java, Python) –most flexibility, control, performance. Longer dev cycle.

15

Decision #2: Data Transfer

• RPC (Millwheel, Flink, SparkStreaming)

• All about speed

• Message-forwarding broker (Heron)

• Applies back-pressure, multiplex

• Persistent stream storage (Samza, Kafka’s Stream API)

• Most reliable

• Decouples processors

16

Decision #2: Data Transfer

17

Love Song to Scribe

Independent stream processing nodes

And storing inputs / outputs

Made everything great

18

Decision #3 – Processing Semantics

19

Decision #3 – Processing Semantics

Facebook Verdict: It depends on requirements

• Ranker writes to idempotent system – at least once

• Scuba can lose data, but not handle duplicates – at most once

• …. Exactly once is REALLY HARD and requires transactions

20

Don’t miss the side-note on side-effects

• Exactly once means writing output + offsets to a transactional system

• This takes time

• Why just wait when you can deserialize? And maybe do other stateless stuff?

21

Decision #4 – State Saving

• In-memory state with replication (Old VoltDB)• Requires lots of hardware and network

• Local database (Samza, Kafka Streams API)

• Remote database (Millwheel)

• Upstream (i.e. replay everything on failure)

• Global consistent snapshot (Flink)

22

Decision #4 – State Saving

Facebook Verdict: It depends

Rhode Island Alaska

23

Best Part of the Paper – by far

How to efficiently work with state in remote DB?

24

Decision #5 - Reprocessing

• Stream only – requires long retention in the stream store

• Maintain both batch and stream systems

• Develop systems that can run in streams and batch (Flink, Spark)

25

Decision #5 - Reprocessing

• Stream only – requires long retention in the stream store

• Maintain both batch and stream systems

• Develop systems that can run in streams and batch (Flink, Spark)

Facebook Verdict:

SQL runs everywhere

And binary generation FTW

26

Applications – Or a whirlwind tour of good patterns

One example:

27

Lessons Learned!

The biggest win is pipelines composed of independent processors

• Mixing multiple systems let us move fast

• High level abstractions let us improve implementation

• Ease of debugging – Independent nodes and ability to replay

• Ease of deployment – Puma as-a-service

• Ease of monitoring – Lag is the most important metric. Everything is instrumented out of the box.

• In the future – auto-scale based on lag

28

Thank You!

papers we love realtime at facebook

Data & Analytics