lambda-less stream processing - linked in

38
Lambda-less Stream Processing @Scale in LinkedIn Yi Pan (Apache Samza PMC/Committer) Kartik Paramasivam (Mgr - Streams Infra) June, 2016

Upload: yi-pan

Post on 16-Apr-2017

426 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Lambda-less stream processing - linked in

Lambda-less Stream Processing @Scale in LinkedIn

Yi Pan (Apache Samza PMC/Committer)Kartik Paramasivam (Mgr -Streams Infra)

June, 2016

Page 2: Lambda-less stream processing - linked in

Agenda• Rise of Stream Processing Applications• Some Hard Problems in Stream Processing

–Data Accuracy–Reprocessing

• Conclusion

Page 3: Lambda-less stream processing - linked in

Newsfeed

Page 4: Lambda-less stream processing - linked in

Cyber-security

Page 5: Lambda-less stream processing - linked in

Internet of Things

Page 6: Lambda-less stream processing - linked in

Agenda• Rise of Stream Processing Applications• Some Hard Problems in Stream Processing

–Data Accuracy–Reprocessing

• Conclusion

Page 7: Lambda-less stream processing - linked in

Data Accuracy• Can Stream Processing generate accurate

results?–Yes.. but it is not trivial.

Page 8: Lambda-less stream processing - linked in

Case Study

Ads HTML

1:00pm

AdViewEventsAdQuality processor

Page 9: Lambda-less stream processing - linked in

Case Study

Ads HTML

1:01pm

AdViewEventsAdQuality processor

Click! AdClickEvents

Page 10: Lambda-less stream processing - linked in

Case Study

Ads HTML

1:01pmAdViewEvents

AdQuality processor

Click! AdClickEvents

Did AdClick happen

within 2min of AdView?

YesNo

Good AdBad Ad

Page 11: Lambda-less stream processing - linked in

Delays in Event Stream

Ad Quality Processor(Samza)

Services Tier

Kafka

Services Tier

Ad Quality Processor(Samza)

KafkaMirrored

Yi

DATACENTER 1 DATACENTER 2

AdViewEventLB

Page 12: Lambda-less stream processing - linked in

Real Time Processing

(Samza)

Services Tier

Kafka

Services Tier

Real Time Processing

(Samza)

KafkaMirrored

Yi

DATACENTER 1 DATACENTER 2

AdClick EventLB

Delays in Event Stream

Late Arrival

Page 13: Lambda-less stream processing - linked in

Real Time Processing

(Samza)

Services Tier

Kafka

Services Tier

Real Time Processing

(Samza)

KafkaMirrored

Yi

DATACENTER 1 DATACENTER 2

AdClick EventLB

Delays in Event stream

Out of Order Arrival

Page 14: Lambda-less stream processing - linked in

Lambda at LinkedIn

Real Time Processing

(Samza)

Batch Processing

(Hadoop/Spark)

Voldemort R/O

Processing

Bulk upload

Espresso

Services Tier

Ingestion Serving

Clients(browser,devices,..)

Kafka

Page 15: Lambda-less stream processing - linked in

• Basic Assumption : Batch jobs have full data-set

• But, how about edges?

Data Accuracy - with Lambda

Smaller batch size == more edges!Graph kudos to Stream Processing 101 from Tyler Akidau (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101)

10:00 11:00 12:00 13:00 system time

Page 16: Lambda-less stream processing - linked in

Fixing Lambda

Real Time Processing

(Samza)

Batch Processing

(Hadoop/Spark)

Voldemort R/O

Processing

Bulk upload

Espresso

Services Tier

Ingestion Serving

Clients(browser,devices, ….)

Kafka

Kafka Audit

Check Safe Start Time

Page 17: Lambda-less stream processing - linked in

Observation• Data Accuracy is still very hard with Lambda

–Additional system (e.g. Kafka Audit) has to be used to safely start the batch jobs

• Duplication in Online/Offline system: –Development cost–Operational overhead–Maintenance overhead

Page 18: Lambda-less stream processing - linked in

Going Lambda-less• Handle late arrivals and out of order arrivals • Eventually correct results

– Compute results at end of ‘window’.– Re-compute when events arrives late

• Influenced by “Google MillWheel”

Page 19: Lambda-less stream processing - linked in

Going Lambda-lessAdViewEvent

AdClickEvent

AdQuality processor

1:00pm1:01pm1:01pm1:02pm1:02pm

1:00pm1:02pm

Window output is computed at the end of window = (2min after the window is created)

window(“1:00pm”, “2min”)

Kafka

Kafka

Page 20: Lambda-less stream processing - linked in

Handling ‘late arrival’

1:00pm1:01pm1:01pm1:02pm1:02pm

1:00pm1:02pm

1:01pm

Late-arrival

Re-compute window(“1:00pm”, “2min”)

Kafka

Kafka

AdViewEvent

AdClickEvent

AdQuality processor

Page 21: Lambda-less stream processing - linked in

Handling ‘out of order arrival’

1:01pm1:02pm

1:00pm1:02pm

null join result in window(“1:00pm”, “2min”)

Kafka

Kafka

AdViewEvent

AdClickEvent

AdQuality processor

Page 22: Lambda-less stream processing - linked in

Handling ‘out of order arrival’

1:01pm1:02pm1:00pm1:01pm

1:00pm1:02pm

Re-compute window(“1:00pm”, “2min”)

Out-of-order arrival

Kafka

Kafka

AdViewEvent

AdClickEvent

AdQuality processor

Page 23: Lambda-less stream processing - linked in

SamzaContainer-1

Samza based SolutionKafka

AdClicks

SamzaContainer-0

Task1

Task2

Task3

AdView

Events are saved into RocksDB based local message store which is backed up durably in Kafka

Kafka

Samza Job

Changelog in Kafka

Page 24: Lambda-less stream processing - linked in

SamzaContainer-1

PerformanceKafka

AdClicks

SamzaContainer-0

Task1

Task2

Task3

AdView

Performance of Samza’s local RocksDB store:- 1.1 Million TPS (read/write) on single machine (ssd)- Largest production job has 1.5 Terabyte of local state

Kafka

Samza Job

Changelog in Kafka

Page 25: Lambda-less stream processing - linked in

Agenda• Rise of Stream Processing Applications• Some Hard Problems in Stream Processing

– Data Accuracy–Reprocessing

• Conclusion

Page 26: Lambda-less stream processing - linked in

Reprocessing • What is reprocessing ?

–Process events that happened in the past.

Page 27: Lambda-less stream processing - linked in

Case Study : Title Standardization

LinkedInProfile

change ‘Title’ :

Before: ArchitectAfter: Chief Technology Nerd

Title Standardizer

Search Ads ….

Page 28: Lambda-less stream processing - linked in

Title Standardizer - Implementation

outputMember Database(espresso)

Profile Updates

(Samza) Title-Standardizer

Machine Learningmodel

Kafka

Databus

Page 29: Lambda-less stream processing - linked in

Reprocessing - dealing with bugs

outputMember Database(espresso)

Profile Updates

(Samza) Title-Standardizer

Kafka

Databus

rewind 4 hours

Machine Learningmodel

Page 30: Lambda-less stream processing - linked in

Reprocessing - entire Dataset

outputMember Database(espresso)

Profile Updates

(Samza) Title-Standardizer

Kafka

Databus

Bootstrap

Backup

Database Backup (NFS)

set offset=0

Machine Learningmodel (NEW)

Page 31: Lambda-less stream processing - linked in

Reprocessing - entire Dataset Profile Updates

(Samza) Title-Standardizer

(Samza) Title-Standardizer

Bootstrap

Backup Machine Learningmodel (NEW)

output

Kafka

Databus

Databus

Member Database(espresso)

Database Backup (NFS)

set offset=0

Page 32: Lambda-less stream processing - linked in

Reprocessing - entire DatasetProfile Updates

(Samza) Title-Standardizer

(Samza) Title-Standardizer

BootstrapBackup

Machine Learningmodel (NEW)

output

Kafka

Databus

Databus

(Samza)Merge and

Store

Results

Page 33: Lambda-less stream processing - linked in

Reprocessing- Caveats• Stream processors are fast.. They can DOS the

system if you reprocess – Control max-concurrency of your job– Quotas for Kafka, Databases

• Reprocessing a 100 TB source ?–Capacity ?–Saturation of NICs, Top of rack switches

Page 34: Lambda-less stream processing - linked in

Reprocessing larger datasets Profile Updates

(Samza) Title-Standardizer

Machine Learningmodel

output

Kafka

Databus

(Samza)Merge and

Store

Results

Database Dump in HDFS

(Samza) Title-Standardizer

ML Model in HDFSHadoop

Page 35: Lambda-less stream processing - linked in

Experimentation

Database Dump in HDFS

(Samza) Title-Standardizer

Hadoop

ML Model in HDFS

Output in HDFS

● Offline experimentation before pushing the logic online○ Most datasets are already available in Hadoop (at LinkedIn)○ Fast Iteration with minimum impact to production

Page 36: Lambda-less stream processing - linked in

Conclusion1.It is possible to avoid code

duplication(hot/cold path) to support– Accuracy–Reprocessing

2. Some Lambda related problems still linger when reprocessing entire datasets

–e.g. merging online/reprocessing results

Page 37: Lambda-less stream processing - linked in

References• MillWheel: http://research.google.com/pubs/pub41378.html• DataFlow: http://research.google.com/pubs/pub41378.html• Samza: http://samza.apache.org/• Window Operator in Samza: https://issues.apache.org/jira/browse/SAMZA-552 • Lambda Architecture: https://www.manning.com/books/big-data• Stream Processing 101:

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 • Stream Processing 102:

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Page 38: Lambda-less stream processing - linked in

Thank You!