keystone - leverage big data 2016

45
The Netflix way to deal with real-time data How we built a 1t/day stream processing cloud platform in a year

Upload: peter-bakas

Post on 14-Apr-2017

153 views

Category:

Technology


0 download

TRANSCRIPT

The Netflix way to deal

with real-time data

How we built a 1t/day stream processing cloud platform in a year

What should I expect

� Keystone Season 1 - Who, What, How and Why � Keystone Season 2 - Preview Trailer

1.

The Who?

hello!

I am Peter BakasI lead the Real-Time Data Infrastructure team @ Netflix

You can find me at @peter_bakas

2.

The What?

Publish +

Collect +

Process +

Move Data

1,000,000,000,000

Whoa! That’s a big number.

Events Processed Every Day

Daily Averages

� 700B unique events ingested � 1T events processed� 1.4 PB

By the numbers

Peak

� 1T unique events ingested/day� 12.5M/sec� 35 GB/sec

Trending on Netflix

80B/day

1/2014

300B/day

1/2015

1T/day

1/2016

Growth has

its season

In the beginning - Chukwa

Q4 2014 - Chukwa/Suro

Q4 2015 - Keystone

InternalRouting Service

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneHTTP

PROXY

Stream Consumers

Kafka

Q4 2015 - Keystone

InternalRouting Service

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneHTTP

PROXY

Stream Consumers

Keystone Kafka Footprint

Fronting Kafka Consumer Kafka

Number of Clusters 24 8

Number of Instances 3000+ 900+

Retention Period 8 to 24 hrs 2 to 4 hrs

Q4 2015 - Keystone

InternalRouting Service

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneHTTP

PROXY

Stream Consumers

Keystone Internal Routing Service

+

CheckpointingCluster

+

Keystone Internal Routing Service

Keystone Internal Routing Service Footprint

S3 ElasticSearch Consumer Kafka

Number of containers 7000 1500 4500

Keystone avg end-to-end metrics

S3 ElasticSearch Consumer Kafka

1 sec 13 sec 800 ms

3.

The How?

Netflix Culture

Freedom and Responsibility

“It may well be the most

important document to ever come out of the Valley

Sheryl Sandberg, COO @ Facebook

What does

culture

have to do

with how?

Sounds easy

Build Team

Build Product

A true story

Keystone went live 10/27/152 days later...

Place your screenshot here80% “loss” over

6 hour period

A true story

Lessons learned

� There are times when things can go wrong… and no turning back � Reduce complexity� Minimize blast radius� Find a way to start over fresh

Failover

� Cold standby Kafka cluster with different instance type � Different ZooKeeper cluster with no state� Fully automated

Place your screenshot hereTime is of the essence

Failover as fast as

5 minutes

Fully Automated

Failover

Best Practices

� Full Automation � Self Healing� Kafka Kong

4.

The Why?

“We didn’t do anything wrong, but

somehow, we lost...

Stephen Elop, Nokia CEO

Global Launch 1/6/16

125,000,000 hrs/day

That’s a lot of hours!

37 %of North America internet traffic @ peak!

81,000,000 members

and a lot of members

If you don’t change,

you will be eliminated from the

competition

5.

Coming Soon

Our philosophy

Create Duplo R Blocks :Let reusability drive new value

Evolution

Keystone Management

Keystone Messaging

Keystone Stream

Processing

Keystone

� Unified event publishing, collection, routing for batch and stream processing

� 85% of data volume

Keystone Messaging

Ad-hoc Messaging

� 15% of data volume

Consumers weary of

� Complexity of self-managed infrastructure

� Multiple runtimes across different platforms

Keystone Stream Processing

Consumers want

� Simple unified model/API/UI/system

Simple and intuitive interface to manage all Keystone services

Keystone Management

thanks!

Any questions?

You can find me at@peter_bakas

[email protected]

Credits

Special thanks to all the people who made and released these awesome resources for free:� Presentation template by SlidesCarnival� Photographs by Unsplash