the forces that disrupt netflixthe forces that disrupt netflix haley tucker nov. 7, 2016 acrobat...

74
The Forces That Disrupt Netflix Nov. 7, 2016 Haley Tucker

Upload: others

Post on 20-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

The Forces That Disrupt Netflix

Nov. 7, 2016Haley Tucker

ACROBAT

FLEA

our world

parallel world

# A distributed system is

one in which the failure

of a computer you didn't

even know existed can

render your own computer

unusable.

--Leslie Lamport

ACROBAT

FLEA

our world

parallel world

computing

ENGINEER

PROLOGUEDISTRIBUTED SYSTEMS

Proxy/Routing

DECOMPOSING THE MONOLITHDevices

Netflix

ServiceNetflix

ServiceEdge

Service

Traffic

Netflix

Playback

Service

Netflix

Playback

Service

Edge

Service

Edge

Service

Edge

Service

Playback

Service

Traffic

Notes on Distributed Systems for Young Bloods

# Distributed

systems are

different because

they fail often.

--Jeff Hodges

TABLE OF CONTENTS

CHAPTER 1: THE WEIRD DATA IN THE CATALOG

• Metadata impacts on availability

CHAPTER 2: THE VANISHING OF CRITICAL SERVICES

• Crashing services and cascading failures

CHAPTER 3: THE THROTTLE

• Latency spikes and the impact of fallbacks

FORCES AT WORK

Whoops, something went wrong…Netflix Streaming ErrorWe’re having trouble playing this title right now. Please try again later or select a different title.

CHAPTER ONETHE WEIRD DATA IN THE CATALOG

VIDEO METADATAARCHITECTURE

Video

Metadata

Service

Amazon S3

Source

System

Source

System

Netflix

ServicesNetflix

ServicesNetflix

ServicesNetflix

ServicesNetflix

Service

Traffic

Amazon S3

Netflix

Playback

Service

{

String msg = “This should never

happen!”;

throw new IllegalStateException(msg);

}

MITIGATIONBLAST RADIUS

Explosion, CC BY 2.0, Andrew Kuznetsov 2008, Flikr

Amazon WS Global Infrastructure

Amazon WS Global Infrastructure

STAGGERED ROLLOUT

Pager

Dia

gnosis

?

Canary, CC BY 2.0, Steve P2008 2014, Flikr

PREVENTIONCANARIES

TRADITIONAL CANARY

Canary

(New Code)

Baseline

(Old Code)

TrafficTraffic

Video

Metadata

Service

Amazon S3

Netflix

ServicesNetflix

ServicesNetflix

ServicesNetflix

ServicesNetflix

Service

Source

System

Source

System

Traffic

CONSISTENCYVALID STATE TRANSITIONS

DATA CANARY

Netflix

ServicesNetflix

ServicesNetflix

ServicesNetflix

Services

Video

Metadata

Service

Amazon S3

Source

System

Source

System

Netflix

Service

Netflix

Data

Canary

Service

Data

Tester

Netflix

Service

Traffic

Verify consistency prior to

applying state changes.

…one tool is a data canary.

CHAPTER TWOTHE VANISHING OF CRITICAL SERVICES

# A distributed system is

one in which the failure

of a computer you didn't

even know existed can

render your own computer

unusable.

--Leslie Lamport

Proxy/Routing

Devices

LOG DATA

Log Data Service

Traffic

Cassandra

Playback ServiceNetflix

Playback

Service

Netflix

Playback

Service

Edge

Service

Edge

Service

Edge

Service

Playback

Service

Traffic

Proxy/Routing

Devices

Proxy

Log Data Service

Traffic

Cassandra

Playback ServiceNetflix

Playback

Service

Netflix

Playback

Service

Edge

Service

Edge

Service

Edge

Service

Playback

Service

Traffic

Proxy/Routing

Devices

CASCADING FAILURE

Log Data Service

Traffic

Cassandra

Playback ServiceNetflix

Playback

Service

Netflix

Playback

Service

Edge

Service

Edge

Service

Edge

Service

Playback

Service

Traffic

{

throw new OutOfMemoryError();

}

Log Data Service

Cassandra

Playback Service

PREVENTIONMANAGING RESOURCE CONSTRAINTS

Whatever you ask, CC BY-SA 2.0, Kreg Steppe 2008, Flikr

1

Keep Only

Dependencies

which are

Necessary

SO MANY JARS!!

DEV

FAVOR IMMUTABILITY

Playback Service

TEST

Playback Service

PROD

Playback Service

4

try {

remoteService.call();

} catch( Throwable t ){

//Oops!

System.exit(1);

}

Log Data

Service

Cassandra

Playback

Service

Proxy/Routing

Traffic

It's Electric, CC BY ND 2.0, Alan Hochberg 2008, Flikr

MITIGATIONCIRCUIT BREAKERS

Wrecking Ball in Building, CC BY 2.0, Jason Eppink 2008, Flikr

FAILURE TESTING

Proxy/Routing

Devices

FAILURE TESTING

Log Data Service

Traffic

Cassandra

Playback ServiceAutomating Chaos

Experiments in

Production

by Ali Basiri

Applying Failure

Testing Research

@Netflix

by Kolton Andrus and

Peter Alvaro

Manage resource constraints by

reducing surface area.

Leverage circuit breakers and

rigorously test failures.

CHAPTER THREETHE THROTTLE

Proxy/Routing

Devices

PLAYBACK ARCHITECTURE

Edge Service Edge Service Edge Service

Playback Service

Traffic Traffic

URL Service

NETFLIX CLIENT JARS

Playback Service

URL

Service

URL Client

Circuit-breakers and Fallbacks

Metrics Retries and Timeouts

RPCService Discovery

Playback

Service

Traffic Concurrent

Requests

Throttled Requests

(HTTP 503)

THROTTLING

}

System.gc();

}

URL

Service

Playback Service

Edge Service

Proxy/Routing

Traffic

NETFLIX CLIENT JARS

Playback Service

URL

Service

URL Client

Circuit-breakers

MetricsRetries and

Timeouts

RPCService Discovery

Heavy

Fallback

FALLBACK TESTING

With 100% Fallback,

CPU held at 90%

15 RPSNo fallback, CPU held at 90%

58 RPSSiege: https://github.com/JoeDog/siege

SELECTING FALLBACKS

CACHESTATIC

FALLBACK

SERVICE

URL

Service

Playback Service

Edge Service

Proxy/Routing

Traffic

}

return Response

.status(503)

.build();

}

REQUEST BUCKETING

NON-CRITICALExperience or

Performance

Impact

CRITICALCustomer

Streaming

Impact

Fire Buckets at Oakworth Statione, CC BY 2.0, Tim Greene 2015, Flikr

APPLICATION SHARDING

Non-Critical

Playback Service

Proxy/RoutingDevices

Edge Service Edge Service Edge Service

Traffic Traffic

URL Service

Critical Playback

Service

Non-Critical URL

Service

APPLICATION SHARDING

Non-Critical

Playback Service

Proxy/RoutingDevices

Edge Service Edge Service Edge Service

Traffic Traffic

Critical Playback

Service

URL ServiceNon-Critical URL

Service

No heavy fallbacks!!

Fallbacks should be light and fast.

Shard your application based on

operational characteristics.

EPILOGUEKEY TAKEAWAYS

KEY TAKEAWAYSCHAPTER 1: THE WEIRD DATA IN THE CATALOG

• Verify consistency prior to applying state changes.

• One tool is a data canary.

CHAPTER 2: THE VANISHING OF CRITICAL SERVICES

• Manage resource constraints by reducing surface area.

• Leverage circuit breakers and rigorously test failures.

CHAPTER 3: THE THROTTLE

• No heavy fallbacks!! Fallbacks should be light and fast.

• Shard your application based on operational characteristics.

The unexpected will happen.

Plan to fail.

PARTING THOUGHTDISTRIBUTED SYSTEMS

SOCIAL

Questions?

Haley Tucker