the forces that disrupt netflix - qconsf · the forces that disrupt netflix haley tucker nov. 7,...

74
The Forces That Disrupt Netflix Nov. 7, 2016 Haley Tucker

Upload: others

Post on 20-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

The Forces That Disrupt Netflix

Nov. 7, 2016Haley Tucker

Page 2: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

ACROBAT

FLEA

our world

parallel world

Page 3: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

# A distributed system is

one in which the failure

of a computer you didn't

even know existed can

render your own computer

unusable.

--Leslie Lamport

Page 4: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

ACROBAT

FLEA

our world

parallel world

computing

ENGINEER

Page 5: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 6: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

PROLOGUEDISTRIBUTED SYSTEMS

Page 7: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 8: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Proxy/Routing

DECOMPOSING THE MONOLITHDevices

Netflix

ServiceNetflix

ServiceEdge

Service

Traffic

Netflix

Playback

Service

Netflix

Playback

Service

Edge

Service

Edge

Service

Edge

Service

Playback

Service

Traffic

Page 9: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Notes on Distributed Systems for Young Bloods

# Distributed

systems are

different because

they fail often.

--Jeff Hodges

Page 10: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

TABLE OF CONTENTS

CHAPTER 1: THE WEIRD DATA IN THE CATALOG

• Metadata impacts on availability

CHAPTER 2: THE VANISHING OF CRITICAL SERVICES

• Crashing services and cascading failures

CHAPTER 3: THE THROTTLE

• Latency spikes and the impact of fallbacks

FORCES AT WORK

Page 11: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 12: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Whoops, something went wrong…Netflix Streaming ErrorWe’re having trouble playing this title right now. Please try again later or select a different title.

Page 13: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

CHAPTER ONETHE WEIRD DATA IN THE CATALOG

Page 14: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 16: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 17: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

VIDEO METADATAARCHITECTURE

Video

Metadata

Service

Amazon S3

Source

System

Source

System

Netflix

ServicesNetflix

ServicesNetflix

ServicesNetflix

ServicesNetflix

Service

Traffic

Page 18: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Amazon S3

Netflix

Playback

Service

{

String msg = “This should never

happen!”;

throw new IllegalStateException(msg);

}

Page 19: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

MITIGATIONBLAST RADIUS

Explosion, CC BY 2.0, Andrew Kuznetsov 2008, Flikr

Page 20: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Amazon WS Global Infrastructure

Page 21: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Amazon WS Global Infrastructure

STAGGERED ROLLOUT

Page 22: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Pager

Dia

gnosis

?

Page 23: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 24: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Canary, CC BY 2.0, Steve P2008 2014, Flikr

PREVENTIONCANARIES

Page 25: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

TRADITIONAL CANARY

Canary

(New Code)

Baseline

(Old Code)

TrafficTraffic

Video

Metadata

Service

Amazon S3

Netflix

ServicesNetflix

ServicesNetflix

ServicesNetflix

ServicesNetflix

Service

Source

System

Source

System

Traffic

Page 26: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

CONSISTENCYVALID STATE TRANSITIONS

Page 27: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

DATA CANARY

Netflix

ServicesNetflix

ServicesNetflix

ServicesNetflix

Services

Video

Metadata

Service

Amazon S3

Source

System

Source

System

Netflix

Service

Netflix

Data

Canary

Service

Data

Tester

Netflix

Service

Traffic

Page 29: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Verify consistency prior to

applying state changes.

…one tool is a data canary.

Page 30: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

CHAPTER TWOTHE VANISHING OF CRITICAL SERVICES

Page 31: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 32: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 33: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

# A distributed system is

one in which the failure

of a computer you didn't

even know existed can

render your own computer

unusable.

--Leslie Lamport

Page 34: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Proxy/Routing

Devices

LOG DATA

Log Data Service

Traffic

Cassandra

Playback ServiceNetflix

Playback

Service

Netflix

Playback

Service

Edge

Service

Edge

Service

Edge

Service

Playback

Service

Traffic

Page 35: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 36: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Proxy/Routing

Devices

Proxy

Log Data Service

Traffic

Cassandra

Playback ServiceNetflix

Playback

Service

Netflix

Playback

Service

Edge

Service

Edge

Service

Edge

Service

Playback

Service

Traffic

Page 37: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 38: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Proxy/Routing

Devices

CASCADING FAILURE

Log Data Service

Traffic

Cassandra

Playback ServiceNetflix

Playback

Service

Netflix

Playback

Service

Edge

Service

Edge

Service

Edge

Service

Playback

Service

Traffic

Page 39: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

{

throw new OutOfMemoryError();

}

Log Data Service

Cassandra

Playback Service

Page 40: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

PREVENTIONMANAGING RESOURCE CONSTRAINTS

Whatever you ask, CC BY-SA 2.0, Kreg Steppe 2008, Flikr

Page 42: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

1

Keep Only

Dependencies

which are

Necessary

SO MANY JARS!!

Page 45: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

DEV

FAVOR IMMUTABILITY

Playback Service

TEST

Playback Service

PROD

Playback Service

4

Page 46: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

try {

remoteService.call();

} catch( Throwable t ){

//Oops!

System.exit(1);

}

Log Data

Service

Cassandra

Playback

Service

Proxy/Routing

Traffic

Page 47: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

It's Electric, CC BY ND 2.0, Alan Hochberg 2008, Flikr

MITIGATIONCIRCUIT BREAKERS

Page 48: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 49: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Wrecking Ball in Building, CC BY 2.0, Jason Eppink 2008, Flikr

FAILURE TESTING

Page 50: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Proxy/Routing

Devices

FAILURE TESTING

Log Data Service

Traffic

Cassandra

Playback ServiceAutomating Chaos

Experiments in

Production

by Ali Basiri

Applying Failure

Testing Research

@Netflix

by Kolton Andrus and

Peter Alvaro

Page 51: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Manage resource constraints by

reducing surface area.

Leverage circuit breakers and

rigorously test failures.

Page 52: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

CHAPTER THREETHE THROTTLE

Page 53: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 54: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Proxy/Routing

Devices

PLAYBACK ARCHITECTURE

Edge Service Edge Service Edge Service

Playback Service

Traffic Traffic

URL Service

Page 55: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

NETFLIX CLIENT JARS

Playback Service

URL

Service

URL Client

Circuit-breakers and Fallbacks

Metrics Retries and Timeouts

RPCService Discovery

Page 56: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 57: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Playback

Service

Traffic Concurrent

Requests

Throttled Requests

(HTTP 503)

THROTTLING

Page 58: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which
Page 59: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

}

System.gc();

}

URL

Service

Playback Service

Edge Service

Proxy/Routing

Traffic

Page 60: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

NETFLIX CLIENT JARS

Playback Service

URL

Service

URL Client

Circuit-breakers

MetricsRetries and

Timeouts

RPCService Discovery

Heavy

Fallback

Page 61: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

FALLBACK TESTING

With 100% Fallback,

CPU held at 90%

15 RPSNo fallback, CPU held at 90%

58 RPSSiege: https://github.com/JoeDog/siege

Page 62: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

SELECTING FALLBACKS

CACHESTATIC

FALLBACK

SERVICE

Page 63: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

URL

Service

Playback Service

Edge Service

Proxy/Routing

Traffic

}

return Response

.status(503)

.build();

}

Page 64: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

REQUEST BUCKETING

NON-CRITICALExperience or

Performance

Impact

CRITICALCustomer

Streaming

Impact

Fire Buckets at Oakworth Statione, CC BY 2.0, Tim Greene 2015, Flikr

Page 65: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

APPLICATION SHARDING

Non-Critical

Playback Service

Proxy/RoutingDevices

Edge Service Edge Service Edge Service

Traffic Traffic

URL Service

Critical Playback

Service

Non-Critical URL

Service

Page 68: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

APPLICATION SHARDING

Non-Critical

Playback Service

Proxy/RoutingDevices

Edge Service Edge Service Edge Service

Traffic Traffic

Critical Playback

Service

URL ServiceNon-Critical URL

Service

Page 69: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

No heavy fallbacks!!

Fallbacks should be light and fast.

Shard your application based on

operational characteristics.

Page 70: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

EPILOGUEKEY TAKEAWAYS

Page 71: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

KEY TAKEAWAYSCHAPTER 1: THE WEIRD DATA IN THE CATALOG

• Verify consistency prior to applying state changes.

• One tool is a data canary.

CHAPTER 2: THE VANISHING OF CRITICAL SERVICES

• Manage resource constraints by reducing surface area.

• Leverage circuit breakers and rigorously test failures.

CHAPTER 3: THE THROTTLE

• No heavy fallbacks!! Fallbacks should be light and fast.

• Shard your application based on operational characteristics.

Page 72: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

The unexpected will happen.

Plan to fail.

Page 73: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

PARTING THOUGHTDISTRIBUTED SYSTEMS

SOCIAL

Page 74: The Forces That Disrupt Netflix - QConSF · The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016. ACROBAT FLEA our world parallel world # A distributed system is one in which

Questions?

Haley Tucker