gluecon 2013 - netflixoss cloud native tutorial introduction

73
Introduction: Building Using The NetflixOSS Architecture May 2013 Adrian Cockcroft @adrianco #netflixcloud @NetflixOSS http://www.linkedin.com/in/adriancockcroft

Upload: adrian-cockcroft

Post on 07-May-2015

13.726 views

Category:

Technology


4 download

DESCRIPTION

Same basic flow as the keynote, but with a lot more detail, and we had a lot more interactive discussion rather than a presentation format. See part 2 for some more specific detail and links to other presentations.

TRANSCRIPT

Page 1: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Introduction:Building Using The NetflixOSS

Architecture

May 2013Adrian Cockcroft

@adrianco #netflixcloud @NetflixOSShttp://www.linkedin.com/in/adriancockcroft

Page 2: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Presentation vs. Tutorial

• Presentation– Short duration, focused subject– One presenter to many anonymous audience– A few questions at the end

• Tutorial– Time to explore in and around the subject– Tutor gets to know the audience– Discussion, rat-holes, “bring out your dead”

Page 3: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Introduction – Who are you?

Netflix Open Source Cloud Prize

Cloud Native – More details

NetflixOSS – Cloud Native On-Ramp

Page 4: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Adrian Cockcroft• Director, Architecture for Cloud Systems, Netflix Inc.

– Previously Director for Personalization Platform

• Distinguished Availability Engineer, eBay Inc. 2004-7– Founding member of eBay Research Labs

• Distinguished Engineer, Sun Microsystems Inc. 1988-2004– 2003-4 Chief Architect High Performance Technical Computing– 2001 Author: Capacity Planning for Web Services– 1999 Author: Resource Management– 1995 & 1998 Author: Sun Performance and Tuning– 1996 Japanese Edition of Sun Performance and Tuning

• SPARC & Solaris パフォーマンスチューニング ( サンソフトプレスシリーズ )

• More– Twitter @adrianco – Blog http://perfcap.blogspot.com– Presentations at http://www.slideshare.net/adrianco

Page 5: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Attendee Introductions

• Who are you, where do you work• Why are you here today, what do you need• “Bring out your dead”– Do you have a specific problem or question?– One sentence elevator pitch

• What instrument do you play?

Page 6: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Boosting the @NetflixOSS EcosystemSee netflix.github.com

Page 7: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

In 2012 Netflix Engineering won this..

Page 8: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

We’d like to give out prizes too

But what for?Contributions to NetflixOSS!Shared under Apache license

Located on github

Page 9: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Best example application mash-up

Best new monkey

Best contribution to code quality

Best new feature

Best contribution to operational tools

Best portability enhancement

Best datastore integration

Best contribution to performance

Best usability enhancement

Judges choice award

Page 10: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

How long do you have?

Entries open March 13th

Entries close September 15th

Six months…

Page 11: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Who can win?

Almost anyone, anywhere…Except current or former Netflix or

AWS employees

Page 12: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Who decides who wins?

Nominating CommitteePanel of Judges

Page 13: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Judges

Aino CorryProgram Chair for Qcon/GOTO

Martin FowlerChief Scientist ThoughtworksSimon Wardley

Strategist

Yury IzrailevskyVP Cloud Netflix

Werner VogelsCTO Amazon Joe Weinman

SVP Telx, Author “Cloudonomics”

Page 14: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

What are Judges Looking For?Eligible, Apache 2.0 licensed

NetflixOSS project pull requests

Original and useful contribution to NetflixOSS

Good code quality and structure

Documentation on how to build and run it

Code that successfully builds and passes a test suite

Evidence that code is in use by other projects, or is running in production

A large number of watchers, stars and forks on github

Page 15: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

What do you win?

One winner in each of the 10 categoriesTicket and expenses to attend AWS

Re:Invent 2013 in Las VegasA Trophy

$10,000 cash and $5,000 in AWS Credits

Page 16: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

How do you enter?

Get a (free) github accountFork github.com/netflix/cloud-prize

Send us your email addressDescribe and build your entry

Twitter #cloudprize

Page 17: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Entrants

NetflixEngineering

Six Judges Winners

Nominations

Conforms to Rules

Working Code

Community Traction

Categories

Registration Opened

March 13Github

Apache Licensed

ContributionsGithub Close Entries

September 15GithubAward

Ceremony Dinner

November

AWS Re:Invent

Ten Prize Categories

$10K cash$5K AWS

AWS Re:Invent

TicketsTrophy

Page 18: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Cloud Native

Recap the keynote in much more detail and discussion

Page 19: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

A new engineering challenge

Construct a highly agile and highly available service from ephemeral and

often broken components

Page 20: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Inspiration

Page 21: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Netflix Streaming

A Cloud Native Application based on an open source platform

Page 22: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Netflix Member Web Site Home PagePersonalization Driven – How Does It Work?

Page 23: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

How Netflix Streaming Works

Customer Device (PC, PS3, TV…)

Web Site or Discovery API

User Data

Personalization

Streaming API

DRM

QoS Logging

OpenConnect CDN Boxes

CDN Management and

Steering

Content Encoding

Consumer Electronics

AWS Cloud Services

CDN Edge Locations

Page 24: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)

Start Here

memcached

Cassandra

Web service

S3 bucket

Personalization movie group choosers (for US, Canada and Latam)

Each icon is three to a few hundred instances across three AWS zones

Page 25: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

New Cloud Native Patterns

Micro-services and Chaos enginesHighly available systems composed

from ephemeral componentsOpen Source is the default

Page 26: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Some Strategic Questions

What changed…

Page 27: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

The AWS Question

Why does Netflix use AWS when Amazon Prime is a competitor?

Page 28: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Netflix vs. Amazon Prime

• Do retailers competing with Amazon use AWS?– Yes, lots of them, Netflix is no different

• Does Prime have a platform advantage?– No, because Netflix gets to run on AWS

• Does Netflix take Amazon Prime seriously?– Yes, but so far Prime isn’t impacting our business

Page 29: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Amazon Video 1.31%

18x Prime

25x Prime

Nov2012StreamingBandwidth

March2013

MeanBandwidth+39% 6mo

Page 30: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

The Google Cloud Question

Why doesn’t Netflix use Google Cloud as well as AWS?

Page 31: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Google Cloud – Wait and See

Pro’s• Cloud Native• Huge scale for internal apps• Exposing internal services• Nice clean API model• Starting a price war• Fast for what it does• Rapid start & minute billing

Con’s• In beta until last week• No big customers yet• Missing many key features• Different arch model• Missing billing options• No SSD or huge instances• Zone maintenance windows

But: Anyone interested is welcome to port NetflixOSS components to Google Cloud

Page 32: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Cloud Wars: Price and Performance

AWS vs. GCS War

Private Cloud

Power cost increase

Labor cost increase

Aging system failures

Maintenance renewal

Storage cost reduction

Instance cost reduction

Faster newer systems

What Changed:Everyone using AWS or GCS gets the price cuts and performance improvements, as they happen. No need to switch vendor.

No Change:Locked in for three years.

Page 33: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

The DIY Question

Why doesn’t Netflix build and run its own cloud?

Page 34: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Fitting Into Public Scale

Public Grey Area Private

1,000 Instances 100,000 Instances

Netflix FacebookStartups

Page 35: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

How big is Public?

AWS upper bound estimate based on the number of public IP AddressesEvery provisioned instance gets a public IP by default

AWS Maximum Possible Instance Count 3.7 MillionGrowth >10x in Three Years, >2x Per Annum

Page 36: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

The Alternative Supplier Question

What if there is no clear leader for a feature, or AWS doesn’t have what

we need?

Page 37: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Things We Don’t Use AWS For

SaaS Applications – Pagerduty, AppdynamicsContent Delivery Service

DNS Service

Page 38: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

CDN Scale

AWS CloudFrontAkamai

LimelightLevel 3

Netflix Openconnect

YouTube

Gigabits Terabits

NetflixFacebookStartups

Page 39: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Content Delivery ServiceOpen Source Hardware Design + FreeBSD, bird, nginx

see openconnect.netflix.com

Page 40: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

DNS Service

AWS Route53 is missing too many featuresMultiple vendor strategy Dyn, Ultra, Route53

Abstracted (broken) DNS APIs with Denominator

Page 41: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Availability Questions

Is it running yet?How many places is it running in?How far apart are those places?

Page 42: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Page 43: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Netflix Outages

• Running very fast with scissors– Mostly self inflicted – bugs, mistakes from pace of change– Some caused by AWS bugs and mistakes

• Incident Life-cycle Management by Platform Team– No runbooks, no operational changes by the SREs– Tools to identify what broke and call the right developer

• Next step is multi-region– Investigating and building in stages during 2013– Could have prevented some of our 2012 outages

Page 44: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Managing Multi-Region Availability

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

UltraDNSDynECT

DNS

AWS Route53

Denominator – manage traffic via multiple DNS providers

Denominator

Page 45: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Cloud Native Big Data

Size the cluster to the dataSize the cluster to the questionsNever wait for space or answers

Page 46: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Netflix Dataoven

Data WarehouseOver 2 Petabytes

Ursula

Aegisthus

Data Pipelines

From cloud Services~100 BillionEvents/day

From C*Terabytes ofDimensiondata

Hadoop Clusters – AWS EMR

1300 nodes 800 nodes Multiple 150 nodes Nightly

RDS

Metadata

Gateways

Tools

Page 47: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Cloud Native Patterns

Master copies of data are cloud residentDynamically provisioned micro-servicesServices are distributed and ephemeral

Page 48: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Cloud Native Architecture

Distributed Quorum NoSQL Datastores

Autoscaled Micro Services

Autoscaled Micro Services

Clients Things

JVM JVM

JVM JVM

Cassandra Cassandra Cassandra

Memcached

JVM

Zone A Zone B Zone C

Page 49: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Non-Native Cloud Architecture

Datacenter Dinosaurs

Cloudy Buffer

Agile Mobile Mammals

iOS/Android

App Servers

MySQL Legacy Apps

Page 50: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

How to get to Cloud Native?

Freedom and Responsibility for DevelopersDecentralize and Automate Ops Activities

Integrate DevOps into the Business Organization

Re-Org!

Page 51: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Four Transitions

• Management: Integrated Roles in a Single Organization– Business, Development, Operations -> BusDevOps

• Developers: Denormalized Data – NoSQL– Decentralized, scalable, available, polyglot

• Responsibility from Ops to Dev: Continuous Delivery– Decentralized small daily production updates

• Responsibility from Ops to Dev: Agile Infrastructure - Cloud– Hardware in minutes, provisioned directly by developers

Page 52: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Netflix BusDevOps OrganizationChief Product

Officer

VP Product Management

Directors Product

VP UI Engineering

Directors Development

Developers + DevOps

UI Data Sources

AWS

VP Discovery Engineering

Directors Development

Developers + DevOps

Discovery Data Sources

AWS

VP Platform

Directors Platform

Developers + DevOps

Platform Data Sources

AWS

Denormalized, independently updated and scaled data

Cloud, independently updated and scaled infrastructure

Code, independently updated continuous delivery

Page 53: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Decentralized Deployment

Page 54: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Asgardhttp://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html

Page 55: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Ephemeral Instances

• Largest services are autoscaled• Average lifetime of an instance is 36 hours

Push

Autoscale UpAutoscale Down

Page 56: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

A Cloud Native Open Source PlatformSee netflix.github.com

Page 57: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Three Questions

Why is Netflix doing this?

How does it all fit together?

What is coming next?

Page 58: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Beware of Geeks Bearing Gifts: Strategies for an Increasingly Open Economy

Simon Wardley - Researcher at the Leading Edge Forum

Page 59: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

How did Netflix get ahead?

Netflix BusDevOps Org• Doing it since 2009• SaaS Applications• PaaS for agility• Public IaaS for AWS features• Big data in the cloud• Integrating many APIs• FOSS from github• Renting hardware for 1hr• Coding in Java/Groovy/Scala

Traditional IT Operations• Taking their time• Pilot private cloud projects• Beta quality installations• Small scale• Integrating several vendors• Paying big $ for software• Paying big $ for consulting• Buying hardware for 3yrs• Hacking at scripts

Page 60: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Netflix Platform Evolution

Bleeding Edge Innovation

Common Pattern

Shared Pattern

2009-2010 2011-2012 2013-2014

Netflix ended up several years ahead of the industry, but it’s becoming commoditized now

Page 61: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Making it easy to follow

Exploring the wild west each time vs. laying down a shared route

Page 62: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Establish our solutions as Best

Practices / Standards

Hire, Retain and Engage Top Engineers

Build up Netflix Technology Brand

Benefit from a shared ecosystem

Goals

Page 63: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

How does it all fit together?

Page 64: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Example Application – RSS Reader

Page 65: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

GithubNetflixOSS

Source

AWSBase AMI

MavenCentral

Cloudbees Jenkins

AminatorBakery

DynaslaveAWS Build

Slaves

Asgard(+ Frigga)Console

AWSBaked AMIs

OdinOrchestration

API

AWS Account

NetflixOSS Continuous Build and Deployment

Coming Soon!

Page 66: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

AWS AccountAsgard Console

Archaius Config Service

Cross region Priam C*

PytheasDashboards

AtlasMonitoring

Genie, LipstickHadoop Services

AWS UsageCost Monitoring

Multiple AWS RegionsEureka Registry

Exhibitor ZK

Edda History

Simian Army

3 AWS ZonesApplication

ClustersAutoscale Groups

Instances

PriamCassandra

Persistent Storage

EvcacheMemcached

Ephemeral Storage

NetflixOSS Services Scope

Page 67: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

• Baked AMI – Tomcat, Apache, your code• Governator – Guice based dependency injection• Archaius – dynamic configuration properties client• Eureka - service registration client

Initialization

• Karyon - Base Server for inbound requests• RxJava – Reactive pattern• Hystrix/Turbine – dependencies and real-time status• Ribbon - REST Client for outbound calls

Service Requests

• Astyanax – Cassandra client and pattern library• Evcache – Zone aware Memcached client• Curator – Zookeeper patterns• Denominator – DNS routing abstraction

Data Access

• Blitz4j – non-blocking logging• Servo – metrics export for autoscaling• Atlas – high volume instrumentation

Logging

NetflixOSS Instance Libraries

Page 68: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

• CassJmeter – Load testing for Cassandra• Circus Monkey – Test account reservation rebalancingTest Tools

• Janitor Monkey – Cleans up unused resources• Efficiency Monkey• Doctor Monkey• Howler Monkey – Complains about AWS limits

Maintenance

• Chaos Monkey – Kills Instances• Chaos Gorilla – Kills Availability Zones• Chaos Kong – Kills Regions• Latency Monkey – Latency and error injection

Availability

• Security Monkey – security group and S3 bucket permissions• Conformity Monkey – architectural pattern warningsSecurity

NetflixOSS Testing and Automation

Page 69: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

More Use Cases

More Features

Better portability

Higher availability

Easier to deploy

Contributions from end users

Contributions from vendors

What’s Coming Next?

Page 70: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Vendor Driven PortabilityInterest in using NetflixOSS for Enterprise Private Clouds

“It’s done when it runs Asgard”Functionally completeDemonstrated MarchRelease 3.3 in 2Q13

Some vendor interestNeeds AWS compatible Autoscaler

Some vendor interestMany missing features“Confused” AWS API strategy

Page 71: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

AWS 2009Baseline features needed to support NetflixOSS

Eucalyptus 3.3

Page 72: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Functionality and scale now, portability coming

Moving from parts to a platform in 2013

Netflix is fostering a cloud native ecosystem

Rapid Evolution - Low MTBIAMSH(Mean Time Between Idea And Making Stuff Happen)

Page 73: Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction

Takeaway

NetflixOSS makes it easier for everyone to become Cloud Native

http://netflix.github.comhttp://techblog.netflix.comhttp://slideshare.net/Netflix

http://www.linkedin.com/in/adriancockcroft

@adrianco #netflixcloud @NetflixOSS