netflixoss for triangle devops oct 2013

50
Learning about NetflixOSS For Oct 2013 @TriangleDevops Andrew Spyker @aspyker Some content from @ma4jpb

Upload: aspyker

Post on 06-May-2015

1.140 views

Category:

Technology


2 download

DESCRIPTION

My @TriangleDevops talk from 2013-10-17. I covered the work that led us to @NetflixOSS (Acme Air), the work we did on the cloud prize (NetflixOSS on IBM SoftLayer/RightScale) and the @NetflixOSS platform (Karyon, Archaius, Eureka, Ribbon, Asgard, Hystrix, Turbine, Zuul, Servo, Edda, Ice, Denominator, Aminator, Janitor/Conformity/Chaos Monkeys of the Simian Army).

TRANSCRIPT

Page 1: NetflixOSS for Triangle Devops Oct 2013

Learning about NetflixOSS

For Oct 2013 @TriangleDevops

Andrew Spyker@aspyker

Some content from @ma4jpb

Page 2: NetflixOSS for Triangle Devops Oct 2013

Agenda

• How did I get here?

• Netflix and Netflix OSS platform overview• Runtime components• Management components• Build components• Automated test and cleanliness components

2

Page 3: NetflixOSS for Triangle Devops Oct 2013

About me …• IBM STSM of Performance Architect and Strategy

• Eleven years in performance in WebSphere– Led the App Server Performance team for years– Small sabbatical focused on IBM XML technology– Work in Emerging Technology Institute and CTO Office– Starting to look at cloud service operations

• Email: [email protected]– Blog: http://ispyker.blogspot.com/– Linkedin: http://www.linkedin.com/in/aspyker– Twitter: http://twitter.com/aspyker– Github: http://www.github.com/aspyker

• Triangle dad that enjoys technology as well as running, wine and poker 3

Page 4: NetflixOSS for Triangle Devops Oct 2013

Develop or maintain a service today?

• Develop – starting

• Maintain – starting

• More on this later ….

4

http://www.flickr.com/photos/stevendepolo/

Page 5: NetflixOSS for Triangle Devops Oct 2013

What qualifies me to talk?

• My shirt?

• Of cloud prize ~ 25 nominees– Personally

• Best example mash-up sample

– My IBM team• Best portability enhancement

– More on this coming …

• http://techblog.netflix.com/2013/09/netflixoss-meetup-s1e4-cloud-prize.html

5

Page 6: NetflixOSS for Triangle Devops Oct 2013

Seriously, how did I get here?• Plenty of experience with performance and scale on

standardized benchmarks (SPEC/TPC)– Non representative of how to (web) scale

• Pinning, biggest monolithic DB “wins”, hand tuned for fixed size

– Out of date on modern architecture for mobile/cloud

• Created Acme Air– http://bit.ly/acmeairblog

• Demonstrated that we could achieve (web) scale runs– 4B+ Mobile/Browser request/day– With modern mobile and cloud best practices

6

Page 7: NetflixOSS for Triangle Devops Oct 2013

Demo

7

Page 8: NetflixOSS for Triangle Devops Oct 2013

What was shown?

• Peak performance and scale – You betcha!

• Operational visibility – Only during the run via nmon collection and post-run visualization

• True operational visibility - nope• Devops – nope• HA and DR – nope• Manual and automatic elastic scaling - nope

8

Page 9: NetflixOSS for Triangle Devops Oct 2013

What next?

• Went looking for what best industry practices around devops and high availability at web scale existed– Many have documented via research papers and on

highscalability.com – Google, Twitter, Facebook, Linkedin, etc.

• Why Netflix?– Documented not only on their tech blog, but also have

released working OSS on github– Also, given dependence on Amazon, they are a clear

bellwether of web scale public cloud availability

9

Page 10: NetflixOSS for Triangle Devops Oct 2013

Steps to NetflixOSS understanding

• Recoded Acme Air application to make use of NetflixOSS runtime components

• Worked to implement a NetflixOSS devops and high availability setup around Acme Air (on EC2) run at previous levels of scale and performance

• Worked to port NetflixOSS runtime and devops/high availability servers to IBM Cloud (SoftLayer) and RightScale

• Through public collaboration with Netflix technical team– Google groups, github and meetups 10

Page 11: NetflixOSS for Triangle Devops Oct 2013

Why?

• To prove that advanced cloud high availability and devops platform wasn’t “tied” to Amazon

• To understand how we can advance IBM cloud platforms for our customers

• To understand how we can host our IBM public cloud services better

11

Page 12: NetflixOSS for Triangle Devops Oct 2013

Agenda

• How did I get here?

• Netflix and Netflix OSS platform overview

• Runtime components• Management components• Build components• Automated test and cleanliness components

12

Page 13: NetflixOSS for Triangle Devops Oct 2013

My view of Netflix goals

• As a business– Be the best streaming media provider in the world– Make best content deals based on real data/analysis

• Technology wise– Have the most availability possible– Measure all things by “stream starts per unit of time”

• Any dip in that relates back to the business

– Do this at web scale

13

Page 14: NetflixOSS for Triangle Devops Oct 2013

Standing on the shoulder of a giants

• Public Cloud (Amazon)– When adding streaming, Netflix decided they

• Shouldn’t invest in building data centers worldwide• Had to plan for the streaming business to be very big

– Embraced cloud architecture paying only for what they need

• Open Source– Many parts of runtime depend on open source

• Linux, Apache Tomcat, Apache Cassandra, etc.

– Realized that Amazon wasn’t enough• Started a cloud platform on top that would eventually be open sourced - NetflixOSS

14

http://en.wikipedia.org/wiki/File:Andre_in_the_late_%2780s.jpg

Page 15: NetflixOSS for Triangle Devops Oct 2013

Faleure• What is failing?

– Underlying IaaS problems• Instances, racks, availability zones, regions

– Software issues• Operating system, servers, application code

– Surrounding services• Other application services, DNS, user registries, etc.

• How is a component failing?– Fails and disappears altogether– Intermittently fails– Works, but is responding slowly– Works, but is causing users a poor experience

Inspiration

15

Page 16: NetflixOSS for Triangle Devops Oct 2013

Overview of Amazon EC2

• Amazon launches instances into availability zones– Instances of various sizes (compute, storage, etc.)

• Organized into regions and availability zones– Regions independent of each other– Regions only connected over the Internet– Regions contain availability zones– Availability zones are isolated from each over– Availability zones are connected /w low-latency links

• This gives a high level of resilience to outages– Unlikely to affect multiple availability zones or regions

• Amazon requires customer be aware of this topology to take advantage of its benefits within their application

EC2 Region(US East)

AvailabilityZone

AvailabilityZone

AvailabilityZone

EC2 Region(US West)

AvailabilityZone

AvailabilityZone

AvailabilityZone

16

Internet

Page 17: NetflixOSS for Triangle Devops Oct 2013

NetflixOSS

• “Technical indigestion as a service” - @adrianco

• netflix.github.io• 30+ OSS projects• Expanding every day

17

Page 18: NetflixOSS for Triangle Devops Oct 2013

NetflixOSS – for today

• For today– Focus on mid tier web

app and micro service servers

– Devops servers and tools– Skipping some just for

simplicity

• For another time– Big data– Data tier– Caching

18

Page 19: NetflixOSS for Triangle Devops Oct 2013

Agenda

• How did I get here?• Netflix and Netflix OSS platform overview

• Runtime components

• Management components• Build components• Automated test and cleanliness components

19

Page 20: NetflixOSS for Triangle Devops Oct 2013

Acme Air As A Sample

Web AppFront End

(REST services)

App Service(Authentication) Data TierELB

20

Greatly simplified …

Page 21: NetflixOSS for Triangle Devops Oct 2013

Micro-services architecture• Decompose system into isolated services that can be developed

separately

• Why?– They can fail independently vs. fail together monolythically– They can be developed and released with difference velocities by

different teams

• To show this we created separate “auth service” for Acme Air

• In a typical customer facing application any single front end invocation could spawn 20-30 calls to services and data sources

21

Page 22: NetflixOSS for Triangle Devops Oct 2013

EurekaServer(s)

How do services advertise themselves?• Upon web app startup, Karyon server is started

– Karyon will configure (via Archaius) the application– Karyon will register the location of the instance with Eureka

• Others can know of the existence of the service• Lease based so instances continue to check in updating list of available instances

– Karyon will also expose a JMX console, healthcheck URL• Devops can change things about the service via JMX• The system can monitor the health of the instance

App Service(Authentication)

config.properties, auth-service.propertiesOr remote Archaius stores

KaryonTomcat

EurekaServer(s)

EurekaServer(s)

EurekaServer(s)

Name, PortIP address,Healthcheck url

22

Page 23: NetflixOSS for Triangle Devops Oct 2013

EurekaServer(s)

How do consumers find services?

• Service consumers query eureka at startup and periodically to determine location of dependencies– Can query based on availability zone and cross

availability zone

Eureka clientTomcat

EurekaServer(s)

EurekaServer(s)

EurekaServer(s)

What “auth-service”instances exist?Web App

Front End(REST services)

23

Page 24: NetflixOSS for Triangle Devops Oct 2013

Demo

24

Page 25: NetflixOSS for Triangle Devops Oct 2013

App Service(Authentication)

How does the consumer call the service?

• Protocols impls have eureka aware load balancing support build in– In client load balancing -- does not require separate LB tier

• Ribbon – REST client– Pluggable load balancing scheme– Built in failure recovery support (retry next server, mark instance as failing, etc.)

• Other eureka enabled clients – memcached (EVCache), asystanax coming (Priam and Cassandra)

Ribbon REST client

Call“auth-service”Web App

Front End(REST services) App Service

(Authentication)

App Service(Authentication)

App Service(Authentication)

Eureka client

25

Page 26: NetflixOSS for Triangle Devops Oct 2013

How to deploy this with HA?Instances?• Deploy across AZs• Using AutoScalingGroups in

EC2 managed by Asgard– ASG manages recovery

Eureka?• DNS and Elastic IP trickery• Deployed across AZs

• For clients to find eureka servers– DNS TXT record for domain lists AZ TXT

records– AZ TXT records have list of Eureka servers

• For new eureka servers– Look for list of eureka servers IP’s for the AZ

it’s coming up in– Look for unassigned elastic IP’s, grab one and

assign it to itself– Sync with other already assigned IP’s that

likely are hosting Eureka server instances

• Simpler configurations with less HA are available 26

Page 27: NetflixOSS for Triangle Devops Oct 2013

Protect yourself from unhealthy services

• Wrap all calls to services with Hystrix command pattern– Hystrix implements circuit breaker pattern– Executes command using semaphore or separate thread

pool to guarantee return within finite time to caller– If a unhealthy service is detected, start to call fallback

implementation (broken circuit) and periodically check if main implementation works (reset circuit)

App Service(Authentication)

Ribbon REST client

Call“auth-service”

Web AppFront End

(REST services)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

Executeauth-service

call

Fallback implementation

Hys

trix

27

Page 28: NetflixOSS for Triangle Devops Oct 2013

Does Hystrix do more?

• Main reason for Hystrix is protect yourself from dependencies, but …

• Once you have a layer of indirection take advantage of it, Hystrix can provide– Caching– Visualization

• Aggregated via Turbine

– Request collapsing

• Programming models– Sync, Async, Reactive (RxJava)

28

Page 29: NetflixOSS for Triangle Devops Oct 2013

Agenda

• How did I get here?• Netflix and Netflix OSS platform overview• Runtime components

• Management components

• Build components• Automated test and cleanliness components

29

Page 30: NetflixOSS for Triangle Devops Oct 2013

Ability to reconfigure - Archaius• Using dynamic properties, can

easily change properties across cluster of applications, either– NetflixOSS named props

• Hystrix timeouts for example

– Custom dynamic props• High throughput achieved by

polling approach• HA of configuration source

dependent on what source you use– HTTP server, database, etc.

Container

Libraries

Application Props

Persisted DB

Runtime

Hie

rarc

hy

URL

Application

JMXKaryonConsole

DynamicIntProperty prop = DynamicPropertyFactory.getInstance().getIntProperty("myProperty", DEFAULT_VALUE);int value = prop.get(); // value will change over time based on configuration 30

Page 31: NetflixOSS for Triangle Devops Oct 2013

ASGard

• Asgard is the missing EC2 console for AutoScalingGroup mgmt.– EC2 only has CLI for ASG management

EC2 Region(US East) Availability

Zone

AvailabilityZone

AvailabilityZone

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

Web App(REST Services)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

Web App(REST Services)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

App Service(Authentication)

Web App(REST Services)

Tell EC2 to startthese instances andKeep this manyInstances running

31

Page 32: NetflixOSS for Triangle Devops Oct 2013

Asgard creates an “application”

• Enforces common practices for deploying code– Common approach to linking auto scaling groups to launch configs, ELB’s,

security groups, scaling policies and AMIs

• Adds missing concept to the EC2 domain model – “application”– Extends clustering to applications vs. AMI’s

• Example– Application – app1– Cluster – app1-env– Autoscaling group version n – app1-env-v009– Autoscaling group version n+1 – app1-env-v010

32

Page 33: NetflixOSS for Triangle Devops Oct 2013

Asgard devops procedures• Fast rollback• Canary testing• Red/Black pushes• More through REST interfaces

– Adhoc processes but enforced through Asgard model• More coming using Glisten and Amazon SWF

33

Page 34: NetflixOSS for Triangle Devops Oct 2013

Demo

34

Page 35: NetflixOSS for Triangle Devops Oct 2013

ZuulZuul

Augmenting the ELB tier - Zuul• Zuul adds devops support in the front tier routing

– Stress testing (squeeze testing)– Canary testing– Dynamic routing– Load Shedding– Debugging

• And some common function– Authentication– Security– Static response handling– Multi-region resiliency (DR for ELB tier)– Insight

• Through dynamically deployable filters (written in Groovy)• Eureka aware using ribbon, and archaius like shown in runtime section

Zuul

AmazonELB

Edge Service

Edge Service

FilterFilterFilterFilters

35

Page 36: NetflixOSS for Triangle Devops Oct 2013

Monitoring - Servo

• Annotation based publishing through JMX of application metrics

• Filters, Observers, and Pollers to publish metrics– Can export metrics to CloudWatch and other

monitors

• The entire Netflix monitoring infrastructure hasn’t been open sourced due to complexity and priority

36

Page 37: NetflixOSS for Triangle Devops Oct 2013

A note on the next three projects

• I haven’t personally worked with the projects

• Given the audience, I included as I believe they will be of interest

37

Page 38: NetflixOSS for Triangle Devops Oct 2013

Edda

• Polls Amazon config and stores the data in a queriable database

• Provides a searchable view of Amazon deployments– Searchable in ways not possible from Amazon API’s

• Provides a historical view– For correlation of problems to changes– Likely less of an issue in clouds that expose all changes

38

Page 39: NetflixOSS for Triangle Devops Oct 2013

Ice

• Cloud spend and usage analytics

• Communicates with billing API to give birds eye view of cloud spend with drill down to region, availability zone, and service team through application groups

• Watches on-demand, used and unused reserved instances and instance sizes to help optimize

• Not point in time– Shows trends to help predict future

optimizations39

Page 40: NetflixOSS for Triangle Devops Oct 2013

Denominator

• Java Library and CLI for cross DNS configuration

• Allows for common, quicker (than using various DNS provider UI) and automated DNS updates

• Plugins have been developed by various DNS providers

40

Page 41: NetflixOSS for Triangle Devops Oct 2013

Agenda

• How did I get here?• Netflix and Netflix OSS platform overview• Runtime components• Management components

• Build components

• Automated test and cleanliness components41

Page 42: NetflixOSS for Triangle Devops Oct 2013

Get baked!• Caution: Flame/troll bait ahead!!

• Netflix takes the approach of baking images as part of build such that– Instance boot-up doesn’t depend on outside servers– Instance boot-up only starts servers already set to run– New code = new instances (never update instances in place)

• Why?– Critical when launching hundreds of servers at a time– Goal to reduce the failure points in places where dynamic system configuration

doesn’t provide value– Speed of elastic scaling, boot and go– Discourages ad hoc changes to server instances

• Criticism – “Netflix is ruining the cloud”– Overhead of AMI’s for every code version– Ties to Amazon AMI’s (would this work for containers – I think yes)

42

Page 43: NetflixOSS for Triangle Devops Oct 2013

AMInator• Starting image/volume

– Foundational image created (maybe via loopback), base AMI with common software created/tested independently

• Aminator running – Bakery– Bakery obtains a known EBS volume of the base image from

a pool– Bakery mounts volume and provisions the application

(apt/deb or yum/rpm)– Bakery snapshots and registers snapshot

• Recent work to add other provisioning such as chef as plugins

• I have used hand built AMI’s thus far, but blog states developers can go through CI builds and have running test instances within 15 minutes of code being checked in 43

Page 44: NetflixOSS for Triangle Devops Oct 2013

Agenda

• How did I get here?• Netflix and Netflix OSS platform overview• Runtime components• Management components• Build components

• Automated test and cleanliness components

44

Page 45: NetflixOSS for Triangle Devops Oct 2013

The Simian Army

• A bunch of automated “monkeys” that perform automated system administration tasks

• Anything that is done by a human more than once can and should be automated

• Absolutely necessary at web scale

45

Page 46: NetflixOSS for Triangle Devops Oct 2013

Good Monkeys

• Janitor Monkey– Somewhat a mitigation for baking approach– Will mark and sweep unused resources

(instances, volumes, snapshots, ASG’s, launch configs, images, etc.)

– Owners notified, then removed

• Conformity Monkey– Check instances are conforming to rules

around security, ASG/ELB, age, status/health check, etc.

46

http://www.flickr.com/photos/sonofgroucho/5852049290

Page 47: NetflixOSS for Triangle Devops Oct 2013

Back to high availability

• Failure is inevitable. Don’t try to avoid it!

• How do you know if your backup is good?– Try to restore from your backup every so often– Better to ensure backup works before you have a crashed system

and find out your backup is broken

• How do you know if your system is HA?– Try to force failures every so often– Better to force those failures during office hours– Better to ensure HA before you have a down system and angry users– Best to learn from failures and add automated tests

47

Page 48: NetflixOSS for Triangle Devops Oct 2013

Bad Monkeys• Open Sourced – Chaos Monkey

– Used to randomly terminate instances– Now block network, burn cpu, kill processes,

fail amazon api, fail dns, fail dynamo, fail s3, introduce network errors/latency, detach volumes, fill disk, burn I/O

• Not yet open sourced– Chaos Gorilla

• Kill all instances in an availability zone

– Chaos Kong• Kill all instances in an entire region

– Latency Monkey• Introduce latency into service calls directly

(ribbon server side) 48

http://www.flickr.com/photos/27261720@N00/132750805

Page 49: NetflixOSS for Triangle Devops Oct 2013

Agenda

• Blah, blah, blah

• How can I learn more?

• How do I play with this?

• Let’s write some code!

49

Page 50: NetflixOSS for Triangle Devops Oct 2013

Want to play?• NetflixOSS blog and github

– http://techblog.netflix.com– http://github.com/Netflix

• Acme Air, NetflixOSS AMI’s– Try Asgard/Eureka with a real application– http://bit.ly/aa-AMIs

• See what we ported to IBM Cloud (video)– http://bit.ly/noss-sl-blog

• Fork and submit pull requests to Acme Air– http://github.com/aspyker/acmeair-netflix

50