netflix cloud platform and open source

18
Andrew Spyker @aspyker Netflix Cloud Platform and Open Source

Upload: aspyker

Post on 06-Aug-2015

169 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Netflix Cloud Platform and Open Source

Andrew Spyker@aspyker

Netflix Cloud Platformand Open Source

Page 2: Netflix Cloud Platform and Open Source

IntroductionBig enterprise/datacenter to consumer/cloud

Netflix Cloud PlatformElastic and Web ScaleHigh Availability and Auto RecoveryContinuous DeliveryOperational VisibilitySecurity

Agenda

Page 3: Netflix Cloud Platform and Open Source

About me, road to Netflix● Working for IBM on Java/Middleware performance

○ Cloud & mobile deemed Enterprise Java benchmarks non-interesting○ Monolithic DB’s, resiliency and code updates not required

● Acme Air (Benchmark) Example App○ Showed web/cloud scale

■ 4B+ per day mobile requests end to end, hundreds of nodes■ But, wasn’t operable

○ Rewrote using NetflixOSS libraries & services■ Now operable, with same levels of scale■ Also enabled Microservices and CI/CD■ Won Netflix Cloud Prize

Page 4: Netflix Cloud Platform and Open Source

About me, road to Netflix● Now that NetflixOSS was understood

○ Ported libraries & services to IBM middleware and cloud■ POC’s for Open Stack, Docker, Mesos, Kubernetes

○ Started to onboard and operate IBM SaaS businesses■ Most interestingly … IBM Watson

● 2014 - “Should I work on applying this platform to more systems or help build the next cloud platform?”

● Joined Netflix in the cloud platform team○ Focusing on performance/scalability○ Also helping with architecture, containers, open source

@aspyker

ispyker.blogspot.

com

Page 5: Netflix Cloud Platform and Open Source

Elastic, Web and Hyper Scale Doing this

Not doing that

Page 6: Netflix Cloud Platform and Open Source

Elastic, Web and Hyper Scale

Front end API(browser and mobile)

AuthenticationService

BookingService Temporal

caching

DurableStorage

LoadBalancers

Strategy Benefit

Make deployments automated Without automation impossible

Expose well designed API to users Offloads presentation complexity to clients

Remove state for mid tier services Allows easy elastic scale out

Push temporal state to client and caching tier Leverage clients, avoids data tier overload

Use partitioned data storage Data design and storage scales with HA

……

Page 7: Netflix Cloud Platform and Open Source

HA and Automatic Recovery

Feeling This

Not Feeling That

Page 8: Netflix Cloud Platform and Open Source

Micro serviceImplementation

Call “Auth Service”

Highly Available Service Runtime Recipe

Ribbon REST clientwith Eureka

Web AppFront End

(REST services)App Service

(auth-service)

Executeauth-service

call

Hys

trix

EurekaServer(s)

EurekaServer(s)

EurekaServer(s)

KaryonFallback

Implementation

Implementation Detail Benefits

Decompose into micro services• Key user path always available• Failure does not propagate across service boundaries

Karyon /w automatic Eureka registration• New instances are quickly found• Failing individual instances disappear

Ribbon client with Eureka awareness• Load balances & retries across instances with “smarts”• Handles temporal instance failure

Hystrix as dependency circuit breaker• Allows for fast failure• Provides graceful cross service degradation/recovery

Page 9: Netflix Cloud Platform and Open Source

IaaS High Availability

Region (us-east-1)

us-east-1eus-east-1c

Eureka

Web App Service1 Service2

Cluster Auto Recovery and Scaling Services (Auto Scaling Groups)

Global LoadBalancers

Rule Why?

Always > 2 of everything 1 is SPOF, 2 doesn’t web scale and slow DR recovery

Including IaaS and cloud services You’re only as strong as your weakest dependency

Use auto scaler/recovery monitoring Clusters guarantee availability and service latency

Use application level health checks Instance on the network != healthy

Worldwide availability Data replication, global front-end routing, cross region traffic

us-east-1d

Page 10: Netflix Cloud Platform and Open Source

Testing is only way to prove HA● Chaos Monkey

○ Kill instances in production - runs regularly● Chaos Gorilla

○ Kills availability zones (single datacenter)○ Also testing for split brain important

● Chaos Kong○ Kill entire region and shift traffic globally○ Run frequently but with prior scheduling

Page 11: Netflix Cloud Platform and Open Source

Continuous Delivery

Reading This

Not This

Page 12: Netflix Cloud Platform and Open Source

v

Continuous Delivery

Cluster v1 Canary v2 Cluster V2

Step Technology

Developers test locally Unit test frameworks

Continuous build Continuous build server based on gradle builds

Build “bakes” full instance image Aminator and deployment pipeline bake images from build artifacts

Developer work across dev and test Archaius allows for environment based context

Developers do canary tests, red/black deployments in prod

Asgard console provides app cluster common devops approach, security patterns, and visibility

ContinuousBuild Server

Baked to images (AMI’s)

… …

Page 13: Netflix Cloud Platform and Open Source

Operational Visibility

If you can’t see it, you can’t improve it

Page 14: Netflix Cloud Platform and Open Source

Operational Visibility

Web App Auth Service

Visibility Point Technology

Basic IaaS instance monitoring Not enough (not scalable, not app specific)

User like external monitoring SaaS offerings or OSS like Uptime

Targeted performance, sampling Vector performance and app level metrics

Service to service interconnects Hystrix streams ➔Turbine aggregation ➔Hystrix dashboard

Application centric metrics Servo gauges, counters, timers sent to metrics store like Atlas

Remote logging Logstash/Kibana or similar log aggregation and analysis frameworks

Threshold monitoring and alerts Services like Atlas and PagerDuty for incident management

ServoHystrix/Turbine

External UptimeMonitoring Metric/Event

Repositories

LogStash/ElasticSearch/Kibana

Incidents

……

Atlas

Vector

Page 15: Netflix Cloud Platform and Open Source

Security

Solid Security

Done in new ways

NOT

Page 16: Netflix Cloud Platform and Open Source

SecuritySecurity must consider fluid environment

Security must be automated!

Security Monkey● Monitors security policies, tracks changes, alerts on situations

Scumblr● Searches the web, social media for security “nuggets” (credentials, hacking

discussions, etc.). Collect via Sketchy.

Sketchy● A safe way to collect text and screenshots from websites

Page 17: Netflix Cloud Platform and Open Source

What did we not cover?Over 50 github projects● “Technical indigestion as a service”

Big Data and User Interface Engineering● Both deserve their own sections● Extensive Open Source existing and coming

projects (Falcor)

Page 18: Netflix Cloud Platform and Open Source

How do I get started?● All of the previous slides shows NetflixOSS components

○ Code: http://netflix.github.io○ Announcements: http://techblog.netflix.com/

● Want to get running a bit faster?

● ZeroToCloud○ Workshop for getting started with build/bake/deploy in Amazon EC2

● ZeroToDocker○ Docker images that containing running Netflix technologies (not

production ready, but easy to understand)