python through the back door: netflix presentation at codemash 2014

Post on 17-Oct-2014

1.213 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A discussion of the technical problems that resulted in adopting Python as a first-class language within Netflix's cloud environment

TRANSCRIPT

Python Through the Back Door

!

CodeMash 2014 !

Roy Rapoport @royrapoport rsr@netflix.com

www.linkedin.com/in/royrapoport

A Word About Me

!2

• About 20 years in technology • Systems engineering, networking,

software development, QA, release management

• Time at Netflix: 1655 days • Before at Netflix: Service Delivery in

the IT/Ops, troubleshooter, Builder of Python Things[tm]

• Current role: Insight Engineering •Real-Time Operational Insight

(4y:6m:11d)

Stories We Tell

!3

• Technical Problems • Howler Monkey • Alerting

• Python As a First-Class Language • Culture and People

ProblemPythonPeople

“Netflix Company Profile Now via self service*

!4

People

Go to your favorite Python REPL and type the following:

import re, requests!content = requests.get(“http://ir.netflix.com").content!content = content.replace(“ ", " ")!p = re.compile(r”.*over (\d+) .*in (\d+)”, re.S)!m = re.match(p, content)!print "Netflix is the world's leading Internet \! subscription service for enjoying TV and movies, \! with more than {} million subscribers in {} \! countries.”.format(m.group(1), m.group(2))!

*No whining. Remember that you’ll never again need to wait for me to update this slide like you had to wait for database access when you started your last job.”

- Jay Zarfoss, http://www.slideshare.net/zarfide

Design Your Culture for Desired Outcomes

!5

1. Speed of innovation!2. Availability!3. Cost

People

Design For What’s Important

!6

Freedom and Responsibility!Hire Smart Experienced People!

Set them Loose!Watch Magic Happen

People

PoliciesRaise your hand if you love them

People

Policies (How They Usually Work)

���8

People

Policies (How They Usually Work)

���9

11/27/2006 “Sorry, but the standard monitor...is the HP 17" flat panel. I actually told a director last week that they couldn't have a 19" for a new office so I am not picking on just you.”

People

Policies (How They Usually Work)

���10

!6/18/2007

“There is a request for quantity 2 17” flat panels. We have received direction from the CIO that no one will have more than 1 flat panel monitor. I just wanted to let you know that there will only be one monitor ordered ... The 17” is our only standard except for Legal.”

People

Policies (How They Usually Work)

���11

•Prescriptive •Inflexible •Determined by others •Slow to change

People

Policies @nflx

���12

People

Policies @nflx

���13

!01/30/2013, 15:22 PST

I'd like to request a 15” MBP w/ Retina Display. I don't know how much you guys care about CPU specs -- it looks like the bump from 2.3GHz to 2.6GHz is reasonably priced at only about $100, so if it works for you that'd be nice. 16GB RAM and at least 512GB drive. !

01/31/2013 12:00 PST: “Forwarded to IT Purchasing to provide a quote to Roy for the requested configuration.” 13:33 PST: “Requesting quote from vendor” 15:32 PST: “Attached is the quote, please approve and I’ll place order” 15:46 PST: “Thanks for the rapid response. Please order.” 15:52 PST: “Ordered. PO #...”

People

Policies @nflx

���14

• Descriptive

• As flexible as we are

• Describe what we choose to do/get

• Evolve quickly

People

The Before Time

���15

Dozens of SSL Certificates

Decentralized

Kept Expiring

Hilarity would ensue

Amazon Resources

“No Preset Limit”

You know when you hit it

Hilarity would ensue

Problem

The Before Time

���16

• Well-developed Developer Ecosystem

• Discovery

• DB Client

• Credentials Management

• Memory Object Cache

• Server Infrastructure

• Telemetry

• You wanted that for Java, right?

Python

The Before Time

���17

• Just moved from IT/Ops

• Formally tasked with SSL cert issue as quarterly goal

• Limits issue “tacked” on

• Happily hackily Pythonic

• Didn’t know JavaPresenter Selfie

ProblemPython

Architecture

���18

ELB

EC2

Filesystem

IP Range

DNS Domain

Cassandra

Certificate

Nagger

CherryPy

Problem

7/10/2011 Ready for beta

Persistence• Started with SimpleDB

• Then Cassandra

• Drove creation of …

• import Discovery

• import Cassandra

• And a design error!19

Python

Abstraction

!20

Python

• “The process of separating ideas from specific instances of those ideas at work.”

• Some abstraction: Good

• Too much abstraction burns your tongue*

• Known bug

* Mixed metaphor is mixed

Architecture

���21

Problem

Architecture

���22

Problem

Alerting

���23

Problem

• Enterprise IT Solution • Managed by the Enterprise IT Alerting People • File Tickets • Send alerts to NOC • Completely separate from telemetry system

Copyright USAID Microlinks. CC Attribution 2.0

Alerting

���24

Problem

• Enterprise IT Solution • Managed by the Enterprise IT Alerting People • File Tickets • Send alerts to NOC • Completely separate from telemetry system

Copyright: http://www.flickr.com/photos/s_w_ellis

CC Attribution 2.0 License

Alerting

���25

Problem

•Already had a good telemetry system

•Outsourced notification to PagerDuty

•No alert routing (and deduplication)

Monitoring Alerting Notification

Alerting

���26

People

•Space crunch •New cube mate: @jedberg •One Month Deadline

Alerting

���27

Problem

alerting

api

Central Alert

Gateway

Pager Duty

Amazon SES

Atlas

Let’s Wake Someone Up (Livecoding for Fun and Profit)

But Now We Need…

���28

Python

•import Discovery.publish

•import EVCache

•import EpicMetrics

•import Archaius

•import Asgard.Registry

•import AKMS

AKMS?

���29

Python

In [1]: import AKMS!In [2]: ak = AKMS.AKMS("RoyWasHere")!In [3]: ak.keys()!Out[3]: ['MLQBAYLLDIGXPBQB', ‘eMr+Mdhv+E4xD+paPCxXF+’]!In [4]: a, s = ak.keys()!In [5]: s3_object = boto.connect_s3(a, s)!In [6]: ak = AKMS.AKMS("RoyWasHere", version=2)!In [7]: ak.keys()!Out[7]: [‘yn[…]G’, ‘rV[…]bKfSUHDSA’, ‘reallyLongStringElided']!In [8]: ak.expiration!Out[8]: 1389165118!In [9]: a, s, s2 = ak.keys()!In [10]: s3_object = boto.connect_s3(a, s, security_token=s2)

So AKMS

���30

People

•Server more paranoid than most

•Making Python library was a pain

•Remember Jay?

•High lateral trust

•Prioritization autonomy

•Never ask for permission

Lateral Trust

���31

People

•Humans are good game players

•What are the rules?

•Zero-sum games: I want you to lose

•Stack ranking

•Fixed bonus / raise pools

Lateral Trust @nflx

���32

People

•No fixed pools for anything

•No ranking (at all)

•Reviews != raises

•Smart people generally make good decisions

•Global optimization

Subordinate Trust @nflx

���33

People

•Focus on results

•Unleash employees

•Encourage disagreement

•Accept dissent

•Job #1: Attract and retain world-class talent

Manager Trust @nflx

���34

People

•Question, question, question

•Drive for context, not decisions

•Nobody is above questioning

Field of Dreams

���35

Python

•Turned out I wasn’t the only one •Striking the right balance between MVP and future growth (maybe)

•And if it hadn’t … it’d still have been the right choice

A Virtuous Cycle

���36

Python

•Requirement for high impact •No process for permission •Unorthodox language choice •Lateral support for development •Increased adoption •… •Profit!*

PeopleProblem

* (or at least a new standard)

!37

http://bit.ly/netflixcmpython

Tell me what you think. You know you want to.

Attributions

���38

http://www.flickr.com/photos/watchsmart/

http://www.flickr.com/photos/yaketyyakyak/

Pem Dorjee Sherpa

top related