api world 2013 - transforming the netflix api
DESCRIPTION
The Netflix API has undergone a transformation since its inception in 2008. It has transitioned from being a public API with a generic RESTful interface to a platform for creating highly optimized, device-centric APIs that are critical to delivering the Netflix streaming experience on over 1000 different device types. This talk covers the design principles that shaped the transformation of the API as well as the technology that powers it, enabling rapid user experience iteration and bringing Netflix streaming to almost 38 million subscribers around the world.TRANSCRIPT
Transforming the Netflix API
Ben Schmaus, NetflixOctober 2013, API World
[email protected] || @schmaus
Streaming TV Shows & Movies Globally
> 1000 Devices
1/3 ofInternet at peak
ExclusiveContent
Almost 38 million subscribers in over
40 countries
PersonalizationEngine User Info Movie
Metadata Ratings SimilarMovies
InstantQueue
A/B TestEngine
API
PersonalizationEngine User Info Movie
Metadata Ratings SimilarMovies
InstantQueue
A/B TestEngine
APIEnable UX Innovation
Insulate from Failure
It wasn’t always this way
FosterInnovation
fromOutside
REST for Easy Interop, Integration
RESTful API
REST for Easy Interop, Integration
RESTful API
Model resources like users, movies, series, ratings, etc
REST for Easy Interop, Integration
RESTful API
/catalog/titles/catalog/titles/movies/catalog/titles/series
/catalog/titles/series/episodes
REST for Easy Interop, Integration
RESTful API
/users/lists/users/title_states
/users/ratings
EnterStreaming
Devices
Lots of devices, lots of variety across platforms
External apps
1 /apps/{app_id}/config1 /users/{user_id}1 /users/{user_id}/queues/instant/available1 /users/{user_id}/lists26 /catalog/titles?...
Doesn’t include JS, CSS, images, etc.
1 /apps/{app_id}/config1 /users/{user_id}1 /users/{user_id}/queues/instant/available1 /users/{user_id}/lists26 /catalog/titles?...
Device variance: interaction models
Device variance: interaction models & form factors
A/B Tests Add to the Challenge
“Bolt on” functionality
expand=@queue_item@title,@box_art_exists,@short_synopsis,@directors,@cast,@episodes,@formats,@format_availability,@subtitle_languages,@languages_and_audio,@maturity_ratings
Resource-based API wasn’t scaling
Model Resources/catalog, /users
Model Experiences/tv/home
Reduce network chattiness
Support device optimizations
Enable fast dev iterations
GET/users/{user_id}/lists
apiGateway.getLists(userId)
Discrete HTTP requests pay WAN tax repeatedly
Single, optimized request; pay WAN tax once
Single, optimized request; pay WAN tax once
Client data assembly logic pushed to server
Add server-side scripting capability
Enable independent dev and device optimization
API
Internet /tv/home
API
/tv/home
Client Server
Internet
API
Application
Client Server
Internet /tv/home
API
Client Server
Internet
UI Teams
Application
/tv/home
API
API Team
Client Server
Internet
UI Teams
Application
/tv/home
APIUI
Teams
API Team
Mid-tierServices
Client Server
Internet
Application
/tv/home
RxJava Hystrix
API Team
Java Service Layer
Mid-tierServices
UI Teams
Client Server
Internet
Application
/tv/home
ELB ZuulMid-tier Services
ScriptableBackend
ScriptableBackend
+
API Layer
https://github.com/Netflix/Hystrixresilience patterns for distributed systems
https://github.com/Netflix/RxJavaasync, event-based programming
https://github.com/Netflix/zuuldynamic request routing, realtime monitoring, resiliency
Need well-appointed toolkit
Cassandra
upload script.groovyfor /script
Cassandra
Cassandra
activate /script
upload script.groovyfor /script
Cassandra
Cassandra
activate /script
CassandraScriptableBackend
load /script
upload script.groovyfor /script
Cassandra
Cassandra
activate /script
CassandraScriptableBackend
load /script
upload script.groovyfor /script
ScriptableBackend
DeviceClient
GET/script
Deployment & Ops
Historical & realtime metrics, sort realtime by error/request rate
Distributed grep + tail
2013-05-09.20:38:54 MX 200 us-east-1c i-1824cb73 i-1c61b77f prod NFPS3-001-8G50FJCX... 288404769389848058 90ms api-global.netflix.com GET /tvui/release/470/plus/pathEvaluator -amazon.ami-id: ami-502eb039amazon.availability-zone: us-east-1camazon.instance-id: i-1824cb73amazon.instance-type: m2.2xlargeamazon.local-ipv4: 10.6.213.112amazon.public-hostname: ec2-54-243-4-69.compute-1.amazonaws.comamazon.public-ipv4: 54.243.4.69cookie_esn: NFPS3-001-8G50FJCX...country: MXcurrentTime: 1368131934468duration-millis: 90esn: NFPS3-001-8G50FJCX...geo.city: CIUDADOBREGON...
$ ./simple_stream.py -f -q 'e["country"]=="MX" && e["esn"]==~/NFPS3.*/' -r us
Go for haystack handing you the needle
Or at least be able to make smaller haystacks
Run 1% of your traffic on the new code and see how it does
API ami-123 API ami-456
2xx4xx5xx
latencybusy threads
load...
Confidence score for each AMI based on comparison of 1000+ metrics
Scannable visualization of metric space
More important
Less important
Your basic red/black push
Doing red/black by hand for multiple clusters across multiple regions is
not fun
Automate multi-cluster/region pushes
Automate multi-cluster/region pushes
Don't forget to automate
rollbacks, too!
$Who, $What, $Where, $When
e.g., "bschmaus, ami-123, Sandbox Canary, 2013-05-06 19:05"
Latest prod change in chat topic
Quickly see status of all clusters in a region
API for APIs
Control network interactions
Device optimization
Development agility
Automation & insights
Continuously experiment to make hard things easier