experimentation platform at netflix
TRANSCRIPT
A/B Testing at Netflix: Experimentation Platform
Steve [email protected]
• Technology is just one part of the equation: a culture of
experimentation is the other essential part
• All product ideas are subjected to the scientific method, with
actual data supporting changes before changes are rolled out
to all users
• The effectiveness of any idea is measured without bias - the
seniority of the person proposing the idea is irrelevant
Importance of A/B Testing at Netflix
A/B testing enables product decisions throughout Netflix, with
our users spread across all departments
• Data Scientists: Does this new ranking algorithm result in more plays?
• Product Managers: Does this new UI reduce the time for users to find content?
• Marketing: Which email campaign resulted in more new subscribers?
• Content: Which thumbnail image resulted in more streams of Daredevil?
• Engineers: Is the new implementation of this streaming algorithm more
performant when internet connectivity is spotty?
• and so on...
Our Users
• Being an internal tool is not an excuse for poor UX • Given the diverse expertise of our users workflows must be
simple and effective while providing value• Cover all generic test management scenarios• Easily accommodate unique experimentation needs as they
come up• Ingest and combine real-time behavioral and batch metadata
from numerous sources
A/B Testing Platform Objectives
We’re looking for a Full-Stack
Engineer to help across the board:
• Collaborate with users across Netflix to
understand their UI needs
• Be part of a team of engineers and UX
experts
• Tech stack: Java, React, Node
• Data visualization experience is a plus
We’re Hiring
Netflix has a unique culture. Read about it here.
We need a Server-Side Engineer with
expertise designing distributed systems:
• Help design and rebuild our allocation
engine
• Experience processing large datasets -
including efficient incorporation of near
real-time data
• Expertise with various Big Data databases
• Machine learning experience is a plus
WAIT, THAT’S NOT ENOUGH
I WANT TO GO DEEPER
orA B
Which Version is Better?
Which set of recommendations is better?
orA B
Given that I Watched House of Cards...
Hard to Answer Without Disciplined Experimentation
orA? B?
A/B Testing Process
Target Population
Hypothesis: Retention and/or engagement will improve with new recommendation algorithm.Process: Randomly group users into different buckets. Other than the tests, all other factors are constant.
Control Group:Continue to experience the current version (A)
Test Group B:Experience version B
Test Group C:Experience version C
A/B Testing Process ContinuedAnalyze & Compare Key Results
Algorithm A (Control)
Algorithm B
Algorithm C?...
Viewing hours delta: N/A N/A as this is what we are measuring other options against
Viewing hours delta: +2.3%Statistically Significant: Yes
Viewing hours delta: -5.7%Statistically Significant: Yes
2.3% better than the control, and we’re confident about it
Ouch! Don’t use this algorithm.
Data Driven Results
orA B
Experimentation Service
Persist/Retrieve Allocations
Experiment Criteria
Define Experiments
Sampling
Metadata
Allocations
Evaluate Eligibility
Ad Hoc queries
REST
API
* Allocate Customers* Retrieve
Allocations
Real-time Analysis & MonitoringPersist
Metrics
Health Metrics
Visualize
Technology Stack
Other Netflix
Services
Allocation & Stratification
All US Regions
● Randomly distribute and assign customers to a variant in the experiment utilizing Stratified Sampling
● Start, Stop, and Track allocations in near real-time
Percentage of Users*:
North East 22%
South East 13%
South West 17%
... ...
*Numerical values are for illustrative purposes only and are totally made up
“Random sampling” with enforcement of sample
proportions across regions
Percentage of Users
Segmentation
Target Population
● Divide a broad target population into subsets with similar properties● Some tests are meant to measure impact on specific populations● Must maintain scale and low latencies
Segmentation by specific properties
Haven’t used a tablet to access Netflix in n days
Used a game console to access Netflix within last n days
Smart TV users
Test Health● All test experiences are not equal, but we must ensure this isn’t due to buggy implementations● Issues can be device specific, so must monitor at device, test, and experience granularity● The example below is super-simplified - we need to create visualizations which effectively convey
test health, internationally, across thousands of devices
Control Cell
Experience B No errors/fallbacks
Experience A Issue on TV UI detected
No errors/fallbacks
ABlaze UI: Test Lifecycle Management
Initial Planning: Test Configuration Screens● Determine hypothesis● Implement each test experience
Schedule Test: Scheduler View● Define real-time rules & conditions● Consider potential conflicts
Monitor Test: Dashboard and Alert Views ● Monitor test health over time
○ Real-time analysis and alerting on metrics and allocations
● Pull test if bugs/issues present themselves
Hypothesis Evaluation: Comparison Views● Interactive filtering, analysis, & visualization of
data● Call success or failure of test
Implement or Re-Test● Devise plan to roll winning experience
(if any) out to production● Else, potentially revise hypothesis and
retest
Some Challenges
• Operate resiliently and at low latencies, despite:• Customer allocations taking place in real-time
• Need for near real-time insights into test health over massive datasets
• Data that is distributed across multiple clusters
• Data processing:• Joins across billions of rows of data from many sources can cause massive increase in
number of rows
• Efficient management of datasets to support interactive analysis, dashboards, etc.
• Rich and flexible filtering to support interactive analysis
• Extract forecasts and insights
• Oh, and make it as easy to use as possible for the users...