experimentation platform at netflix

A/B Testing at Netflix: Experimentation Platform

Steve [email protected]

• Technology is just one part of the equation: a culture of

experimentation is the other essential part

• All product ideas are subjected to the scientific method, with

actual data supporting changes before changes are rolled out

to all users

• The effectiveness of any idea is measured without bias - the

seniority of the person proposing the idea is irrelevant

Importance of A/B Testing at Netflix

A/B testing enables product decisions throughout Netflix, with

our users spread across all departments

• Data Scientists: Does this new ranking algorithm result in more plays?

• Product Managers: Does this new UI reduce the time for users to find content?

• Marketing: Which email campaign resulted in more new subscribers?

• Content: Which thumbnail image resulted in more streams of Daredevil?

• Engineers: Is the new implementation of this streaming algorithm more

performant when internet connectivity is spotty?

• and so on...

Our Users

• Being an internal tool is not an excuse for poor UX • Given the diverse expertise of our users workflows must be

simple and effective while providing value• Cover all generic test management scenarios• Easily accommodate unique experimentation needs as they

come up• Ingest and combine real-time behavioral and batch metadata

from numerous sources

A/B Testing Platform Objectives

We’re looking for a Full-Stack

Engineer to help across the board:

• Collaborate with users across Netflix to

understand their UI needs

• Be part of a team of engineers and UX

experts

• Tech stack: Java, React, Node

• Data visualization experience is a plus

We’re Hiring

Netflix has a unique culture. Read about it here.

We need a Server-Side Engineer with

expertise designing distributed systems:

• Help design and rebuild our allocation

engine

• Experience processing large datasets -

including efficient incorporation of near

real-time data

• Expertise with various Big Data databases

• Machine learning experience is a plus

https://jobs.netflix.com/jobs/448



http://www.slideshare.net/reed2001/culture-1798664


WAIT, THAT’S NOT ENOUGH

I WANT TO GO DEEPER

orA B

Which Version is Better?

Which set of recommendations is better?

orA B

Given that I Watched House of Cards...

Hard to Answer Without Disciplined Experimentation

orA? B?

A/B Testing Process

Target Population

Hypothesis: Retention and/or engagement will improve with new recommendation algorithm.Process: Randomly group users into different buckets. Other than the tests, all other factors are constant.

Control Group:Continue to experience the current version (A)

Test Group B:Experience version B

Test Group C:Experience version C

A/B Testing Process ContinuedAnalyze & Compare Key Results

Algorithm A (Control)

Algorithm B

Algorithm C?...

Viewing hours delta: N/A N/A as this is what we are measuring other options against

Viewing hours delta: +2.3%Statistically Significant: Yes

Viewing hours delta: -5.7%Statistically Significant: Yes

2.3% better than the control, and we’re confident about it

Ouch! Don’t use this algorithm.

Data Driven Results

orA B

Experimentation Service

Persist/Retrieve Allocations

Experiment Criteria

Define Experiments

Sampling

Metadata

Allocations

Evaluate Eligibility

Ad Hoc queries

REST

API

* Allocate Customers* Retrieve

Allocations

Real-time Analysis & MonitoringPersist

Metrics

Health Metrics

Visualize

Technology Stack

Other Netflix

Services

Allocation & Stratification

All US Regions

● Randomly distribute and assign customers to a variant in the experiment utilizing Stratified Sampling

● Start, Stop, and Track allocations in near real-time

Percentage of Users*:

North East 22%

South East 13%

South West 17%

... ...

*Numerical values are for illustrative purposes only and are totally made up

“Random sampling” with enforcement of sample

proportions across regions

Percentage of Users

http://en.wikipedia.org/wiki/Stratified_sampling

Segmentation

Target Population

● Divide a broad target population into subsets with similar properties● Some tests are meant to measure impact on specific populations● Must maintain scale and low latencies

Segmentation by specific properties

Haven’t used a tablet to access Netflix in n days

Used a game console to access Netflix within last n days

Smart TV users

Test Health● All test experiences are not equal, but we must ensure this isn’t due to buggy implementations● Issues can be device specific, so must monitor at device, test, and experience granularity● The example below is super-simplified - we need to create visualizations which effectively convey

test health, internationally, across thousands of devices

Control Cell

Experience B No errors/fallbacks

Experience A Issue on TV UI detected

No errors/fallbacks

ABlaze UI: Test Lifecycle Management

Initial Planning: Test Configuration Screens● Determine hypothesis● Implement each test experience

Schedule Test: Scheduler View● Define real-time rules & conditions● Consider potential conflicts

Monitor Test: Dashboard and Alert Views ● Monitor test health over time

○ Real-time analysis and alerting on metrics and allocations

● Pull test if bugs/issues present themselves

Hypothesis Evaluation: Comparison Views● Interactive filtering, analysis, & visualization of

data● Call success or failure of test

Implement or Re-Test● Devise plan to roll winning experience

(if any) out to production● Else, potentially revise hypothesis and

retest

Some Challenges

• Operate resiliently and at low latencies, despite:• Customer allocations taking place in real-time

• Need for near real-time insights into test health over massive datasets

• Data that is distributed across multiple clusters

• Data processing:• Joins across billions of rows of data from many sources can cause massive increase in

number of rows

• Efficient management of datasets to support interactive analysis, dashboards, etc.

• Rich and flexible filtering to support interactive analysis

• Extract forecasts and insights

• Oh, and make it as easy to use as possible for the users...