1 the quest for the optimal experiment recsys 10-06-14

1

The Quest for the Optimal ExperimentRecSys 10-06-14

2

‘Science & Algorithms’ at NetflixC

ausa

tion

Cor

rela

tion

Experimentation Science, methodology, and statistical analysis of experiments

Algorithm R&D Mathematical algorithms that get embedded into automated processes, such as our recommendation system

Predictive models Standalone mathematical models to support decision making (e.g. title demand prediction)

3

Numbers shown in this presentation

are not representative

of Netflix’s overall metric values

4

Netflix Experimentation: Common “Product” is a set of controlled, randomized

experiments, many running at once

Experiment in all areas

Plenty of rigor and attention around statistics, metrics, analysis

5

Netflix Experimentation: Distinctive Core to culture (not just process)

Curated approach Decisions not automated Scrutiny of each test (and by many people)

Paying customers who are always logged in

Monthly subscription Tests last several months Sampling (test allocation) of new members can take weeks or

even months

Many devices

6

Retention is our core metric (OEC) Continually improve member enjoyment

Streaming Hours is our main engagement metric

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 500%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

Customers’ Stream Hours in the past 28 days

Can

cel R

ate

8

Probability of retaining at each future billing cycle based on streaming S hours at N days of tenure

Total hours consumed during N days of membership

Ret

enti

on

Streaming measurement: Streaming score

Streaming measurement: KS visual & Mann Whitney u test statistic

KS Test statistic

10

Streaming measurement: Thresholds with z-tests for proportions

Much experimentation on the recommender system

Row selection

Video ranking

Video-video similarity

User-user similarity

Search recommendations

Popularity vs personalization

Diversity

Novelty/Freshness

Evidence

12

Sample and Subject Purity

13

Same test, different populations

14

Who should Netflix sample?Geography

Global US International Region-specific

Tenure 1 month (free trial) 2-6 months 7+ months

Classes of experience with Netflix Signups who are not rejoining members Rejoining members Existing members (any tenure) Existing members who are beyond their

free trial Newly activating a device

15

Two considerations1. For whom/what do you want to optimize?

2. Who will experience the winning test experience that gets launched?

16

“New members” by country region

Time

17

Membership by tenure

Longer tenure

Medium tenure

Free trial

Time

18

Hard to impact long-tenured membersC

ance

l Rat

e

Long tenureMedium tenureFree trial

19

Current favored samples in algorithm testing Global signups who are not rejoining within a

year

Secondarily: US existing members who are beyond their free trial International (non-US) existing members who are

beyond their free trial

20

Addressing Sampling Bias Stratified sampling on attributes that are:

Correlated with core metric Independent of the test treatment

Regression tests for any systematic randomization process

Bias monitoring for each test’s sample

Large sample sizes

Re-testing

Good judgment to recognize that the “story” makes sense

21

In the words of Nate SilverOn predicting the 2008 recession in a world of noisy data anddependent variables:

Not only was Hatzius’s forecast correct, but it was also right for the right reasons,

explaining the causes of the collapse and anticipating the effects. Hatzius refers to this chain of cause and effect as a “story”…

In contrast, if you just look at the economy asa series of variables and equations without any underlying structure,

you are almost certain to mistake noise for a signal…

The Signal and the Noise: Why so Many Predictions Fail – but Some Don’t by Nate Silver

22

Short- versus long-term engagement metrics

23

Short-term metrics we consider Daily cancel requests

Daily streaming hours

Daily visits

Session length

Failed sessions (no play)

“Take rates” (CTR where the clicks is to play) Page-level Row-level Title-level

24

Statistically significant differences in churn rarely stabilize until after Day 45

Test Duration Test Duration

25

Short-term metrics we consider Daily cancel requests

Daily streaming hours

Daily visits

Session length

Failed sessions (no play)

“Take rates” (CTR where the clicks is to play) Page-level Row-level Title-level

26

How well do your short-term metrics correlate with your OEC, and

how much improvement do you seein that correlation if you increase

the time interval?

27

Streaming signal that appears over time

1 Week 1 Month 2 Months

28

Or disappears over time

1 Week 1 Month 2 Months

29

Ability to predict 4-month retention using streaming hours improves with longer-term data

30

Key Takeaways Exercise rigor in selecting the population to sample;

representative of: The population you want to optimize for The population that will receive the experience if launched

Remain open-minded about changing the target population as business shifts occur

Address bias, ongoing

Know and apply the time duration necessary for your OEC to stabilize

Additional short-term metrics need to have sufficient duration to correlate well with your OEC

1 the quest for the optimal experiment recsys 10-06-14

Documents