n=10^9: automated experimentation at scale

Post on 20-Aug-2015

529 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

N=109  Automated  

Experimenta5on  at  Scale  Wojciech  Galuba  Decision  Tools  Lead,  

Facebook  @wgaluba  

N=109: Automated Experimentation at Scale

Wojtek Galuba (wgaluba@fb) Decision Tools Team Lead Data Science Infrastructure Facebook

History of Data Science Infra at FB •  Founded April 2012 •  A group of data scientists and software engineers •  Experienced first hand the need for better infrastructure •  Need continues to grow •  Team doubled over the past year •  Expect continued rapid growth this year

Why do we experiment?

Experimentation

Product changes

Experiment to study this

Metrics

Experiment to:

Catch problems before they arise

Experiment to:

Choose between multiple options

Experiment to:

Challenge intuitions about product

Experiment to:

Not only evaluate ideas but generate new ones

Challenges

Many experiments

• Experiments running in parallel • Modifying many different aspects of the product • Overlaps are possible and may conflict

Many metric dimensions • Different contexts of user actions • Thousands of device types • Geography • Demographics • Time • Enormous space of possible questions

Many teams • Many ways to run an experiment • Diverse audience for results • Huge set of results from every experiment • Many ways to interpret results

Experimentation at Facebook

An experiment

QuickExperiment

Div

ide

peop

le ra

ndom

ly color: blue

size: medium"

color: blue"size: big"

color: green"size: medium"

QuickExperiment • Centralized experiment management • Purely config-level: no code pushes to iterate • Automatic exposure logging

PlanOut

PlanOut • Open sourced: http://facebook.github.io/planout/ • Flexible experimental design • Full, programmatic control over param values

Experiment evaluation

Exposures

Metrics

% change from control to test -1 0 1 2 -2 3 -3

posts

99.9 % 99 % 95 % Confidence:

Assess decision risk

99.9 % 99 % 95 % Confidence:

Lessons learned

Computing answers to exponential number of possible questions

Pre-compute • low specificity • low dimensionality • long-term

Compute on-the-fly • high specificity • high dimensionality • short-term

A balancing act

Tackling many dimensions Two sets of tools

For exploration For extraction

Automated exploration

Enforce a lifecycle; In particular:

clear experiment end dates

Why lifecycle policy? • Unifies methodology across teams • Prevents tech debt buildup • Minimizes bad impact on product

Ease of rapid iteration; Safe and scientifically valid iteration

Fast, but not too fast • Novelty effect vs. top engaged users bump • Understand if waiting helps

Ensure mutual exclusion; Across platforms, features and infra

Why mutual exclusion? • Fewer experiment conflicts • Lower metrics variance

Exposure log everything • Measure effects on the exposed only • Conditioning analyses on the time since last exposure

The culture

Experimentation gives focus; But watch out for tunnel vision!

The culture

Cultivate sound practices; Safe and low-impact experimentation

The culture

Educate on data interpretation; Uniform decision-making

across teams

Understanding uncertainty

“Robust misinterpretation of confidence intervals” Rink Hoekstra et al. Psychonomic Bulletin & Review

• Only 3% of scientists got all 6 answers right...

• How do we educate the users of the tools?

The three stages of experimentation

infrastructure

Stage 1: Artisanal

Photo credit: Abhisek Sarda

Stage 2: Power tools

Stage 2: Power tools

Stage 3: Industrialized

Photo credit: Steve Jurvetson

Conclusions

Empower, but don’t overwhelm

Conclusions

Filter and automate, but maintain broad focus

Conclusions Clean data and powerful tools are great, but

building the right experimentation culture is equally important

N=109  Automated  Experimenta5on  at  

Scale  Wojciech  Galuba  

Decision  Tools  Lead,  Facebook  @wgaluba  

top related