randomized experiments, a/b tests and sequential monitoring · 2018. 8. 21. · design choices 3....

114
Randomized experiments, A/B tests and sequential monitoring Steve Howard April 26, 2018 1

Upload: others

Post on 24-Sep-2020

2 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Randomized experiments, A/B tests andsequential monitoring

Steve HowardApril 26, 2018

1

Page 2: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Guess the winner!

From Kohavi, Longbotham, et al. (2009).

2

Page 3: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Guess the winner!

From Kohavi, Longbotham, et al. (2009).

2

10x revenue!

Page 4: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Guess the winner!

From Kohavi, Deng, et al. (2012).3

Page 5: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Guess the winner!

From Kohavi, Deng, et al. (2012). 3

+10% revenue

Page 6: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Yes, the color thing is real.

4

+$10MM annually

Page 7: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

For years, Microso t, like many other companies, hadrelied on expert designers—rather than the behaviorof actual users—to define corporate style guides andcolors.

From Kohavi and Thomke (2017).

5

Page 8: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

“Which version is better?” is a thorny question.

• What I’d really like to know:1. What would happen if I were to show everyone version A?2. What would happen if I were to show everyone version B?

• A simpler version:1. What would happen if I were to show you version A?2. What would happen if I were to show you version B?

• I can never (reliably) answer this question!

“The fundamental problem of causal inference”

6

Page 9: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

“Which version is better?” is a thorny question.

• What I’d really like to know:1. What would happen if I were to show everyone version A?2. What would happen if I were to show everyone version B?

• A simpler version:1. What would happen if I were to show you version A?2. What would happen if I were to show you version B?

• I can never (reliably) answer this question!

“The fundamental problem of causal inference”

6

Page 10: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

“Which version is better?” is a thorny question.

• What I’d really like to know:1. What would happen if I were to show everyone version A?2. What would happen if I were to show everyone version B?

• A simpler version:1. What would happen if I were to show you version A?2. What would happen if I were to show you version B?

• I can never (reliably) answer this question!

“The fundamental problem of causal inference”

6

Page 11: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

“Which version is better?” is a thorny question.

• What I’d really like to know:1. What would happen if I were to show everyone version A?2. What would happen if I were to show everyone version B?

• A simpler version:1. What would happen if I were to show you version A?2. What would happen if I were to show you version B?

• I can never (reliably) answer this question!

“The fundamental problem of causal inference”

6

Page 12: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Carefully designed,randomized controlled experiments are theonly reliable way to learn what works best.

6

Page 13: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Sometimes the only thing you can do with a poorlydesigned experiment is to try to find out what it diedof. (Fisher)

From Box, J. S. Hunter, and W. G. Hunter (2005).

7

Page 14: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

1. The need for randomized experiments

• Prediction, estimation, and causal inference• The two benefits of randomization

2. Design choices

3. Sequential experimentation

7

Page 15: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Prediction, estimation, and causal inference

7

Page 16: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Prediction, estimation, and causal inference:a coarse classification of statistical problems

Suppose I have data on birth weights at a certain hosptital,and whether each mother smoked.

8

Page 17: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Prediction, estimation, and causal inference:a coarse classification of statistical problems

Suppose I have data on birth weights at a certain hosptital,and whether each mother smoked.

A prediction problem:Can you predict the birth weight of the next baby,given the mother’s smoking status and other info?

Use any algorithm we want, check accuracy on held-out data.

8

Page 18: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Prediction, estimation, and causal inference:a coarse classification of statistical problems

Suppose I have data on birth weights at a certain hosptital,and whether each mother smoked.

An estimation problem:What is the (adjusted) difference in birth weightbetween smokers and nonsmokers, in the population?

How precise is that estimate?

Need a probability model.

8

Page 19: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Prediction, estimation, and causal inference:a coarse classification of statistical problems

Suppose I have data on birth weights at a certain hosptital,and whether each mother smoked.

A causal inference problem:What will be the effect on birth weight oftelling mothers to stop smoking?

Need two groups of mothers similar except for treatment.

8

Page 20: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Prediction, estimation, and causal inference:a coarse classification of statistical problems

Suppose I have data on birth weights at a certain hosptital,and whether each mother smoked.

A causal inference problem:What will be the effect on birth weight oftelling mothers to stop smoking?

Need two groups of mothers similar except for treatment.

Randomized assignment yields such groups.

8

Page 21: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

The two benefits of randomization

8

Page 22: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

First benefit of randomization: freedom from bias

Designing an experiment is like gambling with thedevil: Only a random strategy can defeat all hisbetting systems. (Fisher)

From Box, J. S. Hunter, and W. G. Hunter (2005).

9

Page 23: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

From Freedman, Pisani, and Purves (2007).

10

Page 24: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

From Freedman, Pisani, and Purves (2007).

11

Page 25: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Second benefit of randomization:a “reasoned basis for inference”

By putting known randomness into the world,we justify probability calculation by design.

An idea due to Fisher. Also known as“putting a rabbit into the hat”. (Freedman)

Without randomization, probabilities are justifiedpurely by a model.

12

Page 26: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Don’t fall in love with a model.

From Box, J. S. Hunter, and W. G. Hunter (2005).

13

Page 27: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

1. The need for randomized experiments

2. Design choices

• Choosing the unit of randomization• Choosing who to enroll• Choosing an outcome metric

3. Sequential experimentation

13

Page 28: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Choosing the unit of randomization

13

Page 29: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Pricing is a tricky thing to experiment on.

From keepa.com

14

Page 30: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

How would you run a pricing experiment?

• Randomize by session?

• Randomize by user?• Randomize by product?• Randomize by product category?• Randomize by day?

The right unit of randomization is sometimes not obvious!

15

Page 31: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

How would you run a pricing experiment?

• Randomize by session?• Randomize by user?

• Randomize by product?• Randomize by product category?• Randomize by day?

The right unit of randomization is sometimes not obvious!

15

Page 32: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

How would you run a pricing experiment?

• Randomize by session?• Randomize by user?• Randomize by product?

• Randomize by product category?• Randomize by day?

The right unit of randomization is sometimes not obvious!

15

Page 33: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

How would you run a pricing experiment?

• Randomize by session?• Randomize by user?• Randomize by product?• Randomize by product category?

• Randomize by day?

The right unit of randomization is sometimes not obvious!

15

Page 34: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

How would you run a pricing experiment?

• Randomize by session?• Randomize by user?• Randomize by product?• Randomize by product category?• Randomize by day?

The right unit of randomization is sometimes not obvious!

15

Page 35: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

How would you run a pricing experiment?

• Randomize by session?• Randomize by user?• Randomize by product?• Randomize by product category?• Randomize by day?

The right unit of randomization is sometimes not obvious!

15

Page 36: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

The unit of analysis should be the same asthe unit of randomization.

Whatever your unit of randomization,

• compute one summary outcome per unit, and• analyze results with these outcomes.

Sample size = number of randomized units!

16

Page 37: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

The unit of analysis should be the same asthe unit of randomization.

Whatever your unit of randomization,

• compute one summary outcome per unit, and• analyze results with these outcomes.

Sample size = number of randomized units!

16

Page 38: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Making the unit of analysis differ from the unit of randomizationis dangerous.

Be wary of finer-grained analysis, e.g.,

• randomizing by city, analyzing by user.• randomizing by category, analyzing by product.

It can be done, but requires delicate modeling assumptions.

An extreme example: imagine we have just two groups,say San Francisco and Los Angeles.

There are only two possible randomizations.The randomization implies only two possible outcomes.

17

Page 39: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Making the unit of analysis differ from the unit of randomizationis dangerous.

Be wary of finer-grained analysis, e.g.,

• randomizing by city, analyzing by user.• randomizing by category, analyzing by product.

It can be done, but requires delicate modeling assumptions.

An extreme example: imagine we have just two groups,say San Francisco and Los Angeles.

There are only two possible randomizations.The randomization implies only two possible outcomes.

17

Page 40: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Choosing who to enroll

17

Page 41: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

An aside: sample size planning / power calculations

The guarantee of a hypothesis test:

“If the treatment has no effect,the chance of false discovery is at most 5%.”

What if the treatment does have an effect?

“If the treatment has at least a 20% li t,the chance of detecting it is at least 80%.”

How to guarantee the second statement?Sample size planning.

18

Page 42: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

An aside: sample size planning / power calculations

The guarantee of a hypothesis test:

“If the treatment has no effect,the chance of false discovery is at most 5%.”

What if the treatment does have an effect?

“If the treatment has at least a 20% li t,the chance of detecting it is at least 80%.”

How to guarantee the second statement?Sample size planning.

18

Page 43: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

An aside: sample size planning / power calculations

The guarantee of a hypothesis test:

“If the treatment has no effect,the chance of false discovery is at most 5%.”

What if the treatment does have an effect?

“If the treatment has at least a 20% li t,the chance of detecting it is at least 80%.”

How to guarantee the second statement?Sample size planning.

18

Page 44: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

An aside: sample size planning / power calculations

The guarantee of a hypothesis test:

“If the treatment has no effect,the chance of false discovery is at most 5%.”

What if the treatment does have an effect?

“If the treatment has at least a 20% li t,the chance of detecting it is at least 80%.”

How to guarantee the second statement?Sample size planning.

18

Type I error rate

Minimum planned-for effect

Power

Page 45: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

An aside: sample size planning / power calculations

• Type I error rate: 5%• Minimum planned-for effect: 20% li t• Power: 80%

>>> power_prop_test(p1=0.1, p2=0.1 * 1.2,significance_level=0.05,power=0.8).n_per_group

3840.847482436278

19

Page 46: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Randomize as lazily as possible.

50,000 visitors/week → 1% advance to checkout → 20% complete

Say our variation increases conversion rate from 20% to 25%.

Bad idea: enroll everyone.

power_prop_test(p1=.20 * .01, p2=.25 * .01, power=.8)→ need 280,000 visitors.

Good idea: enroll only those get to the checkout page.

power_prop_test(p1=.20, p2=.25, power=.8)→ need 2,200 visitors at checkout→ need 220,000 visitors total (20% decrease in sample size)

20

Page 47: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Randomize as lazily as possible.

50,000 visitors/week → 1% advance to checkout → 20% complete

Say our variation increases conversion rate from 20% to 25%.

Bad idea: enroll everyone.

power_prop_test(p1=.20 * .01, p2=.25 * .01, power=.8)→ need 280,000 visitors.

Good idea: enroll only those get to the checkout page.

power_prop_test(p1=.20, p2=.25, power=.8)→ need 2,200 visitors at checkout→ need 220,000 visitors total (20% decrease in sample size)

20

Page 48: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Randomize as lazily as possible.

50,000 visitors/week → 1% advance to checkout → 20% complete

Say our variation increases conversion rate from 20% to 25%.

Bad idea: enroll everyone.

power_prop_test(p1=.20 * .01, p2=.25 * .01, power=.8)→ need 280,000 visitors.

Good idea: enroll only those get to the checkout page.

power_prop_test(p1=.20, p2=.25, power=.8)→ need 2,200 visitors at checkout

→ need 220,000 visitors total (20% decrease in sample size)

20

Page 49: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Randomize as lazily as possible.

50,000 visitors/week → 1% advance to checkout → 20% complete

Say our variation increases conversion rate from 20% to 25%.

Bad idea: enroll everyone.

power_prop_test(p1=.20 * .01, p2=.25 * .01, power=.8)→ need 280,000 visitors.

Good idea: enroll only those get to the checkout page.

power_prop_test(p1=.20, p2=.25, power=.8)→ need 2,200 visitors at checkout→ need 220,000 visitors total (20% decrease in sample size)

20

Page 50: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Sometimes it helps to enroll a highly-affected subgroup.

Suppose treatment is expensive, and

• among all users at checkout, 20% complete the purchase;• Among users with at least two items in their cart,40% complete the purchase.

Idea #1: enroll everyone.

power_prop_test(p1=.20, p2=.25, power=.8)→ need 2,200 visitors.

Idea #2: enroll only those with two items.

power_prop_test(p1=.40, p2=.50, power=.8)→ need 770 visitors at checkout(65% decrease in enrolled sample size)

21

Page 51: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Sometimes it helps to enroll a highly-affected subgroup.

Suppose treatment is expensive, and

• among all users at checkout, 20% complete the purchase;• Among users with at least two items in their cart,40% complete the purchase.

Idea #1: enroll everyone.

power_prop_test(p1=.20, p2=.25, power=.8)→ need 2,200 visitors.

Idea #2: enroll only those with two items.

power_prop_test(p1=.40, p2=.50, power=.8)→ need 770 visitors at checkout(65% decrease in enrolled sample size)

21

Page 52: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Sometimes it helps to enroll a highly-affected subgroup.

Suppose treatment is expensive, and

• among all users at checkout, 20% complete the purchase;• Among users with at least two items in their cart,40% complete the purchase.

Idea #1: enroll everyone.

power_prop_test(p1=.20, p2=.25, power=.8)→ need 2,200 visitors.

Idea #2: enroll only those with two items.

power_prop_test(p1=.40, p2=.50, power=.8)→ need 770 visitors at checkout(65% decrease in enrolled sample size) 21

Page 53: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Focusing on a subgroup may help internal validity,but hurt external validity.

Internal validity: are my conclusions valid for enrolledsubjects?

External validity: do my conclusions generalize to othergroups?

Sometimes a difficult tradeoff.

22

Page 54: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Focusing on a subgroup may help internal validity,but hurt external validity.

Internal validity: are my conclusions valid for enrolledsubjects?

External validity: do my conclusions generalize to othergroups?

Sometimes a difficult tradeoff.

22

Page 55: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Focusing on a subgroup may help internal validity,but hurt external validity.

Internal validity: are my conclusions valid for enrolledsubjects?

External validity: do my conclusions generalize to othergroups?

Sometimes a difficult tradeoff.

22

Page 56: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Choosing an outcome metric

22

Page 57: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Imagine we’re testing changes to a search ranking algorithm.

23

Page 58: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

How should we evaluate search quality?

We need a metric to run an experiment.

We’d like to improve market share:# queries to our search engine# queries to all search engines

We can’t measure the denominator. So just use the numerator?

24

Page 59: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

How should we evaluate search quality?

We need a metric to run an experiment.

We’d like to improve market share:# queries to our search engine# queries to all search engines

We can’t measure the denominator. So just use the numerator?

24

Page 60: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

How should we evaluate search quality?

We need a metric to run an experiment.

We’d like to improve market share:# queries to our search engine# queries to all search engines

We can’t measure the denominator. So just use the numerator?

24

Page 61: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Finding the right metric isn’t easy!

Kohavi, Deng, et al. (2012): a buggy experiment showing verypoor search results caused

• 10% li t in queries per user, and• 30% li t in revenue per user!

What happened here?

• Bad results → issue more queries• Bad organic results → more clicks on ads

25

Page 62: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Finding the right metric isn’t easy!

Kohavi, Deng, et al. (2012): a buggy experiment showing verypoor search results caused

• 10% li t in queries per user, and• 30% li t in revenue per user!

What happened here?

• Bad results → issue more queries• Bad organic results → more clicks on ads

25

Page 63: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

QueriesMonth =

UsersMonth × Sessions

User × QueriesSession

• # Users is fixed by design.• Sessions / User seems the best metric.• Queries / Session is difficult to interpret.

From Kohavi, Deng, et al. (2012).

26

Page 64: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

QueriesMonth =

UsersMonth × Sessions

User × QueriesSession

• # Users is fixed by design.

• Sessions / User seems the best metric.• Queries / Session is difficult to interpret.

From Kohavi, Deng, et al. (2012).

26

Page 65: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

QueriesMonth =

UsersMonth × Sessions

User × QueriesSession

• # Users is fixed by design.

• Sessions / User seems the best metric.

• Queries / Session is difficult to interpret.

From Kohavi, Deng, et al. (2012).

26

Page 66: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

QueriesMonth =

UsersMonth × Sessions

User × QueriesSession

• # Users is fixed by design.• Sessions / User seems the best metric.• Queries / Session is difficult to interpret.

From Kohavi, Deng, et al. (2012).

26

Page 67: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

What if I want to look at multiple outcome metrics?

My suggestion for multiple outcome metrics:

• Pick one primary metric—a “key performance indicator” orKPI.

• If the others are important, correct for multiple testing.• Otherwise, look at them, but educate yourself and othersabout multiplicity.

27

Page 68: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

1. The need for randomized experiments

2. Design issues

3. Sequential experimentation

• A lesson about random walks• Repeated looks inflate error• Simulation-based sequential p-values

27

Page 69: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

A lesson about random walks

27

Page 70: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

The coin-flipping game: long leads

Every day for one year, I flip a fair coin.

• Heads → you pay me $1.• Tails → I pay you $1.

What’s the chance that, a ter the first eight days,one of us stays in the lead the entire rest of the year?

(a) One in 10,000(b) One in 1,000(c) One in 100(d) One in 10

28

Page 71: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

The coin-flipping game: long leads

Every day for one year, I flip a fair coin.

• Heads → you pay me $1.• Tails → I pay you $1.

What’s the chance that, a ter the first eight days,one of us stays in the lead the entire rest of the year?

(a) One in 10,000(b) One in 1,000(c) One in 100(d) One in 10

28

Page 72: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

The coin-flipping game: long leads

Every day for one year, I flip a fair coin.

• Heads → you pay me $1.• Tails → I pay you $1.

What’s the chance that, a ter the first eight days,one of us stays in the lead the entire rest of the year?

(a) One in 10,000(b) One in 1,000(c) One in 100(d) One in 10

28

Page 73: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

One random walk path

−60

−30

0

30

60

0 100 200 300Number of coin flips

My

prof

it/lo

ss

29

Page 74: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Random walks with long leads

−60

−30

0

30

60

0 100 200 300Number of coin flips

My

prof

it/lo

ss

10 / 100 walks have one leader for the last 357 / 365 days. 30

Page 75: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Random walks with a dominant leader

−60

−30

0

30

60

0 100 200 300Number of coin flips

My

prof

it/lo

ss

15 / 100 walks have one player ahead any 357 / 365 days. 31

Page 76: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

A highly uneven outcome is the norm, not the exception.

0 100 200 300Number of days I was ahead

(These are called arcsine laws.) 32

Page 77: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

What’s going on here?

In a random walk, outcomes at different times arehighly correlated.

Our usual notions about long-run behavior don’t apply.

These examples are based on Feller (1971, §III.4).

33

Page 78: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Repeated looks inflate error

33

Page 79: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Sequential monitoring of A/B testsis desirable but problematic.

From https://crozdesk.com/analytics-intelligence/a-b-testing-software/optimizely34

Page 80: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Testing a coin for fairness

How to determine if a coin is fair? Start flipping it!

T H T

H H T H H H T H H T T H T H T H H H T H H ...

Is 15 / 24 heads surprising? This is what p-values are for.

35

Page 81: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Testing a coin for fairness

How to determine if a coin is fair? Start flipping it!

T H T H H T H H H

T H H T T H T H T H H H T H H ...

Is 15 / 24 heads surprising? This is what p-values are for.

35

Page 82: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Testing a coin for fairness

How to determine if a coin is fair? Start flipping it!

T H T H H T H H H T H H T T H T H T H H H T H H ...

Is 15 / 24 heads surprising? This is what p-values are for.

35

Page 83: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Testing a coin for fairness

How to determine if a coin is fair? Start flipping it!

T H T H H T H H H T H H T T H T H T H H H T H H ...

Is 15 / 24 heads surprising?

This is what p-values are for.

35

Page 84: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Testing a coin for fairness

How to determine if a coin is fair? Start flipping it!

T H T H H T H H H T H H T T H T H T H H H T H H ...

Is 15 / 24 heads surprising? This is what p-values are for.

35

Page 85: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

p-values control the chance of false discovery

The guarantee of a hypothesis test:

“If the treatment has no effect,the chance of false discovery is at most 5%.”

The key property of p-values: if the treatment has no effect,

P(p-value ≤ 0.05) ≤ 0.05.

Declare a discovery when p-value ≤ 0.05→ chance of false discovery is at most 5%.

36

Page 86: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

p-values control the chance of false discovery

The guarantee of a hypothesis test:

“If the treatment has no effect,the chance of false discovery is at most 5%.”

The key property of p-values: if the treatment has no effect,

P(p-value ≤ 0.05) ≤ 0.05.

Declare a discovery when p-value ≤ 0.05→ chance of false discovery is at most 5%.

36

Page 87: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

p-values control the chance of false discovery

The guarantee of a hypothesis test:

“If the treatment has no effect,the chance of false discovery is at most 5%.”

The key property of p-values: if the treatment has no effect,

P(p-value ≤ 0.05) ≤ 0.05.

Declare a discovery when p-value ≤ 0.05→ chance of false discovery is at most 5%.

36

Page 88: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

One path of p-values from a fair coin.

.05

1

0 215 437Number of flips

Fixe

d−sa

mpl

e p−

valu

e

37

Page 89: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

With no bias, we only rarely conclude the coin is biased.

.05

1

0 215 437Number of flips

Fixe

d−sa

mpl

e p−

valu

e

38

Page 90: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Continuous monitoring of fixed-sample p-valuesbreaks the guarantee.

.05

1

0 215 437Number of flips

Fixe

d−sa

mpl

e p−

valu

e

Here, with a fair coin, 35% of paths reach significance. 39

Page 91: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

For a fair coin, chance of false discovery grows arbitrarily largewith enough flips.

0.05

0.5

100 101 102 103 104 105

Number of samples (log scale)

Prob

. of f

alse

dis

cove

ry

40

Page 92: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Sequential monitoring happens all over the place.

• In A/B testing.• In clinical trials.• In lab experments.• ...

41

Page 93: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Simulation-based sequential p-values

41

Page 94: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Reminder: fixed-sample p-values by simulation

Standard p-value to test whether a coin is fair:

• Flip the coin 1,000 times.• Compute tobs1000 = # heads - # tails a ter 1,000 flips.• Simulate T1000 many times and estimate

p-value = P(|T10,000| ≥ tobs10,000).

t 1000obs

−100 −50 0 50 100T 1000

Example: tobs1000 = 70 → p ≈ 0.029.

42

Page 95: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Reminder: fixed-sample p-values by simulation

Standard p-value to test whether a coin is fair:

• Flip the coin 1,000 times.• Compute tobs1000 = # heads - # tails a ter 1,000 flips.• Simulate T1000 many times and estimate

p-value = P(|T10,000| ≥ tobs10,000).

t 1000obs

−100 −50 0 50 100T 1000

Example: tobs1000 = 70 → p ≈ 0.029. 42

Page 96: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Now we have a sequence of test statistics.

Now say we want to compute a p-value a ter every 100 flips.

We need to consider the sequence of test statistics

T100, T200, . . . , T900, T1000.

For a fair coin, Tn is a sum of n i.i.d. random variables,each taking values ±1 with probability 1/2 each.

The variance of this ±1 random variable is one.⇒ The variance of Tn is n (standard deviation

√n).

⇒ The variance of Tn/√n is 1.

43

Page 97: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Now we have a sequence of test statistics.

Now say we want to compute a p-value a ter every 100 flips.

We need to consider the sequence of test statistics

T100, T200, . . . , T900, T1000.

For a fair coin, Tn is a sum of n i.i.d. random variables,each taking values ±1 with probability 1/2 each.

The variance of this ±1 random variable is one.⇒ The variance of Tn is n (standard deviation

√n).

⇒ The variance of Tn/√n is 1.

43

Page 98: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Now we have a sequence of test statistics.

Now say we want to compute a p-value a ter every 100 flips.

We need to consider the sequence of test statistics

T100, T200, . . . , T900, T1000.

For a fair coin, Tn is a sum of n i.i.d. random variables,each taking values ±1 with probability 1/2 each.

The variance of this ±1 random variable is one.⇒ The variance of Tn is n (standard deviation

√n).

⇒ The variance of Tn/√n is 1.

43

Page 99: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Now we have a sequence of test statistics.

Now say we want to compute a p-value a ter every 100 flips.

We need to consider the sequence of test statistics

T100, T200, . . . , T900, T1000.

For a fair coin, Tn is a sum of n i.i.d. random variables,each taking values ±1 with probability 1/2 each.

The variance of this ±1 random variable is one.

⇒ The variance of Tn is n (standard deviation√n).

⇒ The variance of Tn/√n is 1.

43

Page 100: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Now we have a sequence of test statistics.

Now say we want to compute a p-value a ter every 100 flips.

We need to consider the sequence of test statistics

T100, T200, . . . , T900, T1000.

For a fair coin, Tn is a sum of n i.i.d. random variables,each taking values ±1 with probability 1/2 each.

The variance of this ±1 random variable is one.⇒ The variance of Tn is n (standard deviation

√n).

⇒ The variance of Tn/√n is 1.

43

Page 101: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Now we have a sequence of test statistics.

Now say we want to compute a p-value a ter every 100 flips.

We need to consider the sequence of test statistics

T100, T200, . . . , T900, T1000.

For a fair coin, Tn is a sum of n i.i.d. random variables,each taking values ±1 with probability 1/2 each.

The variance of this ±1 random variable is one.⇒ The variance of Tn is n (standard deviation

√n).

⇒ The variance of Tn/√n is 1.

43

Page 102: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

−100

−50

0

50

100

100 200

300

400

500

600 700 80

090

01,0

00

Number of flips, n

T n

44

Page 103: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

−4

−2

0

2

4

100 200

300

400

500

600 700 80

090

01,0

00

Number of flips, n

T nn

45

Page 104: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

We’ll use amaximal test statistic to compute sequential p-values.

Now we’ll simulate the test statistic

T⋆1000 = max

{T100√100

,T200√200

, . . . ,T900√900

,T1000√1000

}.

Our sequential procedure:

• A ter every 100 flips, compute

tobsn =# heads - # tails a ter n flips√

nt⋆n = max

{tobs100 , tobs200, . . . , tobsn

}.

• Simulate T⋆1000 many times and estimate

p-value = P(|T⋆1000| ≥ t⋆n).

46

Page 105: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

We’ll use amaximal test statistic to compute sequential p-values.

Now we’ll simulate the test statistic

T⋆1000 = max

{T100√100

,T200√200

, . . . ,T900√900

,T1000√1000

}.

Our sequential procedure:

• A ter every 100 flips, compute

tobsn =# heads - # tails a ter n flips√

nt⋆n = max

{tobs100 , tobs200, . . . , tobsn

}.

• Simulate T⋆1000 many times and estimate

p-value = P(|T⋆1000| ≥ t⋆n).46

Page 106: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Example: t⋆1000 = 70/√1000 → p ≈ 0.054.

(Compare to p ≈ 0.029 earlier.)

tn*

−2.5 0.0 2.5T 1000*

47

Page 107: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

We can look at these p-values repeatedly.

Now we can compute a p-value a ter every 100 flips, stop assoon as p ≤ 0.05, and still have the guarantee

P(any p-value ≤ 0.05) ≤ 0.05

if the coin is fair.

If the coin is biased, we have a chance to stop early.

Remember:

• We must choose the maximum sample size in advance(here, 1,000).

• We can only look as o ten as we do in the simulation(here, every 100 flips).

48

Page 108: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

We can look at these p-values repeatedly.

Now we can compute a p-value a ter every 100 flips, stop assoon as p ≤ 0.05, and still have the guarantee

P(any p-value ≤ 0.05) ≤ 0.05

if the coin is fair.

If the coin is biased, we have a chance to stop early.

Remember:

• We must choose the maximum sample size in advance(here, 1,000).

• We can only look as o ten as we do in the simulation(here, every 100 flips).

48

Page 109: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

We can look at these p-values repeatedly.

Now we can compute a p-value a ter every 100 flips, stop assoon as p ≤ 0.05, and still have the guarantee

P(any p-value ≤ 0.05) ≤ 0.05

if the coin is fair.

If the coin is biased, we have a chance to stop early.

Remember:

• We must choose the maximum sample size in advance(here, 1,000).

• We can only look as o ten as we do in the simulation(here, every 100 flips).

48

Page 110: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Recap

1. Randomized assignment

• protects from bias, and• justifies probability calculations.

2. Before running an experiment, carefully choose

• the unit of randomization (and analysis!),• the enrolled population, and• the outcome metric.

3. If you want to monitor sequentially, use sequential methods!

49

Page 111: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Recap

1. Randomized assignment

• protects from bias, and• justifies probability calculations.

2. Before running an experiment, carefully choose

• the unit of randomization (and analysis!),• the enrolled population, and• the outcome metric.

3. If you want to monitor sequentially, use sequential methods!

49

Page 112: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Recap

1. Randomized assignment

• protects from bias, and• justifies probability calculations.

2. Before running an experiment, carefully choose

• the unit of randomization (and analysis!),• the enrolled population, and• the outcome metric.

3. If you want to monitor sequentially, use sequential methods!

49

Page 113: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Box, G. E. P., J. S. Hunter, and W. G. Hunter (2005). Statistics forexperimenters: design, innovation, and discovery.Wiley-Interscience.

Feller, W. (1971). An introduction to probability theory and itsapplications. 3rd. Wiley.

Freedman, D., R. Pisani, and R. Purves (2007). Statistics. W.W. Norton &Company.

Kohavi, R., A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu(2012). “Trustworthy online controlled experiments: Five puzzlingoutcomes explained”. In: Proceedings of the 18th ACM SIGKDDinternational conference on Knowledge discovery and datamining. ACM, pp. 786–794.

Kohavi, R., R. Longbotham, D. Sommerfield, and R. M. Henne (2009).“Controlled experiments on the web: survey and practical guide”.Data Mining and Knowledge Discovery 18 (1), pp. 140–181.

Kohavi, R. and S. H. Thomke (2017). “The Surprising Power of OnlineExperiments”. Harvard Business Review 95 (5), pp. 74–82.

50

Page 114: Randomized experiments, A/B tests and sequential monitoring · 2018. 8. 21. · Design choices 3. Sequential experimentation 7. Prediction, estimation, ... power=0.8).n_per_group

Thank you.

50