a/b testing: avoiding common pitfalls · 34 1,000 experiments with 200,000 fake participants...
TRANSCRIPT
![Page 1: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/1.jpg)
März 6, 2014
Danielle Jabin
A/B Testing: Avoiding Common Pitfalls
![Page 2: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/2.jpg)
2
Make all the world’s music available instantly to everyone, wherever and whenever they
want it
![Page 3: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/3.jpg)
3
![Page 4: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/4.jpg)
4
Over 24 million active users
![Page 5: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/5.jpg)
5
Access to more than 20 million songs
![Page 6: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/6.jpg)
6
![Page 7: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/7.jpg)
7
But can we make it even easier?
![Page 8: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/8.jpg)
8
We can try… …with A/B testing!
![Page 9: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/9.jpg)
9
So…what’s an A/B test?
![Page 10: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/10.jpg)
10
Control A
![Page 11: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/11.jpg)
Pitfall #1: Not limiting your error rate
![Page 12: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/12.jpg)
12
Source: assets.20bits.com/20081027/normal-‐curve-‐small.png
![Page 13: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/13.jpg)
13
What if I flip a coin 100 times and get 51 heads?
![Page 14: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/14.jpg)
14
What if I flip a coin 100 times and get 5 heads?
![Page 15: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/15.jpg)
15
![Page 16: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/16.jpg)
16
The likelihood of obtaining a certain value under a given
distribution is measured by its p-value
![Page 17: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/17.jpg)
17
If there is a low likelihood that a change is due to chance alone, we call our results statistically
significant
![Page 18: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/18.jpg)
18
What if I flip a coin 100 times and get 5 heads?
![Page 19: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/19.jpg)
19
● alpha levels of 5% and 1% are most commonly used – Alternatively: P(significant) = .05 or .01
Statistical significance is measured by alpha
![Page 20: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/20.jpg)
20
Each alpha has a corresponding Z-score
alpha Z-‐score (two-‐sided test)
.10 1.65
.05 1.96
.01 2.58
![Page 21: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/21.jpg)
21
The Z-score tells us how far a particular value is from the
mean (and what the corresponding likelihood is)
![Page 22: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/22.jpg)
22
Source: assets.20bits.com/20081027/normal-‐curve-‐small.png
![Page 23: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/23.jpg)
23
Compute the Z-score at the end of the test
![Page 24: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/24.jpg)
24
Standard deviation (σ) tells us how spread out the numbers
are
![Page 25: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/25.jpg)
25
![Page 26: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/26.jpg)
26
To lock in error rates before you start, fix your sample size
![Page 27: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/27.jpg)
27
● To lock in error rates before you start a test, fix your sample size
What should my sample size be?
Sample size in each group (assumes equal sized groups)
Represents the desired power (typically .84 for 80% power).
Represents the desired level of staJsJcal significance (typically 1.96).
Standard deviaJon of the outcome variable Effect Size (the
difference in means)
n =2σ 2 (Zβ +Zα /2 )
2
difference2
Source: www.stanford.edu/~kcobb/hrp259/lecture11.ppt
![Page 28: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/28.jpg)
28
● Compute your sample size – Using alpha, beta, standard deviation of your metric, and effect size
● Run your test! But stop once you’ve reached the fixed sample size stopping point ● Compute your z-score and compare it with the z-score for the chosen alpha level
Recap: running an A/B test
![Page 29: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/29.jpg)
29
Control A
![Page 30: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/30.jpg)
30
Resulting Z-score?
![Page 31: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/31.jpg)
31
33.3
![Page 32: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/32.jpg)
Pitfall #2: Stopping your test before the fixed sample size stopping point
![Page 33: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/33.jpg)
33
● With σ = 10, difference in means = 1
Sample size for varying alpha levels
Two-‐sided test
alpha = .10, beta = .80 1230
alpha = .05, beta = .80
1568
alpha = .01, beta = .80 2339
![Page 34: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/34.jpg)
34
● 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion rate
Let’s see some numbers
Stop at first point of significance
Ended as significant
90% significance reached
654 of 1,000 100 of 1,000
95% significance reached
427 of 1,000 49 of 1,000
99% significance reached
146 of 1,000 14 of 1,000
Source: destack.home.xs4all.nl/projects/significance/
![Page 35: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/35.jpg)
35
● Don’t peek ● Okay, maybe you can peek, but don’t stop or make a decision before you reach the fixed
sample size stopping point ● Sequential sampling
Remedies
![Page 36: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/36.jpg)
36
Control A B
![Page 37: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/37.jpg)
Pitfall #3: Making multiple comparisons in one test
![Page 38: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/38.jpg)
38
● P(significant) + P(not significant) = 1 ● Let’s take an alpha of .05 – P(significant) = .05 – P(not significant) = 1 – P(significant) = 1 - .05 = .95
A test can be one of two things: significant or not significant
![Page 39: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/39.jpg)
39
● P(at least 1 significant) = 1 - P(none of the 2 are significant) ● P(none of the 2 are significant) = P(not significant)*P(not significant) = .95*.95 = .9025 ● P(at least 1 significant) = 1 - .9025 = .0975
What about for two comparisons?
![Page 40: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/40.jpg)
40
● That’s almost 2x (1.95x, to be precise) your .05 significance rate!
What about for two comparisons?
![Page 41: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/41.jpg)
41
And it just gets worse…L
P(at least 1 signifcant) An increase of…
5 variaJons 1 – (1-‐.05)^5 = .23 4.6x
10 variaJons 1 – (1-‐.05)^10 = .40 8x
20 variaJons 1 – (1-‐.05)^20 = .64 12.8x
![Page 42: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/42.jpg)
42
● Bonferroni correction – Divide P(significant), your alpha, by the number of variations you are testing, n – alpha/n becomes the new level of statistical significance
How can we remedy this?
![Page 43: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/43.jpg)
43
● Our new P(significant) = .05/2 = .025 ● Our new P(not significant) = 1 - .025 = .975 ● P(at least 1 significant) = 1 - P(none of the 2 are significant) ● P(none of the 2 are significant) = P(not significant)*P(not significant) = .975*.975 = .951 ● P(at least 1 significant) = 1 - .951 = .0499
So what about two comparisons now?
![Page 44: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/44.jpg)
44
P(significant) stays under .05 J
Corrected alpha P(at least 1 signifcant)
5 variaJons .05/5 = .01 1 – (1-‐.01)^5 = .049
10 variaJons .05/10 = .005 1 – (1-‐.005)^10 = .049
20 variaJons .05/20 = .0025 1 – (1-‐.0025)^20 = .049
![Page 45: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/45.jpg)
Questions?
![Page 46: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/46.jpg)
Appendix
![Page 47: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/47.jpg)
47
1. Decide what to test 2. Determine a metric to test 3. Formulate your hypothesis
1. Select an effect size threshold: what change of the metric would make a rollout worthwhile?
4. Calculate sample size (your stopping point) 1. Decide your Type I (alpha) and Type 2 (beta) error levels and the corresponding z-
scores 2. Determine the standard deviation of your metric
5. Run your test! But stop once you’ve reached the fixed sample size stopping point 6. Compute your z-score and compare it with the z-score for your chosen alpha level
A/B test steps:
![Page 48: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/48.jpg)
48
● Type I error: incorrectly reject a true null hypothesis – alpha
● Type II error: incorrectly accept a false null hypothesis – beta – Power: 1 - beta
Type I and Type II error
![Page 49: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/49.jpg)
49
Z-score reference table
alpha One-‐sided test Two-‐sided test
.10 1.28 1.65
.05 1.65 1.96
.01 2.33 2.58
![Page 50: A/B Testing: Avoiding Common Pitfalls · 34 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion](https://reader034.vdocument.in/reader034/viewer/2022042319/5f08f5577e708231d4248cbc/html5/thumbnails/50.jpg)
50
Z-score for proportions (e.g. conversion)