a/b testing pitfalls and lessons learned at spotify
DESCRIPTION
TRANSCRIPT
![Page 1: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/1.jpg)
April 9, 2023
Danielle Jabin
A/B Testing: Avoiding Common Pitfalls
![Page 2: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/2.jpg)
Who are you?
![Page 3: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/3.jpg)
3
![Page 4: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/4.jpg)
4
![Page 5: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/5.jpg)
5
Babies named Ava caused the housing bubble
![Page 6: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/6.jpg)
6
Correlation does not equal causation!
![Page 7: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/7.jpg)
7
Correlation does not equal causation!
![Page 8: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/8.jpg)
8
What about in a product context?
![Page 9: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/9.jpg)
Do your improvements actually affect user behavior or are the variations due to chance?
A/B Testing!
![Page 10: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/10.jpg)
But…what’s an A/B test?
![Page 11: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/11.jpg)
11
![Page 12: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/12.jpg)
Pitfall #1: Not limiting your error rate
![Page 13: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/13.jpg)
13
![Page 14: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/14.jpg)
14
What if I flip a coin 100 times, and I get 51 heads?
![Page 15: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/15.jpg)
15
What if I flip a coin 100 times, and I get 5 heads?
![Page 16: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/16.jpg)
16
![Page 17: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/17.jpg)
17
●alpha levels of 5% and 1% are the most common– Alternatively: P(significant) = .05 or .01
Statistical significance: probability that a change is not due to chance alone
![Page 18: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/18.jpg)
18
●To lock in error rates before you start a test, fix your sample size
Remedies
![Page 19: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/19.jpg)
19
●To lock in error rates before you start a test, fix your sample size
What should my sample size be?
Sample size in each group (assumes equal sized groups)
Represents the desired power (typically .84 for 80% power).
Represents the desired level of statistical significance (typically 1.96).
Standard deviation of the outcome variable Effect Size (the
difference in means)
2
2/2
2
difference
)Z(2
Zn
Source: www.stanford.edu/~kcobb/hrp259/lecture11.ppt
![Page 20: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/20.jpg)
20
●With σ = 10, difference in means = 1
Let’s see some numbers
One-sided test Two-sided test
alpha = .10, beta = .80 841 1230
alpha = .05, beta = .80 1230 1568
alpha = .01, beta = .80 2010 2339
![Page 21: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/21.jpg)
Pitfall #2: Stopping your test before the fixed sample size stopping point
![Page 22: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/22.jpg)
22
●1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion rate
Let’s see some more numbers
Stop at first point of significance
Ended as significant
90% significance reached
654 of 1,000 100 of 1,000
95% significance reached
427 of 1,000 49 of 1,000
99% significance reached
146 of 1,000 14 of 1,000
![Page 23: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/23.jpg)
23
●Don’t peek●Okay, maybe you can peek, but don’t stop or make a decision before you
reach the fixed sample size stopping point●Sequential sampling
Remedies
![Page 24: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/24.jpg)
Pitfall #3: Making multiple comparisons in one test
![Page 25: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/25.jpg)
25
Let’s look at an A/B test from our Discovery feature:
![Page 26: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/26.jpg)
26
●P(significant) + P(not significant) = 1●Let’s take an alpha of .05
– P(significant) = .05– P(not significant) = 1 – P(significant) = 1 - .05 = .95
A test can be one of two things: significant or not significant
![Page 27: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/27.jpg)
27
●P(at least 1 significant) = 1 - P(none of the 2 are significant)●P(none of the 2 are significant) = P(not significant)*P(not significant) = .95*.95
= .9025●P(at least 1 significant) = 1 - .9025 = .0975
What about for two comparisons?
![Page 28: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/28.jpg)
28
●That’s almost 2x (1.95x, to be precise) your .05 significance rate!
What about for two comparisons?
![Page 29: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/29.jpg)
29
And it just gets worse…
P(at least 1 signifcant) An increase of…
5 comparisons 1 – (1-.05)^5 = .23 4.6x
10 comparisons 1 – (1-.05)^10 = .40 8x
20 comparisons 1 – (1-.05)^20 = .64 12.8x
![Page 30: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/30.jpg)
30
●Bonferroni correction– Divide P(significant), your alpha, by the number of alternate hypotheses you
have, n– alpha/n becomes the new level of statistical significance
How can we remedy this?
![Page 31: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/31.jpg)
31
●Our new P(significant) = .05/2 = .025●Our new P(not significant) = 1 - .025 = .975●P(at least 1 significant) = 1 - P(none of the 2 are significant)●P(none of the 2 are significant) = P(not significant)*P(not significant)
= .975*.975 = .951●P(at least 1 significant) = 1 - .951 = .0499
So what about two comparisons now?
![Page 32: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/32.jpg)
32
P(significant) stays under .05
Corrected alpha P(at least 1 signifcant)
5 comparisons .05/5 = .01 1 – (1-.01)^5 = .049
10 comparisons .05/10 = .005 1 – (1-.005)^10 = .049
20 comparisons .05/20 = .0025 1 – (1-.0025)^20 = .049
![Page 33: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocument.in/reader035/viewer/2022070302/547bb8185806b503408b4629/html5/thumbnails/33.jpg)
Questions?