Download - Crash Course in A/B testing
![Page 1: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/1.jpg)
Crash Course in A/B testingA statistical perspective
Wayne Tai Lee
![Page 2: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/2.jpg)
Roadmap• What is A/B testing?
• Good experiments and the role of statistics• Similar to proof by contradiction• “Tests”• Big data meets classic asymptotics• Complaints with classical hypothesis testing• Alternatives?
![Page 3: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/3.jpg)
What is A/B Testing• An industry term for controlled and randomized experiment
between treatment/control groups.• Age old problem….especially with humans
![Page 4: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/4.jpg)
What most people know:
A
B
Gather samples
Assign treatments
Apply treatments
Measure Outcome
Compare
?
![Page 5: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/5.jpg)
What most people know:
A
B
?
Only difference is in the treatment!
![Page 6: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/6.jpg)
Reality:
A
B
??????
Variability fromSamples/Inputs
Variability fromTreatment/function
Variability fromMeasurement
How do we accountfor all that?
![Page 7: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/7.jpg)
• If there are variabilities in addition to the treatment effect, how can we identify/isolate the effect from the treatment?
Confounding:
![Page 8: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/8.jpg)
• Controlled variability• Systematic and desired• i.e. our treatment
• Bias • Systematic but not desired• Anything that can confound our study
• Noise • Random error but not desired• Won’t confound the study but makes it hard to
make a decision.
3 Types of Variability:
![Page 9: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/9.jpg)
How do we categorize each?
A
B
??????
Variability fromSamples/Inputs
Variability fromTreatment/function
Variability fromMeasurement
![Page 10: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/10.jpg)
Reality:
A
B
??????
Good instrumentation!
![Page 11: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/11.jpg)
Reality:
A
B
??????
Randomize assignment!Convert bias to noise
![Page 12: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/12.jpg)
Reality:
A
B
??????
Randomize assignment!Convert bias to noise
Your population can be skewed or biased….but that only restricts the generalizability of the results
![Page 13: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/13.jpg)
Reality:
A
B
?
Think about what you want to measure and how!Minimize the noise level/variability in the metric.
![Page 14: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/14.jpg)
A good experiment in general:
- Good design and implementation should be used to avoid bias.- For unavoidable biases, use randomization to turn it into noise.- Good planning to minimize noise in data.
![Page 15: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/15.jpg)
How do we deal with noise?
- Bread and butter of statisticians!- Quantify the magnitude of the treatment- Quantify the magnitude of the noise- Just compare…..most of the time
![Page 16: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/16.jpg)
Formalizing the Comparison
Similar to proof by contradiction- You assume the difference is by chance (noise)
![Page 17: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/17.jpg)
Formalizing the Comparison
Similar to proof by contradiction- You assume the difference is by chance (noise)- See how the data contradicts the assumption
![Page 18: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/18.jpg)
Formalizing the Comparison
Similar to proof by contradiction- You assume the difference is by chance (noise)- See how the data contradicts the assumption- If the surprise surpasses a threshold, we reject the assumption.- ….nothing is “100%”
![Page 19: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/19.jpg)
Difference due to chance?
ID PVPerson 1 39Person 2 209Person 3 31Person 4 98Person 5 9Person 6 151
Red -> treatment; Black -> control
![Page 20: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/20.jpg)
Difference due to chance?
ID PV | mean meanPerson 1 39 | 72 124.5Person 2 209 |Person 3 31 |Person 4 98 |Person 5 9 |Person 6 151 |
Red -> treatment; Black -> control
Diff = -52.5….so what?
Let’s measure the difference in means!
![Page 21: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/21.jpg)
Difference due to chance?
ID PV ID PVPerson 1 39 1 39Person 2 209 2 209Person 3 31 3 31Person 4 98 4 98Person 5 9 5 9Person 6 151 6 151
Red -> treatment; Black -> control
If there was no difference from the treatment, shuffling the treatment statuscan emulate the randomization of the samples.
![Page 22: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/22.jpg)
Difference due to chance?
ID PV ID PVPerson 1 39 1 39Person 2 209 2 209Person 3 31 3 31Person 4 98 4 98Person 5 9 5 9Person 6 151 6 151
Red -> treatment; Black -> control
Diff = 122.25 – 24 = 98.25
![Page 23: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/23.jpg)
Difference due to chance?
ID PV ID PVPerson 1 39 1 39Person 2 209 2 209Person 3 31 3 31Person 4 98 4 98Person 5 9 5 9Person 6 151 6 151
Red -> treatment; Black -> control
Diff = 107. 5 – 53.5 = 54
![Page 24: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/24.jpg)
Difference due to chance?
Our original -52.5
50000 repeats later…..
![Page 25: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/25.jpg)
Difference due to chance?
Our original -52.5
46.5% of the permutations yielded a larger if not the same difference as our original sample (in magnitude). Are you surprised by the initial results?
![Page 26: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/26.jpg)
“Tests”
Congratulations!
- You just learned the permutation test!- The 46.5% is the p-value under the permutation test.
![Page 27: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/27.jpg)
“Tests”
Congratulations!
- You just learned the permutation test!- The 46.5% is the p-value under the permutation test.
Problems:- Permuting the labels can be computationally costly.
- Not possible before computers!- Statistical theory says there are many tests out there.
![Page 28: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/28.jpg)
“Tests”
28
Standard t-test:1) Calculate delta:
= mean_treatment – mean_control2) Assumes follows a Normal distribution then calculatethe p-value.
3) If p-value < 0.05 then we reject the assumption that there is nodifference between treatment and control.
p-value = sum of red areas
0-
![Page 29: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/29.jpg)
29
Wait, our metrics may not be Normal!
Big data meets classic Stats
![Page 30: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/30.jpg)
Big Data meets Classic Stat
30
Wait, our metrics may not be Normal! We care about the “mean ofthe metric” and not the actual metric distribution.
![Page 31: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/31.jpg)
Big Data meets Classic Stat
31
Wait, our metrics may not be Normal!
Central Limit Theorem:The “mean of the metric” will be
Normal if the sample size is LARGE!
We care about the “mean ofthe metric” and not the actual metric distribution.
![Page 32: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/32.jpg)
Big Data meets Classic Stat
32
Assumptions with t-test- Normality of %delta
- Guaranteed with large sample sizes- Independent Samples- Not too many 0’s
That’s IT!!!- Easy to automate.- Simple and general.
![Page 33: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/33.jpg)
What are “Tests”?
33
• Statistical tests are just procedures that depend on data to make a decision.
• Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.
![Page 34: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/34.jpg)
What are “Tests”?
34
• Statistical tests are just procedures that depend on data to make a decision.
• Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.
Guarantees:• By setting the p-value to compare to a 5% threshold, we control
P( Test says difference exists | In reality NO difference) <= 5%
![Page 35: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/35.jpg)
What are “Tests”?
35
• Statistical tests are just procedures that depend on data to make a decision.
• Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.
Guarantees:• By setting the p-value to compare to a 5% threshold, we control
P( Test says difference exists | In reality NO difference) <= 5%
• By setting the power of the test to be 80%, we controlP( Test says difference exists | In reality difference exists) >= 80%
![Page 36: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/36.jpg)
What are “Tests”?
36
• Statistical tests are just procedures that depend on data to make a decision.
• Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.
Guarantees:• By setting the p-value to compare to a 5% threshold, we control
P( Test says difference exists | In reality NO difference) <= 5%
• By setting the power of the test to be 80%, we controlP( Test says difference exists | In reality difference exists) >= 80%• Increasing this often requires more data
![Page 37: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/37.jpg)
Meaning:
37
Reality Difference existNo difference
All treatments
Impactful treatmentsUseless treatments
![Page 38: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/38.jpg)
Meaning:
38
Reality Difference existNo difference
Test DecisionNo difference Difference Exists No difference Difference Exists
All treatments
Impactful treatmentsUseless treatments
![Page 39: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/39.jpg)
Meaning:
39
Reality Difference existNo difference
Test DecisionNo difference Difference Exists No difference Difference Exists
All treatments
Impactful treatmentsUseless treatments
>=80%<=5%Guaranteesthrough conventional thresholds
>95% <20%
![Page 40: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/40.jpg)
Meaning:
40
Reality Difference existNo difference
Test DecisionNo difference Difference Exists No difference Difference Exists
All treatments
Impactful treatmentsUseless treatments
>=80%<=5%Guaranteesthrough conventional thresholds
>95% <20%
Jargon Significance level Power
![Page 41: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/41.jpg)
Meaning:
41
- Most appropriate over repeated decision making- E.g. spammer or not
![Page 42: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/42.jpg)
Meaning:
42
- Most appropriate over repeated decision making- E.g. spammer or not
- Not seeing a difference could mean- There is no difference- Not enough power
![Page 43: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/43.jpg)
Meaning:
43
- Most appropriate over repeated decision making- E.g. spammer or not
- Not seeing a difference could mean- There is no difference- Not enough power
- Seeing a difference could mean- There is a difference- Got unlucky/lucky
![Page 44: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/44.jpg)
Meaning:
44
- Most appropriate over repeated decision making- E.g. spammer or not
- Not seeing a difference could mean- There is no difference- Not enough power
- Seeing a difference could mean- There is a difference- Got unlucky/lucky
- Your specific test is either impactful or not. (100% or 0%)Not what most people want to hear….
![Page 45: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/45.jpg)
Complaints with Hypth Testing
45
• People get really stuck on p-values and tests.• Confusing, boring, and formulaic.
![Page 46: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/46.jpg)
Complaints with Hypth Testing
46
• People get really stuck on p-values and tests.• Confusing, boring, and formulaic.
• Statistical significance != Scientific significance• You could detect a .000001 difference, so what?
![Page 47: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/47.jpg)
Complaints with Hypth Testing
47
• People get really stuck on p-values and tests.• Confusing, boring, and formulaic.
• Statistical significance != Scientific significance• You could detect a .000001 difference, so what?
• Multiple Hypothesis testing• 5% false positive is 1 out of 20. Quite high!• http://xkcd.com/882/• Most published results are false still (Ioannidis 2005)
![Page 48: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/48.jpg)
Complaints with Hypth Testing
48
• People get really stuck on p-values and tests.• Confusing, boring, and formulaic.
• Statistical significance != Scientific significance• You could detect a .000001 difference, so what?
• Multiple Hypothesis testing• 5% false positive is 1 out of 20. Quite high!• http://xkcd.com/882/• Most published results are false still (Ioannidis 2005)
• What is it answering?• Nothing specific about your test…. probabilities are
over repeated trials.
![Page 49: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/49.jpg)
Abuse: Prosecutor Fallacy
49
Both children of a British mother died within a short period of time. Mother was convicted of murder because p-value was low.
If she was innocent, the chance of both children dying is low
p-value = P( two deaths | innocent )
![Page 50: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/50.jpg)
Abuse: Prosecutor Fallacy
50
Both children of a British mother died within a short period of time. Mother was convicted of murder because p-value was low.
If she was innocent, the chance of both children dying is low
p-value = P( two deaths | innocent )
In fact, we should be looking at P( innocent | two deaths )
This is the prosecutor’s fallacy.
![Page 51: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/51.jpg)
Example:
51
All Mothers
Guilty Mothers Innocent Mothers
Two deaths Two deaths
![Page 52: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/52.jpg)
Example: base line matters!
52
All Mothers
Guilty Mothers Innocent Mothers
Two deaths Two deaths
P-value can be small. But base line can be huge.
![Page 53: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/53.jpg)
Any Alternatives?
53
P( innocent | two deaths ) is what we want…… but does it make sense?
Bayesian methodology:P( difference exists | data )
This requires knowing P(difference exists), i.e. the prior- Philosophical debate, “What is a probability?”- Easy to cheat the numbers
![Page 54: Crash Course in A/B testing](https://reader036.vdocument.in/reader036/viewer/2022062316/58ed46901a28ab8d458b457b/html5/thumbnails/54.jpg)
Questions?
54
- How to deal with multiple hypothesis testing?- What are we doing in the company?- Rumor has it that “Multi-armed bandit > A/B testing”?