a/b testing - design, analysis and pitfals
DESCRIPTION
TRANSCRIPT
Business Package Test
Additional Advertising Test
Agenda
• Design the Experiment
▫ 2 main questions – how many users and how long to run the test
▫ Define reasonable number of KPIs
▫ Pay attention on seasonality/weekdays effect
• Analyze the Experiment
▫ Statistical methods for checking significance
▫ Non-parametric methods
▫ Outliers/bots/fraud
• Data-driven culture
• Pitfalls
• Open Discussion
Design
• Test Duration & Sample Size
▫ Duration needs to be defined before the experiment is started!
▫ Depends on distribution of main KPIs
80% have Binomial Distribution (Conversion Rate, CTR, etc…) + CLT can help.
20% other (count events, revenue).
Power calculations for defining N (size) and t (duration) OR use rules of thumb.
General rule – the less difference you want to catch the more data you’ll need to collect.
Design
▫ Example – # of searches per user (SweetIM)
Poisson assumption for count events
Not appropriate when variance >> mean
NB was found appropriate
Power limitation of NB
Statistical Power and Sensitivity
50% 55% 60% 65% 70% 75% 80% 85% 90% 95%
0.5% 1,001,243 1,133,747 1,276,816 1,433,622 1,608,696 1,808,942 2,045,743 2,340,142 2,738,670 3,386,960
1.0% 251,556 284,847 320,792 360,188 404,175 454,485 513,980 587,946 688,074 850,953
1.5% 112,359 127,228 143,284 160,880 180,527 202,998 229,572 262,609 307,332 380,083
2.0% 63,516 71,922 80,998 90,945 102,052 114,755 129,777 148,453 173,734 214,860
2.5% 40,853 46,259 52,097 58,494 65,638 73,808 83,470 95,482 111,743 138,194
3.0% 28,511 32,284 36,358 40,823 45,809 51,511 58,254 66,637 77,985 96,446
3.5% 21,051 23,837 26,845 30,142 33,823 38,033 43,011 49,201 57,580 71,210
4.0% 16,197 18,341 20,655 23,192 26,024 29,264 33,094 37,857 44,304 54,791
4.5% 12,861 14,564 16,401 18,416 20,665 23,237 26,279 30,060 35,180 43,507
5.0% 10,470 11,855 13,351 14,991 16,821 18,915 21,391 24,470 28,637 35,416
5.5% 8,696 9,846 11,089 12,451 13,971 15,710 17,767 20,324 23,785 29,415
6.0% 7,343 8,315 9,364 10,514 11,798 13,267 15,003 17,162 20,085 24,839
6.5% 6,288 7,120 8,018 9,003 10,103 11,360 12,847 14,696 17,199 21,270
7.0% 5,449 6,170 6,948 7,801 8,754 9,844 11,132 12,735 14,903 18,431
7.5% 4,770 5,401 6,083 6,830 7,664 8,618 9,746 11,148 13,047 16,135
8.0% 4,213 4,771 5,373 6,032 6,769 7,612 8,608 9,847 11,524 14,252
8.5% 3,750 4,247 4,783 5,370 6,026 6,776 7,663 8,766 10,259 12,687
9.0% 3,362 3,807 4,287 4,814 5,402 6,074 6,869 7,858 9,196 11,373
9.5% 3,032 3,434 3,867 4,342 4,872 5,478 6,196 7,087 8,294 10,258
10.0% 2,750 3,114 3,507 3,938 4,419 4,969 5,619 6,428 7,523 9,303
Sen
siti
vity
Statistical Power
Sample size as a function of sensitivity and statistical power; Negative Binomial parameter α =0.31, average and length of the test 𝑡 = 30, 𝜇 = 0.69
Design
• Define Reasonable Number of KPIs
▫ It’s impossible to conclude based on 20 KPIs
• Project your KPI on Main Business (Lead) Indicators
• Consider Weighted KPIs or GPI (General Performance Indicator)
• Seasonality
▫ Weekends may have different user behavior than Weekdays
▫ Holidays can be unpredictable
• 7-days rule of thumb
Analysis
• Statistical Parametric Methods
• Non-Parametric Methods
• Permutation Tests
• Outliers/Bots/Fraud
Analysis
• Statistical Parametric Methods
▫ Use confidence intervals based on KPI distribution
▫ T-test, Chi-square test, etc will work, but…
T-test assumes normal distribution of statistic
Chi-square can be weak when low frequencies are observed
▫ Try Hypothesis testing based on KPI distribution – it’s not simple but worse it
• Can be used as a generalization of Poisson in over dispersed cases (Var >> Mean).
• Has been used before in other domains to analyze the count data (genetics, traffic modeling).
• Fits well the real distribution.
0 100 200 300 400 500
0.0
00
.05
0.1
00
.15
0.2
00
.25
Number of search
Fre
qu
en
cy
Real data
Fitted NB
Fitted Poisson
Analysis
• Non-parametric tests
▫ When it’s hard to estimate the distribution
▫ As Q&A for parametric tests
• Mann-Whitney, Kolmogorov-Smirnov
▫ Pros:
Can be appropriate for unknown or not Normal distributions
More robust than t-test
▫ Cons
Less sensitive and have less power than parametric test (median as a parameter)
Assume that both samples come from the same distribution
Assume normal distribution in large samples
Analysis
• Permutations tests
1. Calculate test statistic
2. Shuffle and resample 2 random groups
3. Calculate again test statistic
4. Compare to your original statistic, if is more extreme ->k=+1
5. Return on step 2 N times
6. Calculate the probability to get a result, more extreme than your original k/N - this is your P-value
Analysis
• Check for outliers
▫ Plot your data on daily/hourly level
▫ Descriptive statistics can help (variance)
• Try to filter bots and crawlers
▫ It is almost impossible to filter all non-human activity on the web.
▫ Automatic bots and crawlers can bias the results and drive to wrong conclusions.
• Continuous A/A test for sanity check for the whole system
▫ What difference you observe between A groups and is it insignificant?
▫ Technical and tracking issues
Data-Driven Culture
• Avoid HiPPO that is not supported by data
Highest
Paid
Person’s
Opinion
• Be clear about your KPI & how they affect your business
• Fight your ego – numbers don’t lie
• 80%-90% of tests won’t give positive result
• Learn from failed tests
Pitfalls
• Picking an easy-to-beat KPI without relation to lead business metrics
▫ Example – focusing on increase click-through rate for banners/buttons and ignoring other metrics like user retention or revenue.
• Using incorrect statistical methods or violate the assumptions
▫ Example 1 – assuming that KPI has Normal distribution without actually checking it.
▫ Example 2 – Using online significance calculators without understanding the data distribution
Pitfalls
• Combining ratios from different proportions over time -Simpson’s Paradox
▫ Example:
• Ignoring outliers and bots | not plotting data on a timeline
▫ Example: One outlier can change the test results
Pitfalls
• Starting test without validation (A/A test as a solution)
• Change control group during the test (solution- change them both!)
• Technical issues with experiment group
▫ Example – redirect , cash, new technology
• Running your experiment “until it will reach significant difference”
• Not “anchoring” users to one group only (also cookie problems)
Reference
▫ How Not To Run An A/B Test
▫ http://www.evanmiller.org/how-not-to-run-an-ab-test.html
▫ Microsoft Experimentation Platform
▫ http://www.exp-platform.com/Pages/ExPpitfalls.aspx
▫ Simpson’s Paradox
▫ http://vudlab.com/simpsons/