a/b testing - design, analysis and pitfals

A/B Testing – Design, Analysis and

Pitfalls

[email protected]

Business Package Test

Additional Advertising Test

Agenda

• Design the Experiment

▫ 2 main questions – how many users and how long to run the test

▫ Define reasonable number of KPIs

▫ Pay attention on seasonality/weekdays effect

• Analyze the Experiment

▫ Statistical methods for checking significance

▫ Non-parametric methods

▫ Outliers/bots/fraud

• Data-driven culture

• Pitfalls

• Open Discussion

Design

• Test Duration & Sample Size

▫ Duration needs to be defined before the experiment is started!

▫ Depends on distribution of main KPIs

80% have Binomial Distribution (Conversion Rate, CTR, etc…) + CLT can help.

20% other (count events, revenue).

Power calculations for defining N (size) and t (duration) OR use rules of thumb.

General rule – the less difference you want to catch the more data you’ll need to collect.

Design

▫ Example – # of searches per user (SweetIM)

Poisson assumption for count events

Not appropriate when variance >> mean

NB was found appropriate

Power limitation of NB

Statistical Power and Sensitivity

50% 55% 60% 65% 70% 75% 80% 85% 90% 95%

0.5% 1,001,243 1,133,747 1,276,816 1,433,622 1,608,696 1,808,942 2,045,743 2,340,142 2,738,670 3,386,960

1.0% 251,556 284,847 320,792 360,188 404,175 454,485 513,980 587,946 688,074 850,953

1.5% 112,359 127,228 143,284 160,880 180,527 202,998 229,572 262,609 307,332 380,083

2.0% 63,516 71,922 80,998 90,945 102,052 114,755 129,777 148,453 173,734 214,860

2.5% 40,853 46,259 52,097 58,494 65,638 73,808 83,470 95,482 111,743 138,194

3.0% 28,511 32,284 36,358 40,823 45,809 51,511 58,254 66,637 77,985 96,446

3.5% 21,051 23,837 26,845 30,142 33,823 38,033 43,011 49,201 57,580 71,210

4.0% 16,197 18,341 20,655 23,192 26,024 29,264 33,094 37,857 44,304 54,791

4.5% 12,861 14,564 16,401 18,416 20,665 23,237 26,279 30,060 35,180 43,507

5.0% 10,470 11,855 13,351 14,991 16,821 18,915 21,391 24,470 28,637 35,416

5.5% 8,696 9,846 11,089 12,451 13,971 15,710 17,767 20,324 23,785 29,415

6.0% 7,343 8,315 9,364 10,514 11,798 13,267 15,003 17,162 20,085 24,839

6.5% 6,288 7,120 8,018 9,003 10,103 11,360 12,847 14,696 17,199 21,270

7.0% 5,449 6,170 6,948 7,801 8,754 9,844 11,132 12,735 14,903 18,431

7.5% 4,770 5,401 6,083 6,830 7,664 8,618 9,746 11,148 13,047 16,135

8.0% 4,213 4,771 5,373 6,032 6,769 7,612 8,608 9,847 11,524 14,252

8.5% 3,750 4,247 4,783 5,370 6,026 6,776 7,663 8,766 10,259 12,687

9.0% 3,362 3,807 4,287 4,814 5,402 6,074 6,869 7,858 9,196 11,373

9.5% 3,032 3,434 3,867 4,342 4,872 5,478 6,196 7,087 8,294 10,258

10.0% 2,750 3,114 3,507 3,938 4,419 4,969 5,619 6,428 7,523 9,303

Sen

siti

vity

Statistical Power

Sample size as a function of sensitivity and statistical power; Negative Binomial parameter α =0.31, average and length of the test 𝑡 = 30, 𝜇 = 0.69

Design

• Define Reasonable Number of KPIs

▫ It’s impossible to conclude based on 20 KPIs

• Project your KPI on Main Business (Lead) Indicators

• Consider Weighted KPIs or GPI (General Performance Indicator)

• Seasonality

▫ Weekends may have different user behavior than Weekdays

▫ Holidays can be unpredictable

• 7-days rule of thumb

Analysis

• Statistical Parametric Methods

• Non-Parametric Methods

• Permutation Tests

• Outliers/Bots/Fraud

Analysis

• Statistical Parametric Methods

▫ Use confidence intervals based on KPI distribution

▫ T-test, Chi-square test, etc will work, but…

T-test assumes normal distribution of statistic

Chi-square can be weak when low frequencies are observed

▫ Try Hypothesis testing based on KPI distribution – it’s not simple but worse it

• Can be used as a generalization of Poisson in over dispersed cases (Var >> Mean).

• Has been used before in other domains to analyze the count data (genetics, traffic modeling).

• Fits well the real distribution.

0 100 200 300 400 500

0.0

00

.05

0.1

00

.15

0.2

00

.25

Number of search

Fre

qu

en

cy

Real data

Fitted NB

Fitted Poisson

Analysis

• Non-parametric tests

▫ When it’s hard to estimate the distribution

▫ As Q&A for parametric tests

• Mann-Whitney, Kolmogorov-Smirnov

▫ Pros:

Can be appropriate for unknown or not Normal distributions

More robust than t-test

▫ Cons

Less sensitive and have less power than parametric test (median as a parameter)

Assume that both samples come from the same distribution

Assume normal distribution in large samples

Analysis

• Permutations tests

1. Calculate test statistic

2. Shuffle and resample 2 random groups

3. Calculate again test statistic

4. Compare to your original statistic, if is more extreme ->k=+1

5. Return on step 2 N times

6. Calculate the probability to get a result, more extreme than your original k/N - this is your P-value

Analysis

• Check for outliers

▫ Plot your data on daily/hourly level

▫ Descriptive statistics can help (variance)

• Try to filter bots and crawlers

▫ It is almost impossible to filter all non-human activity on the web.

▫ Automatic bots and crawlers can bias the results and drive to wrong conclusions.

• Continuous A/A test for sanity check for the whole system

▫ What difference you observe between A groups and is it insignificant?

▫ Technical and tracking issues

Data-Driven Culture

• Avoid HiPPO that is not supported by data

Highest

Paid

Person’s

Opinion

• Be clear about your KPI & how they affect your business

• Fight your ego – numbers don’t lie

• 80%-90% of tests won’t give positive result

• Learn from failed tests

Pitfalls

• Picking an easy-to-beat KPI without relation to lead business metrics

▫ Example – focusing on increase click-through rate for banners/buttons and ignoring other metrics like user retention or revenue.

• Using incorrect statistical methods or violate the assumptions

▫ Example 1 – assuming that KPI has Normal distribution without actually checking it.

▫ Example 2 – Using online significance calculators without understanding the data distribution

Pitfalls

• Combining ratios from different proportions over time -Simpson’s Paradox

▫ Example:

• Ignoring outliers and bots | not plotting data on a timeline

▫ Example: One outlier can change the test results

Pitfalls

• Starting test without validation (A/A test as a solution)

• Change control group during the test (solution- change them both!)

• Technical issues with experiment group

▫ Example – redirect , cash, new technology

• Running your experiment “until it will reach significant difference”

• Not “anchoring” users to one group only (also cookie problems)

Reference

▫ How Not To Run An A/B Test

▫ http://www.evanmiller.org/how-not-to-run-an-ab-test.html

▫ Microsoft Experimentation Platform

▫ http://www.exp-platform.com/Pages/ExPpitfalls.aspx

▫ Simpson’s Paradox

▫ http://vudlab.com/simpsons/

http://www.evanmiller.org/how-not-to-run-an-ab-test.html















http://www.exp-platform.com/Pages/ExPpitfalls.aspx





a/b testing - design, analysis and pitfals

Data & Analytics