an evaluation framework for personalization strategy ...an evaluation framework for personalization...

An Evaluation Framework for Personalization StrategyExperiment Designs

C. H. Bryan [email protected]

Imperial College London & ASOS.com, UK

Emma J. McCoy

Imperial College London, UK

ABSTRACTOnline Controlled Experiments (OCEs) are the gold standard inevaluating the effectiveness of changes to websites. An importanttype of OCE evaluates different personalization strategies, whichpresent challenges in low test power and lack of full control ingroup assignment. We argue that getting the right experimentsetup—the allocation of users to treatment/analysis groups—shouldtake precedence of post-hoc variance reduction techniques in orderto enable the scaling of the number of experiments. We present anevaluation framework that, along with a few simple rule of thumbs,allow experimenters to quickly compare which experiment setupwill lead to the highest probability of detecting a treatment effectunder their particular circumstance.

KEYWORDSExperimentation, Experiment design, Personalization strategies,A/B testing.

1 INTRODUCTIONThe use of Online Controlled Experiments (OCEs, e.g. A/B tests)has become popular in measuring the impact of digital productsand services, and guiding business decisions on theWeb [10]. Majorcompanies report running thousands of OCEs on any given day [6,8, 14] and many startups exist purely to manage OCEs [2, 7].

A large number of OCEs address simple variations on elements ofthe user experience based on random splits, e.g. showing a differentcolored button to users based on a user ID hash bucket. Here, we areinterested in experiments that compare personalization strategies,complex sets of targeted user interactions that are common ine-commerce and digital marketing, and measure the change toa performance metric of interest (metric hereafter). Examples ofpersonalization strategies include the scheduling, budgeting andordering of marketing activities directed at a user based on theirpurchase history.

Experiments for personalization strategies face two unique chal-lenges. Firstly, strategies are often only applicable to a small fractionof the user base, and thus many simple experiment designs suf-fer from either a lack of sample size / statistical power, or dilutedmetric movement by including irrelevant samples [3]. Secondly,as users are not randomly assigned a priori, but must qualify tobe treated with a strategy via their actions or attributes, groupsof users subjected to different strategies cannot be assumed to bestatistically equivalent and hence are not directly comparable.

While there are a number of variance reduction techniques (in-cluding stratification and control variates [4, 12]) that partiallyaddress the challenges, the strata and control variates involved canvary dramatically from one personalization strategy experimentto another, requiring many ad hoc adjustments. As a result, such

techniques may not scale well when organizations design and runhundreds or thousands of experiments at any given time.

We argue that personalization strategy experiments should focuson the assignment of users from the strategies they qualified forto the treatment/analysis groups. We call this mapping process anexperiment setup. Identifying the best experiment setup increasesthe chance to detect any treatment effect. An experimentationframework can also reuse and switch between different setupsquickly with little custom input, ensuring the operation can scale.More importantly, the process does not hinder the subsequentapplication of variance reduction techniques, meaning that we canstill apply the techniques post hoc if required.

To date, many experiment setups exist to compare personaliza-tion strategies. An increasingly popular approach is to compare thestrategies using multiple control groups — Quantcast calls it a dualcontrol [1], and Facebook calls it a multi-cell lift study [9]. In thetwo-strategy case, this involves running two experiments on tworandom partitions of the user base in parallel, with each experimentfurther splitting the respective partition into treatment/control andmeasuring the incrementality (the change in a metric as comparedto the case where nothing is done) of each strategy. The incremen-tality of the strategies are then compared against each other.

Despite the setup above gaining traction in display advertising,there is a lack of literature on whether it is a better setup—onethat has a higher sensitivity and/or apparent effect size than othersetups. While [9] noted that multi-cell lift studies require a largenumber of users, they did not discuss how the number compares toother setups.1 The ability to identify and adopt a better experimentsetup can reduce the required sample size, and hence enable morecost-effective experimentation.

We address the gap in the literature by introducing an evaluationframework that compares experiment setups given two personal-ization strategies. The framework is designed to be flexible so thatit is able to deal with a wide range of baselines and changes in userresponses presented by any pairs of strategies (situations hereafter).We also recognize the need to quickly compare common setups, andprovide some simple rule of thumbs on situations where a setupwill be better than another. In particular, we outline the conditionswhere employing a multi-cell setup, as well as metric dilution, isdesirable.

To summarize, our contributions are: (i) We develop a flexibleevaluation framework for personalization strategy experiments,where one can compare two experiment setups given the situationpresented by two competing strategies (Section 2); (ii) We providesimple rule of thumbs to enable experimenters who do not requirethe full flexibility of the framework to quickly compare common

1A single-cell lift study is often used to measure the incrementality of a single person-alization strategy, and hence is not a representative comparison.

arX

iv:2

007.

1163

8v2

[st

at.M

E]

17

Dec

202

0

AdKDD ’20, Aug 2020, Online Liu and McCoy

Qualifyforstrategy1

Qualifyforstrategy2

Group 0:Qualifyforneitherstrategy

Group1 Group2Group3

Figure 1: Venn diagram of the user groups in our evaluationframework. The outer, left inner (red), and right inner (blue)boxes represent the entire user base, those who qualify forstrategy 1, and those who qualify for strategy 2 respectively.

setups (Section 3); and (iii) We make our results useful to practi-tioners by making the code used in the paper (Section 4) publiclyavailable.2

2 EVALUATION FRAMEWORKWe first present our evaluation framework for personalization strat-egy experiments. The experiments compare two personalizationstrategies, which we refer to as strategy 1 and strategy 2. Often oneof them is the existing strategy, and the other is a new strategy weintend to test and learn from. In this section we introduce (i) howusers qualifying themselves into strategies creates non-statisticallyequivalent groups, (ii) how experimenters usually assign the users,and (iii) when we would consider an assignment to be better.

2.1 User groupingAs users qualify themselves into the two strategies, four disjointgroups emerge: those who qualify for neither strategy, those whoqualify only for strategy 1, those who qualify only for strategy 2,and those who qualify for both strategies. We denote these groups(user) groups 0, 1, 2, and 3 respectively (see Figure 1). It is perhapsobvious that we cannot assume those in different user groups arestatistically equivalent and compare them directly.

We assume groups 0, 1, 2, 3 have 𝑛0, 𝑛1, 𝑛2, and 𝑛3 users respec-tively. We also assume responses from users (which we aggregate,often by taking the mean, to obtain our metric) are distributed dif-ferently between groups, and within the same group, between thescenario where the group is subjected to the treatment associatedto the corresponding strategy and where nothing is done (baseline).We list all group-scenario combinations in Table 1, and denote themean and variance of the responses (`𝐺 , 𝜎2

𝐺) for a combination𝐺 .3

2.2 Experiment setupsMany experiment setups exist and are in use in different organiza-tions. Here we introduce four common setups of various sophisti-cation, which we also illustrate in Figure 2.

2The code and supplementary document are available on: https://github.com/liuchbryan/experiment_design_evaluation.3For example, the responses for group 1 without any interventions has mean andvariance (`𝐶1 ,𝜎2

𝐶1), and that for group 2with the treatment prescribed under strategy 2has mean and variance (`𝐼2 , 𝜎2

𝐼2).

• • • • • • • • •• • • • • • • • •• • • • • • • • •

• • • • • • • • • • •• • • • • • • • • • •• • • • • • • • • • •

A

B

A

B

Setup1 Setup2

A

B

Setup3

• • •• • •• • •

▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾

▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾

▾ ▾▾ ▾▾ ▾

• • • • • • • • • • • • • • • •

A=A2-A1

B=B2-B1

Setup4

▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾▾ ▾ ▾ ▾ ▾ ▾ ▾ ▾

A1A2

B1B2

Figure 2: Experiment setups overlaid on the user groupingVenn diagram in Figure 1. The hatched boxes indicate whoare included in the analysis, and the downward triangles anddots indicate who are subjected to treatment prescribed un-der strategies 1 and 2 respectively. See Section 2.2 for a de-tailed description.

Setup 1 (Users in the intersection only). The setup considers userswho qualify for both strategies only. The said users are randomlysplit (usually 50/50) into two (analysis) groups 𝐴 and 𝐵, and areprescribed the treatment specified by strategies 1 and 2 respectively.The setup is easy to implement, though it is difficult to translateany learnings obtained from the setup to other user groups (e.g.those who qualify for one strategy only) [5].

Setup 2 (All samples). The setup is a simple A/B test where itconsiders all users, regardless on whether they qualify for anystrategy or not. The users are randomly split into two analysisgroups 𝐴 and 𝐵, and are prescribed the treatment specified bystrategy 1(2) if (i) they qualify under the strategy and (ii) they are inanalysis group 𝐴(𝐵). This setup is easiest to implement but usuallysuffers severely from a dilution in metric [3].

Setup 3 (Qualified users only). The setup is similar to Setup 2except only those who qualified for at least one strategy (“triggered”users in some literature [3]) are included in the analysis groups. Thesetup sits between Setup 1 and Setup 2 in terms of user coverage,and has the advantage of capturing the most number of usefulsamples yet having the least metric dilution. However, the setupalso prevents one from telling the incrementality of a strategy itself,but only the difference in incrementalities between two strategies.

Setup 4 (Dual control / multi-cell lift test). As described in Sec-tion 1, the setup first split the users randomly into two randomiza-tion groups. For the first randomization group, we consider thosewho qualify for strategy 1, and split them into analysis groups 𝐴1and 𝐴2. Group 𝐴2 receives the treatment prescribed under strat-egy 1, and group 𝐴1 acts as control. The incrementality for strat-egy 1 is then the difference in metric between groups𝐴2 and𝐴1. We

https://github.com/liuchbryan/experiment_design_evaluation

https://github.com/liuchbryan/experiment_design_evaluation

An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD ’20, Aug 2020, Online

Group 0 Group 1 Group 2 Group 3Baseline (Control) 𝐶0 𝐶1 𝐶2 𝐶3

Under treatment (Intervention) / 𝐼1 𝐼2 Under strategy 1: 𝐼𝜙

Under strategy 2: 𝐼𝜓

Table 1: All group-scenario combinations in our evaluation framework for personalization strategy experiments. The columnsrepresent the groups described in Figure 1. The baseline represents the scenario where nothing is done. We assume those whoqualify for both strategies (Group 3) can only receive treatment(s) associated to either strategy.

apply the same process to the second randomization group, withstrategy 2 and analysis groups 𝐵1 and 𝐵2 in place, and comparethe incrementality for strategies 1 and 2. The setup allows one toobtain the incrementality of each individual strategy and minimizesmetric dilution. Though it also leaves a number of samples unusedand creates extra analysis groups, and hence generally suffers froma low test power [9].

2.3 Evaluation criteriaThere are a number of considerations when one evaluates compet-ing experiment setups. These include technical considerations suchas the complexity of setting up the setups on an OCE framework,and business considerations such as whether the incrementality ofindividual strategies is required.

Here we focus on the statistical aspect and propose two evalua-tion criteria: (i) the actual average treatment effect size as presentedby the two analysis groups in an experiment setup, and (ii) the sen-sitivity of the experiment represented by the minimum detectableeffect (MDE) under a pre-specified test power. Both criteria arenecessary as the former indicates whether a setup suffers from met-ric dilution, whereas the latter indicates whether the setup suffersfrom lack of power/sample size. An ideal setup should yield a highactual effect size and a high sensitivity (i.e. a low MDE),4 thoughas we observe in the next section it is usually a trade-off.

We formally define the two evaluation criteria from first prin-ciples while introducing relevant notations along the way. Let 𝐴and 𝐵 be the two analysis groups in an experiment setup, with userresponses randomly distributed with mean and variance (`𝐴, 𝜎2

𝐴)

and (`𝐵, 𝜎2𝐵) respectively. We first recall that if there are sufficient

samples, the sample mean of the two groups approximately followsthe normal distribution by the Central Limit Theorem:

𝐴approx.∼ N

(`𝐴, 𝜎

2𝐴/𝑛𝐴

), 𝐵

approx.∼ N(`𝐵, 𝜎

2𝐵/𝑛𝐵

), (1)

where 𝑛𝐴 and 𝑛𝐵 are the number of samples taken from 𝐴 and 𝐵

respectively. The difference in the sample means then also approxi-mately follows a normal distribution:

�̄� ≜ (𝐵 −𝐴) approx.∼ N(Δ ≜ `𝐵 − `𝐴, 𝜎

2�̄�≜ 𝜎2

𝐴/𝑛𝐴 + 𝜎2𝐵/𝑛𝐵

). (2)

Here, Δ is the actual effect size that we are interested in.The definition of the MDE \∗ requires a primer to the power of

a statistical test. A common null hypothesis statistical test in per-sonalization strategy experiments uses the two-tailed hypotheses𝐻0 : Δ = 0 and 𝐻1 : Δ ≠ 0, with the test statistic under 𝐻0 being:

𝑇 ≜ 𝐷/𝜎�̄�approx.∼ N (0, 1) . (3)

4We will use the terms “high(er) sensitivity” and “low(er) MDE” interchangeably.

We recall the null hypothesis will be rejected if |𝑇 | > 𝑧1−𝛼/2, where𝛼 is the significance level and 𝑧1−𝛼/2 is the 1 − 𝛼/2 quantile of astandard normal. Under a specific alternate hypothesis Δ = \ , thepower is specified as

1 − 𝛽\ ≜ Pr(|𝑇 | > 𝑧1−𝛼/2 | Δ = \

)≈ 1 − Φ

(𝑧1−𝛼/2 − |\ |/𝜎�̄�

). (4)

where Φ denotes the cumulative density function of a standardnormal.5 To achieve a minimum test power 𝜋min, we require that1 − 𝛽\ > 𝜋min. Substituting Equation (4) into the inequality andrearranging to make \ the subject yields the effect sizes that thetest will be able to detect with the specified power:

|\ | > (𝑧1−𝛼/2 − 𝑧1−𝜋min ) 𝜎�̄� . (5)

\∗ is then defined as the positive minimum \ that satisfies Inequal-ity (5), i.e. that specified by the RHS of the inequality.

We finally define what it means to be better under these evalu-ation criteria. WLOG we assume the actual effect size of the twocompeting experiment setups are positive,6 and say a setup 𝑆 issuperior to another setup 𝑅 if, all else being equal,(i) 𝑆 produces a higher actual effect size (Δ𝑆 > Δ𝑅 ) and a lower

minimum detectable effect size (\∗𝑆< \∗

𝑅), or

(ii) The gain in actual effect is greater than the loss in sensitivity:

Δ𝑆 − Δ𝑅 > \∗𝑆 − \∗𝑅 , (6)

which means an actual effect still stands a higher chance to beobserved under 𝑆 .

3 COMPARING EXPERIMENT SETUPSHaving described the evaluation framework above, in this sectionwe use the framework to compare the common experiment setupsdescribed in Section 2.2. We will first derive the actual effect sizeand MDE for each setup in Section 3.1, and using the result to createrule of thumbs on (i) whether diluting the metric by including userswho qualify for neither strategies is beneficial (Section 3.2) and (ii)if dual control is a better setup for personalization strategy experi-ments (Section 3.3), two questions that are often discussed amonge-commerce and marketing-focused experimenters. For brevity, werelegate most of the intermediate algebraic work when derivingthe actual & minimum detectable effect sizes, as well as the con-ditions that lead to a setup being superior, to our supplementarydocument.2

5The approximation in Equation (4) is tight for experiment design purposes, where𝛼 < 0.2 and 1 − 𝛽 > 0.6 for nearly all cases.6If both the actual effect sizes are negative, we simply swap the analysis groups. If theactual effect sizes are of opposite signs, it is likely an error.


3.1 Actual & minimum detectable effect sizesWe first present the actual effect size and MDE of the four experi-ment setups. For each setup we first compute the sample size, meanresponse, and response variance in each analysis group, whicharises as a mixture of user groups described in Section 2.1. Forbrevity, we only present the quantities for one analysis group persetup—expressions for other analysis group(s) can be easily ob-tained by substituting in the corresponding user groups. We thensubstitute the quantities computed into the definitions of Δ (seeEquation (2)) and \∗ (see Inequality (5)) to obtain the setup-specificactual effect size and MDE. We assume all random splits are done50/50 in these setups to maximize the test power.

Setup 1 (Users in the intersection only). The setup randomly splitsuser group 3 into two analysis groups, each with 𝑛3/2 samples. Usersin analysis group 𝐴 are provided treatment under strategy 1, andhence the group’s responses have a mean and variance of (`𝐼𝜙 , 𝜎2

𝐼𝜙).

The actual effect size and MDE for Setup 1 are hence:

Δ𝑆1 = `𝐼𝜓 − `𝐼𝜙 , (7)

\∗𝑆1 = (𝑧1−𝛼/2 − 𝑧1−𝜋min )√︃

2(𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓)/𝑛3 . (8)

Setup 2 (All samples). This setup also contains two analysisgroups, 𝐴 and 𝐵, each taking half of the population (i.e. (𝑛0 + 𝑛1 +𝑛2 + 𝑛3)/2). The mean response and response variance for groups𝐴 and 𝐵 are the weighted mean response and response varianceof the constituent user groups respectively. As we only providetreatment to those who qualify for strategy 1 in group 𝐴, and like-wise for group 𝐵 with strategy 2, each user group will give differentresponses, e.g. for group 𝐴:

`𝐴 = (𝑛0`𝐶0 + 𝑛1`𝐼1 + 𝑛2`𝐶2 + 𝑛3`𝐼𝜙 )/(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3), (9)

𝜎2𝐴 = (𝑛0𝜎

2𝐶0 + 𝑛1𝜎

2𝐼1 + 𝑛2𝜎

2𝐶2 + 𝑛3𝜎

2𝐼𝜙)/(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3). (10)

Substituting the above (and that for group 𝐵) into the definitionsof actual effect size and MDE we have:

Δ𝑆2 =𝑛1 (`𝐶1 − `𝐼1) + 𝑛2 (`𝐼2 − `𝐶2) + 𝑛3 (`𝐼𝜓 − `𝐼𝜙 )

𝑛0 + 𝑛1 + 𝑛2 + 𝑛3, (11)

\∗𝑆2 = (𝑧1−𝛼/2 − 𝑧1−𝜋min )

√√√√√√ 2(𝑛0 (2𝜎2

𝐶0) + 𝑛1 (𝜎2𝐼1 + 𝜎2

𝐶1)+𝑛2 (𝜎2

𝐶2 + 𝜎2𝐼2) + 𝑛3 (𝜎2

𝐼𝜙+ 𝜎2

𝐼𝜓))

(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)2 .

(12)

Setup 3 (Qualified users only). The setup is very similar to Setup 2,with members from user group 0 excluded. This leads to both anal-ysis groups having (𝑛1 + 𝑛2 + 𝑛3)/2 users. The absence of group 0users means they are not featured in the weighted mean responseand response variance of the two analysis groups, e.g. for group A:

`𝐴 =𝑛1`𝐼1 + 𝑛2`𝐶2 + 𝑛3`𝐼𝜙

𝑛1 + 𝑛2 + 𝑛3, 𝜎2

𝐴 =

𝑛1𝜎2𝐼1 + 𝑛2𝜎2

𝐶2 + 𝑛3𝜎2𝐼𝜙

𝑛1 + 𝑛2 + 𝑛3. (13)

This leads to the following actual effect size and MDE for Setup 3:

Δ𝑆3 =𝑛1 (`𝐶1 − `𝐼1) + 𝑛2 (`𝐼2 − `𝐶2) + 𝑛3 (`𝐼𝜓 − `𝐼𝜙 )

𝑛1 + 𝑛2 + 𝑛3, (14)

\∗𝑆3 =(𝑧1−𝛼/2−𝑧1−𝜋min )

√√√2(𝑛1 (𝜎2

𝐼1+𝜎2𝐶1)+𝑛2 (𝜎2

𝐶2+𝜎2𝐼2)+𝑛3 (𝜎2

𝐼𝜙+𝜎2

𝐼𝜓))

(𝑛1 + 𝑛2 + 𝑛3)2 .

(15)

Setup 4 (Dual control). The setup is the odd one out as it has fouranalysis groups. Two of the analysis groups (𝐴1 and 𝐴2) are drawnfrom those who qualified into strategy 1 and are allocated into thefirst randomization group, and the other two (𝐵1 and 𝐵2) are drawnfrom those who are qualified into strategy 2 and are allocated intothe second randomization group:

𝑛𝐴1 = 𝑛𝐴2 = (𝑛1 + 𝑛3)/4 , 𝑛𝐵1 = 𝑛𝐵2 = (𝑛2 + 𝑛3)/4. (16)

The mean response and response variance for group 𝐴1 are:

`𝐴1 =𝑛1`𝐶1 + 𝑛3`𝐶3

𝑛1 + 𝑛3, 𝜎2

𝐴1 =𝑛1𝜎2

𝐶1 + 𝑛3𝜎2𝐶3

𝑛1 + 𝑛3. (17)

As the setup takes the difference of differences in the metric (i.e.the difference between the mean response for groups 𝐵2 and 𝐵1,and the difference between the mean response for groups 𝐴2 and𝐴1),7 the actual effect size is as follows:

Δ𝑆4 = (`𝐵2 − `𝐵1) − (`𝐴2 − `𝐴1)

=𝑛2 (`𝐼2−`𝐶2) + 𝑛3 (`𝐼𝜓 −`𝐶3)

𝑛2 + 𝑛3−𝑛2 (`𝐼1−`𝐶1) + 𝑛3 (`𝐼𝜙−`𝐶3)

𝑛1 + 𝑛3.

(18)

The MDE for Setup 4 is similar to that specified in RHS of Inequal-ity (5), albeit with more groups:

\∗𝑆4 = (𝑧1−𝛼/2−𝑧1−𝜋min )√︃𝜎2𝐴1/𝑛𝐴1 + 𝜎2

𝐴2/𝑛𝐴2 + 𝜎2𝐵1/𝑛𝐵1 + 𝜎2

𝐵2/𝑛𝐵2

= 2 · (𝑧1−𝛼/2 − 𝑧1−𝜋min )×√√𝑛1 (𝜎2

𝐶1+𝜎2𝐼1)+𝑛3 (𝜎2

𝐶3+𝜎2𝐼𝜙)

(𝑛1 + 𝑛3)2 +𝑛2 (𝜎2

𝐶2+𝜎2𝐼2)+𝑛3 (𝜎2

𝐶3+𝜎2𝐼𝜓)

(𝑛2 + 𝑛3)2 .

(19)

3.2 Is dilution always bad?The use of responses from users who do not qualify for any ofthe strategies we are comparing, an act known as metric dilution,has stirred countless debates in experimentation teams. On onehand, responses from these users make any treatment effect lesspronounced by contributing exactly zero; on the other hand, itmight be necessary as one does not know who actually qualify fora strategy [9], or it might be desirable as they can be leveraged toreduce the variance of the treatment effect estimator [3].

Here, we are interested in whether we should engage in the actof dilution given the assumed user responses prior to an experi-ment. This can be clarified by understanding the conditions whereSetup 3 would emerge superior (as defined in Section 2.3) to Setup 2.By inspecting Equations (11) and (14), it is clear that Δ𝑆3 > Δ𝑆2

7Not to be confused with the difference-in-differences method, which captures themetric for the control and treatment groups at both the beginning (pre-intervention)and the end (post-intervention) of the experiment. Here we only capture the metriconce at the end of the experiment.


if 𝑛0 > 0. Thus, Setup 3 is superior to Setup 2 under the first crite-rion if \∗

𝑆3 < \∗𝑆2, which is the case if 𝜎2

𝐶0, the metric variance ofusers who qualify for neither strategies, is large. This can be shownby substituting Equations (12) and (15) into the \∗-inequality andrearranging the terms to obtain:(

𝑛1 (𝜎2𝐼1 + 𝜎2

𝐶1) + 𝑛2 (𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛3 (𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓))·

(𝑛0 + 2𝑛1 + 2𝑛2 + 2𝑛3)2(𝑛1 + 𝑛2 + 𝑛3)2 < 𝜎2

𝐶0 .

(20)

If we assume the response variance are similar across groups withusers who qualified for at least one strategy, i.e. 𝜎2

𝐼1 ≈ 𝜎2𝐶1 ≈ · · · ≈

𝜎2𝐼𝜓

≈ 𝜎2𝑆, Inequality (20) can then be simplified as

𝜎2𝑆

(𝑛0

𝑛1 + 𝑛2 + 𝑛3+ 2

)< 𝜎2

𝐶0 , (21)

where it can be used to quickly determine if one should considerdilution at all.

If Inequality (20) is not true (i.e. \∗𝑆3 ≥ \∗

𝑆2), we should thenconsider when the second criterion (i.e. Δ𝑆3 − Δ𝑆2 > \∗

𝑆3 − \∗𝑆2) is

met. Writing

[ = 𝑛1 (`𝐶1 − `𝐼1) + 𝑛2 (`𝐼2 − `𝐶2) + 𝑛3 (`𝐼𝜓 − `𝐼𝜙 ),b = 𝑛1 (𝜎2

𝐶1 + 𝜎2𝐼1) + 𝑛2 (𝜎2

𝐼2 + 𝜎2𝐶2) + 𝑛3 (𝜎2

𝐼𝜓+ 𝜎2

𝐼𝜙), and

𝑧 = 𝑧1−𝛼/2 − 𝑧1−𝜋min ,

we can substitute Equations (11), (12), (14), and (15) into the in-equality and rearrange to obtain

𝑛1 + 𝑛2 + 𝑛3𝑛0

√︃2𝑛0𝜎2

𝐶0 + b >𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

𝑛0

√︁b − [

√2𝑧

. (22)

As the LHS of Inequality (22) is always positive, Setup 3 is superiorif the RHS ≤ 0. Noting

Δ𝑆3 = [/(𝑛1 + 𝑛2 + 𝑛3) and \∗𝑆3 =√

2 · 𝑧 ·√︁b/(𝑛1 + 𝑛2 + 𝑛3),

the trivial case is satisfied if𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

𝑛0· \∗𝑆3 ≤ Δ𝑆3 . (23)

If the RHS of Inequality (22) is positive, we can safely squareboth sides and use the identities for Δ𝑆3 and \∗

𝑆3 to get

2𝜎2𝐶0𝑛0

>

[(\∗𝑆3 − Δ𝑆3 + 𝑛1+𝑛2+𝑛3

𝑛0\∗𝑆3

)2−

(𝑛1+𝑛2+𝑛3

𝑛0\∗𝑆3

)2]

2𝑧2 . (24)

As the LHS is always positive, the second criterion is met if

\∗𝑆3 ≤ Δ𝑆3 . (25)

Note this is a weaker, and thus more easily satisfiable conditionthan that introduced in Inequality (23). This suggests an experimentsetup is always superior to an diluted alternative if the experiment isalready adequately powered—introducing any dilution will simplymake things worse.

Failing the condition in Inequality (25), we can always fall backto Inequality (24). While the inequality operates in squared space, itis essentially comparing the standard error of user group 0 (LHS)—those who qualify for neither strategies—to the gap between theminimum detectable and actual effects (\∗

𝑆3 − Δ𝑆3). The gap canbe interpreted as the existing noise level, thus a higher standard

error means mixing in group 0 users will introduce extra noise,and one is better off without them. Conversely, a smaller standarderror means group 0 users can lower the noise level, i.e. stabilizethe metric fluctuation, and one should take advantage of them.

To summarize, diluting a personalization strategies experimentsetup is not helpful if(i) Users who do not qualify for any strategies have a large metric

variance (Inequality (20)), or(ii) The experiment is already adequately powered (Inequality (25)).It could help if the experiment has not gained sufficient power yetand users who do not qualify for any strategy provide low-varianceresponses, such that they exhibit stabilizing effects when includedinto the analysis (complement of Inequality (24)).

3.3 When is a dual-control more effective?Often when advertisers compare two personalization strategies,the question on whether to use a dual control/multi-cell designcomes up. Proponents of such approach celebrate its ability to tella story by making the incrementality of an individual strategyavailable, while opponents voice concerns on the complexity insetting up the design. Here we are interested if Setup 4 (dual control)is superior to Setup 3 (a simple A/B test) from a power/detectableeffect perspective, and if so, under what circumstances.

We first observe \∗𝑆4 > \

∗𝑆3 is always true, and hence a dual control

setup will never be superior to a simpler setup under the first crite-rion. This can be verified by substituting in Equations (19) and (15)and rearranging the terms to show the inequality is equivalent to

2(

𝑛1(𝑛1+𝑛3)2 (𝜎2

𝐶1 + 𝜎2𝐼1) +

𝑛2(𝑛2+𝑛3)2 (𝜎2

𝐶2 + 𝜎2𝐼2)+

𝑛3(𝑛1+𝑛3)2 𝜎

2𝐼𝜙

+ 𝑛3(𝑛2+𝑛3)2 𝜎

2𝐼𝜓

+(

𝑛3(𝑛1+𝑛3)2 + 𝑛3

(𝑛2+𝑛3)2

)𝜎2𝐶3

)>

𝑛1(𝑛1+𝑛2+𝑛3)2 (𝜎2

𝐶1 + 𝜎2𝐼1) +

𝑛2(𝑛1+𝑛2+𝑛3)2 (𝜎2

𝐶2 + 𝜎2𝐼2)+

𝑛3(𝑛1+𝑛2+𝑛3)2 𝜎

2𝐼𝜙

+ 𝑛3(𝑛1+𝑛2+𝑛3)2 𝜎

2𝐼𝜓, (26)

which is always true given the 𝑛s are non-negative and the 𝜎2s arepositive: not only the coefficients of the 𝜎2-terms are larger on theLHS than their RHS counterparts, the LHS also carries an extra 𝜎2

𝐶3term with non-negative coefficient and a factor of two.

Moving on to the second evaluation criterion, we recall thatSetup 4 is superior if Δ𝑆4 − Δ𝑆3 > \∗

𝑆4 − \∗𝑆3, otherwise Setup 3 is

superior under the same criterion. The full flexibility of the modelcan be seen by substituting Equations (14), (15), (18), and (19) intothe inequality and rearrange to obtain

𝑛1𝑛2 (`𝐼2−`𝐶2)+𝑛3 (`𝐼𝜓−`𝐶3)

𝑛2+𝑛3− 𝑛2

𝑛1 (`𝐼1−`𝐶1)+𝑛3 (`𝐼𝜙−`𝐶3)𝑛1+𝑛3√︃

𝑛1 (𝜎2𝐶1 + 𝜎2

𝐼1) + 𝑛2 (𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛3 (𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓)

> (27)

√2𝑧

√√√√√√√√√2 ·

(1 + 𝑛2𝑛1+𝑛3

)2 [𝑛1 (𝜎2𝐶1 + 𝜎2

𝐼1) + 𝑛3 (𝜎2𝐶3 + 𝜎2

𝐼𝜙)]+

(1 + 𝑛1𝑛2+𝑛3

)2 [𝑛2 (𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛3 (𝜎2𝐶3 + 𝜎2

𝐼𝜓)]

𝑛1 (𝜎2𝐶1 + 𝜎2

𝐼1) + 𝑛2 (𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛3 (𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓)

− 1

,where 𝑧 = 𝑧1−𝛼/2 − 𝑧1−𝜋min .

A key observation from inspecting Inequality (27) is that theLHS of the inequality scales along𝑂 (

√𝑛), where 𝑛 is the number of

users, while the RHS remains a constant. This leads to the insight


that Setup 4 is more likely to be superior if the 𝑛s are large. Herewe assume the ratio 𝑛1 : 𝑛2 : 𝑛3 remains unchanged when we scalethe number of samples, an assumption that generally holds whenan organization increases their reach while maintaining their usermix. It is worth pointing out that our claim is stronger than thatin previous work — we have shown that having a large user basenot only fulfills the requirement of running a dual control experi-ment as described in [9], it also makes a dual control experiment abetter setup than its simpler counterparts in terms of apparent anddetectable effect sizes.

The scaling relationship can be seen more clearly if we applysome simplifying assumptions to the 𝜎2- and 𝑛-terms. If we as-sume the response variances are similar across user groups (i.e.𝜎2𝐶1 ≈ 𝜎2

𝐼1 ≈ · · · ≈ 𝜎2𝐼𝜓

≈ 𝜎2𝑆), the RHS of Inequality (27) becomes

√2𝑧

[√︂𝑛1 + 𝑛2 + 𝑛3𝑛1 + 𝑛3

+ 𝑛1 + 𝑛2 + 𝑛3𝑛2 + 𝑛3

− 1], (28)

which remains a constant if the ratio 𝑛1 : 𝑛2 : 𝑛3 remains un-changed. If we assume the number of users in groups 1, 2, 3 aresimilar (i.e. 𝑛1 = 𝑛2 = 𝑛3 = 𝑛), the LHS of Inequality (27) becomes

√𝑛((`𝐼2 − `𝐶2) − (`𝐼1 − `𝐶1) + `𝐼𝜓 − `𝐼𝜙

)2√︃𝜎2𝐶1 + 𝜎2

𝐼1 + 𝜎2𝐶2 + 𝜎2

𝐼2 + 𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓

, (29)

which clearly scales along 𝑂 (√𝑛).

We conclude the section by providing an indication on what alarge 𝑛 may look like. If we assume both the response variancesand the number of users are are similar across user groups, we canrearrange Inequality (27) to make 𝑛 the subject:

𝑛 >

(2√

12(√

6 − 1)𝑧

)2 𝜎2𝑆

Δ2 , (30)

where Δ = (`𝐼2 − `𝐶2) − (`𝐼1 − `𝐶1) + `𝐼𝜓 − `𝐼𝜙 is the difference inactual effect sizes between Setups 4 and 3. Under a 5% significancelevel and 80% power, the first coefficient amounts to around 791,which is roughly 50 times the coefficient one would use to deter-mine the sample size of a simple A/B test [11]. This suggests a dualcontrol setup is perhaps a luxury accessible only to the largest ad-vertising platforms and their top advertisers. For example, consideran experiment to optimize conversion rate where the baselinesattain 20% (hence having a variance of 0.2(1 − 0.2) = 0.16). If thereis a 2.5% relative (i.e. 0.5% absolute) effect between the competingstrategies, the dual control setup will only be superior if there are> 5M users in each user group.

4 EXPERIMENTSHaving performed theoretical calculations for the actual and de-tectable effects and conditions where an experiment setup is supe-rior to another, here we verify those calculations using simulationresults. We focus on the results presented in Section 3.1, as the restof the results presented followed from those calculations.

In each experiment setup evaluation, we randomly select thevalue of the parameters (i.e. the `s, 𝜎2s, and 𝑛s), and take 1,000actual effect samples, each by (i) sampling the responses from theuser groups under the specified parameters, (ii) computing themeanfor the analysis groups, and (iii) taking the difference of the means.

Actual effect size Minimum detectable effectSetup 1 1049/1099 (95.45%) 66/81 (81.48%)Setup 2 853/999 (85.38%) 87/106 (82.08%)Setup 3 922/1099 (83.89%) 93/116 (80.18%)Setup 4 240/333 (72.07%) 149/185 (80.54%)

Table 2: Number of evaluations where the theoretical valueof the quantities (columns) falls between the 95% bootstrapconfidence interval for each experiment setup (rows). SeeSection 4 for a detailed description on the evaluations.

We also take 100 MDE samples in separate evaluations, eachby (i) sampling a critical value under null hypothesis; (ii) comput-ing the test power under a large number of possible effect sizes,each using the critical value and sampled metric means under thealternate hypothesis; and (iii) searching the effect size space forthe value that gives the predefined power. As the power vs. effectsize curve is noisy given the use of simulated power samples, weuse the bisection algorithm provided by the noisyopt package toperform the search. The algorithm dynamically adjusts the numberof samples taken from the same point on the curve to ensure thenoise does not send us down the wrong search space.

We expect the mean of the sampled actual effect and MDE tomatch the theoretical value. To verify this, we perform 1,000 boot-strap resamplings on the samples obtained above to obtain an em-pirical bootstrap distribution of the sample mean in each evaluation.The 95% bootstrap resampling confidence interval (BRCI) shouldthen contain the theoretical mean 95% of the times. The histogramof the percentile rank of the theoretical quantity in relation to thebootstrap samples across multiple evaluations should also follow auniform distribution [13].

The result is shown in Table 2. One can observe that there aremore evaluations having their theoretical quantity lying outsidethan the BRCI than expected. Upon further investigation, we ob-served a characteristic ∪-shape from the histograms of the per-centile ranks for the actual effects. This suggests the bootstrapsamples may be under-dispersed but otherwise centered on thetheoretical quantities.

We also observed the histograms forMDEs curving upward to theright, this suggests that the theoretical value is a slight overestimate(of < 1% to the bootstrap mean in all cases). We believe this is likelydue to a small bias in the bisection algorithm. The algorithm testsif the mean of the power samples is less than the target power todecide which half of the search space to continue along. Given wecan bisect up to 10 times in each evaluation, it is likely to see afalse positive even when we set the significance level for individualcomparisons to 1%. This leads to the algorithm favoring a smallerMDE sample. Having that said, since we have tested for a widerange of parameters and the overall bias is small, we are satisfiedwith the theoretical quantities for experiment design purposes.

5 CONCLUSIONWe have addressed the problem of comparing experiment designsfor personalization strategies by presenting an evaluation frame-work that allows experimenters to evaluate which experiment setup


should be adopted given the situation. The flexible framework canbe easily extended to compare setups that compare more than twostrategies by adding more user groups (i.e. new sets to the Venndiagram in Figure 1). A new setup can also be incorporated quicklyas it is essentially a different weighting of user group-scenariocombinations shown in Table 1. The framework also allows thedevelopment of simple rule of thumbs such as:(i) Metric dilution should never be employed if the experiment

already has sufficient power; though it can be useful if theexperiment is under-powered and the non-qualifying usersprovide a “stabilizing effect”; and

(ii) A dual control setup is superior to simpler setups only if onehas access to the user base of the largest organizations.

We have validated the theoretical results via simulations, and madethe code available2 so that practitioners can benefit from the resultsimmediately when designing their upcoming experiments.

Future Work. So far we assume the responses from each usergroup-scenario combination are randomly distributed with themean independent to the variance, with the evaluation criteria cal-culated assuming the metric (being the weighted sample mean ofthe responses) is approximately normally distributed under CentralLimit Theorem. We are interested in whether the evaluation frame-work and its results still hold if we deviate from these assumptions,e.g. with binary responses (where the mean and variance correlate)and heavy-tailed distributed responses (where the sample meanconverges to a normal distribution slowly).

ACKNOWLEDGMENTSThework is partially funded by the EPSRCCDT inModern Statisticsand Statistical Machine Learning at Imperial andOxford (StatML.IO)and ASOS.com. The authors thank the anonymous reviewers forproviding many improvements to the original manuscript.

REFERENCES[1] Brooke Bengier and Amanda Knupp. [n.d.]. Selling More Stuff: The What, Why

& How of Incrementality Testing. https://www.quantcast.com/blog/selling-more-stuff-the-what-why-how-of-incrementality-testing/. Blog post.

[2] Will Browne and Mike Swarbrick Jones. 2017. What Works in e-commerce - aMeta-analysis of 6700 Online Experiments. http://www.qubit.com/wp-content/uploads/2017/12/qubit-research-meta-analysis.pdf. http://www.qubit.com/wp-content/uploads/2017/12/qubit-research-meta-analysis.pdf White paper.

[3] Alex Deng and Victor Hu. 2015. Diluted Treatment Effect Estimation for Trig-ger Analysis in Online Controlled Experiments. In Proceedings of the EighthACM International Conference on Web Search and Data Mining (Shanghai, China)(WSDM ’15). Association for Computing Machinery, New York, NY, USA, 349–358.https://doi.org/10.1145/2684822.2685307

[4] Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the Sen-sitivity of Online Controlled Experiments by Utilizing Pre-experiment Data.In Proceedings of the Sixth ACM International Conference on Web Search andData Mining (Rome, Italy) (WSDM ’13). ACM, New York, NY, USA, 123–132.https://doi.org/10.1145/2433396.2433413

[5] Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A DirtyDozen: Twelve Common Metric Interpretation Pitfalls in Online ControlledExperiments. In Proceedings of the 23rd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD ’17). ACM,New York, NY, USA, 1427–1436.

[6] HenningHohnhold, Deirdre O’Brien, andDiane Tang. 2015. Focusing on the Long-term: It’s Good for Users and Business. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining (Sydney, NSW,Australia) (KDD ’15). ACM, New York, NY, USA, 1849–1858.

[7] Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. 2017. Peeking atA/B Tests: Why It Matters, and What to Do About It. In Proceedings of the 23rdACM SIGKDD International Conference on Knowledge Discovery and Data Mining(Halifax, NS, Canada) (KDD ’17). ACM, New York, NY, USA, 1517–1525.

[8] Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann.2013. Online Controlled Experiments at Large Scale. In Proceedings of the 19thACM SIGKDD International Conference on Knowledge Discovery and Data Mining(Chicago, Illinois, USA) (KDD ’13). ACM, New York, NY, USA, 1168–1176.

[9] C. H. Bryan Liu, Elaine M. Bettaney, and Benjamin Paul Chamberlain.2018. Designing Experiments to Measure Incrementality on Facebook.arXiv:1806.02588. [stat.ME] 2018 AdKDD & TargetAd Workshop.

[10] C. H. Bryan Liu, Benjamin Paul Chamberlain, and Emma J. McCoy. 2020. What isthe Value of Experimentation and Measurement?: Quantifying the Value and Riskof Reducing Uncertainty to Make Better Decisions. Data Science and Engineering5, 2 (2020), 152–167. https://doi.org/10.1007/s41019-020-00121-5

[11] EvanMiller. 2010. HowNot To Run anA/B Test. https://www.evanmiller.org/how-not-to-run-an-ab-test.html. Blog post.

[12] Alexey Poyarkov, Alexey Drutsa, Andrey Khalyavin, Gleb Gusev, and PavelSerdyukov. 2016. Boosted Decision Tree Regression Adjustment for VarianceReduction in Online Controlled Experiments. In Proceedings of the 22Nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (SanFrancisco, California, USA) (KDD ’16). ACM, New York, NY, USA, 235–244. https://doi.org/10.1145/2939672.2939688

[13] Sean Talts, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gel-man. 2018. Validating Bayesian Inference Algorithms with Simulation-BasedCalibration. arXiv:1804.06788 [stat.ME]

[14] Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015.From Infrastructure to Culture: A/B Testing Challenges in Large Scale SocialNetworks. In Proceedings of the 21th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD ’15). ACM,New York, NY, USA, 2227–2236.

http://www.qubit.com/wp-content/uploads/2017/12/qubit-research-meta-analysis.pdf




https://doi.org/10.1145/2684822.2685307

https://doi.org/10.1145/2433396.2433413

https://arxiv.org/abs/1806.02588.

https://doi.org/10.1007/s41019-020-00121-5

https://doi.org/10.1145/2939672.2939688

https://doi.org/10.1145/2939672.2939688

https://arxiv.org/abs/1804.06788


SUPPLEMENTARY DOCUMENTIn this supplementary document, we revisit Section 3 of the paper“An Evaluation Framework for Personalization Strategy ExperimentDesign” by Liu and McCoy (2020), and expand on the intermediatealgebraic work when deriving:(i) The actual & minimum detectable effect sizes, and(ii) The conditions that lead to a setup being superior.We will use the equation numbers featured in the original paper,and append letters for intermediate steps.

A EFFECT SIZE OF EXPERIMENT SETUPSWe begin with the actual effect size and the MDE of the four ex-periment setups that are featured in Section 3.1 of the paper. Foreach setup we first compute the sample size, mean response, andresponse variance in each analysis group (denoted 𝑛𝑔 , `𝑔 , and 𝜎2

𝑔

respectively for each analysis group 𝑔). These quantities arise asa mixture of user groups described in Section 2.1 of the paper. Wethen substitute the quantities computed into the definitions of Δand \∗:

Δ ≜ `𝐵 − `𝐴, (see (2))

\∗ ≜ (𝑧1−𝛼/2 − 𝑧1−𝜋min )𝜎�̄� , where 𝜎�̄� = 𝜎2𝐴/𝑛𝐴 + 𝜎2

𝐵/𝑛𝐵 (see (5))

and 𝑧𝑞 represents the 𝑞th quantile of a standard normal, to obtainthe setup-specific actual effect size and MDE for a setup with twoanalysis groups. For setups with more than two analysis groups,we will specify the actual and minimum detectable effect when wediscuss specifics for each of the setups. We assume all random splitsare done 50/50 in these setups to maximize the test power.

Setup 1 (Users in the intersection only). We recall the setup, whichconsiders only users who qualify for both personalization strategies(i.e. the intersection), randomly splits user group 3 into two analysisgroups, 𝐴 and 𝐵, each with the following number of samples:

𝑛𝐴 = 𝑛𝐵 =𝑛32.

Users in analysis group 𝐴 are provided treatment under strategy 1,and users in analysis group 𝐵 are provided treatment under strat-egy 2. This leads to the groups’ responses having the followingmetric mean and variance:

`𝐴 = `𝐼𝜙 , `𝐵 = `𝐼𝜓 , 𝜎2𝐴 = 𝜎2

𝐼𝜙, 𝜎2

𝐵 = 𝜎2𝐼𝜓.

The actual effect size and MDE for Setup 1 are hence:

Δ𝑆1 = `𝐼𝜓 − `𝐼𝜙 , (7)

\∗𝑆1 = (𝑧1−𝛼/2 − 𝑧1−𝜋min )

√√𝜎2𝐼𝜙

𝑛3/2+𝜎2𝐼𝜓

𝑛3/2. (8)

Setup 2 (All samples). The setup, which considers all users regard-less of whether they qualify for any strategy or not, also containstwo analysis groups, 𝐴 and 𝐵, each taking half of the population:

𝑛𝐴 = 𝑛𝐵 =𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

2.

The mean response and response variance for groups 𝐴 and 𝐵

are the weighted mean response and response variance of theconstituent user groups respectively, weighted by the constituentgroups’ size. As we only provide treatment to those who qualify

for strategy 1 in group A, and those who qualify for strategy 2 ingroup B, this leads to different responses in different constituentuser groups:

`𝐴 =𝑛0`𝐶0 + 𝑛1`𝐼1 + 𝑛2`𝐶2 + 𝑛3`𝐼𝜙

𝑛0 + 𝑛1 + 𝑛2 + 𝑛3, (9)

`𝐵 =𝑛0`𝐶0 + 𝑛1`𝐶1 + 𝑛2`𝐼2 + 𝑛3`𝐼𝜓

𝑛0 + 𝑛1 + 𝑛2 + 𝑛3;

𝜎2𝐴 =

𝑛0𝜎2𝐶0 + 𝑛1𝜎2

𝐼1 + 𝑛2𝜎2𝐶2 + 𝑛3𝜎2

𝐼𝜙

𝑛0 + 𝑛1 + 𝑛2 + 𝑛3, (10)

𝜎2𝐵 =

𝑛0𝜎2𝐶0 + 𝑛1𝜎2

𝐶1 + 𝑛2𝜎2𝐼2 + 𝑛3𝜎2

𝐼𝜓

𝑛0 + 𝑛1 + 𝑛2 + 𝑛3.

Substituting the above into the definitions of actual effect sizeand MDE and simplifying the resultant expressions we have:

Δ𝑆2 =𝑛1 (`𝐶1 − `𝐼1) + 𝑛2 (`𝐼2 − `𝐶2) + 𝑛3 (`𝐼𝜓 − `𝐼𝜙 )

𝑛0 + 𝑛1 + 𝑛2 + 𝑛3, (11)

\∗𝑆2 = (𝑧1−𝛼/2 − 𝑧1−𝜋min )

√√√√√√ 2(𝑛0 (2𝜎2

𝐶0) + 𝑛1 (𝜎2𝐼1 + 𝜎2

𝐶1)+𝑛2 (𝜎2

𝐶2 + 𝜎2𝐼2) + 𝑛3 (𝜎2

𝐼𝜙+ 𝜎2

𝐼𝜓))

(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)2 .

(12)

Setup 3 (Qualified users only). The setup is very similar to Setup 2,with members from user group 0 excluded:

𝑛𝐴 = 𝑛𝐵 =𝑛1 + 𝑛2 + 𝑛3

2.

The absence of members from user group 0 means they are notfeatured in the weighted mean response and response variance ofthe two analysis groups:

`𝐴 =𝑛1`𝐼1 + 𝑛2`𝐶2 + 𝑛3`𝐼𝜙

𝑛1 + 𝑛2 + 𝑛3, `𝐵 =

𝑛1`𝐶1 + 𝑛2`𝐼2 + 𝑛3`𝐼𝜓

𝑛1 + 𝑛2 + 𝑛3;

𝜎2𝐴 =

𝑛1𝜎2𝐼1 + 𝑛2𝜎2

𝐶2 + 𝑛3𝜎2𝐼𝜙

𝑛1 + 𝑛2 + 𝑛3, 𝜎2

𝐵 =

𝑛1𝜎2𝐶1 + 𝑛2𝜎2

𝐼2 + 𝑛3𝜎2𝐼𝜓

𝑛1 + 𝑛2 + 𝑛3. (13)

This lead to the following actual effect size and MDE for Setup 3:

Δ𝑆3 =𝑛1 (`𝐶1 − `𝐼1) + 𝑛2 (`𝐼2 − `𝐶2) + 𝑛3 (`𝐼𝜓 − `𝐼𝜙 )

𝑛1 + 𝑛2 + 𝑛3, (14)

\∗𝑆3 =(𝑧1−𝛼/2−𝑧1−𝜋min )

√√√2(𝑛1 (𝜎2

𝐼1+𝜎2𝐶1)+𝑛2 (𝜎2

𝐶2+𝜎2𝐼2)+𝑛3 (𝜎2

𝐼𝜙+𝜎2

𝐼𝜓))

(𝑛1 + 𝑛2 + 𝑛3)2 .

(15)

Setup 4 (Dual control). Setup 4 is unique amongst the experimentsetups introduced as it has four analysis groups. Two of the analysisgroups are drawn from those who qualified into strategy 1 and areallocated into the first half, and the other two are drawn from thosewho are qualified into strategy 2 and are allocated into the secondhalf:

𝑛𝐴1 = 𝑛𝐴2 =𝑛1 + 𝑛3

4, 𝑛𝐵1 = 𝑛𝐵2 =

𝑛2 + 𝑛34

. (16)

The mean response and response variance of each analysis groupare the weighted metric mean response and response variance of


the user groups involved respectively:

`𝐴1 =𝑛1`𝐶1 + 𝑛3`𝐶3

𝑛1 + 𝑛3, `𝐴2 =

𝑛1`𝐼1 + 𝑛3`𝐼𝜙

𝑛1 + 𝑛3,

`𝐵1 =𝑛1`𝐶2 + 𝑛3`𝐶3

𝑛2 + 𝑛3, `𝐵2 =

𝑛1`𝐼2 + 𝑛3`𝐼𝜓

𝑛2 + 𝑛3;

𝜎2𝐴1 =

𝑛1𝜎2𝐶1 + 𝑛3𝜎2

𝐶3𝑛1 + 𝑛3

, 𝜎2𝐴2 =

𝑛1𝜎2𝐼1 + 𝑛3𝜎2

𝐼𝜙

𝑛1 + 𝑛3,

𝜎2𝐵1 =

𝑛1𝜎2𝐶2 + 𝑛3𝜎2

𝐶3𝑛2 + 𝑛3

, 𝜎2𝐵2 =

𝑛1𝜎2𝐼2 + 𝑛3𝜎2

𝐼𝜓

𝑛2 + 𝑛3; (17)

As the setup takes the difference of differences (i.e. the differenceof groups 𝐵2 and 𝐵1, and the difference of groups 𝐴2 and 𝐴1), theactual effect size are specified, post-simplification, as follows:

Δ𝑆4 = (`𝐵2 − `𝐵1) − (`𝐴2 − `𝐴1)

=𝑛2 (`𝐼2−`𝐶2) + 𝑛3 (`𝐼𝜓 −`𝐶3)

𝑛2 + 𝑛3−𝑛2 (`𝐼1−`𝐶1) + 𝑛3 (`𝐼𝜙−`𝐶3)

𝑛1 + 𝑛3.

(18)

The MDE for Setup 4 is similar to that specified in the RHS ofInequality (5), albeit with more groups:

\∗𝑆4 = (𝑧1−𝛼/2−𝑧1−𝜋min )√︃𝜎2𝐴1/𝑛𝐴1 + 𝜎2

𝐴2/𝑛𝐴2 + 𝜎2𝐵1/𝑛𝐵1 + 𝜎2

𝐵2/𝑛𝐵2

= 2 · (𝑧1−𝛼/2 − 𝑧1−𝜋min )×√√𝑛1 (𝜎2

𝐶1+𝜎2𝐼1)+𝑛3 (𝜎2


(𝑛1 + 𝑛3)2 +𝑛2 (𝜎2

𝐶2+𝜎2𝐼2)+𝑛3 (𝜎2


(𝑛2 + 𝑛3)2 .

(19)

B METRIC DILUTIONWe then expand the calculations in Section 3.2 of the paper, wherewe discuss the conditions where an experiment setup with met-ric dilution (Setup 2) will emerge superior to one without metricdilution (Setup 3), and vice versa.

B.1 The first criterionWe first show \∗

𝑆3 < \∗𝑆2, the condition which will lead to Setup 3

being superior to Setup 2 under the first criterion, is equivalent to(𝑛1 (𝜎2

𝐼1 + 𝜎2𝐶1) + 𝑛2 (𝜎2

𝐶2 + 𝜎2𝐼2) + 𝑛3 (𝜎2

𝐼𝜙+ 𝜎2

𝐼𝜓))·

(𝑛0 + 2𝑛1 + 2𝑛2 + 2𝑛3)2(𝑛1 + 𝑛2 + 𝑛3)2 < 𝜎2

𝐶0 . (20)

We start by substituting the expressions for \∗𝑆2 (Equation (12))

and \∗𝑆3 (Equation (15)) into the inequality \∗

𝑆3 < \∗𝑆2 to obtain

(𝑧1−𝛼/2−𝑧1−𝜋min )

√√2(𝑛1 (𝜎2

𝐼1+𝜎2𝐶1)+𝑛2 (𝜎2

𝐶2+𝜎2𝐼2)+𝑛3 (𝜎2

𝐼𝜙+𝜎2

𝐼𝜓))

(𝑛1 + 𝑛2 + 𝑛3)2

< (𝑧1−𝛼/2 − 𝑧1−𝜋min )

√√√√√√ 2(𝑛0 (2𝜎2

𝐶0) + 𝑛1 (𝜎2𝐼1 + 𝜎2

𝐶1)+𝑛2 (𝜎2

𝐶2 + 𝜎2𝐼2) + 𝑛3 (𝜎2

𝐼𝜙+ 𝜎2

𝐼𝜓))

(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)2 .

(20a)

Canceling the 𝑧1−𝛼/2 −𝑧1−𝜋min and√

2 terms on both sides, and thensquaring them yields

𝑛1 (𝜎2𝐼1 + 𝜎2

𝐶1) + 𝑛2 (𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛3 (𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓)

(𝑛1 + 𝑛2 + 𝑛3)2

<𝑛0 (2𝜎2

𝐶0) + 𝑛1 (𝜎2𝐼1 + 𝜎2

𝐶1) + 𝑛2 (𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛3 (𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓)

(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)2 .

(20b)

We then write b = 𝑛1 (𝜎2𝐼1 +𝜎

2𝐶1) +𝑛2 (𝜎2

𝐶2 +𝜎2𝐼2) +𝑛3 (𝜎2

𝐼𝜙+𝜎2

𝐼𝜓)

and move the b terms on the RHS to the LHS:

b

(1

(𝑛1 + 𝑛2 + 𝑛3)2 − 1(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)2

)<

𝑛0 (2𝜎2𝐶0)

(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)2 .

(20c)

As the partial fractions can be consolidated as1

(𝑛1 + 𝑛2 + 𝑛3)2 − 1(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)2

=(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)2 − (𝑛1 + 𝑛2 + 𝑛3)2

(𝑛1 + 𝑛2 + 𝑛3)2 (𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)2

=(𝑛0 + 2𝑛1 + 2𝑛2 + 2𝑛3)𝑛0

(𝑛1 + 𝑛2 + 𝑛3)2 (𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)2 , (20d)

where the second step utilizes the identity 𝑎2 − 𝑏2 = (𝑎 + 𝑏) (𝑎 − 𝑏),Inequality (20c) can be written as

b

((𝑛0 + 2𝑛1 + 2𝑛2 + 2𝑛3)𝑛0

(𝑛1 + 𝑛2 + 𝑛3)2 (𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)2

)<

𝑛0 (2𝜎2𝐶0)

(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)2 .

(20e)

We finally cancel the 𝑛0 and (𝑛0 +𝑛1 +𝑛2 +𝑛3)2 terms on both sidesof Inequality (20e), move the factor of two to the LHS, and write bin its full form to arrive at Inequality (20):(

𝑛1 (𝜎2𝐼1 + 𝜎2

𝐶1) + 𝑛2 (𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛3 (𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓))·

(𝑛0 + 2𝑛1 + 2𝑛2 + 2𝑛3)2(𝑛1 + 𝑛2 + 𝑛3)2 < 𝜎2

𝐶0 .

B.2 The second criterionIn the case where Inequality (20) does not hold, we consider whenSetup 3 will emerge superior to Setup 2 under the second criterion:

Δ𝑆3 − Δ𝑆2 > \∗𝑆3 − \∗𝑆2 .

If this inequality does not hold either (and both sides are not equal),we consider Setup 2 as superior to Setup 3 under the same criterionas the following holds:

Δ𝑆3 − Δ𝑆2 < \∗𝑆3 − \∗𝑆2 ⇐⇒ Δ𝑆2 − Δ𝑆3 > \∗𝑆2 − \∗𝑆3 .

The master inequality. We first show that the inequality Δ𝑆3 −Δ𝑆2 > \∗

𝑆3 − \∗𝑆2 is equivalent to

𝑛1 + 𝑛2 + 𝑛3𝑛0

√︃2𝑛0𝜎2

𝐶0 + b >𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

𝑛0

√︁b − [

√2𝑧

, (22)

where

[ = 𝑛1 (`𝐶1 − `𝐼1) + 𝑛2 (`𝐼2 − `𝐶2) + 𝑛3 (`𝐼𝜓 − `𝐼𝜙 ),b = 𝑛1 (𝜎2

𝐶1 + 𝜎2𝐼1) + 𝑛2 (𝜎2

𝐼2 + 𝜎2𝐶2) + 𝑛3 (𝜎2

𝐼𝜓+ 𝜎2

𝐼𝜙), and

𝑧 = 𝑧1−𝛼/2 − 𝑧1−𝜋min .


Writing [, b , and 𝑧 as shown above, we substitute in the expres-sions for Δ𝑆2, \∗𝑆2, Δ𝑆3, and \∗𝑆3 (Equations (11), (12), (14), and (15)respectively) into the initial inequality to obtain

[

𝑛1 + 𝑛2 + 𝑛3− [

𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

> 𝑧

√︄2b

(𝑛1 + 𝑛2 + 𝑛3)2 − 𝑧

√︄2(𝑛0 (2𝜎2

𝐶0) + b)

(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)2 . (22a)

Pulling out the common factors on each side we have

[

(1

𝑛1 + 𝑛2 + 𝑛3− 1𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

)>√

2𝑧

( √︁b

𝑛1 + 𝑛2 + 𝑛3−

√︃2𝑛0𝜎2

𝐶0 + b

𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

). (22b)

Writing the partial fraction on the LHS of Inequality (22b) as acomposite fraction we have

[

(𝑛0

(𝑛1 + 𝑛2 + 𝑛3) (𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)

)>√

2𝑧

( √︁b

𝑛1 + 𝑛2 + 𝑛3−

√︃2𝑛0𝜎2

𝐶0 + b

𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

). (22c)

We then move the composite fraction to the RHS and the√

2𝑧 termto the LHS:

[√

2𝑧>

(𝑛1 + 𝑛2 + 𝑛3)·(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)

𝑛0

( √︁b

𝑛1 + 𝑛2 + 𝑛3−

√︃2𝑛0𝜎2

𝐶0 + b

𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

),

(22d)

and expand the brackets, canceling terms that appear on both sidesof the fractions in the RHS:

[√

2𝑧>

𝑛0 + 𝑛1 + 𝑛2 + 𝑛3𝑛0

√︁b − 𝑛1 + 𝑛2 + 𝑛3

𝑛0

√︃2𝑛0𝜎2

𝐶0 + b . (22e)

Finally, we swap the position of the leftmost term with that of therightmost term in Inequality (22e) to arrive at Inequality (22):

𝑛1 + 𝑛2 + 𝑛3𝑛0

√︃2𝑛0𝜎2

𝐶0 + b >𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

𝑛0

√︁b − [

√2𝑧

.

The trivial case: RHS ≤ 0. We first observe that the LHS of In-equality (22) is always positive, and hence the inequality triviallyholds if the RHS is non-positive. Here we show RHS ≤ 0 is equiva-lent to

𝑛0 + 𝑛1 + 𝑛2 + 𝑛3𝑛0

\∗𝑆3 ≤ Δ𝑆3 . (23)

The can be done by writing RHS ≤ 0 in full:𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

𝑛0

√︁b − [

√2𝑧

≤ 0, (23a)

and moving the second term on the LHS to the RHS:𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

𝑛0

√︁b ≤ [

√2𝑧

. (23b)

We then add a factor of√

2𝑧/(𝑛1 + 𝑛2 + 𝑛3) on both sides to get

𝑛0 + 𝑛1 + 𝑛2 + 𝑛3𝑛0

√︁b√

2𝑧𝑛1 + 𝑛2 + 𝑛3

≤ [

𝑛1 + 𝑛2 + 𝑛3. (23c)

Noting from Equations (14) and (15) that

Δ𝑆3 =[

𝑛1 + 𝑛2 + 𝑛3and \∗𝑆3 =

√2 · 𝑧 ·

√︁b

𝑛1 + 𝑛2 + 𝑛3,

we finally replace the terms in Inequality (23c) with Δ𝑆3 and \∗𝑆3 to

arrive at Inequality (23):𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

𝑛0\∗𝑆3 ≤ Δ𝑆3 .

The non-trivial case: RHS > 0. We then tackle the case where theRHS of the master inequality (Inequality (22)) is greater than zero.We show in this non-trivial case, Inequality (22) is equivalent to

2𝜎2𝐶0𝑛0

>

(\∗𝑆3 − Δ𝑆3 + 𝑛1+𝑛2+𝑛3

𝑛0\∗𝑆3

)2−

(𝑛1+𝑛2+𝑛3

𝑛0\∗𝑆3

)2

2𝑧2 . (24)

We first multiply both sides of Inequality (22) with the fraction𝑛0√

2𝑧/(𝑛1 + 𝑛2 + 𝑛3) to get

𝑛1 + 𝑛2 + 𝑛3𝑛0

√︃2𝑛0𝜎2

𝐶0 + b𝑛0√

2𝑧𝑛1 + 𝑛2 + 𝑛3

>

(𝑛0 + 𝑛1 + 𝑛2 + 𝑛3

𝑛0

√︁b − [

√2𝑧

)𝑛0√

2𝑧𝑛1 + 𝑛2 + 𝑛3

. (24a)

Canceling terms on both sides of the fractions we have√︃2𝑛0𝜎2

𝐶0 + b√

2𝑧

> (𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)√︁b√

2𝑧𝑛1 + 𝑛2 + 𝑛3

− 𝑛0[

𝑛1 + 𝑛2 + 𝑛3. (24b)

Again noting the identities for Δ𝑆3 and \∗𝑆3 stated above, we can

replace the fractions on the RHS and obtain√︃2𝑛0𝜎2

𝐶0 + b√

2𝑧 > (𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)\∗𝑆3 − 𝑛0Δ𝑆3 . (24c)

We then square both sides of Inequality (24c) and move the 2𝑧2

term to the RHS:

2𝑛0𝜎2𝐶0 + b >

((𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)\∗𝑆3 − 𝑛0Δ𝑆3

)2

2𝑧2 . (24d)

Note the squaring still allows the implication to go both ways asboth sides of Inequality (24c) are positive. Based on the identity for\∗𝑆3, we observe b can also be written as

b =(𝑛1 + 𝑛2 + 𝑛3)2 (\∗

𝑆3)2

2𝑧2 . (24e)

Thus, we can group all terms with a 2𝑧2 denominator by moving bin Inequality (24d) to the RHS and substituting Equation (24e) intothe resultant inequality:

2𝑛0𝜎2𝐶0 >

((𝑛0 + 𝑛1 + 𝑛2 + 𝑛3)\∗𝑆3 − 𝑛0Δ𝑆3

)2−((𝑛1 + 𝑛2 + 𝑛3)\∗𝑆3

)2

2𝑧2 .

(24f)

We finally normalize the inequality to one with unit Δ𝑆3 and \∗𝑆3to enable effective comparison. We divide both sides of Inequal-ity (24f) by 𝑛02:

2𝜎2𝐶0𝑛0

>

(𝑛0+𝑛1+𝑛2+𝑛3

𝑛0\∗𝑆3 − Δ𝑆3

)2−

(𝑛1+𝑛2+𝑛3

𝑛0\∗𝑆3

)2

2𝑧2 , (24g)


and split the coefficient of \∗𝑆3 in the first squared term into an

integer (1) and a fractional ((𝑛1 + 𝑛2 + 𝑛3)/𝑛0) part to arrive atInequality (24):

2𝜎2𝐶0𝑛0

>

(\∗𝑆3 − Δ𝑆3 + 𝑛1+𝑛2+𝑛3

𝑛0\∗𝑆3

)2−

(𝑛1+𝑛2+𝑛3

𝑛0\∗𝑆3

)2

2𝑧2 .

C DUAL CONTROLWe finally clarify the calculations in Section 3.3 of the paper, wherewe determine the sample size required for Setup 4 (aka a dual controlsetup) to emerge superior to Setup 3 (a simpler A/B test setup). Inthe paper we showed that \∗

𝑆4 > \∗𝑆3 always holds, and hence

Setup 4 will never be superior to Setup 3 under the first evaluationcriterion. We thus focus on the second evaluation criterion Δ𝑆4 −Δ𝑆3 > \∗

𝑆4 − \∗𝑆3, and first show that the criterion is equivalent to

Inequality (27). Assuming the ratio of user group sizes 𝑛1 : 𝑛2 : 𝑛3remains unchanged, we then show how(i) The RHS of the Inequality (27) remains a constant and(ii) The LHS of the Inequality (27) scales along 𝑂 (

√𝑛), where 𝑛 is

the number of users.The results mean Setup 4 could emerge superior to Setup 3 if wehave sufficiently large number of users. We also show from theinequality that(iii) The number of users required for a dual control setup to emerge

superior to simpler setups is accessible only to the largest orga-nizations and their top affiliates.

The master inequality. We first show the criterion Δ𝑆4 − Δ𝑆3 >

\∗𝑆4 − \∗

𝑆3 is equivalent to

𝑛1𝑛2 (`𝐼2−`𝐶2)+𝑛3 (`𝐼𝜓−`𝐶3)

𝑛2+𝑛3− 𝑛2

𝑛1 (`𝐼1−`𝐶1)+𝑛3 (`𝐼𝜙−`𝐶3)𝑛1+𝑛3√︃

𝑛1 (𝜎2𝐶1 + 𝜎2

𝐼1) + 𝑛2 (𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛3 (𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓)

> (27)

√2𝑧

√√√√√√√√√2 ·

(1 + 𝑛2𝑛1+𝑛3

)2 [𝑛1 (𝜎2𝐶1 + 𝜎2

𝐼1) + 𝑛3 (𝜎2𝐶3 + 𝜎2

𝐼𝜙)]+

(1 + 𝑛1𝑛2+𝑛3

)2 [𝑛2 (𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛3 (𝜎2𝐶3 + 𝜎2

𝐼𝜓)]

𝑛1 (𝜎2𝐶1 + 𝜎2

𝐼1) + 𝑛2 (𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛3 (𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓)

− 1

,where 𝑧 = 𝑧1−𝛼/2 − 𝑧1−𝜋min . The number of terms involved is large,and hence we first simplify the LHS and RHS independently, andcombine them in the final step.

For the LHS (i.e. Δ𝑆4 − Δ𝑆3), we substitute in Equations (14)and (18) to obtain𝑛2 (`𝐼2−`𝐶2) + 𝑛3 (`𝐼𝜓 −`𝐶3)

𝑛2 + 𝑛3−𝑛2 (`𝐼1−`𝐶1) + 𝑛3 (`𝐼𝜙−`𝐶3)

𝑛1 + 𝑛3−

𝑛1 (`𝐶1 − `𝐼1) + 𝑛2 (`𝐼2 − `𝐶2) + 𝑛3 (`𝐼𝜓 − `𝐼𝜙 )𝑛1 + 𝑛2 + 𝑛3

. (27a)

The expression can be rewritten in terms of multiplicative productsbetween the 𝑛-terms and the (difference between) `-terms:

𝑛1 (`𝐼1 − `𝐶1)[− 1

𝑛1+𝑛3+ 1

𝑛1+𝑛2+𝑛3

]+

𝑛2 (`𝐼2 − `𝐶2)[ 1𝑛2+𝑛3

− 1𝑛1+𝑛2+𝑛3

]+

𝑛3`𝐼𝜓[ 1𝑛2+𝑛3

− 1𝑛1+𝑛2+𝑛3

]+ 𝑛3`𝐼𝜙

[− 1

𝑛1+𝑛3+ 1

𝑛1+𝑛2+𝑛3

]+

𝑛3`𝐶3[− 1

𝑛2+𝑛3+ 1

𝑛1+𝑛3

]. (27b)

We then extract a 1/(𝑛1 + 𝑛2 + 𝑛3) term from Expression (27b):

(𝑛1 + 𝑛2 + 𝑛3)−1 [𝑛1 (`𝐼1 − `𝐶1)(− 𝑛1+𝑛2+𝑛3

𝑛1+𝑛3+ 1

)+

𝑛2 (`𝐼2 − `𝐶2)(𝑛1+𝑛2+𝑛3

𝑛2+𝑛3− 1

)+

𝑛3`𝐼𝜓(𝑛1+𝑛2+𝑛3

𝑛2+𝑛3− 1

)+ 𝑛3`𝐼𝜙

(− 𝑛1+𝑛2+𝑛3

𝑛1+𝑛3+ 1

)+

𝑛3`𝐶3(− 𝑛1+𝑛2+𝑛3

𝑛2+𝑛3+ 𝑛1+𝑛2+𝑛3

𝑛1+𝑛3

) ]. (27c)

This allows us to perform some cancellation with the RHS, whichalso has a 1/(𝑛1 + 𝑛2 + 𝑛3) term, in the final step. Noting

𝑛1 + 𝑛2 + 𝑛3𝑛1 + 𝑛3

= 1 + 𝑛2𝑛1 + 𝑛3

and𝑛1 + 𝑛2 + 𝑛3𝑛2 + 𝑛3

= 1 + 𝑛1𝑛2 + 𝑛3

,

we can write the inner square brackets as

(𝑛1 + 𝑛2 + 𝑛3)−1 [𝑛1 (`𝐼1 − `𝐶1)(− 𝑛2

𝑛1+𝑛3

)+ 𝑛2 (`𝐼2 − `𝐶2)

( 𝑛1𝑛2+𝑛3

)+

𝑛3`𝐼𝜓( 𝑛1𝑛2+𝑛3

)+ 𝑛3`𝐼𝜙

(− 𝑛2

𝑛1+𝑛3

)+

𝑛3`𝐶3(− 1 − 𝑛1

𝑛2+𝑛3+ 1 + 𝑛2

𝑛1+𝑛3

) ], (27d)

and group the 𝑛1/(𝑛2 + 𝑛3) and 𝑛2/(𝑛1 + 𝑛3) terms to arrive at1

𝑛1 + 𝑛2 + 𝑛3

[ 𝑛1𝑛2 + 𝑛3

(𝑛2 (`𝐼2 − `𝐶2) + 𝑛3 (`𝐼𝜓 − `𝐶3)

)−

𝑛2𝑛1 + 𝑛3

(𝑛1 (`𝐼1 − `𝐶1) + 𝑛3 (`𝐼𝜙 − `𝐶3)

) ]. (27e)

For the RHS (i.e. \∗𝑆4 − \∗

𝑆3), we substitute in Equations (15)and (19) to obtain

2𝑧

√√𝑛1 (𝜎2

𝐶1+𝜎2𝐼1) + 𝑛3 (𝜎2


(𝑛1 + 𝑛3)2 +𝑛2 (𝜎2

𝐶2+𝜎2𝐼2) + 𝑛3 (𝜎2


(𝑛2 + 𝑛3)2

−√

2𝑧

√√𝑛1 (𝜎2

𝐼1 + 𝜎2𝐶1) + 𝑛2 (𝜎2

𝐶2 + 𝜎2𝐼2) + 𝑛3 (𝜎2

𝐼𝜙+ 𝜎2

𝐼𝜓)

(𝑛1 + 𝑛2 + 𝑛3)2 , (27f)

where 𝑧 = 𝑧1−𝛼/2 − 𝑧1−𝜋min . We then extract a√

2𝑧/(𝑛1 + 𝑛2 + 𝑛3)term from Expression (27f) to arrive at

√2𝑧

𝑛1 + 𝑛2 + 𝑛3

[√

2

√√√(𝑛1+𝑛2+𝑛3𝑛1+𝑛3

)2 [𝑛1 (𝜎2𝐶1+𝜎

2𝐼1) + 𝑛3 (𝜎2

𝐶3+𝜎2𝐼𝜙)]+(𝑛1+𝑛2+𝑛3

𝑛2+𝑛3

)2 [𝑛2 (𝜎2𝐶2+𝜎

2𝐼2) + 𝑛3 (𝜎2

𝐶3+𝜎2𝐼𝜓)]

−

√︃𝑛1 (𝜎2

𝐶1 + 𝜎2𝐼1) + 𝑛2 (𝜎2

𝐶2 + 𝜎2𝐼2) + 𝑛3 (𝜎2

𝐼𝜙+ 𝜎2

𝐼𝜓)],

(27g)

where (𝑛1 +𝑛2 +𝑛3)/(𝑛2 +𝑛3) and (𝑛1 +𝑛2 +𝑛3)/(𝑛1 +𝑛3) can alsobe written as 1 + 𝑛1/(𝑛2 + 𝑛3) and 1 + 𝑛2/(𝑛1 + 𝑛3) respectively.

We finally combine both sides of the inequality by taking Ex-pressions (27e) and (27g):

1𝑛1 + 𝑛2 + 𝑛3

[ 𝑛1𝑛2 + 𝑛3

(𝑛2 (`𝐼2 − `𝐶2) + 𝑛3 (`𝐼𝜓 − `𝐶3)

)−

𝑛2𝑛1 + 𝑛3

(𝑛1 (`𝐼1 − `𝐶1) + 𝑛3 (`𝐼𝜙 − `𝐶3)

) ]>

√2𝑧

𝑛1 + 𝑛2 + 𝑛3

[√

2

√√√(1 + 𝑛2

𝑛1+𝑛3

)2 [𝑛1 (𝜎2𝐶1+𝜎

2𝐼1) + 𝑛3 (𝜎2

𝐶3+𝜎2𝐼𝜙)]+(

1 + 𝑛1𝑛2+𝑛3

)2 [𝑛2 (𝜎2𝐶2+𝜎

2𝐼2) + 𝑛3 (𝜎2

𝐶3+𝜎2𝐼𝜓)]−

√︃𝑛1 (𝜎2

𝐶1 + 𝜎2𝐼1) + 𝑛2 (𝜎2

𝐶2 + 𝜎2𝐼2) + 𝑛3 (𝜎2

𝐼𝜙+ 𝜎2

𝐼𝜓)].

(27h)


Canceling the common 1/(𝑛1 +𝑛2 +𝑛3) terms on both sides, and di-viding both sides by

√︃𝑛1 (𝜎2

𝐶1+𝜎2𝐼1) + 𝑛2 (𝜎2

𝐶2+𝜎2𝐼2) + 𝑛3 (𝜎2

𝐼𝜙+𝜎2

𝐼𝜓)

leads us to Inequality (27):

𝑛1𝑛2 (`𝐼2−`𝐶2)+𝑛3 (`𝐼𝜓−`𝐶3)

𝑛2+𝑛3− 𝑛2

𝑛1 (`𝐼1−`𝐶1)+𝑛3 (`𝐼𝜙−`𝐶3)𝑛1+𝑛3√︃

𝑛1 (𝜎2𝐶1 + 𝜎2

𝐼1) + 𝑛2 (𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛3 (𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓)

>

√2𝑧

√√√√√√√√√2 ·

(1 + 𝑛2𝑛1+𝑛3

)2 [𝑛1 (𝜎2𝐶1 + 𝜎2

𝐼1) + 𝑛3 (𝜎2𝐶3 + 𝜎2

𝐼𝜙)]+

(1 + 𝑛1𝑛2+𝑛3

)2 [𝑛2 (𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛3 (𝜎2𝐶3 + 𝜎2

𝐼𝜓)]

𝑛1 (𝜎2𝐶1 + 𝜎2

𝐼1) + 𝑛2 (𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛3 (𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓)

− 1

.RHS remains a constant. We then simplify the 𝜎2-terms in In-

equality (27) by assuming that they are similar in magnitude, i.e.

𝜎2𝐶1 ≈ 𝜎2

𝐼1 ≈ · · · ≈ 𝜎2𝐼𝜓

≈ 𝜎2𝑆 ,

and show the RHS of the inequality is equal to

√2𝑧

[√︂𝑛1 + 𝑛2 + 𝑛3𝑛1 + 𝑛3

+ 𝑛1 + 𝑛2 + 𝑛3𝑛2 + 𝑛3

− 1]. (28)

As long as the group size ratio 𝑛1 : 𝑛2 : 𝑛3 remains unchanged,Expression (28) will remain a constant. It is safe to apply the sim-plifying assumption as we know from the evaluation frameworkspecification that there are three classes of parameters in the in-equality: the user group sizes (𝑛), the mean responses (`), and theresponse variances (𝜎2). Among these three classes of parameters,only the user group sizes have the potential to scale in any practicalsettings, and thus we can effectively treat the means and variancesas constants below.

We begin by substituting 𝜎2𝑆into Inequality (27):

𝑛1𝑛2 (`𝐼2−`𝐶2)+𝑛3 (`𝐼𝜓−`𝐶3)

𝑛2+𝑛3− 𝑛2

𝑛1 (`𝐼1−`𝐶1)+𝑛3 (`𝐼𝜙−`𝐶3)𝑛1+𝑛3√︃

𝑛1 (2𝜎2𝑆) + 𝑛2 (2𝜎2

𝑆) + 𝑛3 (2𝜎2

𝑆)

>

√2𝑧

√√√√√√√

2 ·

(1 + 𝑛2𝑛1+𝑛3

)2 [𝑛1 (2𝜎2𝑆) + 𝑛3 (2𝜎2

𝑆)]+

(1 + 𝑛1𝑛2+𝑛3

)2 [𝑛2 (2𝜎2𝑆) + 𝑛3 (2𝜎2

𝑆)]

𝑛1 (2𝜎2𝑆) + 𝑛2 (2𝜎2

𝑆) + 𝑛3 (2𝜎2

𝑆)

− 1

. (28a)

Moving the common 2𝜎2𝑆terms out and canceling the common

terms in the RHS fraction we have

𝑛1𝑛2 (`𝐼2−`𝐶2)+𝑛3 (`𝐼𝜓−`𝐶3)

𝑛2+𝑛3− 𝑛2

𝑛1 (`𝐼1−`𝐶1)+𝑛3 (`𝐼𝜙−`𝐶3)𝑛1+𝑛3√︃

2𝜎2𝑆(𝑛1 + 𝑛2 + 𝑛3)

>

√2𝑧

[√︄2 ·

(1+ 𝑛2𝑛1+𝑛3

)2 (𝑛1 + 𝑛3) + (1+ 𝑛1𝑛2+𝑛3

)2 (𝑛2 + 𝑛3)(𝑛1 + 𝑛2 + 𝑛3)

− 1

].

(28b)

We can already see the LHS of Inequality (28b) scales along𝑂 (√𝑛)—

we will demonstrate this result in greater detail below.Focusing on the RHS of the inequality, we express the squared

terms as rational fractions and divide each term in the numerator

by the denominator to obtain

√2𝑧

√√√

2

[(𝑛1+𝑛2+𝑛3𝑛1 + 𝑛3

)2𝑛1 + 𝑛3

𝑛1+𝑛2+𝑛3+(𝑛1+𝑛2+𝑛3𝑛2 + 𝑛3

)2𝑛2 + 𝑛3

𝑛1+𝑛2+𝑛3

]−1

.(28c)

Canceling the common 𝑛1 + 𝑛2 + 𝑛3 terms leads to that presentedin Expression (28):

√2𝑧

[√︂𝑛1 + 𝑛2 + 𝑛3𝑛1 + 𝑛3

+ 𝑛1 + 𝑛2 + 𝑛3𝑛2 + 𝑛3

− 1].

LHS scales along 𝑂 (√𝑛). We demonstrate the scaling relation

between the LHS of Inequality (27) and the number of users in eachgroup by simplifying the 𝑛-terms (but not the 𝜎2-terms as above),assuming 𝑛1 ≈ 𝑛2 ≈ 𝑛3 ≈ 𝑛, and showing the LHS of the inequalityis equal to

√𝑛((`𝐼2 − `𝐶2) − (`𝐼1 − `𝐶1) + `𝐼𝜓 − `𝐼𝜙

)2√︃𝜎2𝐶1 + 𝜎2

𝐼1 + 𝜎2𝐶2 + 𝜎2


+ 𝜎2𝐼𝜓

. (29)

While the relationship (that the LHS of the inequality scales along𝑂 (

√𝑛)) is evident by inspecting the LHS of Inequality (28b) or even

Inequality (27) itself, we believe the simplification allows us to showthe relationship more clearly.

We begin by substituting 𝑛 into Inequality (27) to obtain

𝑛𝑛 (`𝐼2−`𝐶2)+𝑛 (`𝐼𝜓−`𝐶3)

𝑛+𝑛 − 𝑛𝑛 (`𝐼1−`𝐶1)+𝑛 (`𝐼𝜙−`𝐶3)

𝑛+𝑛√︃𝑛(𝜎2

𝐶1 + 𝜎2𝐼1) + 𝑛(𝜎

2𝐶2 + 𝜎2

𝐼2) + 𝑛(𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓)

> (29a)

√2𝑧

√√√√√√√√√2 ·

(1 + 𝑛𝑛+𝑛 )

2 [𝑛(𝜎2𝐶1 + 𝜎2

𝐼1) + 𝑛(𝜎2𝐶3 + 𝜎2

𝐼𝜙)]+

(1 + 𝑛𝑛+𝑛 )

2 [𝑛(𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛(𝜎2𝐶3 + 𝜎2

𝐼𝜓)]

𝑛(𝜎2𝐶1 + 𝜎2

𝐼1) + 𝑛(𝜎2𝐶2 + 𝜎2

𝐼2) + 𝑛(𝜎2𝐼𝜙

+ 𝜎2𝐼𝜓)

− 1

.Moving the common 𝑛-terms out and canceling them in the frac-tions where appropriate lead to

√𝑛 1

2[((`𝐼2 − `𝐶2) + (`𝐼𝜓 − `𝐶3)) − ((`𝐼1 − `𝐶1) + (`𝐼𝜙 − `𝐶3))

]√︃𝜎2𝐶1 + 𝜎2

𝐼1 + 𝜎2𝐶2 + 𝜎2


+ 𝜎2𝐼𝜓

>

√2𝑧

√√√√√√√√√2 ·

(1 + 12 )

2 (𝜎2𝐶1 + 𝜎2

𝐼1 + 𝜎2𝐶3 + 𝜎2

𝐼𝜙)+

(1 + 12 )

2 (𝜎2𝐶2 + 𝜎2

𝐼2 + 𝜎2𝐶3 + 𝜎2

𝐼𝜓)

𝜎2𝐶1 + 𝜎2

𝐼1 + 𝜎2𝐶2 + 𝜎2


+ 𝜎2𝐼𝜓

− 1

, (29b)

where the LHS is equal to Expression (29) as claimed above.It is clear that there are no 𝑛-terms left on the RHS of Inequal-

ity (29), and hence the RHS remains a constant as shown previously.Setting up the inequality to demonstrate the third result—that thenumber of users required for a dual control setup to emerge su-perior is large—we further simplify the RHS of the inequality by


rearranging the terms in the square root:√𝑛

[((`𝐼2 − `𝐶2) + (`𝐼𝜓 − `𝐶3)) − ((`𝐼1 − `𝐶1) + (`𝐼𝜙 − `𝐶3))

]2√︃𝜎2𝐶1 + 𝜎2

𝐼1 + 𝜎2𝐶2 + 𝜎2


+ 𝜎2𝐼𝜓

>

√2𝑧

[√√√2(

32

)2 (1 +

2𝜎2𝐶3

𝜎2𝐶1 + 𝜎2

𝐼1 + 𝜎2𝐶2 + 𝜎2


+ 𝜎2𝐼𝜓

)− 1

].

(29c)

Required number of users is large. We finally show that whileSetup 4 could emerge superior to Setup 3 as the number of usersincrease, the number of users required is high.We do so by assumingboth the 𝜎2- and 𝑛-terms are similar in magnitude, i.e. 𝜎2

𝐶1 ≈ 𝜎2𝐼1 ≈

· · · ≈ 𝜎2𝐼𝜓

≈ 𝜎2𝑆and 𝑛1 ≈ 𝑛2 ≈ 𝑛3 ≈ 𝑛, and show that Inequality (27)

is equivalent to

𝑛 >

(2√

12(√

6 − 1)𝑧

)2 𝜎2𝑆

Δ2 , (30)

where Δ = (`𝐼2 − `𝐶2) − (`𝐼1 − `𝐶1) + `𝐼𝜓 − `𝐼𝜙 is the actual effectsize difference between Setups 4 and 3. Note we are determiningwhen Setup 4 is superior to Setup 3 under the second evaluationcriterion—that the gain in actual effect is greater than the loss insensitivity—and thus assume Δ is positive.

The equivalence can be shown by substituting 𝜎2𝑆into Inequal-

ity (29c), which already assumes the 𝑛-terms are similar in magni-tude:8√𝑛

[((`𝐼2 − `𝐶2) + (`𝐼𝜓 − `𝐶3)) − ((`𝐼1 − `𝐶1) + (`𝐼𝜙 − `𝐶3))

]2√︃

6𝜎2𝑆

>

√2𝑧

[√√2(

32

)2 (1 +

2𝜎2𝑆

6𝜎2𝑆

)− 1

]. (30a)

Noting the expression within the LHS square bracket is equal to Δ,we simplify the expression within the RHS square root, and moveevery non-𝑛 term to the RHS of the inequality to obtain

√𝑛 >

√2𝑧

[√6 − 1

] 2√︃

6𝜎2𝑆

Δ. (30b)

As all quantities in the inequality are positive, we can square bothsides and consolidate the coefficients on the RHS to arrive at In-equality (30):

𝑛 >

(2√

12(√

6 − 1)𝑧

)2 𝜎2𝑆

Δ2 .

8Alternatively we can substitute 𝑛 into Inequality (28b), which already assumes the𝜎2-terms are similar in magnitude. Simplifying the resultant inequality would yieldthe same end result.

an evaluation framework for personalization strategy ...an evaluation framework for personalization...

Documents