an evaluation framework for personalization strategy ...an evaluation framework for personalization...
TRANSCRIPT
An Evaluation Framework for Personalization StrategyExperiment Designs
C. H. Bryan [email protected]
Imperial College London & ASOS.com, UK
Emma J. McCoy
Imperial College London, UK
ABSTRACTOnline Controlled Experiments (OCEs) are the gold standard inevaluating the effectiveness of changes to websites. An importanttype of OCE evaluates different personalization strategies, whichpresent challenges in low test power and lack of full control ingroup assignment. We argue that getting the right experimentsetupâthe allocation of users to treatment/analysis groupsâshouldtake precedence of post-hoc variance reduction techniques in orderto enable the scaling of the number of experiments. We present anevaluation framework that, along with a few simple rule of thumbs,allow experimenters to quickly compare which experiment setupwill lead to the highest probability of detecting a treatment effectunder their particular circumstance.
KEYWORDSExperimentation, Experiment design, Personalization strategies,A/B testing.
1 INTRODUCTIONThe use of Online Controlled Experiments (OCEs, e.g. A/B tests)has become popular in measuring the impact of digital productsand services, and guiding business decisions on theWeb [10]. Majorcompanies report running thousands of OCEs on any given day [6,8, 14] and many startups exist purely to manage OCEs [2, 7].
A large number of OCEs address simple variations on elements ofthe user experience based on random splits, e.g. showing a differentcolored button to users based on a user ID hash bucket. Here, we areinterested in experiments that compare personalization strategies,complex sets of targeted user interactions that are common ine-commerce and digital marketing, and measure the change toa performance metric of interest (metric hereafter). Examples ofpersonalization strategies include the scheduling, budgeting andordering of marketing activities directed at a user based on theirpurchase history.
Experiments for personalization strategies face two unique chal-lenges. Firstly, strategies are often only applicable to a small fractionof the user base, and thus many simple experiment designs suf-fer from either a lack of sample size / statistical power, or dilutedmetric movement by including irrelevant samples [3]. Secondly,as users are not randomly assigned a priori, but must qualify tobe treated with a strategy via their actions or attributes, groupsof users subjected to different strategies cannot be assumed to bestatistically equivalent and hence are not directly comparable.
While there are a number of variance reduction techniques (in-cluding stratification and control variates [4, 12]) that partiallyaddress the challenges, the strata and control variates involved canvary dramatically from one personalization strategy experimentto another, requiring many ad hoc adjustments. As a result, such
techniques may not scale well when organizations design and runhundreds or thousands of experiments at any given time.
We argue that personalization strategy experiments should focuson the assignment of users from the strategies they qualified forto the treatment/analysis groups. We call this mapping process anexperiment setup. Identifying the best experiment setup increasesthe chance to detect any treatment effect. An experimentationframework can also reuse and switch between different setupsquickly with little custom input, ensuring the operation can scale.More importantly, the process does not hinder the subsequentapplication of variance reduction techniques, meaning that we canstill apply the techniques post hoc if required.
To date, many experiment setups exist to compare personaliza-tion strategies. An increasingly popular approach is to compare thestrategies using multiple control groups â Quantcast calls it a dualcontrol [1], and Facebook calls it a multi-cell lift study [9]. In thetwo-strategy case, this involves running two experiments on tworandom partitions of the user base in parallel, with each experimentfurther splitting the respective partition into treatment/control andmeasuring the incrementality (the change in a metric as comparedto the case where nothing is done) of each strategy. The incremen-tality of the strategies are then compared against each other.
Despite the setup above gaining traction in display advertising,there is a lack of literature on whether it is a better setupâonethat has a higher sensitivity and/or apparent effect size than othersetups. While [9] noted that multi-cell lift studies require a largenumber of users, they did not discuss how the number compares toother setups.1 The ability to identify and adopt a better experimentsetup can reduce the required sample size, and hence enable morecost-effective experimentation.
We address the gap in the literature by introducing an evaluationframework that compares experiment setups given two personal-ization strategies. The framework is designed to be flexible so thatit is able to deal with a wide range of baselines and changes in userresponses presented by any pairs of strategies (situations hereafter).We also recognize the need to quickly compare common setups, andprovide some simple rule of thumbs on situations where a setupwill be better than another. In particular, we outline the conditionswhere employing a multi-cell setup, as well as metric dilution, isdesirable.
To summarize, our contributions are: (i) We develop a flexibleevaluation framework for personalization strategy experiments,where one can compare two experiment setups given the situationpresented by two competing strategies (Section 2); (ii) We providesimple rule of thumbs to enable experimenters who do not requirethe full flexibility of the framework to quickly compare common
1A single-cell lift study is often used to measure the incrementality of a single person-alization strategy, and hence is not a representative comparison.
arX
iv:2
007.
1163
8v2
[st
at.M
E]
17
Dec
202
0
AdKDD â20, Aug 2020, Online Liu and McCoy
Qualifyforstrategy1
Qualifyforstrategy2
Group 0:Qualifyforneitherstrategy
Group1 Group2Group3
Figure 1: Venn diagram of the user groups in our evaluationframework. The outer, left inner (red), and right inner (blue)boxes represent the entire user base, those who qualify forstrategy 1, and those who qualify for strategy 2 respectively.
setups (Section 3); and (iii) We make our results useful to practi-tioners by making the code used in the paper (Section 4) publiclyavailable.2
2 EVALUATION FRAMEWORKWe first present our evaluation framework for personalization strat-egy experiments. The experiments compare two personalizationstrategies, which we refer to as strategy 1 and strategy 2. Often oneof them is the existing strategy, and the other is a new strategy weintend to test and learn from. In this section we introduce (i) howusers qualifying themselves into strategies creates non-statisticallyequivalent groups, (ii) how experimenters usually assign the users,and (iii) when we would consider an assignment to be better.
2.1 User groupingAs users qualify themselves into the two strategies, four disjointgroups emerge: those who qualify for neither strategy, those whoqualify only for strategy 1, those who qualify only for strategy 2,and those who qualify for both strategies. We denote these groups(user) groups 0, 1, 2, and 3 respectively (see Figure 1). It is perhapsobvious that we cannot assume those in different user groups arestatistically equivalent and compare them directly.
We assume groups 0, 1, 2, 3 have ð0, ð1, ð2, and ð3 users respec-tively. We also assume responses from users (which we aggregate,often by taking the mean, to obtain our metric) are distributed dif-ferently between groups, and within the same group, between thescenario where the group is subjected to the treatment associatedto the corresponding strategy and where nothing is done (baseline).We list all group-scenario combinations in Table 1, and denote themean and variance of the responses (`ðº , ð2
ðº) for a combinationðº .3
2.2 Experiment setupsMany experiment setups exist and are in use in different organiza-tions. Here we introduce four common setups of various sophisti-cation, which we also illustrate in Figure 2.
2The code and supplementary document are available on: https://github.com/liuchbryan/experiment_design_evaluation.3For example, the responses for group 1 without any interventions has mean andvariance (`ð¶1 ,ð2
ð¶1), and that for group 2with the treatment prescribed under strategy 2has mean and variance (`ðŒ2 , ð2
ðŒ2).
⢠⢠⢠⢠⢠⢠⢠⢠â¢â¢ ⢠⢠⢠⢠⢠⢠⢠â¢â¢ ⢠⢠⢠⢠⢠⢠⢠â¢
⢠⢠⢠⢠⢠⢠⢠⢠⢠⢠â¢â¢ ⢠⢠⢠⢠⢠⢠⢠⢠⢠â¢â¢ ⢠⢠⢠⢠⢠⢠⢠⢠⢠â¢
A
B
A
B
Setup1 Setup2
A
B
Setup3
⢠⢠â¢â¢ ⢠â¢â¢ ⢠â¢
⟠⟠⟠⟠⟠⟠⟠âŸâŸ ⟠⟠⟠⟠⟠⟠âŸâŸ ⟠⟠⟠⟠⟠⟠âŸ
⟠⟠⟠⟠⟠⟠⟠âŸâŸ ⟠⟠⟠⟠⟠⟠âŸâŸ ⟠⟠⟠⟠⟠⟠âŸ
⟠âŸâŸ âŸâŸ âŸ
⢠⢠⢠⢠⢠⢠⢠⢠⢠⢠⢠⢠⢠⢠⢠â¢
A=A2-A1
B=B2-B1
Setup4
⟠⟠⟠⟠⟠⟠⟠âŸâŸ ⟠⟠⟠⟠⟠⟠âŸ
A1A2
B1B2
Figure 2: Experiment setups overlaid on the user groupingVenn diagram in Figure 1. The hatched boxes indicate whoare included in the analysis, and the downward triangles anddots indicate who are subjected to treatment prescribed un-der strategies 1 and 2 respectively. See Section 2.2 for a de-tailed description.
Setup 1 (Users in the intersection only). The setup considers userswho qualify for both strategies only. The said users are randomlysplit (usually 50/50) into two (analysis) groups ðŽ and ðµ, and areprescribed the treatment specified by strategies 1 and 2 respectively.The setup is easy to implement, though it is difficult to translateany learnings obtained from the setup to other user groups (e.g.those who qualify for one strategy only) [5].
Setup 2 (All samples). The setup is a simple A/B test where itconsiders all users, regardless on whether they qualify for anystrategy or not. The users are randomly split into two analysisgroups ðŽ and ðµ, and are prescribed the treatment specified bystrategy 1(2) if (i) they qualify under the strategy and (ii) they are inanalysis group ðŽ(ðµ). This setup is easiest to implement but usuallysuffers severely from a dilution in metric [3].
Setup 3 (Qualified users only). The setup is similar to Setup 2except only those who qualified for at least one strategy (âtriggeredâusers in some literature [3]) are included in the analysis groups. Thesetup sits between Setup 1 and Setup 2 in terms of user coverage,and has the advantage of capturing the most number of usefulsamples yet having the least metric dilution. However, the setupalso prevents one from telling the incrementality of a strategy itself,but only the difference in incrementalities between two strategies.
Setup 4 (Dual control / multi-cell lift test). As described in Sec-tion 1, the setup first split the users randomly into two randomiza-tion groups. For the first randomization group, we consider thosewho qualify for strategy 1, and split them into analysis groups ðŽ1and ðŽ2. Group ðŽ2 receives the treatment prescribed under strat-egy 1, and group ðŽ1 acts as control. The incrementality for strat-egy 1 is then the difference in metric between groupsðŽ2 andðŽ1. We
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD â20, Aug 2020, Online
Group 0 Group 1 Group 2 Group 3Baseline (Control) ð¶0 ð¶1 ð¶2 ð¶3
Under treatment (Intervention) / ðŒ1 ðŒ2 Under strategy 1: ðŒð
Under strategy 2: ðŒð
Table 1: All group-scenario combinations in our evaluation framework for personalization strategy experiments. The columnsrepresent the groups described in Figure 1. The baseline represents the scenario where nothing is done. We assume those whoqualify for both strategies (Group 3) can only receive treatment(s) associated to either strategy.
apply the same process to the second randomization group, withstrategy 2 and analysis groups ðµ1 and ðµ2 in place, and comparethe incrementality for strategies 1 and 2. The setup allows one toobtain the incrementality of each individual strategy and minimizesmetric dilution. Though it also leaves a number of samples unusedand creates extra analysis groups, and hence generally suffers froma low test power [9].
2.3 Evaluation criteriaThere are a number of considerations when one evaluates compet-ing experiment setups. These include technical considerations suchas the complexity of setting up the setups on an OCE framework,and business considerations such as whether the incrementality ofindividual strategies is required.
Here we focus on the statistical aspect and propose two evalua-tion criteria: (i) the actual average treatment effect size as presentedby the two analysis groups in an experiment setup, and (ii) the sen-sitivity of the experiment represented by the minimum detectableeffect (MDE) under a pre-specified test power. Both criteria arenecessary as the former indicates whether a setup suffers from met-ric dilution, whereas the latter indicates whether the setup suffersfrom lack of power/sample size. An ideal setup should yield a highactual effect size and a high sensitivity (i.e. a low MDE),4 thoughas we observe in the next section it is usually a trade-off.
We formally define the two evaluation criteria from first prin-ciples while introducing relevant notations along the way. Let ðŽand ðµ be the two analysis groups in an experiment setup, with userresponses randomly distributed with mean and variance (`ðŽ, ð2
ðŽ)
and (`ðµ, ð2ðµ) respectively. We first recall that if there are sufficient
samples, the sample mean of the two groups approximately followsthe normal distribution by the Central Limit Theorem:
ðŽapprox.⌠N
(`ðŽ, ð
2ðŽ/ððŽ
), ðµ
approx.⌠N(`ðµ, ð
2ðµ/ððµ
), (1)
where ððŽ and ððµ are the number of samples taken from ðŽ and ðµ
respectively. The difference in the sample means then also approxi-mately follows a normal distribution:
ï¿œÌï¿œ â (ðµ âðŽ) approx.⌠N(Î â `ðµ â `ðŽ, ð
2ï¿œÌï¿œâ ð2
ðŽ/ððŽ + ð2ðµ/ððµ
). (2)
Here, Î is the actual effect size that we are interested in.The definition of the MDE \â requires a primer to the power of
a statistical test. A common null hypothesis statistical test in per-sonalization strategy experiments uses the two-tailed hypothesesð»0 : Î = 0 and ð»1 : Î â 0, with the test statistic under ð»0 being:
ð â ð·/ðï¿œÌï¿œapprox.⌠N (0, 1) . (3)
4We will use the terms âhigh(er) sensitivityâ and âlow(er) MDEâ interchangeably.
We recall the null hypothesis will be rejected if |ð | > ð§1âðŒ/2, whereðŒ is the significance level and ð§1âðŒ/2 is the 1 â ðŒ/2 quantile of astandard normal. Under a specific alternate hypothesis Î = \ , thepower is specified as
1 â ðœ\ â Pr(|ð | > ð§1âðŒ/2 | Î = \
)â 1 â Ί
(ð§1âðŒ/2 â |\ |/ðï¿œÌï¿œ
). (4)
where Ί denotes the cumulative density function of a standardnormal.5 To achieve a minimum test power ðmin, we require that1 â ðœ\ > ðmin. Substituting Equation (4) into the inequality andrearranging to make \ the subject yields the effect sizes that thetest will be able to detect with the specified power:
|\ | > (ð§1âðŒ/2 â ð§1âðmin ) ðï¿œÌï¿œ . (5)
\â is then defined as the positive minimum \ that satisfies Inequal-ity (5), i.e. that specified by the RHS of the inequality.
We finally define what it means to be better under these evalu-ation criteria. WLOG we assume the actual effect size of the twocompeting experiment setups are positive,6 and say a setup ð issuperior to another setup ð if, all else being equal,(i) ð produces a higher actual effect size (Îð > Îð ) and a lower
minimum detectable effect size (\âð< \â
ð ), or
(ii) The gain in actual effect is greater than the loss in sensitivity:
Îð â Îð > \âð â \âð , (6)
which means an actual effect still stands a higher chance to beobserved under ð .
3 COMPARING EXPERIMENT SETUPSHaving described the evaluation framework above, in this sectionwe use the framework to compare the common experiment setupsdescribed in Section 2.2. We will first derive the actual effect sizeand MDE for each setup in Section 3.1, and using the result to createrule of thumbs on (i) whether diluting the metric by including userswho qualify for neither strategies is beneficial (Section 3.2) and (ii)if dual control is a better setup for personalization strategy experi-ments (Section 3.3), two questions that are often discussed amonge-commerce and marketing-focused experimenters. For brevity, werelegate most of the intermediate algebraic work when derivingthe actual & minimum detectable effect sizes, as well as the con-ditions that lead to a setup being superior, to our supplementarydocument.2
5The approximation in Equation (4) is tight for experiment design purposes, whereðŒ < 0.2 and 1 â ðœ > 0.6 for nearly all cases.6If both the actual effect sizes are negative, we simply swap the analysis groups. If theactual effect sizes are of opposite signs, it is likely an error.
AdKDD â20, Aug 2020, Online Liu and McCoy
3.1 Actual & minimum detectable effect sizesWe first present the actual effect size and MDE of the four experi-ment setups. For each setup we first compute the sample size, meanresponse, and response variance in each analysis group, whicharises as a mixture of user groups described in Section 2.1. Forbrevity, we only present the quantities for one analysis group persetupâexpressions for other analysis group(s) can be easily ob-tained by substituting in the corresponding user groups. We thensubstitute the quantities computed into the definitions of Î (seeEquation (2)) and \â (see Inequality (5)) to obtain the setup-specificactual effect size and MDE. We assume all random splits are done50/50 in these setups to maximize the test power.
Setup 1 (Users in the intersection only). The setup randomly splitsuser group 3 into two analysis groups, each with ð3/2 samples. Usersin analysis group ðŽ are provided treatment under strategy 1, andhence the groupâs responses have a mean and variance of (`ðŒð , ð2
ðŒð).
The actual effect size and MDE for Setup 1 are hence:
Îð1 = `ðŒð â `ðŒð , (7)
\âð1 = (ð§1âðŒ/2 â ð§1âðmin )âïž
2(ð2ðŒð
+ ð2ðŒð)/ð3 . (8)
Setup 2 (All samples). This setup also contains two analysisgroups, ðŽ and ðµ, each taking half of the population (i.e. (ð0 + ð1 +ð2 + ð3)/2). The mean response and response variance for groupsðŽ and ðµ are the weighted mean response and response varianceof the constituent user groups respectively. As we only providetreatment to those who qualify for strategy 1 in group ðŽ, and like-wise for group ðµ with strategy 2, each user group will give differentresponses, e.g. for group ðŽ:
`ðŽ = (ð0`ð¶0 + ð1`ðŒ1 + ð2`ð¶2 + ð3`ðŒð )/(ð0 + ð1 + ð2 + ð3), (9)
ð2ðŽ = (ð0ð
2ð¶0 + ð1ð
2ðŒ1 + ð2ð
2ð¶2 + ð3ð
2ðŒð)/(ð0 + ð1 + ð2 + ð3). (10)
Substituting the above (and that for group ðµ) into the definitionsof actual effect size and MDE we have:
Îð2 =ð1 (`ð¶1 â `ðŒ1) + ð2 (`ðŒ2 â `ð¶2) + ð3 (`ðŒð â `ðŒð )
ð0 + ð1 + ð2 + ð3, (11)
\âð2 = (ð§1âðŒ/2 â ð§1âðmin )
ââââââ 2(ð0 (2ð2
ð¶0) + ð1 (ð2ðŒ1 + ð2
ð¶1)+ð2 (ð2
ð¶2 + ð2ðŒ2) + ð3 (ð2
ðŒð+ ð2
ðŒð))
(ð0 + ð1 + ð2 + ð3)2 .
(12)
Setup 3 (Qualified users only). The setup is very similar to Setup 2,with members from user group 0 excluded. This leads to both anal-ysis groups having (ð1 + ð2 + ð3)/2 users. The absence of group 0users means they are not featured in the weighted mean responseand response variance of the two analysis groups, e.g. for group A:
`ðŽ =ð1`ðŒ1 + ð2`ð¶2 + ð3`ðŒð
ð1 + ð2 + ð3, ð2
ðŽ =
ð1ð2ðŒ1 + ð2ð2
ð¶2 + ð3ð2ðŒð
ð1 + ð2 + ð3. (13)
This leads to the following actual effect size and MDE for Setup 3:
Îð3 =ð1 (`ð¶1 â `ðŒ1) + ð2 (`ðŒ2 â `ð¶2) + ð3 (`ðŒð â `ðŒð )
ð1 + ð2 + ð3, (14)
\âð3 =(ð§1âðŒ/2âð§1âðmin )
âââ2(ð1 (ð2
ðŒ1+ð2ð¶1)+ð2 (ð2
ð¶2+ð2ðŒ2)+ð3 (ð2
ðŒð+ð2
ðŒð))
(ð1 + ð2 + ð3)2 .
(15)
Setup 4 (Dual control). The setup is the odd one out as it has fouranalysis groups. Two of the analysis groups (ðŽ1 and ðŽ2) are drawnfrom those who qualified into strategy 1 and are allocated into thefirst randomization group, and the other two (ðµ1 and ðµ2) are drawnfrom those who are qualified into strategy 2 and are allocated intothe second randomization group:
ððŽ1 = ððŽ2 = (ð1 + ð3)/4 , ððµ1 = ððµ2 = (ð2 + ð3)/4. (16)
The mean response and response variance for group ðŽ1 are:
`ðŽ1 =ð1`ð¶1 + ð3`ð¶3
ð1 + ð3, ð2
ðŽ1 =ð1ð2
ð¶1 + ð3ð2ð¶3
ð1 + ð3. (17)
As the setup takes the difference of differences in the metric (i.e.the difference between the mean response for groups ðµ2 and ðµ1,and the difference between the mean response for groups ðŽ2 andðŽ1),7 the actual effect size is as follows:
Îð4 = (`ðµ2 â `ðµ1) â (`ðŽ2 â `ðŽ1)
=ð2 (`ðŒ2â`ð¶2) + ð3 (`ðŒð â`ð¶3)
ð2 + ð3âð2 (`ðŒ1â`ð¶1) + ð3 (`ðŒðâ`ð¶3)
ð1 + ð3.
(18)
The MDE for Setup 4 is similar to that specified in RHS of Inequal-ity (5), albeit with more groups:
\âð4 = (ð§1âðŒ/2âð§1âðmin )âïžð2ðŽ1/ððŽ1 + ð2
ðŽ2/ððŽ2 + ð2ðµ1/ððµ1 + ð2
ðµ2/ððµ2
= 2 · (ð§1âðŒ/2 â ð§1âðmin )Ãââð1 (ð2
ð¶1+ð2ðŒ1)+ð3 (ð2
ð¶3+ð2ðŒð)
(ð1 + ð3)2 +ð2 (ð2
ð¶2+ð2ðŒ2)+ð3 (ð2
ð¶3+ð2ðŒð)
(ð2 + ð3)2 .
(19)
3.2 Is dilution always bad?The use of responses from users who do not qualify for any ofthe strategies we are comparing, an act known as metric dilution,has stirred countless debates in experimentation teams. On onehand, responses from these users make any treatment effect lesspronounced by contributing exactly zero; on the other hand, itmight be necessary as one does not know who actually qualify fora strategy [9], or it might be desirable as they can be leveraged toreduce the variance of the treatment effect estimator [3].
Here, we are interested in whether we should engage in the actof dilution given the assumed user responses prior to an experi-ment. This can be clarified by understanding the conditions whereSetup 3 would emerge superior (as defined in Section 2.3) to Setup 2.By inspecting Equations (11) and (14), it is clear that Îð3 > Îð2
7Not to be confused with the difference-in-differences method, which captures themetric for the control and treatment groups at both the beginning (pre-intervention)and the end (post-intervention) of the experiment. Here we only capture the metriconce at the end of the experiment.
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD â20, Aug 2020, Online
if ð0 > 0. Thus, Setup 3 is superior to Setup 2 under the first crite-rion if \â
ð3 < \âð2, which is the case if ð2
ð¶0, the metric variance ofusers who qualify for neither strategies, is large. This can be shownby substituting Equations (12) and (15) into the \â-inequality andrearranging the terms to obtain:(
ð1 (ð2ðŒ1 + ð2
ð¶1) + ð2 (ð2ð¶2 + ð2
ðŒ2) + ð3 (ð2ðŒð
+ ð2ðŒð))·
(ð0 + 2ð1 + 2ð2 + 2ð3)2(ð1 + ð2 + ð3)2 < ð2
ð¶0 .
(20)
If we assume the response variance are similar across groups withusers who qualified for at least one strategy, i.e. ð2
ðŒ1 â ð2ð¶1 â · · · â
ð2ðŒð
â ð2ð, Inequality (20) can then be simplified as
ð2ð
(ð0
ð1 + ð2 + ð3+ 2
)< ð2
ð¶0 , (21)
where it can be used to quickly determine if one should considerdilution at all.
If Inequality (20) is not true (i.e. \âð3 ⥠\â
ð2), we should thenconsider when the second criterion (i.e. Îð3 â Îð2 > \â
ð3 â \âð2) is
met. Writing
[ = ð1 (`ð¶1 â `ðŒ1) + ð2 (`ðŒ2 â `ð¶2) + ð3 (`ðŒð â `ðŒð ),b = ð1 (ð2
ð¶1 + ð2ðŒ1) + ð2 (ð2
ðŒ2 + ð2ð¶2) + ð3 (ð2
ðŒð+ ð2
ðŒð), and
ð§ = ð§1âðŒ/2 â ð§1âðmin ,
we can substitute Equations (11), (12), (14), and (15) into the in-equality and rearrange to obtain
ð1 + ð2 + ð3ð0
âïž2ð0ð2
ð¶0 + b >ð0 + ð1 + ð2 + ð3
ð0
âïžb â [
â2ð§
. (22)
As the LHS of Inequality (22) is always positive, Setup 3 is superiorif the RHS †0. Noting
Îð3 = [/(ð1 + ð2 + ð3) and \âð3 =â
2 · 𧠷âïžb/(ð1 + ð2 + ð3),
the trivial case is satisfied ifð0 + ð1 + ð2 + ð3
ð0· \âð3 †Îð3 . (23)
If the RHS of Inequality (22) is positive, we can safely squareboth sides and use the identities for Îð3 and \â
ð3 to get
2ð2ð¶0ð0
>
[(\âð3 â Îð3 + ð1+ð2+ð3
ð0\âð3
)2â
(ð1+ð2+ð3
ð0\âð3
)2]
2ð§2 . (24)
As the LHS is always positive, the second criterion is met if
\âð3 †Îð3 . (25)
Note this is a weaker, and thus more easily satisfiable conditionthan that introduced in Inequality (23). This suggests an experimentsetup is always superior to an diluted alternative if the experiment isalready adequately poweredâintroducing any dilution will simplymake things worse.
Failing the condition in Inequality (25), we can always fall backto Inequality (24). While the inequality operates in squared space, itis essentially comparing the standard error of user group 0 (LHS)âthose who qualify for neither strategiesâto the gap between theminimum detectable and actual effects (\â
ð3 â Îð3). The gap canbe interpreted as the existing noise level, thus a higher standard
error means mixing in group 0 users will introduce extra noise,and one is better off without them. Conversely, a smaller standarderror means group 0 users can lower the noise level, i.e. stabilizethe metric fluctuation, and one should take advantage of them.
To summarize, diluting a personalization strategies experimentsetup is not helpful if(i) Users who do not qualify for any strategies have a large metric
variance (Inequality (20)), or(ii) The experiment is already adequately powered (Inequality (25)).It could help if the experiment has not gained sufficient power yetand users who do not qualify for any strategy provide low-varianceresponses, such that they exhibit stabilizing effects when includedinto the analysis (complement of Inequality (24)).
3.3 When is a dual-control more effective?Often when advertisers compare two personalization strategies,the question on whether to use a dual control/multi-cell designcomes up. Proponents of such approach celebrate its ability to tella story by making the incrementality of an individual strategyavailable, while opponents voice concerns on the complexity insetting up the design. Here we are interested if Setup 4 (dual control)is superior to Setup 3 (a simple A/B test) from a power/detectableeffect perspective, and if so, under what circumstances.
We first observe \âð4 > \
âð3 is always true, and hence a dual control
setup will never be superior to a simpler setup under the first crite-rion. This can be verified by substituting in Equations (19) and (15)and rearranging the terms to show the inequality is equivalent to
2(
ð1(ð1+ð3)2 (ð2
ð¶1 + ð2ðŒ1) +
ð2(ð2+ð3)2 (ð2
ð¶2 + ð2ðŒ2)+
ð3(ð1+ð3)2 ð
2ðŒð
+ ð3(ð2+ð3)2 ð
2ðŒð
+(
ð3(ð1+ð3)2 + ð3
(ð2+ð3)2
)ð2ð¶3
)>
ð1(ð1+ð2+ð3)2 (ð2
ð¶1 + ð2ðŒ1) +
ð2(ð1+ð2+ð3)2 (ð2
ð¶2 + ð2ðŒ2)+
ð3(ð1+ð2+ð3)2 ð
2ðŒð
+ ð3(ð1+ð2+ð3)2 ð
2ðŒð, (26)
which is always true given the ðs are non-negative and the ð2s arepositive: not only the coefficients of the ð2-terms are larger on theLHS than their RHS counterparts, the LHS also carries an extra ð2
ð¶3term with non-negative coefficient and a factor of two.
Moving on to the second evaluation criterion, we recall thatSetup 4 is superior if Îð4 â Îð3 > \â
ð4 â \âð3, otherwise Setup 3 is
superior under the same criterion. The full flexibility of the modelcan be seen by substituting Equations (14), (15), (18), and (19) intothe inequality and rearrange to obtain
ð1ð2 (`ðŒ2â`ð¶2)+ð3 (`ðŒðâ`ð¶3)
ð2+ð3â ð2
ð1 (`ðŒ1â`ð¶1)+ð3 (`ðŒðâ`ð¶3)ð1+ð3âïž
ð1 (ð2ð¶1 + ð2
ðŒ1) + ð2 (ð2ð¶2 + ð2
ðŒ2) + ð3 (ð2ðŒð
+ ð2ðŒð)
> (27)
â2ð§
âââââââââ2 ·
(1 + ð2ð1+ð3
)2 [ð1 (ð2ð¶1 + ð2
ðŒ1) + ð3 (ð2ð¶3 + ð2
ðŒð)]+
(1 + ð1ð2+ð3
)2 [ð2 (ð2ð¶2 + ð2
ðŒ2) + ð3 (ð2ð¶3 + ð2
ðŒð)]
ð1 (ð2ð¶1 + ð2
ðŒ1) + ð2 (ð2ð¶2 + ð2
ðŒ2) + ð3 (ð2ðŒð
+ ð2ðŒð)
â 1
,where ð§ = ð§1âðŒ/2 â ð§1âðmin .
A key observation from inspecting Inequality (27) is that theLHS of the inequality scales alongð (
âð), where ð is the number of
users, while the RHS remains a constant. This leads to the insight
AdKDD â20, Aug 2020, Online Liu and McCoy
that Setup 4 is more likely to be superior if the ðs are large. Herewe assume the ratio ð1 : ð2 : ð3 remains unchanged when we scalethe number of samples, an assumption that generally holds whenan organization increases their reach while maintaining their usermix. It is worth pointing out that our claim is stronger than thatin previous work â we have shown that having a large user basenot only fulfills the requirement of running a dual control experi-ment as described in [9], it also makes a dual control experiment abetter setup than its simpler counterparts in terms of apparent anddetectable effect sizes.
The scaling relationship can be seen more clearly if we applysome simplifying assumptions to the ð2- and ð-terms. If we as-sume the response variances are similar across user groups (i.e.ð2ð¶1 â ð2
ðŒ1 â · · · â ð2ðŒð
â ð2ð), the RHS of Inequality (27) becomes
â2ð§
[âïžð1 + ð2 + ð3ð1 + ð3
+ ð1 + ð2 + ð3ð2 + ð3
â 1], (28)
which remains a constant if the ratio ð1 : ð2 : ð3 remains un-changed. If we assume the number of users in groups 1, 2, 3 aresimilar (i.e. ð1 = ð2 = ð3 = ð), the LHS of Inequality (27) becomes
âð((`ðŒ2 â `ð¶2) â (`ðŒ1 â `ð¶1) + `ðŒð â `ðŒð
)2âïžð2ð¶1 + ð2
ðŒ1 + ð2ð¶2 + ð2
ðŒ2 + ð2ðŒð
+ ð2ðŒð
, (29)
which clearly scales along ð (âð).
We conclude the section by providing an indication on what alarge ð may look like. If we assume both the response variancesand the number of users are are similar across user groups, we canrearrange Inequality (27) to make ð the subject:
ð >
(2â
12(â
6 â 1)ð§
)2 ð2ð
Î2 , (30)
where Î = (`ðŒ2 â `ð¶2) â (`ðŒ1 â `ð¶1) + `ðŒð â `ðŒð is the difference inactual effect sizes between Setups 4 and 3. Under a 5% significancelevel and 80% power, the first coefficient amounts to around 791,which is roughly 50 times the coefficient one would use to deter-mine the sample size of a simple A/B test [11]. This suggests a dualcontrol setup is perhaps a luxury accessible only to the largest ad-vertising platforms and their top advertisers. For example, consideran experiment to optimize conversion rate where the baselinesattain 20% (hence having a variance of 0.2(1 â 0.2) = 0.16). If thereis a 2.5% relative (i.e. 0.5% absolute) effect between the competingstrategies, the dual control setup will only be superior if there are> 5M users in each user group.
4 EXPERIMENTSHaving performed theoretical calculations for the actual and de-tectable effects and conditions where an experiment setup is supe-rior to another, here we verify those calculations using simulationresults. We focus on the results presented in Section 3.1, as the restof the results presented followed from those calculations.
In each experiment setup evaluation, we randomly select thevalue of the parameters (i.e. the `s, ð2s, and ðs), and take 1,000actual effect samples, each by (i) sampling the responses from theuser groups under the specified parameters, (ii) computing themeanfor the analysis groups, and (iii) taking the difference of the means.
Actual effect size Minimum detectable effectSetup 1 1049/1099 (95.45%) 66/81 (81.48%)Setup 2 853/999 (85.38%) 87/106 (82.08%)Setup 3 922/1099 (83.89%) 93/116 (80.18%)Setup 4 240/333 (72.07%) 149/185 (80.54%)
Table 2: Number of evaluations where the theoretical valueof the quantities (columns) falls between the 95% bootstrapconfidence interval for each experiment setup (rows). SeeSection 4 for a detailed description on the evaluations.
We also take 100 MDE samples in separate evaluations, eachby (i) sampling a critical value under null hypothesis; (ii) comput-ing the test power under a large number of possible effect sizes,each using the critical value and sampled metric means under thealternate hypothesis; and (iii) searching the effect size space forthe value that gives the predefined power. As the power vs. effectsize curve is noisy given the use of simulated power samples, weuse the bisection algorithm provided by the noisyopt package toperform the search. The algorithm dynamically adjusts the numberof samples taken from the same point on the curve to ensure thenoise does not send us down the wrong search space.
We expect the mean of the sampled actual effect and MDE tomatch the theoretical value. To verify this, we perform 1,000 boot-strap resamplings on the samples obtained above to obtain an em-pirical bootstrap distribution of the sample mean in each evaluation.The 95% bootstrap resampling confidence interval (BRCI) shouldthen contain the theoretical mean 95% of the times. The histogramof the percentile rank of the theoretical quantity in relation to thebootstrap samples across multiple evaluations should also follow auniform distribution [13].
The result is shown in Table 2. One can observe that there aremore evaluations having their theoretical quantity lying outsidethan the BRCI than expected. Upon further investigation, we ob-served a characteristic âª-shape from the histograms of the per-centile ranks for the actual effects. This suggests the bootstrapsamples may be under-dispersed but otherwise centered on thetheoretical quantities.
We also observed the histograms forMDEs curving upward to theright, this suggests that the theoretical value is a slight overestimate(of < 1% to the bootstrap mean in all cases). We believe this is likelydue to a small bias in the bisection algorithm. The algorithm testsif the mean of the power samples is less than the target power todecide which half of the search space to continue along. Given wecan bisect up to 10 times in each evaluation, it is likely to see afalse positive even when we set the significance level for individualcomparisons to 1%. This leads to the algorithm favoring a smallerMDE sample. Having that said, since we have tested for a widerange of parameters and the overall bias is small, we are satisfiedwith the theoretical quantities for experiment design purposes.
5 CONCLUSIONWe have addressed the problem of comparing experiment designsfor personalization strategies by presenting an evaluation frame-work that allows experimenters to evaluate which experiment setup
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD â20, Aug 2020, Online
should be adopted given the situation. The flexible framework canbe easily extended to compare setups that compare more than twostrategies by adding more user groups (i.e. new sets to the Venndiagram in Figure 1). A new setup can also be incorporated quicklyas it is essentially a different weighting of user group-scenariocombinations shown in Table 1. The framework also allows thedevelopment of simple rule of thumbs such as:(i) Metric dilution should never be employed if the experiment
already has sufficient power; though it can be useful if theexperiment is under-powered and the non-qualifying usersprovide a âstabilizing effectâ; and
(ii) A dual control setup is superior to simpler setups only if onehas access to the user base of the largest organizations.
We have validated the theoretical results via simulations, and madethe code available2 so that practitioners can benefit from the resultsimmediately when designing their upcoming experiments.
Future Work. So far we assume the responses from each usergroup-scenario combination are randomly distributed with themean independent to the variance, with the evaluation criteria cal-culated assuming the metric (being the weighted sample mean ofthe responses) is approximately normally distributed under CentralLimit Theorem. We are interested in whether the evaluation frame-work and its results still hold if we deviate from these assumptions,e.g. with binary responses (where the mean and variance correlate)and heavy-tailed distributed responses (where the sample meanconverges to a normal distribution slowly).
ACKNOWLEDGMENTSThework is partially funded by the EPSRCCDT inModern Statisticsand Statistical Machine Learning at Imperial andOxford (StatML.IO)and ASOS.com. The authors thank the anonymous reviewers forproviding many improvements to the original manuscript.
REFERENCES[1] Brooke Bengier and Amanda Knupp. [n.d.]. Selling More Stuff: The What, Why
& How of Incrementality Testing. https://www.quantcast.com/blog/selling-more-stuff-the-what-why-how-of-incrementality-testing/. Blog post.
[2] Will Browne and Mike Swarbrick Jones. 2017. What Works in e-commerce - aMeta-analysis of 6700 Online Experiments. http://www.qubit.com/wp-content/uploads/2017/12/qubit-research-meta-analysis.pdf. http://www.qubit.com/wp-content/uploads/2017/12/qubit-research-meta-analysis.pdf White paper.
[3] Alex Deng and Victor Hu. 2015. Diluted Treatment Effect Estimation for Trig-ger Analysis in Online Controlled Experiments. In Proceedings of the EighthACM International Conference on Web Search and Data Mining (Shanghai, China)(WSDM â15). Association for Computing Machinery, New York, NY, USA, 349â358.https://doi.org/10.1145/2684822.2685307
[4] Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the Sen-sitivity of Online Controlled Experiments by Utilizing Pre-experiment Data.In Proceedings of the Sixth ACM International Conference on Web Search andData Mining (Rome, Italy) (WSDM â13). ACM, New York, NY, USA, 123â132.https://doi.org/10.1145/2433396.2433413
[5] Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A DirtyDozen: Twelve Common Metric Interpretation Pitfalls in Online ControlledExperiments. In Proceedings of the 23rd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD â17). ACM,New York, NY, USA, 1427â1436.
[6] HenningHohnhold, Deirdre OâBrien, andDiane Tang. 2015. Focusing on the Long-term: Itâs Good for Users and Business. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining (Sydney, NSW,Australia) (KDD â15). ACM, New York, NY, USA, 1849â1858.
[7] Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. 2017. Peeking atA/B Tests: Why It Matters, and What to Do About It. In Proceedings of the 23rdACM SIGKDD International Conference on Knowledge Discovery and Data Mining(Halifax, NS, Canada) (KDD â17). ACM, New York, NY, USA, 1517â1525.
[8] Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann.2013. Online Controlled Experiments at Large Scale. In Proceedings of the 19thACM SIGKDD International Conference on Knowledge Discovery and Data Mining(Chicago, Illinois, USA) (KDD â13). ACM, New York, NY, USA, 1168â1176.
[9] C. H. Bryan Liu, Elaine M. Bettaney, and Benjamin Paul Chamberlain.2018. Designing Experiments to Measure Incrementality on Facebook.arXiv:1806.02588. [stat.ME] 2018 AdKDD & TargetAd Workshop.
[10] C. H. Bryan Liu, Benjamin Paul Chamberlain, and Emma J. McCoy. 2020. What isthe Value of Experimentation and Measurement?: Quantifying the Value and Riskof Reducing Uncertainty to Make Better Decisions. Data Science and Engineering5, 2 (2020), 152â167. https://doi.org/10.1007/s41019-020-00121-5
[11] EvanMiller. 2010. HowNot To Run anA/B Test. https://www.evanmiller.org/how-not-to-run-an-ab-test.html. Blog post.
[12] Alexey Poyarkov, Alexey Drutsa, Andrey Khalyavin, Gleb Gusev, and PavelSerdyukov. 2016. Boosted Decision Tree Regression Adjustment for VarianceReduction in Online Controlled Experiments. In Proceedings of the 22Nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (SanFrancisco, California, USA) (KDD â16). ACM, New York, NY, USA, 235â244. https://doi.org/10.1145/2939672.2939688
[13] Sean Talts, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gel-man. 2018. Validating Bayesian Inference Algorithms with Simulation-BasedCalibration. arXiv:1804.06788 [stat.ME]
[14] Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015.From Infrastructure to Culture: A/B Testing Challenges in Large Scale SocialNetworks. In Proceedings of the 21th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD â15). ACM,New York, NY, USA, 2227â2236.
AdKDD â20, Aug 2020, Online Liu and McCoy
SUPPLEMENTARY DOCUMENTIn this supplementary document, we revisit Section 3 of the paperâAn Evaluation Framework for Personalization Strategy ExperimentDesignâ by Liu and McCoy (2020), and expand on the intermediatealgebraic work when deriving:(i) The actual & minimum detectable effect sizes, and(ii) The conditions that lead to a setup being superior.We will use the equation numbers featured in the original paper,and append letters for intermediate steps.
A EFFECT SIZE OF EXPERIMENT SETUPSWe begin with the actual effect size and the MDE of the four ex-periment setups that are featured in Section 3.1 of the paper. Foreach setup we first compute the sample size, mean response, andresponse variance in each analysis group (denoted ðð , `ð , and ð2
ð
respectively for each analysis group ð). These quantities arise asa mixture of user groups described in Section 2.1 of the paper. Wethen substitute the quantities computed into the definitions of Îand \â:
Î â `ðµ â `ðŽ, (see (2))
\â â (ð§1âðŒ/2 â ð§1âðmin )ðï¿œÌï¿œ , where ðï¿œÌï¿œ = ð2ðŽ/ððŽ + ð2
ðµ/ððµ (see (5))
and ð§ð represents the ðth quantile of a standard normal, to obtainthe setup-specific actual effect size and MDE for a setup with twoanalysis groups. For setups with more than two analysis groups,we will specify the actual and minimum detectable effect when wediscuss specifics for each of the setups. We assume all random splitsare done 50/50 in these setups to maximize the test power.
Setup 1 (Users in the intersection only). We recall the setup, whichconsiders only users who qualify for both personalization strategies(i.e. the intersection), randomly splits user group 3 into two analysisgroups, ðŽ and ðµ, each with the following number of samples:
ððŽ = ððµ =ð32.
Users in analysis group ðŽ are provided treatment under strategy 1,and users in analysis group ðµ are provided treatment under strat-egy 2. This leads to the groupsâ responses having the followingmetric mean and variance:
`ðŽ = `ðŒð , `ðµ = `ðŒð , ð2ðŽ = ð2
ðŒð, ð2
ðµ = ð2ðŒð.
The actual effect size and MDE for Setup 1 are hence:
Îð1 = `ðŒð â `ðŒð , (7)
\âð1 = (ð§1âðŒ/2 â ð§1âðmin )
ââð2ðŒð
ð3/2+ð2ðŒð
ð3/2. (8)
Setup 2 (All samples). The setup, which considers all users regard-less of whether they qualify for any strategy or not, also containstwo analysis groups, ðŽ and ðµ, each taking half of the population:
ððŽ = ððµ =ð0 + ð1 + ð2 + ð3
2.
The mean response and response variance for groups ðŽ and ðµ
are the weighted mean response and response variance of theconstituent user groups respectively, weighted by the constituentgroupsâ size. As we only provide treatment to those who qualify
for strategy 1 in group A, and those who qualify for strategy 2 ingroup B, this leads to different responses in different constituentuser groups:
`ðŽ =ð0`ð¶0 + ð1`ðŒ1 + ð2`ð¶2 + ð3`ðŒð
ð0 + ð1 + ð2 + ð3, (9)
`ðµ =ð0`ð¶0 + ð1`ð¶1 + ð2`ðŒ2 + ð3`ðŒð
ð0 + ð1 + ð2 + ð3;
ð2ðŽ =
ð0ð2ð¶0 + ð1ð2
ðŒ1 + ð2ð2ð¶2 + ð3ð2
ðŒð
ð0 + ð1 + ð2 + ð3, (10)
ð2ðµ =
ð0ð2ð¶0 + ð1ð2
ð¶1 + ð2ð2ðŒ2 + ð3ð2
ðŒð
ð0 + ð1 + ð2 + ð3.
Substituting the above into the definitions of actual effect sizeand MDE and simplifying the resultant expressions we have:
Îð2 =ð1 (`ð¶1 â `ðŒ1) + ð2 (`ðŒ2 â `ð¶2) + ð3 (`ðŒð â `ðŒð )
ð0 + ð1 + ð2 + ð3, (11)
\âð2 = (ð§1âðŒ/2 â ð§1âðmin )
ââââââ 2(ð0 (2ð2
ð¶0) + ð1 (ð2ðŒ1 + ð2
ð¶1)+ð2 (ð2
ð¶2 + ð2ðŒ2) + ð3 (ð2
ðŒð+ ð2
ðŒð))
(ð0 + ð1 + ð2 + ð3)2 .
(12)
Setup 3 (Qualified users only). The setup is very similar to Setup 2,with members from user group 0 excluded:
ððŽ = ððµ =ð1 + ð2 + ð3
2.
The absence of members from user group 0 means they are notfeatured in the weighted mean response and response variance ofthe two analysis groups:
`ðŽ =ð1`ðŒ1 + ð2`ð¶2 + ð3`ðŒð
ð1 + ð2 + ð3, `ðµ =
ð1`ð¶1 + ð2`ðŒ2 + ð3`ðŒð
ð1 + ð2 + ð3;
ð2ðŽ =
ð1ð2ðŒ1 + ð2ð2
ð¶2 + ð3ð2ðŒð
ð1 + ð2 + ð3, ð2
ðµ =
ð1ð2ð¶1 + ð2ð2
ðŒ2 + ð3ð2ðŒð
ð1 + ð2 + ð3. (13)
This lead to the following actual effect size and MDE for Setup 3:
Îð3 =ð1 (`ð¶1 â `ðŒ1) + ð2 (`ðŒ2 â `ð¶2) + ð3 (`ðŒð â `ðŒð )
ð1 + ð2 + ð3, (14)
\âð3 =(ð§1âðŒ/2âð§1âðmin )
âââ2(ð1 (ð2
ðŒ1+ð2ð¶1)+ð2 (ð2
ð¶2+ð2ðŒ2)+ð3 (ð2
ðŒð+ð2
ðŒð))
(ð1 + ð2 + ð3)2 .
(15)
Setup 4 (Dual control). Setup 4 is unique amongst the experimentsetups introduced as it has four analysis groups. Two of the analysisgroups are drawn from those who qualified into strategy 1 and areallocated into the first half, and the other two are drawn from thosewho are qualified into strategy 2 and are allocated into the secondhalf:
ððŽ1 = ððŽ2 =ð1 + ð3
4, ððµ1 = ððµ2 =
ð2 + ð34
. (16)
The mean response and response variance of each analysis groupare the weighted metric mean response and response variance of
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD â20, Aug 2020, Online
the user groups involved respectively:
`ðŽ1 =ð1`ð¶1 + ð3`ð¶3
ð1 + ð3, `ðŽ2 =
ð1`ðŒ1 + ð3`ðŒð
ð1 + ð3,
`ðµ1 =ð1`ð¶2 + ð3`ð¶3
ð2 + ð3, `ðµ2 =
ð1`ðŒ2 + ð3`ðŒð
ð2 + ð3;
ð2ðŽ1 =
ð1ð2ð¶1 + ð3ð2
ð¶3ð1 + ð3
, ð2ðŽ2 =
ð1ð2ðŒ1 + ð3ð2
ðŒð
ð1 + ð3,
ð2ðµ1 =
ð1ð2ð¶2 + ð3ð2
ð¶3ð2 + ð3
, ð2ðµ2 =
ð1ð2ðŒ2 + ð3ð2
ðŒð
ð2 + ð3; (17)
As the setup takes the difference of differences (i.e. the differenceof groups ðµ2 and ðµ1, and the difference of groups ðŽ2 and ðŽ1), theactual effect size are specified, post-simplification, as follows:
Îð4 = (`ðµ2 â `ðµ1) â (`ðŽ2 â `ðŽ1)
=ð2 (`ðŒ2â`ð¶2) + ð3 (`ðŒð â`ð¶3)
ð2 + ð3âð2 (`ðŒ1â`ð¶1) + ð3 (`ðŒðâ`ð¶3)
ð1 + ð3.
(18)
The MDE for Setup 4 is similar to that specified in the RHS ofInequality (5), albeit with more groups:
\âð4 = (ð§1âðŒ/2âð§1âðmin )âïžð2ðŽ1/ððŽ1 + ð2
ðŽ2/ððŽ2 + ð2ðµ1/ððµ1 + ð2
ðµ2/ððµ2
= 2 · (ð§1âðŒ/2 â ð§1âðmin )Ãââð1 (ð2
ð¶1+ð2ðŒ1)+ð3 (ð2
ð¶3+ð2ðŒð)
(ð1 + ð3)2 +ð2 (ð2
ð¶2+ð2ðŒ2)+ð3 (ð2
ð¶3+ð2ðŒð)
(ð2 + ð3)2 .
(19)
B METRIC DILUTIONWe then expand the calculations in Section 3.2 of the paper, wherewe discuss the conditions where an experiment setup with met-ric dilution (Setup 2) will emerge superior to one without metricdilution (Setup 3), and vice versa.
B.1 The first criterionWe first show \â
ð3 < \âð2, the condition which will lead to Setup 3
being superior to Setup 2 under the first criterion, is equivalent to(ð1 (ð2
ðŒ1 + ð2ð¶1) + ð2 (ð2
ð¶2 + ð2ðŒ2) + ð3 (ð2
ðŒð+ ð2
ðŒð))·
(ð0 + 2ð1 + 2ð2 + 2ð3)2(ð1 + ð2 + ð3)2 < ð2
ð¶0 . (20)
We start by substituting the expressions for \âð2 (Equation (12))
and \âð3 (Equation (15)) into the inequality \â
ð3 < \âð2 to obtain
(ð§1âðŒ/2âð§1âðmin )
ââ2(ð1 (ð2
ðŒ1+ð2ð¶1)+ð2 (ð2
ð¶2+ð2ðŒ2)+ð3 (ð2
ðŒð+ð2
ðŒð))
(ð1 + ð2 + ð3)2
< (ð§1âðŒ/2 â ð§1âðmin )
ââââââ 2(ð0 (2ð2
ð¶0) + ð1 (ð2ðŒ1 + ð2
ð¶1)+ð2 (ð2
ð¶2 + ð2ðŒ2) + ð3 (ð2
ðŒð+ ð2
ðŒð))
(ð0 + ð1 + ð2 + ð3)2 .
(20a)
Canceling the ð§1âðŒ/2 âð§1âðmin andâ
2 terms on both sides, and thensquaring them yields
ð1 (ð2ðŒ1 + ð2
ð¶1) + ð2 (ð2ð¶2 + ð2
ðŒ2) + ð3 (ð2ðŒð
+ ð2ðŒð)
(ð1 + ð2 + ð3)2
<ð0 (2ð2
ð¶0) + ð1 (ð2ðŒ1 + ð2
ð¶1) + ð2 (ð2ð¶2 + ð2
ðŒ2) + ð3 (ð2ðŒð
+ ð2ðŒð)
(ð0 + ð1 + ð2 + ð3)2 .
(20b)
We then write b = ð1 (ð2ðŒ1 +ð
2ð¶1) +ð2 (ð2
ð¶2 +ð2ðŒ2) +ð3 (ð2
ðŒð+ð2
ðŒð)
and move the b terms on the RHS to the LHS:
b
(1
(ð1 + ð2 + ð3)2 â 1(ð0 + ð1 + ð2 + ð3)2
)<
ð0 (2ð2ð¶0)
(ð0 + ð1 + ð2 + ð3)2 .
(20c)
As the partial fractions can be consolidated as1
(ð1 + ð2 + ð3)2 â 1(ð0 + ð1 + ð2 + ð3)2
=(ð0 + ð1 + ð2 + ð3)2 â (ð1 + ð2 + ð3)2
(ð1 + ð2 + ð3)2 (ð0 + ð1 + ð2 + ð3)2
=(ð0 + 2ð1 + 2ð2 + 2ð3)ð0
(ð1 + ð2 + ð3)2 (ð0 + ð1 + ð2 + ð3)2 , (20d)
where the second step utilizes the identity ð2 â ð2 = (ð + ð) (ð â ð),Inequality (20c) can be written as
b
((ð0 + 2ð1 + 2ð2 + 2ð3)ð0
(ð1 + ð2 + ð3)2 (ð0 + ð1 + ð2 + ð3)2
)<
ð0 (2ð2ð¶0)
(ð0 + ð1 + ð2 + ð3)2 .
(20e)
We finally cancel the ð0 and (ð0 +ð1 +ð2 +ð3)2 terms on both sidesof Inequality (20e), move the factor of two to the LHS, and write bin its full form to arrive at Inequality (20):(
ð1 (ð2ðŒ1 + ð2
ð¶1) + ð2 (ð2ð¶2 + ð2
ðŒ2) + ð3 (ð2ðŒð
+ ð2ðŒð))·
(ð0 + 2ð1 + 2ð2 + 2ð3)2(ð1 + ð2 + ð3)2 < ð2
ð¶0 .
B.2 The second criterionIn the case where Inequality (20) does not hold, we consider whenSetup 3 will emerge superior to Setup 2 under the second criterion:
Îð3 â Îð2 > \âð3 â \âð2 .
If this inequality does not hold either (and both sides are not equal),we consider Setup 2 as superior to Setup 3 under the same criterionas the following holds:
Îð3 â Îð2 < \âð3 â \âð2 ââ Îð2 â Îð3 > \âð2 â \âð3 .
The master inequality. We first show that the inequality Îð3 âÎð2 > \â
ð3 â \âð2 is equivalent to
ð1 + ð2 + ð3ð0
âïž2ð0ð2
ð¶0 + b >ð0 + ð1 + ð2 + ð3
ð0
âïžb â [
â2ð§
, (22)
where
[ = ð1 (`ð¶1 â `ðŒ1) + ð2 (`ðŒ2 â `ð¶2) + ð3 (`ðŒð â `ðŒð ),b = ð1 (ð2
ð¶1 + ð2ðŒ1) + ð2 (ð2
ðŒ2 + ð2ð¶2) + ð3 (ð2
ðŒð+ ð2
ðŒð), and
ð§ = ð§1âðŒ/2 â ð§1âðmin .
AdKDD â20, Aug 2020, Online Liu and McCoy
Writing [, b , and ð§ as shown above, we substitute in the expres-sions for Îð2, \âð2, Îð3, and \âð3 (Equations (11), (12), (14), and (15)respectively) into the initial inequality to obtain
[
ð1 + ð2 + ð3â [
ð0 + ð1 + ð2 + ð3
> ð§
âïž2b
(ð1 + ð2 + ð3)2 â ð§
âïž2(ð0 (2ð2
ð¶0) + b)
(ð0 + ð1 + ð2 + ð3)2 . (22a)
Pulling out the common factors on each side we have
[
(1
ð1 + ð2 + ð3â 1ð0 + ð1 + ð2 + ð3
)>â
2ð§
( âïžb
ð1 + ð2 + ð3â
âïž2ð0ð2
ð¶0 + b
ð0 + ð1 + ð2 + ð3
). (22b)
Writing the partial fraction on the LHS of Inequality (22b) as acomposite fraction we have
[
(ð0
(ð1 + ð2 + ð3) (ð0 + ð1 + ð2 + ð3)
)>â
2ð§
( âïžb
ð1 + ð2 + ð3â
âïž2ð0ð2
ð¶0 + b
ð0 + ð1 + ð2 + ð3
). (22c)
We then move the composite fraction to the RHS and theâ
2ð§ termto the LHS:
[â
2ð§>
(ð1 + ð2 + ð3)·(ð0 + ð1 + ð2 + ð3)
ð0
( âïžb
ð1 + ð2 + ð3â
âïž2ð0ð2
ð¶0 + b
ð0 + ð1 + ð2 + ð3
),
(22d)
and expand the brackets, canceling terms that appear on both sidesof the fractions in the RHS:
[â
2ð§>
ð0 + ð1 + ð2 + ð3ð0
âïžb â ð1 + ð2 + ð3
ð0
âïž2ð0ð2
ð¶0 + b . (22e)
Finally, we swap the position of the leftmost term with that of therightmost term in Inequality (22e) to arrive at Inequality (22):
ð1 + ð2 + ð3ð0
âïž2ð0ð2
ð¶0 + b >ð0 + ð1 + ð2 + ð3
ð0
âïžb â [
â2ð§
.
The trivial case: RHS †0. We first observe that the LHS of In-equality (22) is always positive, and hence the inequality triviallyholds if the RHS is non-positive. Here we show RHS †0 is equiva-lent to
ð0 + ð1 + ð2 + ð3ð0
\âð3 †Îð3 . (23)
The can be done by writing RHS †0 in full:ð0 + ð1 + ð2 + ð3
ð0
âïžb â [
â2ð§
†0, (23a)
and moving the second term on the LHS to the RHS:ð0 + ð1 + ð2 + ð3
ð0
âïžb †[
â2ð§
. (23b)
We then add a factor ofâ
2ð§/(ð1 + ð2 + ð3) on both sides to get
ð0 + ð1 + ð2 + ð3ð0
âïžbâ
2ð§ð1 + ð2 + ð3
†[
ð1 + ð2 + ð3. (23c)
Noting from Equations (14) and (15) that
Îð3 =[
ð1 + ð2 + ð3and \âð3 =
â2 · 𧠷
âïžb
ð1 + ð2 + ð3,
we finally replace the terms in Inequality (23c) with Îð3 and \âð3 to
arrive at Inequality (23):ð0 + ð1 + ð2 + ð3
ð0\âð3 †Îð3 .
The non-trivial case: RHS > 0. We then tackle the case where theRHS of the master inequality (Inequality (22)) is greater than zero.We show in this non-trivial case, Inequality (22) is equivalent to
2ð2ð¶0ð0
>
(\âð3 â Îð3 + ð1+ð2+ð3
ð0\âð3
)2â
(ð1+ð2+ð3
ð0\âð3
)2
2ð§2 . (24)
We first multiply both sides of Inequality (22) with the fractionð0â
2ð§/(ð1 + ð2 + ð3) to get
ð1 + ð2 + ð3ð0
âïž2ð0ð2
ð¶0 + bð0â
2ð§ð1 + ð2 + ð3
>
(ð0 + ð1 + ð2 + ð3
ð0
âïžb â [
â2ð§
)ð0â
2ð§ð1 + ð2 + ð3
. (24a)
Canceling terms on both sides of the fractions we haveâïž2ð0ð2
ð¶0 + bâ
2ð§
> (ð0 + ð1 + ð2 + ð3)âïžbâ
2ð§ð1 + ð2 + ð3
â ð0[
ð1 + ð2 + ð3. (24b)
Again noting the identities for Îð3 and \âð3 stated above, we can
replace the fractions on the RHS and obtainâïž2ð0ð2
ð¶0 + bâ
2ð§ > (ð0 + ð1 + ð2 + ð3)\âð3 â ð0Îð3 . (24c)
We then square both sides of Inequality (24c) and move the 2ð§2
term to the RHS:
2ð0ð2ð¶0 + b >
((ð0 + ð1 + ð2 + ð3)\âð3 â ð0Îð3
)2
2ð§2 . (24d)
Note the squaring still allows the implication to go both ways asboth sides of Inequality (24c) are positive. Based on the identity for\âð3, we observe b can also be written as
b =(ð1 + ð2 + ð3)2 (\â
ð3)2
2ð§2 . (24e)
Thus, we can group all terms with a 2ð§2 denominator by moving bin Inequality (24d) to the RHS and substituting Equation (24e) intothe resultant inequality:
2ð0ð2ð¶0 >
((ð0 + ð1 + ð2 + ð3)\âð3 â ð0Îð3
)2â((ð1 + ð2 + ð3)\âð3
)2
2ð§2 .
(24f)
We finally normalize the inequality to one with unit Îð3 and \âð3to enable effective comparison. We divide both sides of Inequal-ity (24f) by ð02:
2ð2ð¶0ð0
>
(ð0+ð1+ð2+ð3
ð0\âð3 â Îð3
)2â
(ð1+ð2+ð3
ð0\âð3
)2
2ð§2 , (24g)
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD â20, Aug 2020, Online
and split the coefficient of \âð3 in the first squared term into an
integer (1) and a fractional ((ð1 + ð2 + ð3)/ð0) part to arrive atInequality (24):
2ð2ð¶0ð0
>
(\âð3 â Îð3 + ð1+ð2+ð3
ð0\âð3
)2â
(ð1+ð2+ð3
ð0\âð3
)2
2ð§2 .
C DUAL CONTROLWe finally clarify the calculations in Section 3.3 of the paper, wherewe determine the sample size required for Setup 4 (aka a dual controlsetup) to emerge superior to Setup 3 (a simpler A/B test setup). Inthe paper we showed that \â
ð4 > \âð3 always holds, and hence
Setup 4 will never be superior to Setup 3 under the first evaluationcriterion. We thus focus on the second evaluation criterion Îð4 âÎð3 > \â
ð4 â \âð3, and first show that the criterion is equivalent to
Inequality (27). Assuming the ratio of user group sizes ð1 : ð2 : ð3remains unchanged, we then show how(i) The RHS of the Inequality (27) remains a constant and(ii) The LHS of the Inequality (27) scales along ð (
âð), where ð is
the number of users.The results mean Setup 4 could emerge superior to Setup 3 if wehave sufficiently large number of users. We also show from theinequality that(iii) The number of users required for a dual control setup to emerge
superior to simpler setups is accessible only to the largest orga-nizations and their top affiliates.
The master inequality. We first show the criterion Îð4 â Îð3 >
\âð4 â \â
ð3 is equivalent to
ð1ð2 (`ðŒ2â`ð¶2)+ð3 (`ðŒðâ`ð¶3)
ð2+ð3â ð2
ð1 (`ðŒ1â`ð¶1)+ð3 (`ðŒðâ`ð¶3)ð1+ð3âïž
ð1 (ð2ð¶1 + ð2
ðŒ1) + ð2 (ð2ð¶2 + ð2
ðŒ2) + ð3 (ð2ðŒð
+ ð2ðŒð)
> (27)
â2ð§
âââââââââ2 ·
(1 + ð2ð1+ð3
)2 [ð1 (ð2ð¶1 + ð2
ðŒ1) + ð3 (ð2ð¶3 + ð2
ðŒð)]+
(1 + ð1ð2+ð3
)2 [ð2 (ð2ð¶2 + ð2
ðŒ2) + ð3 (ð2ð¶3 + ð2
ðŒð)]
ð1 (ð2ð¶1 + ð2
ðŒ1) + ð2 (ð2ð¶2 + ð2
ðŒ2) + ð3 (ð2ðŒð
+ ð2ðŒð)
â 1
,where ð§ = ð§1âðŒ/2 â ð§1âðmin . The number of terms involved is large,and hence we first simplify the LHS and RHS independently, andcombine them in the final step.
For the LHS (i.e. Îð4 â Îð3), we substitute in Equations (14)and (18) to obtainð2 (`ðŒ2â`ð¶2) + ð3 (`ðŒð â`ð¶3)
ð2 + ð3âð2 (`ðŒ1â`ð¶1) + ð3 (`ðŒðâ`ð¶3)
ð1 + ð3â
ð1 (`ð¶1 â `ðŒ1) + ð2 (`ðŒ2 â `ð¶2) + ð3 (`ðŒð â `ðŒð )ð1 + ð2 + ð3
. (27a)
The expression can be rewritten in terms of multiplicative productsbetween the ð-terms and the (difference between) `-terms:
ð1 (`ðŒ1 â `ð¶1)[â 1
ð1+ð3+ 1
ð1+ð2+ð3
]+
ð2 (`ðŒ2 â `ð¶2)[ 1ð2+ð3
â 1ð1+ð2+ð3
]+
ð3`ðŒð[ 1ð2+ð3
â 1ð1+ð2+ð3
]+ ð3`ðŒð
[â 1
ð1+ð3+ 1
ð1+ð2+ð3
]+
ð3`ð¶3[â 1
ð2+ð3+ 1
ð1+ð3
]. (27b)
We then extract a 1/(ð1 + ð2 + ð3) term from Expression (27b):
(ð1 + ð2 + ð3)â1 [ð1 (`ðŒ1 â `ð¶1)(â ð1+ð2+ð3
ð1+ð3+ 1
)+
ð2 (`ðŒ2 â `ð¶2)(ð1+ð2+ð3
ð2+ð3â 1
)+
ð3`ðŒð(ð1+ð2+ð3
ð2+ð3â 1
)+ ð3`ðŒð
(â ð1+ð2+ð3
ð1+ð3+ 1
)+
ð3`ð¶3(â ð1+ð2+ð3
ð2+ð3+ ð1+ð2+ð3
ð1+ð3
) ]. (27c)
This allows us to perform some cancellation with the RHS, whichalso has a 1/(ð1 + ð2 + ð3) term, in the final step. Noting
ð1 + ð2 + ð3ð1 + ð3
= 1 + ð2ð1 + ð3
andð1 + ð2 + ð3ð2 + ð3
= 1 + ð1ð2 + ð3
,
we can write the inner square brackets as
(ð1 + ð2 + ð3)â1 [ð1 (`ðŒ1 â `ð¶1)(â ð2
ð1+ð3
)+ ð2 (`ðŒ2 â `ð¶2)
( ð1ð2+ð3
)+
ð3`ðŒð( ð1ð2+ð3
)+ ð3`ðŒð
(â ð2
ð1+ð3
)+
ð3`ð¶3(â 1 â ð1
ð2+ð3+ 1 + ð2
ð1+ð3
) ], (27d)
and group the ð1/(ð2 + ð3) and ð2/(ð1 + ð3) terms to arrive at1
ð1 + ð2 + ð3
[ ð1ð2 + ð3
(ð2 (`ðŒ2 â `ð¶2) + ð3 (`ðŒð â `ð¶3)
)â
ð2ð1 + ð3
(ð1 (`ðŒ1 â `ð¶1) + ð3 (`ðŒð â `ð¶3)
) ]. (27e)
For the RHS (i.e. \âð4 â \â
ð3), we substitute in Equations (15)and (19) to obtain
2ð§
ââð1 (ð2
ð¶1+ð2ðŒ1) + ð3 (ð2
ð¶3+ð2ðŒð)
(ð1 + ð3)2 +ð2 (ð2
ð¶2+ð2ðŒ2) + ð3 (ð2
ð¶3+ð2ðŒð)
(ð2 + ð3)2
ââ
2ð§
ââð1 (ð2
ðŒ1 + ð2ð¶1) + ð2 (ð2
ð¶2 + ð2ðŒ2) + ð3 (ð2
ðŒð+ ð2
ðŒð)
(ð1 + ð2 + ð3)2 , (27f)
where ð§ = ð§1âðŒ/2 â ð§1âðmin . We then extract aâ
2ð§/(ð1 + ð2 + ð3)term from Expression (27f) to arrive at
â2ð§
ð1 + ð2 + ð3
[â
2
âââ(ð1+ð2+ð3ð1+ð3
)2 [ð1 (ð2ð¶1+ð
2ðŒ1) + ð3 (ð2
ð¶3+ð2ðŒð)]+(ð1+ð2+ð3
ð2+ð3
)2 [ð2 (ð2ð¶2+ð
2ðŒ2) + ð3 (ð2
ð¶3+ð2ðŒð)]
â
âïžð1 (ð2
ð¶1 + ð2ðŒ1) + ð2 (ð2
ð¶2 + ð2ðŒ2) + ð3 (ð2
ðŒð+ ð2
ðŒð)],
(27g)
where (ð1 +ð2 +ð3)/(ð2 +ð3) and (ð1 +ð2 +ð3)/(ð1 +ð3) can alsobe written as 1 + ð1/(ð2 + ð3) and 1 + ð2/(ð1 + ð3) respectively.
We finally combine both sides of the inequality by taking Ex-pressions (27e) and (27g):
1ð1 + ð2 + ð3
[ ð1ð2 + ð3
(ð2 (`ðŒ2 â `ð¶2) + ð3 (`ðŒð â `ð¶3)
)â
ð2ð1 + ð3
(ð1 (`ðŒ1 â `ð¶1) + ð3 (`ðŒð â `ð¶3)
) ]>
â2ð§
ð1 + ð2 + ð3
[â
2
âââ(1 + ð2
ð1+ð3
)2 [ð1 (ð2ð¶1+ð
2ðŒ1) + ð3 (ð2
ð¶3+ð2ðŒð)]+(
1 + ð1ð2+ð3
)2 [ð2 (ð2ð¶2+ð
2ðŒ2) + ð3 (ð2
ð¶3+ð2ðŒð)]â
âïžð1 (ð2
ð¶1 + ð2ðŒ1) + ð2 (ð2
ð¶2 + ð2ðŒ2) + ð3 (ð2
ðŒð+ ð2
ðŒð)].
(27h)
AdKDD â20, Aug 2020, Online Liu and McCoy
Canceling the common 1/(ð1 +ð2 +ð3) terms on both sides, and di-viding both sides by
âïžð1 (ð2
ð¶1+ð2ðŒ1) + ð2 (ð2
ð¶2+ð2ðŒ2) + ð3 (ð2
ðŒð+ð2
ðŒð)
leads us to Inequality (27):
ð1ð2 (`ðŒ2â`ð¶2)+ð3 (`ðŒðâ`ð¶3)
ð2+ð3â ð2
ð1 (`ðŒ1â`ð¶1)+ð3 (`ðŒðâ`ð¶3)ð1+ð3âïž
ð1 (ð2ð¶1 + ð2
ðŒ1) + ð2 (ð2ð¶2 + ð2
ðŒ2) + ð3 (ð2ðŒð
+ ð2ðŒð)
>
â2ð§
âââââââââ2 ·
(1 + ð2ð1+ð3
)2 [ð1 (ð2ð¶1 + ð2
ðŒ1) + ð3 (ð2ð¶3 + ð2
ðŒð)]+
(1 + ð1ð2+ð3
)2 [ð2 (ð2ð¶2 + ð2
ðŒ2) + ð3 (ð2ð¶3 + ð2
ðŒð)]
ð1 (ð2ð¶1 + ð2
ðŒ1) + ð2 (ð2ð¶2 + ð2
ðŒ2) + ð3 (ð2ðŒð
+ ð2ðŒð)
â 1
.RHS remains a constant. We then simplify the ð2-terms in In-
equality (27) by assuming that they are similar in magnitude, i.e.
ð2ð¶1 â ð2
ðŒ1 â · · · â ð2ðŒð
â ð2ð ,
and show the RHS of the inequality is equal to
â2ð§
[âïžð1 + ð2 + ð3ð1 + ð3
+ ð1 + ð2 + ð3ð2 + ð3
â 1]. (28)
As long as the group size ratio ð1 : ð2 : ð3 remains unchanged,Expression (28) will remain a constant. It is safe to apply the sim-plifying assumption as we know from the evaluation frameworkspecification that there are three classes of parameters in the in-equality: the user group sizes (ð), the mean responses (`), and theresponse variances (ð2). Among these three classes of parameters,only the user group sizes have the potential to scale in any practicalsettings, and thus we can effectively treat the means and variancesas constants below.
We begin by substituting ð2ðinto Inequality (27):
ð1ð2 (`ðŒ2â`ð¶2)+ð3 (`ðŒðâ`ð¶3)
ð2+ð3â ð2
ð1 (`ðŒ1â`ð¶1)+ð3 (`ðŒðâ`ð¶3)ð1+ð3âïž
ð1 (2ð2ð) + ð2 (2ð2
ð) + ð3 (2ð2
ð)
>
â2ð§
âââââââ
2 ·
(1 + ð2ð1+ð3
)2 [ð1 (2ð2ð) + ð3 (2ð2
ð)]+
(1 + ð1ð2+ð3
)2 [ð2 (2ð2ð) + ð3 (2ð2
ð)]
ð1 (2ð2ð) + ð2 (2ð2
ð) + ð3 (2ð2
ð)
â 1
. (28a)
Moving the common 2ð2ðterms out and canceling the common
terms in the RHS fraction we have
ð1ð2 (`ðŒ2â`ð¶2)+ð3 (`ðŒðâ`ð¶3)
ð2+ð3â ð2
ð1 (`ðŒ1â`ð¶1)+ð3 (`ðŒðâ`ð¶3)ð1+ð3âïž
2ð2ð(ð1 + ð2 + ð3)
>
â2ð§
[âïž2 ·
(1+ ð2ð1+ð3
)2 (ð1 + ð3) + (1+ ð1ð2+ð3
)2 (ð2 + ð3)(ð1 + ð2 + ð3)
â 1
].
(28b)
We can already see the LHS of Inequality (28b) scales alongð (âð)â
we will demonstrate this result in greater detail below.Focusing on the RHS of the inequality, we express the squared
terms as rational fractions and divide each term in the numerator
by the denominator to obtain
â2ð§
âââ
2
[(ð1+ð2+ð3ð1 + ð3
)2ð1 + ð3
ð1+ð2+ð3+(ð1+ð2+ð3ð2 + ð3
)2ð2 + ð3
ð1+ð2+ð3
]â1
.(28c)
Canceling the common ð1 + ð2 + ð3 terms leads to that presentedin Expression (28):
â2ð§
[âïžð1 + ð2 + ð3ð1 + ð3
+ ð1 + ð2 + ð3ð2 + ð3
â 1].
LHS scales along ð (âð). We demonstrate the scaling relation
between the LHS of Inequality (27) and the number of users in eachgroup by simplifying the ð-terms (but not the ð2-terms as above),assuming ð1 â ð2 â ð3 â ð, and showing the LHS of the inequalityis equal to
âð((`ðŒ2 â `ð¶2) â (`ðŒ1 â `ð¶1) + `ðŒð â `ðŒð
)2âïžð2ð¶1 + ð2
ðŒ1 + ð2ð¶2 + ð2
ðŒ2 + ð2ðŒð
+ ð2ðŒð
. (29)
While the relationship (that the LHS of the inequality scales alongð (
âð)) is evident by inspecting the LHS of Inequality (28b) or even
Inequality (27) itself, we believe the simplification allows us to showthe relationship more clearly.
We begin by substituting ð into Inequality (27) to obtain
ðð (`ðŒ2â`ð¶2)+ð (`ðŒðâ`ð¶3)
ð+ð â ðð (`ðŒ1â`ð¶1)+ð (`ðŒðâ`ð¶3)
ð+ðâïžð(ð2
ð¶1 + ð2ðŒ1) + ð(ð
2ð¶2 + ð2
ðŒ2) + ð(ð2ðŒð
+ ð2ðŒð)
> (29a)
â2ð§
âââââââââ2 ·
(1 + ðð+ð )
2 [ð(ð2ð¶1 + ð2
ðŒ1) + ð(ð2ð¶3 + ð2
ðŒð)]+
(1 + ðð+ð )
2 [ð(ð2ð¶2 + ð2
ðŒ2) + ð(ð2ð¶3 + ð2
ðŒð)]
ð(ð2ð¶1 + ð2
ðŒ1) + ð(ð2ð¶2 + ð2
ðŒ2) + ð(ð2ðŒð
+ ð2ðŒð)
â 1
.Moving the common ð-terms out and canceling them in the frac-tions where appropriate lead to
âð 1
2[((`ðŒ2 â `ð¶2) + (`ðŒð â `ð¶3)) â ((`ðŒ1 â `ð¶1) + (`ðŒð â `ð¶3))
]âïžð2ð¶1 + ð2
ðŒ1 + ð2ð¶2 + ð2
ðŒ2 + ð2ðŒð
+ ð2ðŒð
>
â2ð§
âââââââââ2 ·
(1 + 12 )
2 (ð2ð¶1 + ð2
ðŒ1 + ð2ð¶3 + ð2
ðŒð)+
(1 + 12 )
2 (ð2ð¶2 + ð2
ðŒ2 + ð2ð¶3 + ð2
ðŒð)
ð2ð¶1 + ð2
ðŒ1 + ð2ð¶2 + ð2
ðŒ2 + ð2ðŒð
+ ð2ðŒð
â 1
, (29b)
where the LHS is equal to Expression (29) as claimed above.It is clear that there are no ð-terms left on the RHS of Inequal-
ity (29), and hence the RHS remains a constant as shown previously.Setting up the inequality to demonstrate the third resultâthat thenumber of users required for a dual control setup to emerge su-perior is largeâwe further simplify the RHS of the inequality by
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD â20, Aug 2020, Online
rearranging the terms in the square root:âð
[((`ðŒ2 â `ð¶2) + (`ðŒð â `ð¶3)) â ((`ðŒ1 â `ð¶1) + (`ðŒð â `ð¶3))
]2âïžð2ð¶1 + ð2
ðŒ1 + ð2ð¶2 + ð2
ðŒ2 + ð2ðŒð
+ ð2ðŒð
>
â2ð§
[âââ2(
32
)2 (1 +
2ð2ð¶3
ð2ð¶1 + ð2
ðŒ1 + ð2ð¶2 + ð2
ðŒ2 + ð2ðŒð
+ ð2ðŒð
)â 1
].
(29c)
Required number of users is large. We finally show that whileSetup 4 could emerge superior to Setup 3 as the number of usersincrease, the number of users required is high.We do so by assumingboth the ð2- and ð-terms are similar in magnitude, i.e. ð2
ð¶1 â ð2ðŒ1 â
· · · â ð2ðŒð
â ð2ðand ð1 â ð2 â ð3 â ð, and show that Inequality (27)
is equivalent to
ð >
(2â
12(â
6 â 1)ð§
)2 ð2ð
Î2 , (30)
where Î = (`ðŒ2 â `ð¶2) â (`ðŒ1 â `ð¶1) + `ðŒð â `ðŒð is the actual effectsize difference between Setups 4 and 3. Note we are determiningwhen Setup 4 is superior to Setup 3 under the second evaluationcriterionâthat the gain in actual effect is greater than the loss insensitivityâand thus assume Î is positive.
The equivalence can be shown by substituting ð2ðinto Inequal-
ity (29c), which already assumes the ð-terms are similar in magni-tude:8âð
[((`ðŒ2 â `ð¶2) + (`ðŒð â `ð¶3)) â ((`ðŒ1 â `ð¶1) + (`ðŒð â `ð¶3))
]2âïž
6ð2ð
>
â2ð§
[ââ2(
32
)2 (1 +
2ð2ð
6ð2ð
)â 1
]. (30a)
Noting the expression within the LHS square bracket is equal to Î,we simplify the expression within the RHS square root, and moveevery non-ð term to the RHS of the inequality to obtain
âð >
â2ð§
[â6 â 1
] 2âïž
6ð2ð
Î. (30b)
As all quantities in the inequality are positive, we can square bothsides and consolidate the coefficients on the RHS to arrive at In-equality (30):
ð >
(2â
12(â
6 â 1)ð§
)2 ð2ð
Î2 .
8Alternatively we can substitute ð into Inequality (28b), which already assumes theð2-terms are similar in magnitude. Simplifying the resultant inequality would yieldthe same end result.