customizing marketing decisions using field experiments€¦ · customizing marketing decisions...

27
Customizing Marketing Decisions Using Field Experiments Duncan Simester * , Artem Timoshenko * , and Spyros I. Zoumpoulis * Marketing, MIT Sloan School of Management, Massachusetts Institute of Technology Decision Sciences, INSEAD October 3, 2016 Abstract We investigate how firms can use the results of field experiments to optimize the targeting of promotions. We evaluate seven widely-used segmentation methods using a series of two large scale field experiments. The first field experiment is used to generate a common pool of training data for each of the seven methods. We then validate the seven optimized policies provided by each method together with uniform benchmark policies in a second field experiment. The findings reveal that model-driven methods (Lasso regression and Finite Mixture Models) performed the best. Some distance-driven methods also performed well (particularly k-Nearest Neighbors). However, the classification methods we tested performed relatively poorly. The precision of the data varied with the level of aggregation, and the model-driven methods made the best use of the information when it was more precise. The model-driven methods also performed best in parts of the parameter space that are well represented in the training data. The relative performance of the methods is robust to modest differences in the settings used to create the training and validation data. We explain the poor performance of the classification methods in our setting and describe how it could be improved.

Upload: others

Post on 12-Jun-2020

9 views

Category:

Documents


1 download

TRANSCRIPT

Customizing Marketing Decisions Using Field Experiments

Duncan Simester*, Artem Timoshenko*, and Spyros I. Zoumpoulis†

*Marketing, MIT Sloan School of Management, Massachusetts Institute of Technology†Decision Sciences, INSEAD

October 3, 2016

Abstract

We investigate how firms can use the results of field experiments to optimize the targetingof promotions. We evaluate seven widely-used segmentation methods using a series of two largescale field experiments. The first field experiment is used to generate a common pool of trainingdata for each of the seven methods. We then validate the seven optimized policies provided byeach method together with uniform benchmark policies in a second field experiment. The findingsreveal that model-driven methods (Lasso regression and Finite Mixture Models) performed thebest. Some distance-driven methods also performed well (particularly k-Nearest Neighbors).However, the classification methods we tested performed relatively poorly. The precision of thedata varied with the level of aggregation, and the model-driven methods made the best use of theinformation when it was more precise. The model-driven methods also performed best in partsof the parameter space that are well represented in the training data. The relative performanceof the methods is robust to modest differences in the settings used to create the training andvalidation data. We explain the poor performance of the classification methods in our settingand describe how it could be improved.

1 Introduction

The marketing literature has a long tradition of developing methods for optimizing marketingdecisions. Although field experiments have been used to validate these methods, there has beenlittle research investigating the feasibility of using a sequence of field experiments to optimizemarketing decisions themselves.1 In this paper we investigate how firms can use the results of afield experiment to optimize allocating different treatments to different customer segments. In doingso, we bridge two literatures: the customer segmentation literature and the literature studying firms’use of field experiments.

We contemplate settings in which firms can randomly vary treatments to individual customersor groups of customers. After implementing an experiment, many firms will want to learn more thanjust which treatment performed best. They will generally want to optimize the policy by choosingdifferent treatments for different segments of customers. The problem of segmenting observationsusing a set of observed characteristics is a well-established problem. Within marketing, the customersegmentation literature is extensive, and includes a wide variety of proposed methods. The differentmethods offer different strengths, and it is unclear which methods are best suited to designing anoptimal policy using experimental data. We will compare the performance of a broad range ofmethods for a customer segmentation problem for a large retailer.

The research is conducted in two stages. In the first (training) stage we implement a large fieldexperiment involving approximately 1.2 million households. We randomly assign households to threeexperimental conditions, including two promotion conditions and a control. We use the results ofthe first-stage experiment as training data that we submit to seven different segmentation methods.These methods group households into segments using demographic and geographic variables. Fromthese seven groupings we obtain seven optimized policies that we validate in the second stage.

In the second (validation) stage we conduct a second experiment in which we randomly assign4.1 million households into ten experimental conditions. Three experimental conditions correspondto assigning uniformly the same two promotion conditions and the control (no mail) condition fromStage 1. The other seven conditions implement the optimal mailing policy suggested by each of themethods used to analyze the Stage 1 data. Comparison of these optimized conditions provides acomparison of which segmentation method yields the most profitable policy. Moreover, comparisonof the optimal policies with the uniform policies in which all customers receive the same promotion(or no promotion), provides a measure of the extent to which the optimization improves upon auniform naive policy.

We find that model-driven methods (Lasso regression and Finite Mixture Models) yielded thelargest profits. Some distance-driven methods also performed well, in particular k-Nearest Neigh-bors. However, the classification methods we tested (CHAID and Support Vector Machines) per-formed relatively poorly.

Further investigation reveals that the model-driven methods were the best performers in partsof the parameter space that are well represented in the training data. The model-driven methodsalso made the best use of the increased precision of the available information. When aggregationmeant that the precision was low, all methods performed similarly. As the precision of the informa-tion improved because it was aggregated across fewer households, the performance of model-basedmethods exceeded that of the other methods.

1Examples of studies that use field experiments to validate optimization methods include pricing models (Mantralaet al., 2006; Belloni et al., 2012), advertising models (Skiera and Nabout, 2013; Simester et al., 2006; Li and Kannan,2014; Urban et al., 2014) and market research or product development models (Neslin et al., 2009; Urban et al., 1997).

1

We investigate why the model-driven and distance-driven methods outperformed the classifica-tion methods in our setting. Model-driven and distance-driven methods construct optimal policiesbased upon the expected profit each treatment. In contrast, classification methods only use infor-mation about which treatment does better in different areas of the parameter space. They discardinformation about the magnitude of the profit differences.

There are several features of our research setting that are worth considering. First, we focus onprofit maximization for new customers. As a result, the firm cannot use purchase histories of theprospective customers as a basis for segmentation, because the customers have not yet purchased.Instead, the segmentation variables include variables describing demographics, distance measures,and past response measures from each geography. These types of variables are often used in customeracquisition problems. However, we recognize that the method that performs best in the customeracquisition setting may not be the method that performs best when targeting existing customers,using the customers’ own purchase histories.

We also distinguish our setting from digital settings, in which firms are able to run tens ofthousands of randomized experiments each day. In these settings, the opportunity to run largevolumes of experiments introduces a different segmentation problem. In particular, a firm maychoose to use a sequence of experiments rather than a single experiment to design customer segments.It is possible that the method that works best when designing segments from a single experimentis not the best method to use when there is an opportunity to use a large number of experiments.Nevertheless, our basic approach could be extended to A/B testing in a digital setting by havingmore than two stages with a longer sequence of experiments.

The paper proceeds in Section 2, where we describe the Stage 1 and Stage 2 field experiments.In Section 3 we introduce the seven segmentation methods and in Section 4 we present our results.We discuss the performance of classification methods in Section 5 and conclude in Section 6.

2 Data Overview

Data for this study was provided by a large retailer that operates a large number of stores. Theretailer sells a broad range of products including perishables, sundries, and durables. Customers canonly purchase if they have signed up for a membership. The retailer regularly mails promotions toprospective members in order to increase its membership base. We study two types of promotionaloffers. The first offer is a $25 paid 12-month membership, which represents a 50% discount off thefull price. The second offer is a 120-day free trial. If a customer wants to remain a member afterthe trial period, she purchases a membership at the full price.

When new customers register for a membership they provide their name and mailing address.This is used to identify which households responded to which direct mail promotions.

In the first-stage (training) experiment, the 1.2 million households in the mailing list wereall located in a single geographic region. Assignment to the three experimental conditions wasrandomized at the household level. These included a control group that received no promotionoffer, a group that received the $25 paid offer, and a group that received the 120-day free trial.The promotions were highlighted on the front cover, inside front cover and back cover of a 48-page book of product-specific coupons. The product-specific coupons did not vary across the threeexperimental treatments (the only experimental variation involved the front cover, inside front coverand back cover).

The treatments were repeated twice approximately six weeks apart, so customers received thesame offer twice. The first treatment was implemented in early February 2015, and it was then

2

repeated in late March. We calculated the outcome using 63 days after the in-home date to identifynew members, and 77 days of transactions after the in-home date from these new members. Thedata was used to construct a profit measure, which is the response variable. This measure takesinto account mailing costs, membership revenue and an average profit margin earned on purchasesin the stores.2 Although the specific calculation of the profit measure is confidential, it was thesame between conditions (and also the same for both waves of experiments).

The retailer purchased different mailing lists of households for each of the two stages of the exper-iment. The first list contained 1,185,141 households. Thirteen descriptive variables were available atthe household or zip-code level including: distances to the closest retailer’s and competitors’ stores,type of housing, income and age characteristics, and a prior membership history. A complete list ofvariables together with their definitions and summary statistics are presented in the Appendix. Thehousing type, income, and age variables were measured at the household level, while the remainingvariables were measured at the zip-code level.

When analyzing the Stage 1 experiment, we aggregate the households up to the carrier routelevel.3 This aggregation yielded 5,976 carrier routes. We calculated the average of the household-level characteristics to obtain a carrier-route level measure.

Because we randomized at the household level in Stage 1, each carrier route includes householdsrandomly assigned to all three experimental treatments. Therefore, for each carrier route we cancalculate the profit earned under each of the experimental treatments: $25 paid offer, 120-day freetrial, and control.

The mailing list in the second-stage (validation) experiment contained 4,119,244 householdsorganized into 10,419 carrier routes.4 The Stage 2 experiment shared many of the same featuresas the Stage 1 experiment. The study involved the same retailer sending direct mail solicitationsto prospective customers identified through rented mailing lists. As in the Stage 1 experiment,the mailing treatments in the Stage 2 experiment were repeated six weeks apart, with customersreceiving the same treatment in each wave. The promotions were also the same, including a $25discounted membership, a 120-day free trial, and a no-mail (control) condition. We also have thesame thirteen descriptive variables that were available in the first-stage experiment. Summarystatistics for the second-stage mailing list are provided in the Appendix.

However, there are several differences in the design of Stages 1 and 2. First, the householdswere in a much broader geographic area in Stage 2, with just 2% of the households located inthe same geographic area as the first-stage experiment. Second, the second-stage experiment wasconducted in the fall (starting on August 23 2015), while the first-stage experiment was conductedin the spring (of the same year). Third, the Stage 1 treatments were randomly assigned at thehousehold level, while in Stage 2 the ten mailing policies were randomly assigned at the carrierroute level.5 This decision was motivated by lower mailing costs. The United States Postal Service

2There is a small number of customers that purchased products (primarily cigarettes) from the retailer to sell attheir own retail locations. This introduced outliers to measures of store revenue. At the suggestion of the retailer,store revenue over $15,000 was truncated at $15,000. This affected just eight of the 4,119,244 observations in the fallvalidation data. It did not affect any of the 1,185,141 observations in the spring training data.

3The United States Postal Service organizes households into carrier routes. These routes identify the householdsthat each letter carrier visits. There are typically 200 to 400 households per carrier route and they all fall within thesame zip code.

4The carrier routes in the Stage 2 experiment contained an average of 395 households, while the carrier routesin the Stage 1 experiment contained an average of only 198 households. This reflects differences in the geographicregions in which these carrier routes were located.

5For Stage 2 we randomized the 4,119,244 households into seven optimized conditions (one for each of the sevensegmentation methods), and three uniform conditions ($25 paid offer, 120-day free trial, no mailing).

3

Table 1: Summary of notation

Term DescriptionN The number of carrier route observations in Stage 1, i.e.,

5,976p The number of descriptive variables, i.e., 13i The index of the observation (carrier route)X A N × p matrix capturing the descriptive variables for all

observationsxi The ith row of matrix X, i.e., a 1× p vector that holds the

values of all the descriptive variables for observation iyt A N × 1 response vector capturing realized profit for all

observations under treatment tyti Realized profit for observation i under treatment tyti Predicted profit for observation i under treatment tt∗i Optimal treatment for observation i

offers cheaper mailing rates if every household in a carrier route receives exactly the same mailing.Fourth, in Stage 2 the promotional offers were mailed to prospective customers using a postcard,which was printed on both sides and highlighted the offer on each side. Recall that in Stage 1 theoffers were printed on the outside and inside covers of a 48-page book of coupons. Finally, the Stage2 experiment coincided with a mass media advertising campaign by the retailer. There was no suchmass media advertising during the Stage 1 experiment.

As we will discuss, these types of differences between the experiments are typical in a fieldsetting. They are likely to lead to differences in the response functions between the training andvalidation experiments, which will tend to diminish the accuracy of the optimized policies. However,the randomization ensures that this affects all of the optimization methods. Moreover, this variationprovides an opportunity to compare the methods in realistic conditions. In practice, these types ofdifferences are common, and a comparison of the methods across identical experiments is perhapsless informative than a comparison of the methods in realistic conditions.

3 Customer Segmentation Methods

3.1 Taxonomy

For every observation (i.e., carrier route) in the training set, we obtained the profit outcomes associ-ated with the three types of treatment. The problem of assigning an optimal treatment to new cus-tomers based on descriptive variables can be approached in two distinct ways. The regression-basedapproach involves training three separate models (one per treatment) to predict profit outcomes foreach new observation, and assigning the treatment which yields the highest predicted profit. Tomake a prediction, we consider both non-parametric distance-driven and parametric model-drivenmethods. Another approach to the problem is to consider classification methods. The classificationmethods assign an optimal treatment to new observations based upon which treatment performedbest for the observations in the training set. In this section, we provide an overview of the consideredmethods and describe the details of their implementation.

4

3.2 Distance-driven Methods

Distance-driven methods make the best possible profit predictions for each new (i.e., Stage 2)observation, for each treatment, by using the Stage 1 observations that are the closest to thenew observation, where “closest” is meant in terms of some distance metric computed from theobservations’ descriptive variables. Each observation is then assigned the treatment that results inthe highest predicted profit.

3.2.1 Kernel Regression

Overview. Kernel regression is a non-parametric technique that finds a non-linear relation be-tween the descriptive variables and the response variable, i.e., profit. This non-linear function isestimated using a kernel as a weighting function. We estimate a different function for each treat-ment, based on the Stage 1 observations, and we assign to each new observation the treatment thatresults in the highest predicted profit.

Implementation and cross-validation. In the kernel regression approach, we estimate thefollowing function for each treatment t for a new (i.e., Stage 2) observation with descriptive variablesxnew:

ytnew =∑Ni=1Kγ(xnew,xi)wtiyti∑Ni=1Kγ(xnew,xi)wti

, (1)

where N is the number of observations in Stage 1, Kγ(x,xi) = e−γ||x−xi||2 is a Gaussian kernel, wtiis a weight reflecting the number of households in carrier route i that were treated with treatmentt in Stage 1, and yti is the average effect of treatment t in Stage 1 over the households in carrierroute i that were treated with treatment t.

There are four key elements to be calibrated in the kernel regression. First, we decide on theform of the conditional expectation function. In our study, we use the Nadaraya-Watson kernelestimator. Second, we decide on the kernel, i.e., the weighting function. The radial basis function,also called Gaussian, kernel is most often used and is our choice.

Third, we select a distance metric. Our goal is to demonstrate the taxonomy of segmentationtechniques and their comparison, so we favor simplicity across all methods. In particular, we usethe Euclidean distance for all distance-driven methods:

||xi − xj || =√

(xi1 − xj1)2 + . . .+ (xip − xjp)2, (2)

where p is the number of descriptive variables of each carrier route i.The last parameter to be specified for the kernel regression estimator is the bandwidth γ of

the kernel. We use cross-validation to find the best bandwidth. At every iteration of the cross-validation, we randomly split the Stage 1 observations into a training set (80%) and a validation set(20%). For each treatment, and for each observation in the validation set, we derive a predictionusing the kernel regression estimator with the training set observations, and we compute a predictionmean squared error (MSE). The optimal bandwidth is fine-tuned separately for each treatment tominimize the average MSE over 200 cross-validation iterations.

5

Assigning a treatment to new observations in the Stage 2 experiment. Having cross-validated the kernel regression estimator, for each new (i.e., Stage 2) observation we derive a pre-dicted profit based on the kernel regression, for each of the three treatments. We then assign to thenew observation the treatment that results in the highest predicted profit.

3.2.2 k-Nearest Neighbors (k-NN)

Overview. The k-nearest neighbors approach approximates the effectiveness of each treatment foreach new observation by averaging the profit under each treatment across the k Stage 1 observationsthat are the closest to the new observation. It then selects for each new observation the treatmentwith the highest predicted profit.

Implementation and cross-validation. In the k-nearest neighbors approach, there are two keyelements to be calibrated: the distance measure and the number of neighbors k to be considered. Thechoice of the distance measure for the k-nearest neighbors method is discussed in Stone (1977). Tobe consistent across the proposed distance-based methods, and for simplicity, we use the Euclideandistance.

The optimal number of neighbors k to use generally depends upon the dimensionality of thespace of explanatory variables and the distribution of explanatory variables and observations. Tofine-tune the number of neighbors, we conduct cross-validation similar to the cross-validation forthe kernel regression. For each cross-validation iteration and for each observation in the validationset, we identify the observation’s k nearest neighbors among the training set. The observation’spredicted profit under each treatment is then computed as a weighted average of the profit (underthe respective treatment) of the observation’s k nearest neighbors in the training set. We repeatthis process for a range of different k’s. The number of neighbors k is fine-tuned separately for eachtreatment to minimize the average MSE of the predictions over all cross-validation iterations.

Assigning a treatment to new observations in the Stage 2 experiment. For each new(i.e., Stage 2) observation we derive a predicted profit by weight-averaging profit across its k nearestneighbors from the Stage 1 observations, for each of the three treatments. We then assign to thenew observation the treatment that results in the highest predicted profit.

3.2.3 Hierarchical Clustering (HC)

Overview. Hierarchical clustering is a classic greedy clustering technique that links pairs of Stage1 observations that are in close proximity. These binary clusters are grouped into larger clusters,until a hierarchical tree is formed. The hierarchical tree is then cut to create a partition of theobservations into the desired number of clusters. For each new observation, a predicted profit isderived as a weighted average of the profit of the Stage 1 observations in the cluster that are theclosest to the new observation, and the new observation is assigned the treatment that achieves thehighest predicted profit.

Implementation and cross-validation. Three key elements need to be calibrated: the distancemeasure, the linkage criterion, and the desired number of clusters. For the distance measure betweenpairs of observations, we use Euclidean distance to be consistent with the other distance-basedmethods that we employ. The linkage criterion determines how clusters will be grouped with otherclusters and observations to form higher-level clusters. We use a minimum distance linkage criterion,

6

which sets the distance between two clusters to be the minimum distance between observations inthe two clusters.

To fine-tune the number of clusters, we cross-validate using a cross-validation procedure similarto the one described previously. For each cross-validation iteration, we perform hierarchical clus-tering on the training set. We do so for a range of values for the number of clusters. We makepredictions as follows: for each observation in the validation set, its closest cluster from the train-ing set is identified. Then the predicted profit under each treatment is calculated as the weightedaverage of the profit (under the respective treatment) of the training set observations that lie inthe closest cluster. The closest cluster is selected based on the minimum distance criterion. Thenumber of clusters is fine-tuned separately for each treatment to minimize the average MSE of thepredictions over all cross-validation iterations.

Assigning a treatment to new observations in the Stage 2 experiment. For each new(i.e., Stage 2) observation we derive a predicted profit by averaging profit across its closest clusterof Stage 1 observations, for each of the three treatments. We then assign to the new observationthe treatment that results in the highest predicted profit.

3.3 Model-driven Methods

3.3.1 Lasso Regression

Overview. The Lasso regression, a regularized regression method proposed by Tibshirani (1996),minimizes the sum of square errors subject to a constraint on the l1-norm. For each new observation,we predict a profit using Lasso for each treatment and assign the treatment that results in the highestpredicted profit.

Implementation and cross-validation. The Lasso regression estimates for treatment t aregiven by

βt = arg minβ

((yt −XXXβ)TW t(yt −XXXβ) + λ||β||1

), (3)

where yt is the effect of treatment t in Stage 1 on the households in each carrier route that weretreated with treatment t, W t is a diagonal matrix whose ith diagonal entry is wti , i.e., the numberof households in carrier route i that were treated with treatment t in Stage 1, and λ ≥ 0 is aregularization parameter. The predicted profit for some observation i can then be calculated as

yti = βtxi. (4)

We use the Glmnet implementation of the elastic net (Qian et al., 2013) to train a Lassomodel. Glmnet uses cyclical coordinate descent for the optimization6, and performs ten-fold cross-validation7 to fine-tune hyper-parameter λ. As every observation in the dataset pertains to adifferent carrier route, we weight observations by the number of households in the carrier route.The hyper-parameter λ is configured separately for each treatment.

6Cyclical coordinate descent successively optimizes the objective function over each parameter with others fixed,and cycles repeatedly until convergence.

7Split the data set into ten buckets. Estimate βββ on data from nine buckets and cross-validate on the tenth. Rotateand do this for all ten buckets and calculate the average error.

7

Assigning a treatment to new observations in the Stage 2 experiment. Having cross-validated the Lasso estimator from Stage 1 observations, for each new (i.e., Stage 2) observation,we derive a predicted profit based on the Lasso regression, for each of the three treatments. Wethen assign to the new observation the treatment that results in the highest predicted profit.

3.3.2 Finite Mixture Models (FMM)

Overview. Finite mixture models express the response as a finite mixture of regression models,where the regression models can have different specifications. Maximum likelihood estimates of thesegment proportions, the regression coefficients, and the distribution parameters for each segmentare obtained using the expectation-maximization (EM) algorithm on Stage 1 observations. For eachnew observation, the finite mixture model makes a profit prediction for each treatment; we assignthe treatment that results in the highest predicted profit.

Implementation and cross-validation. We assume the response variable yti of carrier route iunder treatment t is distributed according to the finite mixture model

yti ∼ f(yti |xi;θθθt;πππt) =K∑`=1

πt`f`(yti |xi;θθθt`), (5)

where πππt ≥ 0,∑K`=1 π

t` = 1.

We use the Flexmix package in R (Grun and Leisch, 2008) to estimate the model.8 The maximumlikelihood estimation of the regression coefficients, the distribution parameters, and the weight πt`for each segment is carried out using the EM algorithm, which iterates between evaluating theexpectation of the log-likelihood using current estimates (E step), and updating the estimates tomaximize the expectation of the log-likelihood (M step).

The Stage 1 promotion campaign had a low response rate, and therefore the revenue is zero fora significant number of carrier routes. To deal with zero inflation, we use revenue (and not profit)as the response variable yti and consider a zero-inflated Poisson model, which is approximated bysetting an intercept at the first component fixed to −∞ and other coefficients to zero, while the restof the model is a usual mixture of Poisson distributions. The estimated model takes the followingform:

f`(yti |xi;θθθt`) = e−µiµyt

ii

yti !, where

log(µi) = x′iθθθt`.

Zero-inflated Poisson models require the response variable to be discrete. For this reason, webucket the observed revenue. The size of the buckets is a parameter we fine-tune in cross-validation,along with the number of segments K. At each iteration of the cross-validation, the mixture model isestimated from observations in the training set, and then a prediction is calculated for observationsin the validation set. The optimal parameters are selected separately for different treatments tominimize the average MSE of the predictions over all cross-validation iterations.

8https://cran.r-project.org/web/packages/flexmix/index.html

8

Assigning a treatment to new observations in the Stage 2 experiment. Having learntand cross-validated the finite mixture estimator from Stage 1 observations, for each new (i.e., Stage2) observation, we derive a predicted revenue based on the estimated model for each of the threetreatments, and then subtract the mailing costs to retrieve predicted profit. We then assign to thenew observation the treatment that results in the highest predicted profit.

3.4 Classification Methods

3.4.1 CHi-square Automatic Interaction Detection (CHAID)

Overview. The CHi-square Automatic Interaction Detection (CHAID), a multiway classificationtree technique introduced by Kass (1976), became popular in the marketing practice because ofits interpretability and convenience for segmentation analysis. CHAID recursively partitions thetraining observations into subsegments, maximizing at each round the significance of a chi-squaredstatistic for cross-tabulations between the dependent variable, which is the optimal treatment de-cision, and the descriptive variables at each partition. By the end of the process, the Stage 1observations are partitioned into mutually exclusive and collectively exhaustive segments that bestdescribe the optimal treatment decision. New observations are assigned the optimal treatment ofthe segment in which they are classified.

Implementation and cross-validation. To obtain the decision tree, at each split CHAID looksfor the descriptive variable that best explains the response variable if split. In order to decidewhether to create a particular split based on this variable, the algorithm performs a chi-squaredtest for independence between the split variable and the categorical response. If the test decidesthat the split variable and the response are independent, the tree stops growing; otherwise, the splitis created, and the next best split is searched. The process terminates when none of the leaves canbe split.

We used the CHAID package in R.9 This requires all variables to be categorical, so we categorizedcontinuous variables into five quantiles.

Computationally, CHAID is the most expensive method that we implemented. Cross-validationis used to select seven model parameters: the levels of significance used for merging of predictorcategories and splitting of previously merged categories, the level of significance used for splittingof a node in the most significant predictor, the number of observations in split response at whichno further split is desired, the minimum number and frequency of observations in terminal nodes,and the maximum height of the tree. As the number of parameters is high, we consider a small gridof three values for each parameter in the process of cross-validation. The optimal parameters wereselected to maximize the average classification accuracy. We weight observations by the number ofhouseholds in the corresponding carrier route (similar to other methods).

Assigning a treatment to new observations in the Stage 2 experiment. CHAID assignsnew observations to classes, where each class identifies the optimal treatment.

3.4.2 Support Vector Machines (SVM)

Overview. The method first labels each Stage 1 observation according to the treatment thathas the highest profit for that observation. It then divides the space of descriptive variables with

9https://r-forge.r-project.org/R/?group_id=343

9

separating hyperplanes that separate the Stage 1 observations, so that the separation betweenobservations of different labels is maximized. New observations are then assigned the treatment thatcorresponds to their spatial representation in the high-dimensional space of descriptive variables.

Implementation and cross-validation. Support vector machines (SVM) is, inherently, a two-class classification technique. Given the training set of labeled pairs (xi, zi), i = 1, . . . , N , wherelabels z ∈ {+1,−1}N indicate the class, the SVM technique finds a separating hyperplane betweenthe two classes that is a solution to the following optimization problem:

minθθθ,θ0,ξξξ

12θθθTθθθ + C

N∑i=1

ξi

subject to zi(θθθTφφφ(xi) + θ0) ≥ 1− ξi,ξi ≥ 0,

(6)

where φφφ(xi) are feature vectors. We refer to K(xi,xi) = φφφ(xi)Tφφφ(xi) as the kernel function. Weuse the Gaussian (radial basis function) kernel Kγ(xi,xj) = e−γ||xi−xj ||2 , which is known to be areasonable modeling choice for a broad range of applications, as it can handle a nonlinear relationbetween class labels and attributes, while having a moderate number of hyperparameters.

We use the LibSVM multiclass SVM library for MATLAB (Chang and Lin, 2011) to do a one-versus-one multi-class classification.10 We have three classes, one for each of the three treatments;we find a two-class SVM for all

(32)

= 3 pairs of classes, and assign new observations to the class whichis selected by the most classifiers. In the cross-validation stage, we fine-tune the misclassificationpenalty parameter C and the bandwidth parameter for the Gaussian kernel γ to achieve the highestprediction accuracy on the validation set. The same cost and bandwidth parameters are used forall three two-class SVMs.

Assigning a treatment to new observations in the Stage 2 experiment. Like CHAID,SVM assigns each new observation to a class, where the class identifies the optimal treatment.

3.5 Uniform Policies

We assign each new observation the same treatment. We evaluate three uniform policies: the policyassigning the $25 paid membership uniformly, the policy assigning the 120-day free trial uniformly,and the policy assigning the no-mail treatment uniformly.

4 Results

4.1 Equivalence of Stage 1 and Stage 2 Outcomes

We begin the analysis by comparing the outcomes for the uniform (fixed) policies in the Stage 1(spring training) data and the Stage 2 (fall validation) data. Recall that in both spring and fallwe implemented uniform offers to randomly selected sample of households. These uniform policiesincluded $25 paid offers, 120-day free trials and a no-mail policy. Comparing the outcomes betweenStage 1 and Stage 2 offers a basis for comparing the equivalence of the training data and thevalidation data.

10https://www.csie.ntu.edu.tw/˜cjlin/libsvm/

10

The findings are summarized in the Appendix, where we report the average store revenue perhousehold, and the average membership revenue per household. We observe a consistent pattern:store revenue and membership revenue are higher in the Stage 2 (fall validation) data than in theStage 1 (spring training) data. This is true for all three uniform policies. This is not necessarilysurprising, as there are several important differences between the Stage 1 and Stage 2 datasets. First,the populations are not equivalent. The Stage 1 households were all located in a single geographicregion, while the Stage 2 households were from a much larger geographic region. Geographicdifferences in the propensity to respond are likely for a variety of reasons, including income levels,dilution of the populations through past marketing activities, and the local strength of the retailer’sbrand. Second, the mailing vehicles were different between Stage 1 and 2. In Stage 1, the offers wereprinted on the outside and inside covers of a 48-page book of coupons; in Stage 2, the offers weremailed using a postcard. Third, we might expect that the propensity to respond to promotionalmailings may change over time. The average propensity to respond could decline because prospectswho responded in the past are now excluded from the mailing pool. As a result, the mailing poolbecomes increasingly diluted. Alternatively, the propensity to respond may increase if prospectsrequire multiple exposures to a promotional mailing before they will respond. Either possibilityimplies that the response probabilities will be non-stationary over time. Fourth, the market contextmay also change. In particular, promotions in other media may change over time, and this couldalso change the response probabilities. It is this last possibility that the retailer attributed thedifferences in Stage 1 and 2 outcomes to. In particular, a senior manager at the retailer stated: “Ithink it’s largely due to mass media being higher” in the fall.

These differences between the training and validation data are not unusual. In contrast toexperiments conducted in the laboratory, experiments conducted in the field are often subjected tovarying environmental conditions. Reassuringly, this variation between the training and validationdata is common to all of the optimization methods. It does, however, introduce the possibilitythat these differences favored some methods over others. We will investigate the robustness of thefindings on the relative performance of the methods by experimentally introducing an additionalsource of variation between the training and validation data, and evaluating whether the findingsare robust to this additional variation.

In our next analysis we compare the average profit earned in each experimental condition. Inall of the tables and figures that follow, we index the average profit to preserve confidentiality.

4.2 Average Profit in Each Experimental Condition

In Figure 1 we compare the average one-year profit earned from the households in each of theexperimental conditions (complete findings are reported in the Appendix). The findings reveal thatthe policy produced by Lasso yielded the highest average profit. The average profit was significantlyhigher than CHAID and SVM, and it also significantly outperformed all of the uniform policies(p < 0.05). The k-NN method also significantly outperformed CHAID (p < 0.01). However, it didnot significantly outperform the uniform $25 condition.

We can also compare the performance of the methods when grouping the methods using ourtaxonomy of methods. In particular, in our earlier discussion we identified three categories ofmethods:

• Distance-driven methods: kernel regression, k-nearest neighbors, hierarchical clustering;

• Model-driven methods: Lasso, finite mixture models;

11

Figure 1: The figure reports the average one-year profit, averaged across each of the households in eachexperimental condition. To preserve confidentiality, the profits are indexed to 100 for the CHAID data point.The error bars represent 95% confidence intervals. Complete findings including sample sizes and standarderrors are reported in the Appendix.

• Classification methods: CHAID, support vector machine.

In Figure 2 we report the analysis when pooling the households according to this taxonomy.Notice that this grouping preserves the benefits of randomization, although the sample sizes varybecause there are more distance-driven methods than model-driven or classification methods. Theresults clearly favor the distance- and model-driven methods over the classification methods. Thetwo classification methods (CHAID and SVM) are the two worst performing methods. Indeed, whenwe pool the outcomes by method type, the distance- and model-driven methods yield policies thatare both significantly better than the classification outcomes (p < 0.01). The difference betweenthe distance- and model-driven methods is not significant.

4.3 Comparison with the Uniform $25 Policy

Although the Lasso method significantly outperforms the uniform policy of sending every householdthe $25 paid offer, it is the only optimization method to do so. While most of the other optimal poli-cies earn higher average profits than the uniform policies, the differences compared to the uniform$25 paid policy are generally not statistically significant. Although this may seem disappointing,it is not surprising. Most of the methods choose to send the $25 paid offer to the majority ofhouseholds. For households for which a policy would recommend sending the $25 paid offer, thereis obviously no difference in the expected profit compared to the uniform $25 policy. If we want tofocus on the differences between an optimal policy and this uniform policy, we need to focus on thehouseholds for which the policies are different. In particular, we need to compare the households forwhich the optimal policy would not send the $25 paid offer. Because of the experimental variation,we can make this comparison using a randomly selected group of customers who received the $25paid offer (those assigned to the uniform policy) and a randomly selected group of customers thatreceived another treatment (those assigned to the optimized policy).

12

Figure 2: The figure reports the average one-year profit when pooling the households using the taxonomyof methods. To preserve confidentiality, the profits are indexed to 100 for the Classification methods datapoint. The error bars represent 95% confidence intervals. Complete findings including standard errors andsample sizes are reported in the Appendix.

We can illustrate this logic more clearly using an example, in which we focus on the householdsin the Lasso condition and the households in the uniform $25 paid condition. We can ask what theLasso policy would have recommended for all approximately 800,000 customers (in both conditions).It would have recommended sending the $25 paid offer to 73.63% of them, sending the 120-day freetrial to 8.00% of them, and not mailing to the remaining 18.37%. These three sub-groups ofhouseholds are obviously systematically different (otherwise Lasso would not have chosen to treatthem differently). However, within each sub-group there is a sample of customers randomly assignedto the Lasso policy, and an equivalent sample randomly assigned to the uniform $25 paid policy.We can safely compare these equivalent samples.

The results are summarized in Figure 3. Where Lasso chose the 120-day or no-mail treatmentsover the $25 paid treatment, it outperformed the uniform $25 policy. The difference is statisticallysignificant (p < 0.01) for the no-mail treatment, and when pooling the 120-day and no-mail groups(the columns at the far right in the figure).

Reassuringly, we do not observe a significant difference between conditions for the householdsfor which Lasso recommended sending the $25 paid offer. This is true even though this is the largestsub-group of households (73.63% of them). For these households, the comparison between conditionsrepresents a randomization check. They received the same treatments and so any differences in theoutcome could only be attributed to differences in the households themselves.

The findings also reveal another distinctive pattern. Lasso recommends mailing the $25 paidoffer to the most valuable households. It is only the less valuable households to which it chooses tosend the 120-day free trial (or not to mail).

In Table 2 we repeat this analysis for all of the optimized models. In particular, we first pool thehouseholds in the optimized condition with the households in the uniform $25 paid condition, andthen ask what treatment the optimized policy would have recommended. This yields sub-groups ofhouseholds, within which we can compare the uniform $25 paid and optimized policy outcomes.

For the households for which an optimized policy recommends sending the $25 paid offer, wedo not observe any significant differences between the optimized policy and the uniform $25 paid

13

Figure 3: The figure focuses on households in the Lasso and uniform $25 paid conditions. It reports theaverage one-year profit when grouping households according to the treatment recommended by Lasso. Topreserve confidentiality, the profits are indexed to 100 for the Lasso Policy $25 Paid offer data point. Theerror bars represent 95% confidence intervals. Complete findings, including standard errors, are reported inTable 2. The sample sizes are reported in the Appendix.

condition. As we discussed above, this is reassuring, as this comparison serves as a randomizationcheck. Any differences in these outcomes could only be attributed to differences in the households(there is no difference in the experimental treatment). For the households for which the optimizedpolicy did not recommend sending the $25 paid offer, there is a sharp difference in outcomes forthe best-performing and worst-performing methods. For HC, Kernel, FMM, k-NN, and Lasso, theoptimal methods consistently outperformed the uniform policy. However, for CHAID and SVMthe reverse occurred: when these methods chose not to mail the $25 paid offer, the uniform policytended to perform better than the optimized methods. Notably these were also the two methodsthat chose to mail the $25 paid offer least frequently; they mailed the $25 paid offer to less than20% of the households, while all of the other methods mailed this promotion to over 55% of thehouseholds. In Section 5 we provide an explanation for these differences.

4.4 Robustness Check

Recall from the comparison of the uniform policies that there were differences in Stage 1 and Stage 2settings, resulting in different average outcomes. A senior manager from the retailer attributed thesedifferences to more mass market advertising in the fall. This raises the question: how robust arethe findings to differences between the training setting and the validation setting?

The robustness of the findings was of interest not just to the research team, but also to theretailer. As a result, the retailer insisted on including additional experimental conditions designedto test robustness. In particular, they implemented an additional set of conditions where we used

14

Table 2: Comparison with the Uniform $25 Policy

Optimized Policy: Optimized Policy: Optimized Policy: Optimized Policy:$25 paid offer 120-day free trial no mail 120-day free trial or no mail

CHAID $0.034 $0.161† −$0.118∗∗ −$0.055($0.096) ($0.096) ($0.045) ($0.041)

SVM $0.106 −$4.440∗∗ −$0.026 −$0.032($0.141) ($1.647) ($0.033) ($0.033)

HC −$0.011 $1.074 $0.182 $0.602∗($0.040) ($0.707) ($0.131) ($0.285)

Kernel $0.030 $0.420∗∗ $0.046 $0.131∗∗($0.056) ($0.143) ($0.040) ($0.042)

FMM $0.021 $0.386 $0.199∗∗ $0.200∗∗($0.058) ($0.557) ($0.033) ($0.033)

k-NN $0.040 $0.104 $0.109† $0.082†($0.061) ($0.084) ($0.058) ($0.048)

Lasso $0.059 $0.081 $0.223∗∗ $0.186∗∗($0.053) ($0.062) ($0.024) ($0.025)

The table focuses on households in the uniform $25 paid and each optimized policy condition. Households aregrouped according to the treatment recommended by the optimized policy. The table reports the differencein profit for the (randomly assigned) sub-group that received the optimized policy and the randomized sub-group that received the uniform $25 treatment. A positive value indicates that profits were higher in theoptimized policy sub-group. Significance: † for p < 0.1; * for p < 0.05; ** for p < 0.01. Standard errors arein parentheses and sample sizes are reported in the Appendix.

the same training data to design the optimized policies (with $25 paid, 120-day trial, no-mailtreatments), but in the Stage 2 implementation we replaced the 120-day trial with a 90-day trial(i.e., the treatments were $25 paid, 90-day trial, or no mail). In these treatments, whenever anoptimized policy said to mail a 120-day promotion, a 90-day promotion was mailed instead. Becausethe training data did not change, the optimized policies did not otherwise change.

The households in these experimental conditions were randomly selected (by carrier route) andwere therefore equivalent to the households used in the main part of the study.11 This comparisonthus provides a basis for evaluating how robust the optimized policies are to changing the promo-tions themselves. The results are summarized in Figure 4 (complete findings are reported in theAppendix).

The pair-wise correlation between the performance with the 120-day and 90-day trials is 0.79(using the seven data points representing each of the optimized methods). Moreover, the Lassocontinues to be the best-performing method, significantly outperforming CHAID, SVM and theuniform $25 condition. The k-NN, Kernel and HC methods also significantly outperform CHAID,

11The retailer chose to limit the sample sizes in these conditions to approximately 300,000 households (instead ofapproximately 400,000 in the other test conditions).

15

Figure 4: The figure is a scatter plot of the average profit earned in the original experiment (the x-axis)and the average profit earned in the modified experiment, where the 120-day free trial offer is replaced with a90-day free trial offer (the y-axis). To preserve confidentiality, the profits are indexed to 100 for the CHAID120-day free trial offer data point. Each observation represents one of the optimized policies. Completefindings are reported in the Appendix.

SVM and the uniform $25 condition.We can also check the robustness of the earlier results in which we grouped the optimized

methods using our taxonomy of methods. These findings are reported in Figure 5. The patternof results is essentially unchanged. The distance- and model-driven methods are not statisticallydifferent, but the classification methods perform significantly worse than the other methods (p <0.01).

The robustness of the findings is reassuring. It suggests that the relative performance of themethods is not sensitive to details of the promotion. When the methods chose to send the free trial,it did not matter too much whether the trial was 90 days or 120 days. More generally, in practicethere will often be differences between the setting used to create the training data and the settingin which optimized models are deployed. These findings suggest that the relative performance ofthe methods may be robust to modest changes in these settings.

We caution that a limitation of this robustness check is that it effectively only varies the promo-tion when an optimized policy recommends a household should receive the 120-day free trial. Forsome methods, this only affects a relatively small proportion of the households. We also caution thatthese findings should not be interpreted as evidence that the relative performance of the methodswill be robust to all differences between the training data and validation data. Rather, they provideinitial evidence of robustness to changing details of the promotions.

In our next analysis we investigate how the methods performed for households in the validationdata that were inside versus outside the range of the training data.

4.5 Performance Inside vs. Outside the Range of the Training Data

One factor that could influence the performance of the models is the extent to which the distributionof the data in the Stage 2 validation sample matches the distribution in the Stage 1 training sample.

16

Figure 5: The figure focuses on the experimental conditions in which the 120-day free trial was replaced withthe 90-day free trial. It reports the average one-year profit when pooling the households using the taxonomyof methods. To preserve confidentiality, the profits are indexed to 100 for the Classification methods datapoint. The error bars represent 95% confidence intervals. Complete findings including standard errors andsample sizes are reported in the Appendix.

We would expect the models to perform better compared to the uniform policies in parts of theparameter space that are well-represented in the training data. To investigate this possibility, wecalculated the mean and standard deviation of each of the thirteen descriptive variables using theStage 1 training data. We then identified households in the Stage 2 validation data for which oneor more of the variables was at least two standard deviations away from the mean.

This procedure revealed that 60.5% of Stage 2 households fell outside the two standard deviationrange on at least one of the thirteen descriptive variables. We classified these validation samplehouseholds as “Outside the Range” of the training data, and the remaining households as “Insidethe Range”. We then recalculated the average outcome for each method using the two groups ofhouseholds. The findings are summarized in Figures 6 and 7.

In Figure 6 the findings for the seven methods are reported in order of their original ranks(when including all observations). They reveal several interesting patterns. First, we see that thehouseholds in the Stage 2 validation data that were outside the range of the Stage 1 training datawere generally more valuable households. This is consistent with the findings reported in Section 4.1that average store revenue and membership revenue were higher in the Stage 2 data than in theStage 1 data. This observation will also be helpful as we interpret other features of Figure 6.

Second, the $25 paid uniform policy outperforms the other uniform policies on the householdsthat are outside the range of the training data, but the 120-day trial is the best uniform policywithin the range of the training data. This result is consistent with our discussion in Section 4.3that the optimal policy recommends mailing the $25 paid offer to the most valuable households,and the 120-day free trial to the less valuable households.

Third, the seven optimized methods all perform similarly on the households outside the rangeof the training data. Moreover, all seven methods yield optimized policies that underperform the$25 paid uniform policy. This confirms that training models on data that is not representative ofthe parameter space in validation will yield relatively poor policies.

17

Figure 6: The figure summarizes the average profit when grouping the households according to whetherthey are inside or outside the range of the training data. To preserve confidentiality, the profits are indexedto 100 for the CHAID Inside the Range data point. We categorize the households in the Stage 2 validationdata as falling within the range of the Stage 1 training data if they have values on all thirteen descriptivevariables within two standard deviations of the average value in the training data. Complete findings arereported in the Appendix.

Figure 7: The figure summarizes the average profit when grouping the households according to whether theyare inside or outside the range of the training data, and when pooling the households using the taxonomy ofmethods. To preserve confidentiality, the profits are indexed to 100 for the Inside the Range Classificationdata point. We categorize the households in the Stage 2 validation data as falling within the range of theStage 1 training data if they have values on all thirteen descriptive variables within two standard deviations ofthe average value in the training data. The error bars represent 95% confidence intervals. Complete findingsare reported in the Appendix.

Finally, when we focus on the households that were within the range of the training data, thepower of the model-driven optimized methods is revealed. Lasso and FMM sharply outperform theother methods, and also all three of the uniform policies. The implication is that these methods

18

make the best use of the information in the training data. We highlight the relative performanceof the model-driven methods in Figure 7 by reporting the results for each of the three groups ofmodels. Inside the range of the training data, the model-driven methods significantly outperform thedistance-driven methods, and the distance-driven methods significantly outperform the classificationmethods (p < 0.01). However, outside the range of the training data there is no significant differencein the performance of the three groups of methods.

4.6 Amount of Information: Size of Carrier Routes

Recall that the households are described by thirteen variables. In the Stage 2 (fall validation)data, these thirteen variables are calculated at the carrier route level, so that every household inthe same carrier route has the same values on these descriptive variables. Because the number ofhouseholds in a carrier route varies across carrier routes, the amount of information we have abouteach household also varies. For households in carrier routes with relatively few other households(i.e., small carrier routes), the descriptive variables contain more information about the householdsthan in carrier routes with many households.

To investigate how the precision of the information affected the outcomes, we calculated thefindings separately using a median split of the number of households in each carrier route. Thefindings are summarized in Figure 8 where we again present the seven methods in order of theiroriginal ranks (when including all observations). We also repeat this analysis when grouping themethods according to our taxonomy of methods in Figure 9.

Figure 8: The figure summarizes the average one-year profit when grouping the carrier routes according towhether the households they contain are more or less than the median number of households. To preserveconfidentiality, the profits are indexed to 100 for the CHAID Below Median Size data point. Completefindings are reported in the Appendix.

In the larger carrier routes, where the descriptive variables contain less precise information abouteach household, all seven methods perform similarly. However, in the smaller carrier routes, wherethe descriptive variables are more informative, Lasso performs significantly better. When groupingthe methods using our taxonomy of methods, the findings reveal an even clearer picture. Themodel-driven methods (Lasso and FMM) appear to make the best use of the increased precision ofthe information in smaller carrier routes. In particular, these methods perform significantly better

19

Figure 9: The figure summarizes the average one-year profit when grouping the carrier routes accordingto whether the households they contain are more or less than the median number of households, and whenpooling the households using the taxonomy of methods. To preserve confidentiality, the profits are indexedto 100 for the Below Median Size Classification data point. The error bars indicate 95% confidence intervals.Complete findings are reported in the Appendix.

on the smaller carrier routes than on carrier routes containing more households. Notice that thisfinding cannot be explained simply by smaller carrier routes being more profitable. We would expectthis to benefit all of the methods.

5 Why did the Classification Methods Perform so Poorly?

A common thread in our findings is that the classification methods (CHAID and SVM) performedpoorly compared to the distance- and model-driven methods. Understanding the reasons for thispoor performance will help us evaluate the extent to which this result generalizes to other settings.In this section we show that the poor performance of the classification methods is at least partlyattributable to a loss of information. This limitation is particularly costly when the probability ofa response is low, as in the prospecting problem that we study. We then propose a modificationto the classification methods that addresses this limitation and validate this modification usingsimulations.

To help understand why classifiers perform poorly in our study, we start by considering thefollowing simplified setting at the household level (instead of at the carrier route level). Supposethe training set consists of N households, each described by descriptive variables xi. For eachhousehold i in the training set, we know the profit outcomes for the no-mail and mail treatments,(y0i , y

1i ).12 We also have a new household with descriptive variables xnew for which we want to select

an optimal treatment t∗new.We will refer to the distance-driven and model-driven methods as “regression-based” methods.

In the regression-based methods, we use (X,y0,y1) to train two models that predict (y0i , y

1i ) given

12We recognize that knowing the profit outcomes for both treatments is a more reasonable assumption when theunit of observation is a carrier route rather than a household. We refer to “households” merely for clarity, and notethat we use a carrier route as the unit of observation in our analysis in the previous sections.

20

xi. We then predict (y0new, y

1new) for xnew and assign a treatment that yields the largest predicted

profit:

t∗new = arg maxt

ytnew.

In the classification-based approaches, we first transform (y0,y1) → t∗, where t∗i = arg maxt ytiis the optimal treatment for observation i in the training set. We then use (X, t∗) to train a classifierthat predicts t∗i given xi and assign to a new observation the treatment that the classifier predictsis optimal:

t∗new = t(xnew).

Our explanation for why the classification approaches perform poorly focuses on the loss of infor-mation due to the transformation (y0,y1)→ t∗. This transformation recognizes which treatment isoptimal for each household, but discards information about the margin with which it is optimal. Inparticular, this transformation discards information about the magnitude of the difference betweeny0i and y1

i .Discarding information about the magnitude of this difference is particularly costly in the setting

we study because of asymmetry. The difference tends to be large when it is optimal to mail andsmall when it is optimal not to mail. In particular, if a household does not respond, the differencein profit between mailing and not mailing is small, merely reflecting the cost of mailing. However,when a household responds, the difference in profit is large, reflecting the profit that the firm earnsfrom that household’s transactions. In other words, when no mail is optimal, the outperformancemargin (y0

i − y1i ) is small, but when mail is optimal, the outperformance margin (y1

i − y0i ) is large.

Discarding this information about the asymmetry in outperformance margins means that classifiersonly consider how frequently a policy is optimal. They cannot consider the expected profit of eachtreatment. This puts classifiers at a disadvantage compared to regression-based approaches.

To highlight the cost of not using this information, we use an illustrative example. Mailing apromotion offer to a household costs $1, and there are 95% of households who never respond. Theremaining 5% of households always respond and yield $1,000 profit after deducting the mailing cost.Not mailing yields a payoff of zero with certainty. We will initially assume that these two types ofhouseholds cannot be distinguished using the descriptive variables X. We know the response frommailing and not mailing promotions to a training set of households. Now we want to decide whethera new sample of households should be mailed the same promotion. In this setting, regression-basedmethods use the average response in the training sample to predict the expected profit in the mailcondition and correctly mail to every new household. However, classification-based methods makea decision based on the following information: (i) it is optimal to not mail in 95% cases; (ii) itis optimal to mail in 5% cases; and (iii) these cases cannot be distinguished using the descriptivevariables X. Because not mailing is optimal more frequently, the classification methods choose notto mail even though mailing to all households is optimal. Choosing the treatment that is morefrequently optimal, rather than the treatment with the higher expected profit, yields a suboptimalpolicy.

To confirm the argument, we implement this example using simulated data. We train all sevenmethods and evaluate them using a hold-out sample. The details are described in the Appendix andthe outcomes are reported in Table 3. While the distance- and model-driven methods all correctlyidentify the optimal policy almost perfectly (i.e., to mail to every household), the two classificationmethods recommend not to mail to any household or almost any household.

21

Table 3: Profit and Mailing Rate for Each Method for the Illustrative Example

CHAID SVM HC Kernel FMM k-NN LassoProfit $0.00 $0.12 $48.24 $49.01 $49.05 $45.84 $49.05

Mailing Rate 0.0% 0.0% 98.9% 99.2% 100.0% 81.8% 100.0%This table reports the average profit and average mailing rate from simulated data constructed using theillustrative example described above. A detailed description of the simulations is provided in the Appendix.

This example relies on two features. First, the descriptive variables do not distinguish betweenthe two customer segments. When the descriptive variables are able to perfectly distinguish thesegments, so that there is no heterogeneity in the outcomes within a segment, then choosing thetreatment that is more frequently optimal within each segment yields the same policy as choosingthe treatment with the higher expected profit. Second, the outperformance margins are asymmetric.For households in the training set for which no mail is optimal, the outperformance margin (y0

i−y1i ) is

small, but for households for which mailing is optimal, the outperformance margin (y1i −y0

i ) is large.If these outperformance margins were identical, then choosing the treatment that is more frequentlyoptimal would again be a proxy for choosing the treatment that yields the highest expected profit.

We can illustrate both of these features using additional simulations. We start with the firstfeature and modify the example so that the descriptive variables clearly identify two clusters ofobservations. It is optimal to mail to one cluster and not mail to the other cluster. In the trainingstage, classification methods observe that the mail treatment is optimal for most households in thefirst cluster, and no mail is optimal for most households in the second cluster. To illustrate this, weuse the same model as in the illustrative example above and increase the distance between the twoclusters. In Figure 10, red dots and blue crosses represent the training set observations for whichti = 0 (no mail) and ti = 1 (mail), respectively. The decision regions for the trained SVM modelare indicated by the colored background. The SVM model recommends mailing to all observationsin the blue areas. We see that as the distance between the two segments grows, the SVM methodis better able to discriminate between the segments, and this improves the resulting policy.

(a) Distance=0 (b) Distance=2 (c) Distance=4 (d) Distance=6

Figure 10: These figures illustrate the training data in the space of the descriptive variables as we varythe distance between the segments for which mail or no mail are optimal. The red and blue spaces indicatethe regions in which SVM recommends mailing (blue) and not mailing (red). The red dots and blue crossesidentify the households in the training data, with blue crosses identifying households with higher profit frommailing, and red dots identifying households with higher profit from not mailing. A detailed description ofthe simulations is provided in the Appendix.

The other feature of the example that contributes to the results in Table 3 is the asymmetricnature of the outperformance margins. This a distinctive feature of the prospecting application

22

that we study. Very few households respond, and the firm earns relatively larger profits from theresponding households, while it incurs a small loss when mailing to households that do not respond.Unfortunately, marketing decisions often involve low probability outcomes with large payoffs. Thisis true not just when prospecting for new customers, but also when sending promotions to existingcustomers.

A low response rate can confuse classification-based methods, even when the clusters are clearlyseparated. To see this, assume that in the previous simulation the clusters are sufficiently separatedthat the classification methods can distinguish between them. However, assume now that theresponse is stochastic in the cluster for which it is optimal to mail. If the response rate is sufficientlylow, then the no-mail treatment will yield a higher profit for more households in that cluster thanthe mail treatment (even though mailing is optimal in this cluster). For example, with a 20%response rate in this cluster, 80% of the households in this cluster would have the label “no mail”and only 20% would be labeled “mail”. Because variation in the response rate within the clustercannot be explained by the descriptive variables, the decision in this cluster effectively reverts toour first illustrative example, and the classification methods do not mail to either segment. Wedemonstrate this argument in Figure 11. As the response rate in the mail segment decreases, SVMfinds it harder to distinguish between households for which the different treatments are optimal.Even though the two segments are clearly separated in the space of the descriptive variables, becausenot mailing outperforms mailing for a higher proportion of households in the “mail” segment, theoptimal policy for the two segments converges. This is not true for the distance- and model-drivenmethods, which focus on the expected profits from each segment. We report findings for thesemethods in the Appendix. Even when the response rate for the “mail” segment falls, the expectedprofits remain different for the two segments.

(a) Resp. Rate = 1 (b) Resp. Rate = 0.7 (c) Resp. Rate = 0.5 (d) Resp. Rate = 0.2

Figure 11: These figures illustrate the training data in the space of the descriptive variables as we vary theresponse rate in the “mail” segment. The red and blue spaces indicate the regions in which SVM recommendsmailing (blue) and not mailing (red). The red dots and blue crosses identify the households in the trainingdata, with blue crosses identifying households with higher profit from mailing, and red dots identifyinghouseholds with higher profit from not mailing. A detailed description of the simulations is provided in theAppendix.

We conclude this section by proposing a modification to the classification methods to address thelimitation that we have identified. When training the classifiers, the classification methods minimizean objective function, referred to as the cost function. A standard implementation of the classifica-tion methods uses the same penalty for false negatives (not mailing to households that should bemailed) as for false positives (mailing to households that should not be mailed). Our explanationfor the poor performance of the classification methods suggests that we can improve performanceby making these penalties asymmetric to account for the difference in the outperformance margins.In our illustrative example, the following two events are equally costly:

23

• mail 1,000 promotions to households that do not respond;

• do not mail a promotion to a household that would respond.

We can account for the asymmetry in the outperformance margins by making the penalty forfalse negatives 1,000 times higher than the penalty for false positives. In Figure 12 we return tothe first illustrative example and simulate the impact of gradually making the cost function moreasymmetric. We hold the penalty for false positives at 1, and allow the penalty for false negativesto vary from 1 to 20, 100 and 1,000. We see that SVM yields the optimal policy of mailing to everyhousehold for penalty equal to 1,000.

(a) Cost =[1,1] (b) Cost =[1,20] (c) Cost =[1,100] (d) Cost =[1,1000]

Figure 12: These figures illustrate the training data in the space of the descriptive variables as we vary thecost of an error. A cost of [e, f ] indicates a cost of e for false positives and a cost of f for false negatives. Thered and blue spaces indicate the regions in which SVM recommends mailing (blue) and not mailing (red).The red dots and blue crosses identify the households in the training data, with blue crosses identifyinghouseholds with higher profit from mailing, and red dots identifying households with higher profit from notmailing. A detailed description of the simulations is provided in the Appendix.

This proposed modification requires that we know the size of the difference in the outperformancemargins. Fortunately, an estimate of this difference in a real application can be obtained from thetraining data.

6 Conclusion

As more firms embrace field experiments as a way to optimize their marketing activities, the focuson how to derive more value from field experiments is likely to intensify. One way to derive addi-tional value is to segment customers and evaluate how to target different customers with differentmarketing treatments. Fortunately the literature on customer segmentation methods is vast. Firmshave a broad range of segmentation methods to choose from, but little guidance as to which of thesemethods are most effective when applied to data generated by a field experiment.

In this paper, we have evaluated seven widely-used segmentation methods using a series of twolarge scale field experiments. The first field experiment is used to generate a common pool oftraining data for each of the seven methods. We then validate the optimized policies provided byeach method in a second field experiment.

The findings reveal that model-driven methods (Lasso and Finite Mixture Models) performedthe best. Some distance-driven methods also performed well (particularly k-Nearest Neighbors).However, the two classification methods we tested (CHAID and SVM) performed relatively poorly.

The findings also reveal that the model-driven methods performed best in parts of the parameterspace that are well represented in the training data. The implication is that these methods makethe best use of the information in the training data.

24

We also compared how well the methods performed as the precision of the available informationimproved. When aggregation meant that the precision was low, Lasso performed relatively poorly.As the quality of the information improved because it was aggregated across fewer households, theperformance of Lasso and Finite Mixture Models exceeded that of the other methods.

We investigated why the model-driven and distance-driven methods outperformed the classifica-tion methods in our setting. Model- and distance-driven methods construct optimal policies basedupon expected profits from each treatment. In contrast, classification methods only consider whichtreatment is optimal and discard information about the extent to which it is optimal. This leadsclassification methods to undermail in a setting of low response rate, such as a promotion campaign.To address this limitation, we modify the misclassification penalties to reflect the asymmetry in theoutperformance margins between the treatments.

As with most comparisons of methods, there are two important limitations to these findings.First, each of the methods that we implemented is representative of a general class of relatedmethods. It is obviously not possible to test every version of every class of methods. While wechose what we believe to be the most commonly implemented version of each class of methods, werecognize that other versions may perform differently. In defense of the findings, we also note thatwe tuned the methods (where possible) using extensive cross-validation. In this respect, the versionof each method that we implemented was the best performing of the variants that we cross-validated.

A second limitation common to most comparisons of methods is that the findings may notgeneralize to every market setting. Although our findings include modest robustness testing, wecannot claim that the findings will hold in every setting. In particular, the experiments in this studyinvolved prospecting for new customers. When prospecting for new customers, firms generally havea lot less information than if they are targeting existing customers. With existing customers, firmscan often observe each customer’s past purchasing from the firm, which provides a rich source ofinformation for segmenting customers and predicting future responses. The performance of thesegmentation methods may be very different if they have access to this type of information. Infuture research we hope to extend our findings to targeting existing customers, as well as focus onother types of marketing decisions.

Acknowledgments

The authors thank Eric Bradlow, Wayne DeSarbo, Daria Dzyabura, John Hauser, Kris JohnsonFerreira, Matthew Selove, Peng Sun, and Olivier Toubia for their helpful comments and suggestions.

References

Belloni, Alexandre, Mitchell J. Lovett, William Boulding, and Richard Staelin (2012), “Optimaladmission and scholarship decisions: Choosing customized marketing offers to attract a desirablemix of customers.” Marketing Science, 31, 621–636.

Chang, Chih-Chung and Chih-Jen Lin (2011), “LIBSVM: A library for support vector machines.”ACM Transactions on Intelligent Systems and Technology, 2, 1–27.

Grun, Betina and Friedrich Leisch (2008), “Flexmix version 2: Finite mixtures with concomitantvariables and varying and constant parameters.” Journal of Statistical Software, 28, 1–35.

25

Kass, Gordon V. (1976), Significance Testing in, and Some Extensions of, Automatic InteractionDetection. Doctoral dissertation, University of Witwatersrand, Johannesburg, South Africa.

Li, Hongshuang (Alice) and P.K. Kannan (2014), “Attributing conversions in a multichannel on-line marketing environment: An empirical model and a field experiment.” Journal of MarketingResearch, 51, 40–56.

Mantrala, Murali K., P. B. Seetharaman, Rajeeve Kaul, Srinath Gopalakrishna, and Antonie Stam(2006), “Optimal pricing strategies for an automotive aftermarket retailer.” Journal of MarketingResearch, 43, 588–604.

Neslin, S.A., T.P. Novak, K.R. Baker, and D.L. Hoffman (2009), “An optimal contact model formaximizing online panel response rates.” Management Science, 55.

Qian, J., T. Hastie, J. Friedman, R. Tibshirani, and N. Simon (2013), “Glmnet for MATLAB.”http://www.stanford.edu/˜hastie/glmnet_matlab/.

Simester, D. I., P. Sun, and J. N. Tsitsiklis (2006), “Dynamic catalog mailing policies.” ManagementScience, 52, 683–696.

Skiera, B. and N. Abou Nabout (2013), “PROSAD: A bidding decision support system for PRofitOptimizing Search Engine ADvertising.” Marketing Science, 32, 213–220.

Stone, C.J. (1977), “Consistent nonparametric regression.” The Annals of Statistics, 5, 595–645.

Tibshirani, Robert (1996), “Regression shrinkage and selection via the lasso.” Journal of the RoyalStatistical Society. Series B, 58, 267–288.

Urban, G. L., J. R. Hauser, W. J. Qualls, B. D. Weinberg, J. D. Bohlmann, and R. A. Chicos(1997), “Information acceleration: Validation and lessons from the field.” Journal of MarketingResearch, 34.

Urban, Glen L., Gui Liberali, Erin MacDonald, Robert Bordley, and John R. Hauser (2014), “Mor-phing banner advertising.” Marketing Science, 33, 27–46.

26