optimal large-scale internet media selection

39
Optimal Large-Scale Internet Media Selection Courtney Paulson, Lan Luo, and Gareth M. James July 10, 2015 Abstract Internet advertising is vital in today’s business world. It is uncommon for a major Internet advertising campaign not to include an online display component. Neverthe- less, research on optimal Internet media selection has been sparse. Firms face consider- able challenges in their budget allocation decisions: the large number of websites they may potentially choose; the vast variation in traffic and costs across websites; and the inevitable correlations in viewership among these sites. Generally, attempting to select the optimal subset of websites among all possible combinations is a NP-hard problem. Therefore, existing approaches can only handle Internet media selection in settings on the order of ten websites. We propose an optimization method that is computationally feasible to allocate advertising budgets among thousands of websites. While perform- ing similarly to extant approaches in settings scalable to prior methods, our approach successfully tackles the challenging task of large-scale optimal Internet media selection. Our method is also flexible to accommodate practical Internet advertising considera- tions such as targeted consumer demographics, mandatory media coverage to matched content websites, and target frequency of ad exposure. 1. Introduction With the increased role of Internet use in the United States economy, Internet advertising is becoming vital for company survival. In 2012, U.S. digital advertising spending (including display, search, and video advertising) totaled 37 billion dollars (eMarketer, 2012). Of that 37 billion dollars, Internet display advertising accounted for 40%. Internet display ad spending is also expected to grow to 45.6% of the total in 2016, outpacing paid search ad spending (eMarketer, 2012). Such an increasing trend in Internet display advertising is related to a * Marshall School of Business, University of Southern California. 1

Upload: ngoduong

Post on 14-Feb-2017

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimal Large-Scale Internet Media Selection

Optimal Large-Scale Internet Media Selection

Courtney Paulson, Lan Luo, and Gareth M. James ∗

July 10, 2015

Abstract

Internet advertising is vital in today’s business world. It is uncommon for a major

Internet advertising campaign not to include an online display component. Neverthe-

less, research on optimal Internet media selection has been sparse. Firms face consider-

able challenges in their budget allocation decisions: the large number of websites they

may potentially choose; the vast variation in traffic and costs across websites; and the

inevitable correlations in viewership among these sites. Generally, attempting to select

the optimal subset of websites among all possible combinations is a NP-hard problem.

Therefore, existing approaches can only handle Internet media selection in settings on

the order of ten websites. We propose an optimization method that is computationally

feasible to allocate advertising budgets among thousands of websites. While perform-

ing similarly to extant approaches in settings scalable to prior methods, our approach

successfully tackles the challenging task of large-scale optimal Internet media selection.

Our method is also flexible to accommodate practical Internet advertising considera-

tions such as targeted consumer demographics, mandatory media coverage to matched

content websites, and target frequency of ad exposure.

1. Introduction

With the increased role of Internet use in the United States economy, Internet advertising is

becoming vital for company survival. In 2012, U.S. digital advertising spending (including

display, search, and video advertising) totaled 37 billion dollars (eMarketer, 2012). Of that 37

billion dollars, Internet display advertising accounted for 40%. Internet display ad spending

is also expected to grow to 45.6% of the total in 2016, outpacing paid search ad spending

(eMarketer, 2012). Such an increasing trend in Internet display advertising is related to a

∗Marshall School of Business, University of Southern California.

1

Page 2: Optimal Large-Scale Internet Media Selection

wide range of benefits offered by this advertising format, including building awareness and

recognition, forming attitudes, and generating direct responses such as website visits and

downstream purchases (Danaher et al., 2010; Hoban and Bucklin, 2015; Manchanda et al.,

2006).

Nevertheless, firms face considerable challenges in optimal Internet media selection of

online display ads. Because each website represents a unique advertising opportunity, the

number of websites firms may potentially choose to advertise among is extremely high. These

websites also vary vastly by their traffic and advertising costs. Furthermore, when optimizing

advertising budgets across a large number of websites, it is crucial for firms to account for the

inevitable correlations in the viewership among these sites. For example, the 2011 comScore

Media Metrix data show there is over 95% correlation in the viewership of Businessweek.com

and Reuters.com. In such cases, heavy advertising on both websites will inefficiently cause

firms to advertise twice to mostly the same viewers.

These challenges are so formidable that, although Internet advertising is increasingly

recommended to reach consumers (e.g. Unit, 2005; Chapman, 2009), companies often have

to rely on advertising exchanges such as DoubleClick to manage their Internet ad campaigns

(Lothia et al., 2003). These exchanges are recent innovations in advertising that allow firms

to outsource their Internet ad campaigns, giving firms the opportunity to expand online

advertising without having to combat the challenges themselves (Muthukrishnan, 2009).

Generally, a company will specify campaign characteristics (such as which types of consumers

to target) and pay a certain amount of money to the exchange to conduct a campaign with

those characteristics.

One advantage of ad exchanges is their ability to employ behavioral ad targeting, that

is, targeting ads to consumers based on their Internet browsing histories (Chen et al., 2009).

This is usually accomplished by installing cookies or web bugs on users’ computers to track

their online activity. However, this has led to numerous privacy concerns and, in some

cases, legal action against behavioral targeters (Hemphill, 2000; Goldfarb and Tucker, 2011).

Another major concern with outsourcing Internet display ad campaigns to ad exchanges is

that companies must turn over the control of the campaign to the exchange, which creates a

classical principal-agent problem. While the focal firm can request target demographics, the

exchange will ultimately solely determine how funds are allocated (Muthukrishnan, 2009). In

2

Page 3: Optimal Large-Scale Internet Media Selection

such cases, the ad exchange serves as a broker who maximizes its own profit via distributing

ad impressions across multiple campaigns from multiple firms, rather than allocating funds

aligning with each individual firm’s best interest. Consequently, when running an online ad

campaign through an ad exchange, the focal firm’s budget allocation may be more or less

sub-optimal compared with the alternative of managing its own campaign.

In this paper, we propose a method to overcome the above challenges and concerns.

We emphasize a scenario in which firms wish to retain control of their online advertising

campaigns, rather than entirely outsourcing such campaigns to advertising exchanges. In

particular, we consider a setting in which a company wishes to maximize reach, i.e. the

fraction of customers who are exposed to a given ad at least one time. In such cases,

firms still face the same Internet advertising challenges of overwhelming scope and variety.

Historically, to be in full control of their own online advertising campaigns, firms often had

to employ heuristics to choose a select number of websites over which to advertise. These

heuristics include advertising only at big-name websites like Amazon or Yahoo or allocating

evenly over the most visited websites under consideration (Cho and Cheon, 2004). While

such heuristics have been adopted in practice, they can lead to substantial suboptimal budget

allocation. For example, the five highest traffic websites are likely not the optimal sites for

firms to advertise over. Consider again the case of Businessweek and Reuters. The two

websites are both high in traffic. But they actually share highly similar users. A firm will

waste money without gaining many new ad viewers by heavily advertising on both websites,

even if a firm wishes to target primarily frequent viewers of such websites. In addition, a

very popular, high-traffic website may also be very expensive to advertise on and may have

a large percentage of repeat visitors. Hence, it may not be the most cost-effective option for

firms to spend a considerable portion of their advertising budgets on such websites. In many

cases, choosing a less visited but also less expensive website could be a better choice.

Despite the considerable importance of optimal Internet media selection for online display

ads, very few researchers have proposed methods to alleviate the above challenges faced by

firms. Danaher’s Sarmanov-based model (Danaher et al., 2010) was among the first and

most successful attempts to optimally allocate budget across multiple online media vehicles.

This Sarmanov-based method has been proven to work well for budget allocation in settings

on the order of 10 websites. While Danaher’s work represents the most state-of-the-art

method for allocating Internet advertising budget, under this method the consideration of

3

Page 4: Optimal Large-Scale Internet Media Selection

each additional website increases the optimization difficulty exponentially such that the

Sarmanov criterion becomes very difficult to optimize over more than approximately 10

websites (Danaher, 2007; Danaher et al., 2010). For example, even if firms know they wish

to advertise across only 10 out of 50 potential websites, they must test each possible 10-

website combination, resulting in over 10 billion individual problem calculations. Since each

website represents a separate advertising opportunity, such methods are hindered by the

huge volume of Internet websites on which firms could potentially choose to advertise.

The primary goal of this research is to develop a method that allows firms to efficiently

select and allocate budget among a large set of websites (e.g., thousands). One reason for

the difficulty in considering a large number of websites is that the problem of choosing a

subset of websites is generally NP-hard. In a setting involving p potential websites, each of

the 2p possible website subsets must be considered separately, leading to a computationally

infeasible problem.

In a linear regression setting, a similar problem is encountered when performing variable

selection involving large numbers of independent variables. A common solution, adopted by

the statistical literature, involves optimizing a constrained convex loss function, a relaxed

version of the NP-hard variable selection problem. A selection of recent papers includes the

Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), the elastic net (Zou and Hastie, 2005),

the adaptive Lasso (Zou, 2006), CAP (Zhao et al., 2009), the Dantzig selector (Candes and

Tao, 2007), the relaxed Lasso (Meinshausen, 2007), and VISA (Radchenko and James, 2008).

Built upon this stream of research, we develop an analogous constrained criterion ap-

proach in our setting, i.e., a relaxed version of the NP-hard website selection problem. Our

method is related to the well-known Lasso formulation (Tibshirani, 1996), but diverges in

that our optimization criterion does not involve a quadratic loss function. Our empirical

investigation illustrates that, for a small number of websites, the proposed method performs

similarly to Danaher et al. (2010). Furthermore, our method can be used effectively in major

online advertising campaigns where a large number of websites is under consideration. Even

5000 websites takes under twenty seconds to optimize for a particular budget on a personal

laptop computer.

We further demonstrate that this method is flexible enough to accommodate common

practical Internet advertising considerations such as targeted consumer demographics, manda-

4

Page 5: Optimal Large-Scale Internet Media Selection

tory media coverage to matched content websites, and target frequency of ad exposure.

Consequently, firms could use our method to fully control their own Internet advertising

campaigns instead of being forced to rely on advertising exchanges, but without having to

give up specific targeting of particular demographic groups and/or websites. Additionally,

our algorithmic efficiency allows firms to quickly compare expected reach across numerous

budgets and various Internet advertising opportunities, giving firms a broad range of adver-

tising campaign and cost options.

The remainder of the paper is structured as follows: in Section 2, we describe our con-

strained optimization approach as a high-dimensional efficient alternative to existing methods

for large-scale Internet advertising optimization. In Section 3, we discuss simulation studies

that compare our optimization to Danaher et al.’s existing method and demonstrate that the

proposed method can handle budget allocation across thousands of websites. Also in Section

3, we provide two case studies (McDonald’s McRib Advertising Campaign and Norwegian

Cruise Lines Wave Season Advertising Campaign) using 2011 comScore Media Metrix data.

We conclude in Section 4 with a summary of our findings, contributions, and avenues for

future work.

2. Methodology

2.1 Model Formulation

Consider a firm that has a budget B for a campaign that is to be run over a particular time

span (e.g., one month or one quarter). A common goal for such a campaign would be to

allocate the firm’s budget across a set of p possible websites to maximize the probability that

an Internet user views the ad at least once during the campaign. This probability is known

as the reach of a campaign. Let wj represent the budget allocated to advertising at the jth

website, where j = 1, . . . , p. Further, let Xij represent the number of times an ad appears

to customer i during her visits to website j during the course of the ad campaign, where

i = 1, . . . , n. Hence, Yi =∑p

j=1Xij corresponds to the total number of ad appearances to

customer i over all websites. Let us also denote an n by p matrix as Z, with zij corresponding

to the number of visits of customer i to website j during the time span of the ad campaign.

In practice, such data (e.g., the comScore Media Metrix data) are available from commercial

browsing-tracking companies such as comScore.

5

Page 6: Optimal Large-Scale Internet Media Selection

Within this context, our problem can be formulated as a fairly common marketing sce-

nario: given that we are constrained by a budget B, how do we allocate that budget to

maximize reach during our Internet display ad campaign? Mathematically this is equivalent

to the following optimization problem:

minw

1

n

n∑

i=1

P (Yi = 0|zi,w) subject to

p∑

j=1

wj ≤ B, and wj ≥ 0, j = 1, . . . , p, (1)

wherew = (w1, . . . , wp) denotes the budget allocation to the pwebsites, and zi = (zi1, . . . , zip)

represents the number of times consumer i visits the p websites over the course of the Internet

ad campaign.

It is challenging to solve Equation (1) because p may be in the thousands, which means

this is an extremely high dimensional optimization problem. Additionally, the optimal so-

lution to Equation (1) should be able to accommodate corner solutions (i.e., the solution

should allow wj = 0 to arise as an optimal solution for certain websites). We discuss how

we address both challenges below.

We first express P (Yi = 0|zi,w) as a function of zi and w, where Yi =∑p

j=1Xij . A

natural approach is to model Xij as a Poisson random variable with expectation γij, i.e.

Xij |zij, wj ∼ Pois(γij) or equivalently,

P (Xij = x|zij , wj) =e−γijγx

ij

x!. (2)

In Equation (2), we model γij as the expected number of ad appearances to consumer i at

website j, given the consumer’s number of visits to the site (zij) and the amount of money the

focal firm spends on advertising at the site (wj). This expected number of ad appearances

is given by the probability of an ad appearing on a random visit to website j (denoted as sj)

multiplied by the number of visits (zij), i.e. γij = sjzij . For example, if a firm buys 20% of

ad impressions at a particular website, and a consumer visits that website ten times during

the course of the ad campaign, γij = 0.2×10 = 2. In this example, on average we expect the

consumer to see the ad twice during the ten visits. The probability the ad appears is simply

the number of ad impressions bought at the website over the total number of expected visits

by all customers to the site, so sj is called the share of ad impressions (Danaher et al., 2010).

Note because of this, sj is interchangeable with wj; buying all ad impressions for website j

means sj = 1 (or, equivalently, wj is maximized such that the ad appears to all visits to the

6

Page 7: Optimal Large-Scale Internet Media Selection

site), while buying no impressions means sj = 0 (or, equivalently, wj = 0). In the paragraph

below, we provide the formula that outlines the exact correspondence between sj and wj.

Let τj represent the expected total number of visits at the jth website during the course

of the ad campaign. Following Danaher et al. (2010), we operationalize τj as τj = φjN ,

with φj being the expected number of per person visits to site j during the ad campaign,

and N being the total Internet population. Let cj represent the cost to purchase 1000

impressions. (Note that this is an industry standard, referred to popularly as CPM.) Then

the total number of impressions purchased will be given by 1000wj/cj . Hence, we obtain

the corresponding relationship between sj (share of ad impressions) and wj (budget spent)

as follows: sj =1000wj

τjcj. For example, if the CPM of a particular website is $2, the expected

total number of visits to the website during the entire ad campaign is 10 million, and the

firm spends $500 advertising on the website, the firm has bought 2.5% of the ad impressions

at that website.

Given γij = sjzij and substituting sj with1000wj

τjcj, we can express γij as a function of zij

and wj below:

γij = θij × wj where θij =zij

τjcj

1000

. (3)

In Equation (3), θij is a known quantity given values of zij , τj , and cj . With this setup,

correlations in viewership among the p websites are directly captured in the zij terms which

carries into θij and then into γij. In Appendix A, we provide a simple illustration that

demonstrates how correlations in the Z matrix are incorporated in our method.

Thus we can model Yi =∑p

j=1Xij as a Poisson distribution with expected value γi =∑p

j=1 γij , i.e.

P (Yi = y|zi,w) =e−γiγy

i

y!. (4)

Combining Equation (4) with our original Equation (1) gives the criterion we wish to opti-

mize:

minw

1

n

n∑

i=1

e−γi subject to∑

j

wj ≤ B and wj ≥ 0, j = 1, . . . , p. (5)

The optimization in Equation (5) has the following appealing properties. First, because

the objective function is a well-behaved convex and smooth function, it is relatively easy

to solve the optimization, even for large values of p. This transforms the original problem

7

Page 8: Optimal Large-Scale Internet Media Selection

from NP-hard to one that is relatively easy to optimize. The algorithm will also not stall

at suboptimal local minima. Second, the form of Equation (5) encourages sparsity in the

solution. Under each given budget, as the number of websites under consideration increases,

our optimization criterion will automatically set a budget of zero to more websites (hence

the corner solutions as we desired; see more discussions on this in Hastie et al., 2009, p. 71).

Lastly, given the convex and smooth nature of the objective function, prior budget solutions

can be used as effective starting points of neighboring budgets. Therefore, we are able to

efficiently optimize over a range of budgets rather than merely solving one particular budget

at a time.

2.2 The Optimization Algorithm

In order to solve Equation (5), we reformulate the optimization using a Lagrangean1:

minw

1

n

n∑

i=1

e−γi +λ

n(∑

j

wj − B) subject to wj ≥ 0, j = 1, . . . , p, (6)

where λ > 0 is the Lagrangean multiplier. (Note λ must be greater than zero in our setting

given the constraint that budget must always be nonnegative.)

It is evident that, for each given budget, there is a corresponding Lagrangean multiplier

λ. For a given number of websites, as budget increases, λ decreases, and the algorithm

allocates more budget to more websites. As budget decreases, λ increases, and we get a

sparser solution.

Since we optimize over the w terms, Equation (6) can be simplified as Equation (7), with

B dropping out of the first order conditions.

minw

1

n

n∑

i=1

e−γi +λ

n

j

wj subject to wj ≥ 0, j = 1, . . . , p. (7)

Although there is no direct closed form solution to Equation (7), problems similar to

that of Equation (7) have been extensively studied recently in the literature, particularly in

statistics, e.g. (Efron et al., 2004; Friedman et al., 2010; Goeman, 2010; Hesterberg et al.,

1in the statistical literature, this is commonly referred to as a penalized optimization equation. Instatistics, the λ

n

j wj penalty would frequently be written as an ℓ1 penalty rather than a summationpenalty. However, for our setup, these two are identical, since we have the condition wj ≥ 0 for all j.

8

Page 9: Optimal Large-Scale Internet Media Selection

2008; Rosset and Zhu, 2007; Schmidt et al., 2007). As a result there exist very efficient

algorithms for solving such problems. In this paper, we utilize one of the most efficient

and easy to implement algorithms known as coordinate descent to solve Equation (7) over

a grid of values for λ, which in turn provides optimal allocations for a range of possible

campaign budgets. The idea behind coordinate descent simplifies our optimization to a

single one-dimensional optimization as described below (see Appendix B for more details of

the algorithm):

Algorithm 1 Coordinate Descent Algorithm for Budget Optimization

1. Specify a maximum budget, Bmax.

2. Initialize algorithm with w̃ = 0, j = 1, and λ corresponding to B = 0.

3. For j in 1 to p,

(a) Marginally optimize Equation (7) over a single website budget wj, keeping

w1, w2, . . . , wj−1, wj+1, . . . , wp fixed.

(b) Iterate until convergence.

4. Increase budget by incrementally decreasing λ over a grid of values, with each λ cor-

responding to a budget, and repeat Step 3 until reaching Bmax.

What makes this approach so efficient is that each update step is fast to compute and

typically not many iterations are required to reach convergence in Step 3 of the algorithm

above. Note that convergence is guaranteed by (Luo and Tseng, 1992) for the form of

Equation (7) as in Step 3 above. Thus our optimization becomes very efficient to solve for

a range of budgets at once.

However, because there is no closed form solution to Equation (7), we use a quadratic

approximation for the objective function in Step 3 of Algorithm 1. Specifically, since we are

using a coordinate descent approach around a point wj, we employ a second order Taylor

approximation of e−γi around wj as follows:

e−γi ≈ e−γ̃i

1−

p∑

j=1

θij(wj − w̃j) +1

2

p∑

j=1

p∑

k=1

θijθik(wj − w̃j)(wk − w̃k)

s.t. wj , wk ≥ 0, j, k = 1, . . . , p,

(8)

9

Page 10: Optimal Large-Scale Internet Media Selection

where γ̃i =∑p

j=1 θijw̃j , and w̃j can be taken as our most recent estimate for wj based on the

last iteration of the algorithm.

Substituting (8) into (7) and computing the first order condition with respect to wj, all

terms involving w1, w2, . . . , wj−1, wj+1, . . . , wp drop out of our criterion. Hence, up to an

additive constant (i.e. the first term of the Taylor expansion), we can approximate Equation

(7) for a particular coordinate wj as:

minwj

1

n

n∑

i=1

e−γ̃i

(

−θij(wj − w̃j) +1

2θ2ij(wj − w̃j)

2

)

nwj subject to wj ≥ 0. (9)

With our simplified criterion, we show in Appendix B that the first order condition to

Equation (9) can be written as Equation (10), with the otherwise condition enforcing wj ≥ 0:

wj =

w̃j +∑n

i=1e−γ̃iθij−λ

∑ni=1

e−γ̃iθ2ijfor Hj > λ

0 otherwise,(10)

where Hj =∑n

i=1 e−γ̃iθij(w̃jθij + 1) (note that Hj is always positive here). Equation (10)

incorporates the wj ≥ 0 condition by testing if the wj coefficient has been forced below zero

by the update. If it has, we set that coefficient to 0, the minimum value allowed (since

budget cannot be negative). This equation can be computed quite efficiently.

Therefore, the optimization in Equation (7) can be solved by iteratively computing Equa-

tion (10) for j from 1 to p and repeating until convergence.2 Appendix B also demonstrates

the computational efficiency of our algorithm. When increasing the number of websites under

consideration to 5000, it takes less than twenty seconds to optimize for a particular budget

on a personal laptop computer with a 2.30 GHz processor.

2.3 Model Extensions

In what follows we discuss three extensions to the proposed method. We will provide an

illustration of each extension in Section 3.

2Because we employ a Taylor approximation in our algorithm, we also did some empirical evaluationto verify the convergence of the approximation. We ran our algorithm with numerous initialization pointsto determine if the optimization had converged to a global optimum. In all cases, we obtained identicalsolutions regardless of initialization points and the convergence was achieved under very few iterations.

10

Page 11: Optimal Large-Scale Internet Media Selection

2.3.1 Extension 1: Targeted Consumer Demographics

In this subsection we describe how the method discussed above can be modified to accom-

modate targeted consumer demographics. Suppose that each individual belongs to one of m

possible demographic groups. For example, if we wished to target people based on household

income and whether or not they had children, we could have m = 4 possible demographic

groups (low household income with or without children, and high household income with or

without children). It will often be the case that the “actual” proportions of individuals with

these demographics in our data, P1,a, . . . , Pm,a, will differ from the targeted demographic

makeup, P1,d, . . . , Pm,d, of the firm. For instance, it may be that the fraction of individuals

with low household income and with children in our data Z is PLC,a = 0.3, while the focal

firm’s target consumer base consists of a much greater percentage of such consumers, e.g.,

PLC,d = 0.6. Within this context, we would like to upweight individuals with low household

income and children in our data sample.

This is easily accomplished with a simple adaptation to Equation (7):

minw

1

n

n∑

i=1

pie−γi +

λ

n

j

wj subject to wj ≥ 0, j = 1, . . . , p. (11)

where pi = PDi,d/PDi,a and Di represents the demographic group that individual i falls into.

Since PDi,a is computed from observed data and PDi,d is based on the focal firm’s target

customer base, pi is a fixed and known quantity. Therefore, optimizing Equation (11) is

accomplished in exactly the same fashion as for Equation (7).

2.3.2 Extension 2: Mandatory Media Coverage to Matched Content Websites

Aside from targeted consumer demographics, a firm might wish to impose mandatory media

coverage to certain subsets of websites. For example, when planning the online advertising

campaign for its annual “wave season,” Norwegian Cruise Lines may want to allocate a

certain minimum budget to advertising on aggregate travel sites such as Orbitz or Expedia

in addition to other websites. In this subsection we discuss how the proposed method can

be modified to accommodate such requirements. Specifically, we can modify Equation (7)

to require wj to be above a certain threshold, say wj ≥ minj , to ensure that a minimum

budget is allocated to each aggregate travel website j.

11

Page 12: Optimal Large-Scale Internet Media Selection

Using the same approach as for optimizing Equation (7) we can show that the new

optimization is accomplished by setting the “otherwise” condition in Equation (10) to a

minimum non-zero amount. Specifically, we would replace Equation (10) with the following:

wj =

w̃j +∑n

i=1e−γ̃iθij−λ

∑ni=1

e−γ̃iθ2ij

for Hj − λ > minj

minj otherwise.(12)

2.3.3 Extension 3: Target Frequency of Ad Exposure

Another practical consideration in an online advertising campaign is the target frequency of

ad exposures (e.g., Krugman, 1972; Naples, 1979; Danaher et al., 2010). For example, sales

conversions and profits from online display ads might be highest when the consumer is served

an ad within a certain range of frequencies (e.g., one to three times) during the duration of

the ad campaign. The proposed method can also be readily modified to accommodate such

considerations. Within our context, this corresponds to P (ka ≤ Yi ≤ kb|zi,w) where ka < kb

respectively represent lower and upper bounds on ad exposures. Given prior experience,

the firm might determine the lower bound (i.e., ka) and the upper bound (i.e., kb) for the

target range of ad exposures. This is known as effective frequency or frequency capping (the

latter typically sets the lower bound at 1 and imposes an upper bound on the number of ad

exposures).

Within our context, we can modify Equation (5) as follows to accommodate such con-

siderations:

maxw

1

n

n∑

i=1

kb∑

y=ka

P (Yi = y|zi,w) subject to∑

j

wj ≤ B, and wj ≥ 0, (13)

where as before P (Yi = y|zi,w) =e−γiγ

yi

y!. Using the example of 1 ≤ Yi ≤ 3, our problem

involves maximizing1

n

n∑

i=1

e−γi(γi +1

2γ2i +

1

6γ3i ). (14)

Again we take a second-order Taylor expansion, resulting in equations with a similar form

to Equation (9) and Equation (10).

12

Page 13: Optimal Large-Scale Internet Media Selection

3. Empirical Investigation

In Section 3.1, we compare the proposed method with the method by Danaher et al. (2010).

In Section 3.2, we demonstrate how our method can be used for optimal budget allocation

when the number of websites under consideration is very large (e.g., 5000 websites), which

is computationally prohibitive for extant methods. In sections 3.3 and 3.4, we discuss two

case studies where we use the proposed method and its extensions for McDonald’s McRib

and Norwegian Cruise Lines’ Wave Season online advertising campaigns.

Our empirical illustrations are based on the 2011 comScore Media Metrix data, which

comes from the Wharton Research Data Service (www.wrds.upenn.edu). comScore uses

proprietary software to record daily webpage usage information from a panel of 100,000

Internet users (recorded anonymously by individual computer). Therefore, the comScore

data can be used to construct a matrix of all websites visited and the number of times each

computer visited each website during a particular time period. A number of prior studies

in marketing have utilized comScore Media Metrix data (e.g., Danaher, 2007; Liaukonyte

et al., 2015; Montgomery et al., 2004; Park and Fader, 2004).3

3.1 Comparison between Proposed Method and Danaher et al. (2010)

3.1.1 Comparison using Data Simulated from Danaher et al.s Sarmanov Func-

tion

To date, the state-of-the-art method for optimal budget allocation of Internet display ads is

by Danaher et al. (2010). A basic premise of this method is that the number of visits indi-

viduals have to websites (denoted as a n by p matrix Z in our context) can be characterized

by a multivariate negative binomial distribution (referred to as MNBD hereafter). Within

this setup, Danaher et al. (2010) proposes an optimization method to maximize reach for

each given budget.

3We followed Danaher et al. (2010) to calculate the effective Internet population size for our data (denotedas N in Section 2). We first consider the size of the U.S. population at the time of our data set, which is310.5 million (Schlesinger, 2010). We then multiply it by the proportion of users who actually visited atleast one website in our data set (for example, 48.63% in our comScore January 2011 data). We then defineN as 155.25 million (48.63%*310.5 million). It is worth noting that, because the specific value of N simplyserves as a baseline effective Internet population estimate in our reach estimates, the relative performanceof various methods remain qualitatively intact if N is defined as a smaller/greater proportion of the U.S.population.

13

Page 14: Optimal Large-Scale Internet Media Selection

To examine how our method performs under the basic premise of Danaher et al.’s ap-

proach, we first simulate a Z matrix from an MNBD distribution with a set of known

parameters. Based on the simulated Z matrix, we know the true optimal reach under each

budget. Next, we apply both methods on the simulated Z matrix and compare the discrep-

ancies between the true optimal reach and the reach obtained based on the budget allocations

suggested by the two methods.

Because the Z matrix in this case originates from the MNBD distribution (which is the

basic premise of Danaher et al.’s method), we expect that Danaher et al’s (2010) method

would perform better than the proposed method under such comparisons. Nevertheless, we

aim to evaluate the extent to which the proposed method could achieve a reach that is similar

to the true optimal or the reach obtained under Danaher et al.’s (2010) method. Because

Danaher et al.’s (2010) method is only computationally efficient for budget allocation across

a relatively small number of websites, we demonstrate such comparisons for the case of seven

websites below.

We first generate the Internet usage matrix, Z, with 5000 rows (users) and 7 columns

(websites), based on a MNBD with αj and rj, j = 1, ..., 7, the usual parameters associated

with a MNBD, and ωj,j′, a set of correlation parameters denoting the correlation coefficient

in viewership between websites j and j′. To make our simulation as realistic as possible,

we establish αj , rj , and ωj,j′ as the values from the seven most visited websites from the

December 2011 comScore data. We also use the CPMs provided by comScore’s 2010 Media

Metrix (Lipsman, 2010) in this stimulation. See Appendix C for more details on our data

generation method.

We then employ the following procedure to compare the two methods. We first obtain

the true optimal reach under each budget based on the true αj , rj , and ωj,j′ parameters and

the optimal criterion in Danaher et al.’s (2010) method. Next, we apply both the proposed

and Danaher et al.’s (2010) methods on the simulated Z matrix to obtain the corresponding

reach estimates. Note that Danahers methodology optimizes over share of impressions, sj ,

instead of monetary spending, wj . Nevertheless, we can readily convert sj to wj using the

formula sj =1000wj

τjcjas given in Section 2.4

4Since Danaher et al.s reach function is highly nonconvex, it can find local optima during optimization.Consequently, we run his optimization with several initialization points and choose the results with thehighest reach in our result comparisons. Since a firm cannot buy more than 100% of ad impressions at a

14

Page 15: Optimal Large-Scale Internet Media Selection

0.5 1.0 1.5 2.0

0.4

0.5

0.6

0.7

0.8

0.9

Budget (in millions)

Rea

ch (

Dan

aher

)

0.5 1.0 1.5 2.0

0.4

0.5

0.6

0.7

0.8

0.9

Budget (in millions)

Rea

ch (

Pro

pose

d M

etho

d)

OptimalProposedDanaher

Figure 1: Performance Comparison between Proposed and Danaher’s (Simulated Data)

Given that the proposed and Danaher et al.’s (2010) method each have their own defi-

nitions of the reach function, we report the reach comparisons using both reach definitions.

(See Danaher et al. 2010 for the formal definition of their reach function.) Figure 1 shows

the reach curves for the average reach estimate at each budget across the 100 simulation

runs using both definitions of reach. Note that the true optimal reach is in solid black, the

Danaher estimate is in dashed red, and our estimate is in dotted blue.

When using Danaher et al.’s reach function (left panel), both methods yield reach fairly

close to the true optimal reach. As expected, Danaher’s method performs slightly better

under this comparison, because not only are we using Danaher’s definition of reach, but

website (i.e., 0 ≤ sj ≤ 1), we force our algorithm’s optimization to stop allocating budget to a website oncewj =

τjcj1000

is reached (corresponding to sj = 1).

15

Page 16: Optimal Large-Scale Internet Media Selection

we also generate the data from the MNBD assumed by Danaher’s method. When using

our reach definition (right panel), again, both methods perform reasonably well. Our reach

estimate even slightly outperforms the optimal reach toward the higher budgets. This occurs

because the optimal reach is based on Danaher’s reach definition whereas here we are using

our definition of reach in this figure.

Overall, the comparisons above demonstrate that, even when the Internet usage matrix

Z is simulated from the MNBD as assumed in Danaher’s method, our method performs

reasonably well. Comparing the computation speed of the two methods, we discover that the

computation speed of the proposed method is over ten times faster under this setting. Given

the highly non-convex nature of the optimization criterion in Danaher et al’s (2010), the

discrepancies in computation speed would increase exponentially for larger-scale problems.

3.1.2 Comparison using comScore Data

In this subsection, we compare the two methods using the December 2011 comScore Media

Metrix data. Specifically, we use Internet usage data from the top seven most visited websites

that support Internet display advertisements. The full month of data contained 51,093 users

who visited one of the seven websites at least once in December 2011. We fit both the

proposed and Danaher’s method to 100 randomly chosen subsets of these users, each of size

5,109 (approximately ten percent of the population). Again, we use the CPMs as given in

comScore Inc.’s Media Metrix data from May 2010 Lipsman (2010).

Figure 2 shows the reach curves for the average reach at each budget across the 100

sample runs, using both reach functions. Within this context, we define the true optimal

reach (black solid) as that obtained from our method applied to the entire data set of 51,093

users. Danahers (red dashed) and our (blue dotted) estimates are both computed from the

10% subsets of the data. This also approximates real-world conditions in which a company

has access to only part of the total browsing history of all Internet users. All reach curves in

Figure 2 are then calculated on the ninety percent holdout data to ensure fair comparisons

across methods.

When using Danaher’s definition of reach (left panel), the three methods yield relatively

similar reach. Similar to the right panel of Figure 1, the Danaher reach estimate outperforms

the “optimal” reach in the left panel because the optimal reach in this case was computed

16

Page 17: Optimal Large-Scale Internet Media Selection

0.5 1.0 1.5 2.0

0.4

0.5

0.6

0.7

0.8

0.9

Budget (in millions)

Rea

ch (

Dan

aher

)

0.5 1.0 1.5 2.0

0.4

0.5

0.6

0.7

0.8

Budget (in millions)

Rea

ch (

Pro

pose

d M

etho

d)

OptimalProposedDanaher

Figure 2: Performance Comparison between Proposed and Danaher’s (Real Data)

using the proposed method. When using our definition of reach (right panel), the full data

set performs best as expected, followed by results from applying our method to the 10%

subset, and finally the Danaher estimates.

To conclude, both comparisons in Section 3.1 illustrate that the proposed and Danaher’s

methods perform similarly when applied to problem settings scalable to the latter. However,

due to its non-convex optimization criterion, the Danaher approach is considerably slower to

compute and as a result encounters significant computational difficulties in settings involving

a large number of websites. In Section 3.2, we demonstrate that, while computationally

prohibitive for extant methods, the proposed method can be used to optimally allocate

advertising budget across a very large number of websites.

17

Page 18: Optimal Large-Scale Internet Media Selection

3.2 Simulated Large-Scale Problem: 5000 Websites

In practice, most Internet media selection problems involve far more than a handful of web-

sites. In this subsection we illustrate how the proposed method can optimize over thousands

of websites. To demonstrate this, we simulate an Internet usage matrix of 50,000 people over

5000 websites.5 The visits to each website are randomly generated from a standard normal

distribution (after both rounding and taking the absolute value, since website views are pos-

itive integers) which are then multiplied by a random integer from zero to ten with higher

weight on a value of zero. This ensures that our simulated data set has similar characteristics

to the observed comScore data, since we observe a high percentage of matrix entries in the

real data as zeros. The CPMs of these websites are randomly generated, chosen from 0.25

to 8.00 in increments of 0.25.

Similarly as before, we run our method over a 10% subset of the 50,000 users, then calcu-

late reach on the 90% holdout data in our result comparisons. Because it is computationally

prohibitive for Danaher et al.’s method to optimize over 5000 websites, we compare the

proposed method to the following benchmark approaches: 1) equal allocation over all 5000

websites; and 2) cost-adjusted equal allocation (i.e. average number of visits/CPM) over the

most visited 10, 25, and 50 websites. We believe that these alternative approaches mimic

approaches often used in practice when the sheer number of websites is infeasible to examine

individually, such as those outlined in Cho and Cheon (2004).

Figure 3 shows the result comparisons. Even with only a ten percent subset of the data,

the proposed method yields reach estimates very similar to the optimal reach estimate based

on the entire dataset. In addition, the proposed method outperforms all the benchmark

approaches. We find that the equal allocation approach is by far the worst. The cost-

adjusted approaches perform better, but still worse than our method. Overall, we show that

the proposed method can be used to effectively allocate advertising budget across a very

large set of websites.

5We choose to simulate this dataset because data cleaning in comScore for 5000 websites is highly timeconsuming. For simplicity, the data is generated independently without correlations. Since the proposedmethod is designed to leverage correlations across sites, this setup provides a lower bound with respect toadvantages from our approach.

18

Page 19: Optimal Large-Scale Internet Media Selection

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.2

0.4

0.6

0.8

1.0

Budget (in millions)

Rea

ch

Method

OptimalProposedTop 100Top 50Top 25Equal

Figure 3: Simulated Data Reach, 5000 Websites

3.3 Case Study 1: McDonald’s McRib Sandwich Online Advertising Campaign

We now demonstrate how the proposed method can be applied in real-world settings. In our

first case study, we consider a yearly promotion for McDonald’s McRib Sandwich, which is

only available for a limited time each year (approximately one month).

Because McRib is often offered in or around December (Morrison, 2012), we consider

the comScore data from December 2011 to approximate a McRib advertising campaign. In

particular, we manually went through the comScore data set to identify the 500 most visited

websites that also supported Internet display ads. Our data then contains a record of every

computer that visited at least one of these 500 websites at least once (56,666 users). Thus

Z is a 56,666 by 500 matrix. We then separate our full data set into a ten percent training

19

Page 20: Optimal Large-Scale Internet Media Selection

data set (5667 users) and a ninety percent holdout data set. Similarly as before, we use the

training data to fit the method, then calculate reach on the holdout data.

Table 1 provides the categorical makeup of the 500 websites we consider in this applica-

tion. We include sixteen categories of websites: Social Networking, Portals, Entertainment,

E-mail, Community, General News, Sports, Newspapers, Online Gaming, Photos, Fileshar-

ing, Information, Online Shopping, Retail, Service, and Travel. The Total Number column

provides the total number of websites in each category. For simplicity, the CPM values for

each website are based on average costs of the website categories provided by comScore Inc.’s

Media Metrix data from May 2010 (Lipsman, 2010).6 Table 1 shows that Entertainment and

Gaming are by far the largest categories (with 92 and 77 websites out of 500, respectively),

while Sports, Newspaper, and General News are the most expensive at which to advertise

(all over $6.00). Additionally, it appears in Table 1 that advertising costs vary considerably

across these website categories. In Appendix D (Table A3), we also provide an overview of

viewership correlations within and across each of the sixteen website categories.

Table 1 also shows the number of websites chosen in each of the sixteen website categories

over three different methods: 1) the original approach that maximizes overall reach, 2) our

extension to maximize reach among targeted consumer demographics, and 3) our extension

to maximize effective reach with target frequency of ad exposures. This table also provides

the number of websites chosen in each category when we only account for the top 25 and

top 50 most visited sites as benchmarks to our approach. More details about our result

comparisons are provided below.7

3.3.1 McRib Campaign: Maximizing Overall Reach

In this subsection, we assume that McDonald’s simply attempts to reach as many users

as possible during its McRib campaign. Again, because Danaher et al’s (2010) method

cannot optimize over 500 websites, we use the following benchmark methods in our model

comparisons: equal allocation over all 500 websites, and cost-adjusted equal allocation across

the top 10, 25, and 50 most visited websites.8

6In practice firms could readily apply actual CPMs of all sites in such an optimization.7Detailed budget allocation results for each budget and each website are available from authors upon

request.8Note that, while included in Figure 4, the 10-website benchmark method is omitted from Table 1 for

space considerations.

20

Page 21: Optimal Large-Scale Internet Media Selection

Proposed Method Benchmark

Budget = $500K Budget = $2 million

Total Targeted Targeted Targeted Targeted Top Top

Category Number CPM Original Consumers Exposures Original Consumers Exposures 25 50

Community 23 2.10 8 8 11 14 14 20 1 4

E-mail 7 0.94 7 7 7 7 7 7 3 5

Entertainment 92 4.75 2 1 10 13 10 29 0 0

Fileshare 28 1.08 23 20 26 24 22 28 2 7

Gaming 77 2.68 30 40 44 37 45 59 0 1

General News 12 6.14 0 0 0 0 0 0 0 0

Information 47 2.52 24 25 29 27 27 36 1 3

Newspaper 27 6.99 0 0 0 0 0 0 0 0

Online Shop 29 2.52 11 12 15 15 15 26 1 1

Photos 9 1.08 6 6 9 8 9 9 0 2

Portal 30 2.60 13 14 17 16 16 26 5 7

Retail 57 2.52 33 39 39 36 41 49 2 7

Service 18 2.52 13 14 10 14 14 12 2 2

Social Network 17 0.56 16 17 17 17 17 17 8 11

Sports 17 6.29 0 0 1 1 0 1 0 0

Travel 10 2.52 6 7 8 8 8 8 0 0

Table 1: Website Categories Chosen by Method, McRib

Table 1 reports the categorical makeup of chosen sites under two budgets ($500K and $2

million). This categorical makeup shows how many websites in each category were chosen

with non-zero budget allocation in the solutions of the optimization. It is not surprising

that the optimization does not select many websites in relatively expensive categories such

as Sports, Newspaper, and General News. Advertising at a relatively expensive website is

only desirable when that website can reach an otherwise unreachable audience. In this case,

other websites offer reach without the high price. Social Networking, for example, offers a

relatively inexpensive way to reach consumers who are visiting other websites as well. Note

that in Table A3 in Appendix D, social networking sites have relatively high correlations

in viewership across other site categories with the only exception being email and gaming

sites. Consequently, the optimization ultimately includes all 17 Social Networking websites

and leaves out the expensive categories where reach would be duplicated.

21

Page 22: Optimal Large-Scale Internet Media Selection

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.2

0.4

0.6

0.8

Budget (in millions)

Rea

ch

Method

OptimalProposedTop 50Top 25Top 10Equal

Figure 4: McRib Campaign, Maximizing Overall Reach

It is also worth noting that our optimization selects all websites in the Email category. In

addition to the relative lower cost of advertising on these websites, there is a very low within-

category correlation in viewership among email sites (0.01 absolute average correlation; see

Appendix D). This indicates that the same consumer often does not visit more than one

email site, so including an additional email website in the optimization can result in a larger

increase in reach.

Figure 4 shows the results from the proposed method with the comparison methods.

This figure demonstrates that the proposed method again performs well with ten percent

calibration data. The reach estimates based on the ten percent calibration data are very close

to those from the true optimal based on the entire data. Additionally, the reach estimates

from the naive approaches are significantly below both.

22

Page 23: Optimal Large-Scale Internet Media Selection

Actual Desired

No Children 0.344 0.25

Children 0.656 0.75

Income below 15,000 0.135 0.25

Income 15,000-24,999 0.074 0.20

Income 25,000-34,999 0.100 0.20

Income 35,000-49,999 0.150 0.15

Income 50,000-74,999 0.260 0.10

Income 75,000-99,999 0.140 0.05

Income above 100,000 0.141 0.05

Table 2: True and Desired Proportions in Data

3.3.2 McRib Campaign with Targeted Consumer Demographics

In practice, companies often have specific target demographics in mind when running online

display ads. In this section we demonstrate that our method could be readily modified to

accommodate such needs. For illustration purposes, we consider two demographic variables

(children and income level).

We chose these two demographic variables because McDonald’s has historically targeted

families with children (Mintel, 2014). We also know fast food in general tends to target

lower-income households (Drewnowski and Darmon, 2005). Because of this, we illustrate

our approach in a scenario where the McRib campaign wishes to reweight the comScore data

set with greater emphasis on individuals from lower-income households with children.

Following the procedure outlined in Section 2.3.1, we reweight the comScore data with

target population makeup in each variable category as shown in Table 2. For example,

for “children present,” since we want to give individuals with children greater weights than

those who do not have children, we assign a weight of 0.75 to having children and 0.25 to

not having children. We do a similar weighting for income level. We choose these desired

weights arbitrarily to demonstrate our method, but in practice companies would presumably

have data on target proportions before running the campaign.

23

Page 24: Optimal Large-Scale Internet Media Selection

Table 1 shows the number of websites chosen in the reweighted setup compared to the

standard setup. In this example, reweighting the data does not drastically change the types of

websites chosen during our optimization. Families with children and lower-income households

did not represent a significant deviation from the overall data set in terms of their Internet

browsing behavior. However, we do observe some slight changes. For example, the number

of Gaming websites increases when we reweigh our data. Most of the gaming websites in our

data set are online flash-based game websites which primarily target young players (360i,

2008). Hence, it is likely that proportionally more McDonald’s consumers frequently visit

such sites.

3.3.3 McRib Campaign with Target Frequency of Ad Exposure

In this subsection we demonstrate a case in which McDonald’s wishes to allocate its ad

budget such that each individual is exposed to the ad no more than three times during the

course of the McRib campaign. For simplicity, we use the data set without demographic

reweighting, although both approaches could readily be used together. In this case, the

“effective reach” is the value of the function e−γ(γ + 12γ2 + 1

6γ3).

Again, Table 1 shows the optimization allocation across website categories for this ex-

tension. In general, under this extension, our method chooses more websites, with a corre-

spondingly lower average budget at each one. This allows more viewers to be reached with

the ad, but limits the probability an ad will appear to a particular viewer more than three

times. One example of this is the increase in number of Gaming websites chosen by the

algorithm. Gaming websites have many repeat visitors, but low correlation among visitation

to websites within the Gaming category. The algorithm chooses to advertise a small amount

at a number of Gaming sites, which gives consumers a low probability of seeing the ad on

any particular visit, but will ultimately reach different consumers with each ad appearance.

Overall, the algorithm less often includes websites with high repeat visitation. This helps

ensure that a consumer does not see the ad more times than desired. Another example of this

is that the algorithm chooses more Entertainment websites. Although the Entertainment

category is more expensive than others, we observe low repeat visitation for Entertainment

websites in our Z matrix. The websites seem to be more universally visited, so advertising

on an Entertainment website results in more different people seeing the ad.

24

Page 25: Optimal Large-Scale Internet Media Selection

3.4 Case Study 2: Norwegian Cruise Lines Wave Season Online Advertising

Campaign with Mandatory Media Coverage to Travel Aggregate Sites

Each year, the cruise industry advertises for its annual “wave season”, which begins in

January. Norwegian Cruise Lines (NCL) is among the cruise lines that participate heavily in

wave season (Satchell, 2011). Because consumers who are interested in booking a cruise often

use travel aggregation sites like Orbitz and Priceline to compare offerings across multiple

cruise lines, we use this case study to demonstrate the extension in which the proposed

method is applied in such a scenario. We consider that NCL wants to allocate at least a

minimum amount of budget to a set of major aggregate travel websites. While this is a

hypothetical example, it is realistic and can be readily applied to similar scenarios.

Our method handles such scenarios using the extension described in Section 2.3.2. Imag-

ine NCL wants to allocate at least twenty percent of any given budget to eight major ag-

gregate websites (CheapTickets.com, Expedia.com, Hotwire.com, Kayak.com, Orbitz.com,

Priceline.com, Travelocity.com, and TripAdvisor.com). We require our optimization to place

at least 2.5 percent of the budget at each of these eight sites.

We follow the same procedure as in the previous case study to obtain the 500 most

visited websites in January 2011 that supported online display advertisements. These 500

websites are also divided into sixteen categories and assigned an average CPM based on

their category. 48,628 users visited at least one of these 500 websites during January 2011,

meaning our Z matrix is 48,628 by 500. We again divide this data into a 10% subset (4,863

users) of calibration data and use the remaining 90% as holdout data.9

Figure 5 demonstrates our reach curves under this extension. We refer to the optimization

with mandatory media coverage of aggregate travel sites as constrained optimization (in dash

blue), and the standard optimization approach as unconstrained (in solid black). We also

include a naive method, allocating the entire budget evenly to the eight aggregate sites (in

dotted green).

The curves on the left show the calculation of reach using the entire data set, i.e. the

full 90% holdout data. As we expect, the unconstrained curve performs slightly better

9We omit the website category makeup description of this application due to its similarity to Table 1 andpage limits. It is available from authors upon request.

25

Page 26: Optimal Large-Scale Internet Media Selection

0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

Budget (in millions)

Rea

ch b

ased

on

Ove

rall

Dat

a

0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

0.8

1.0

Budget (in millions)

Rea

ch b

ased

on

Trav

el U

sers

Sub

set

UnconstrainedConstrainedEqual

Figure 5: Reach with Mandatory Coverage in Aggregate Travel Sites

than the constrained curve, since we cannot do better in overall reach by constraining our

optimization. In addition, the naive approach performs poorly. Because the aggregate travel

websites do not reach a majority of the users of the data set, allocating budget only to these

eight websites will naturally limit the ad’s exposure to all Internet users.

The curves on the right show the reach for the subset of users who visited at least one of

the eight aggregate travel websites in January 2011 (there are 6,431 such individuals in our

data set). Presumably these consumers are more likely to be interested in searching for travel

deals compared to the others. In this case, the constrained curve significantly outperforms

the unconstrained curve. By constraining the optimization to allocate a percentage of the

budget to each aggregate travel website, we reach far more of the users who actually visit

these sites, which is the group NCL would like to target. In this case, NCL can meet its

aggregate travel site requirements without sacrificing much overall reach, meaning that most

26

Page 27: Optimal Large-Scale Internet Media Selection

users will still view the ad in general, but we are also confident that we have reached the

subset of people most likely to book a cruise.

In the right panel of Figure 5, the naive approach of equal allocation across the eight

travel aggregate sites performs slightly better than the proposed method when the reach is

calculated based on the subset of aggregate travel site users (i.e., constrained reach). But

this result is expected. NCL is most likely to reach users on the aggregate sites by putting

as much budget as possible into those eight sites. As we see from the overall reach curves on

the left, that method will not capture users on other websites who might also be attracted

to NCL’s Wave Season campaign but did not visit one of the eight aggregate travel websites.

Depending on whether the firm wants to reach a broader audience or a targeted audience,

either the constrained or the unconstrained optimization could be employed in such online

ad campaigns.

4. Conclusion and Future Work

In the current advertising climate, firms need an online presence more than ever. Never-

theless, the ever-increasing number of websites presents not only endless opportunities but

also tremendous challenges for firms’ online display ad campaigns. While online advertising

is limited only by the sheer number of websites, optimal Internet media selection among

thousands of websites presents a prohibitively challenging task in the modern era.

While existing methods can only solve Internet budget optimization for moderately-sized

problems (e.g. 10 websites), we propose a method that allows firms to efficiently allocate

budget across a large number of websites (e.g. 5000). We demonstrate the applicability

and scalability of our algorithm in real-world settings using the comScore data on Inter-

net usage. We also illustrate that the proposed method extends easily to accommodate

common practical Internet advertising considerations, including targeted consumer demo-

graphics, mandatory media coverage to matched content websites, and target frequency of

ad exposures. Furthermore, the low computational cost means that the proposed method

can rapidly examine a range of possible budgets. As a result, firms can easily examine the

correspondence between budget and reach, providing them with the ability to spend only as

much money as required to achieve a desired level of reach.

27

Page 28: Optimal Large-Scale Internet Media Selection

Consequently, the proposed method provides firms with great flexibility and adaptabil-

ity in their online display advertising campaigns. Accordingly, firms can obtain ultimate

control of their own Internet display ad campaigns, alleviating their need to turn to ad agen-

cies or other conglomerate advertising exchanges with little to no oversight over their own

campaigns.

Our research also offers some promising avenues for further research. For example, while

the proposed method emphasizes maximizing the reach of online display ads, firms could

readily modify our approach and use Internet browsing-tracking data to maximize click-

through and/or downstream purchases of their Internet display ad campaigns. Additionally,

in the current paper, we consider the perspective of an individual firm that wishes to maxi-

mize reach for its particular campaign. This method could be further extended for use by an

advertising broker who wishes to maximize reach over a set of clients. Advertising brokers

must provide clients with the best possible campaigns but also use as much of their existing

ad space inventory as possible. Thus an interesting extension of our method would be to

maximize over multiple campaigns from the perspective of an advertising agency. We will

leave such endeavors for future work.

References

360i (2008). Point of view on gaming. Technical report.

Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much

larger than n. The Annals of Statistics, 35(6):2313–2351.

Chapman, M. (2009). Digital advertising’s surprising economics. Adweek, 50(10):8.

Chen, Y., Pavlov, D., and Canny, J. (2009). Proceedings of the 15th ACM SIGKDD In-

ternational Conference on Knowledge Discovery and Data Mining: Large-scale behavioral

targeting.

Cho, C. and Cheon, H. (2004). Why do people avoid advertising on the internet? Journal

of Advertising, 33(4):89–97.

Danaher, P. (2007). Modeling page views across multiple websites with an application to

internet reach and frequency prediction. Marketing Science, 26(3):422–437.

28

Page 29: Optimal Large-Scale Internet Media Selection

Danaher, P., Janghyuk, L., and Kerbache, L. (2010). Optimal internet media selection.

Marketing Science, 29(2):336–347.

Drewnowski, A. and Darmon, N. (2005). Food choices and diet costs: an economic analysis.

The Journal of Nutrition, 135(4):900–904.

Efron, B., Hastie, T., Johnston, I., and Tibshirani, R. (2004). Least angle regression (with

discussion). The Annals of Statistics, 32(2):407–451.

eMarketer (2012). Digital ad spending tops 37 billion. URL:

http://www.emarketer.com/newsroom/index.php/digital-ad-spending-top-37-billion-

2012-market-consolidates. Accessed 4 Jun 2015.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its

oracle properties. Journal of the American Statistical Association, 96(456):1348–1360.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized

linear models via coordinate descent. Journal of Statistical Software, 33(1):302–332.

Goeman, J. (2010). L1 penalized estimation in the cox proportional hazards model. Bio-

metrical Journal, 52(1):70–84.

Goldfarb, A. and Tucker, C. (2011). Online display advertising: Targeting and obtrusiveness.

Marketing Science, 30(3):389–404.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning:

Data Mining, Inference, and Prediction. Springer, second edition.

Hemphill, T. (2000). Doubleclick and consumer online privacy: An e-commerce lesson

learned. Business and Society Review, 105(3):361–372.

Hesterberg, T., Choi, N., Meier, L., and Fraley, C. (2008). Least angle and l1 penalized

regression: A review. Statistics Surveys, 2:6193.

Hoban, P. R. and Bucklin, R. E. (2015). Effects of internet display advertising in the purchase

funnel: Model-based insights from a randomized field experiment. Journal of Marketing

Research, LII:375–393.

29

Page 30: Optimal Large-Scale Internet Media Selection

Krugman, H. (1972). Why three exposures may be enough. Journal of Advertising Research,

12(6):11–14.

Liaukonyte, J., Teixeira, T., and Wilbur, K. (2015). Television advertising and online shop-

ping. Marketing Science, 34(3):311–330.

Lipsman, A. (2010). The New York Times ranks as top online newspaper according to May

2010 U.S. comScore Media Metrix data. Technical report, ComScore, Inc.

Lothia, R., Donthu, N., and Hershberger, E. (2003). The impact of content and design

elements on banner advertising click-through rates. Journal of Advertising Research,

43(04):410–418.

Luo, Z. and Tseng, P. (1992). On the convergence of the coordinate descent method for

convex differentiable minimization. Journal of Optimization Theory and Applications,

72(1):7–35.

Manchanda, P., Dub, J., Goh, K., and Chintagunta, P. (2006). The effect of banner adver-

tising on internet purchasing. Journal of Marketing Research, 43:98–108.

Meinshausen, N. (2007). Relaxed lasso. Computational Statistics and Data Analysis, pages

374–393.

Mintel (2014). Kids as influencers–U.S. Technical report, Mintel.

Montgomery, A. L., Li, S., Srinivasan, K., and Liechty, J. (2004). Modeling online browsing

and path analysis using clickstream data. Marketing Science, 23(4):579–595.

Morrison, M. (2012). Can the Mcrib save Christmas? Ad Age.

Muthukrishnan, S. (2009). Ad exchanges: Research issues. Technical report, Google, Inc.

Naples, M. (1979). Effective frequency: The relationship between frequency and advertising

effectiveness. Technical report, Association of National Advertisers, New York.

Park, Y. and Fader, P. (2004). Modeling browsing behavior at multiple websites. Marketing

Science, 23(3):280–303.

30

Page 31: Optimal Large-Scale Internet Media Selection

Radchenko, P. and James, G. (2008). Variable inclusion and shrinkage algorithms. Journal

of the American Statistical Association, 103(483):1304–1315.

Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths. The Annals of

Statistics, 35(3):1012–1030.

Satchell, A. (2011). Norwegian: Cruise fares to increase up to 10 percent April 1. South

Florida Sun-Sentinel.

Schlesinger, R. (2010). U.S. population, 2011: 310 million and growing. U.S. News.

Schmidt, M., Fung, G., and Rosales, R. (2007). Fast optimization methods for l1 regular-

ization: A comparative study and two new approaches. Machine Learning: ECML 2007,

4701:286–297.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society, Series B, 58:267–288.

Unit, T. E. I. (2005). Business: The online ad attack. The Economist, 375(8424):63.

Zhao, P., Rocha, G., and Yu, B. (2009). The composite absolute penalties family for grouped

and hierarchical variance selection. The Annals of Statistics, 37(6A):3468–3497.

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American

Statistical Association, 101:1418–1429.

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.

Journal of the Royal Statistical Society, Series B, 67:301–320.

31

Page 32: Optimal Large-Scale Internet Media Selection

Appendix A Simple Illustration of Correlation in Website View-

ership

In this appendix, we provide a simple illustration of how the proposed method handles

correlation in the Z data matrix. To demonstrate basic intuitions, we illustrate the effects

of correlation on budget allocation by considering a case with three websites, all generated

from the same distribution with the same cost. However, the viewership for websites 1 and

2 has a measurable correlation ranging from 0.0 (fully independent) to 1.0 (perfect positive

correlation), and website 3’s viewership is generated entirely independently of the other two

websites (correlation of 0).

Figure A1 shows the change in budget allocation across the three websites as the cor-

relation between websites 1 and 2 changes, where the red line is website 1’s allocation, the

blue line is website 2’s allocation, and the green line is website 3’s allocation. When the

correlation between websites 1 and 2 is zero, all three websites are completely independent.

In this case, the algorithm allocates one-third of the budget to each of the three websites,

since no website has a clear advantage over the other two. As the correlation between web-

sites 1 and 2 increases, the algorithm gradually allocates more budget to website 3 and split

the remaining budget among websites 1 and 2. When these two websites become perfectly

correlated, the algorithm divides the budget in half, allocating one half to website 3 and the

other half across websites 1 and 2.

32

Page 33: Optimal Large-Scale Internet Media Selection

0.0 0.2 0.4 0.6 0.8 1.0

0.25

0.30

0.35

0.40

0.45

0.50

Correlation Between Websites 1 and 2

Pro

port

ion

of B

udge

t Allo

cate

d

Website 1Website 2Website 3

Figure A1: Illustration of Budget Allocation with Varying Correlations in Website Viewer-

ship

Appendix B Algorithm Details, Convergence, and Efficiency

Our optimization criterion of Equation (7) in Section 2 can be written in statistical form as

an ℓ1 penalty:1

n

n∑

i=1

e−γi +λ

n‖w‖1, (A1)

where ‖w‖1 =∑p

j=1 |wj|. This is then in a common statistical form, namely,

f(w) = g(w) +

p∑

j=1

kj(wj), (A2)

where g(w) = 1n

∑n

i=1 e−γi is a differentiable convex function ofw, and

∑p

j=1 kj(wj) =λn‖w‖1

is a separable convex but not differentiable function.

33

Page 34: Optimal Large-Scale Internet Media Selection

It has been shown in Luo and Tseng (1992) that a coordinate descent algorithm, which

iteratively minimizes the criterion as a function of one coordinate at a time, will achieve

a global minimum for functions of the form in (A2). Thus convergence using coordinate

descent is guaranteed for our criterion, since it is in the form specified by Luo and Tseng.

Because no closed-form solution exists for Equation (7), we employ a Taylor approxima-

tion to (A2), resulting in Equation (9) from Section 2. To minimize Equation (9) over wj,

with all wk k 6= j fixed, we first compute the partial derivative with respect to wj which is

given byn

i=1

θije−

∑j θijw̃j [−1 + θij(wj − w̃j)] + λ (A3)

for wj > 0. Setting (A3) equal to zero gives Equation (10) in Section 2.

We can also use Equation (A3) to find a starting point for our algorithm, i.e., the λ

value corresponding to B = 0. To do this, we employ the same procedure of calculating Hj

as used in Equation (10). In Equation (10), Hj measures whether our algorithm has set a

coefficient to drop below zero. We can use this same procedure to initialize the first λ at which

B = 0. In particular, we first define w̃j = 0 for all j = 1, . . . , p, which corresponds to zero

budget. Then, we calculate Hj for each website and set our initial λ value to maxHj, with

j = 1, . . . , p. To calculate increasing budgets, we use this value as λmax and incrementally

decrease λ by steps. The step size and number of steps are both parameters of the algorithm

and are thus specified by the researcher depending on desired granularity and maximum

budget. For example, for the McRib case study in Section 3.3, we ran the algorithm with

500 steps at a step size of 0.01.

We further demonstrate the efficiency of the algorithm in Figure A2. It shows the time

(in seconds) to run the algorithm at a particular budget over a range of p websites. To create

Figure A2, we used the same method as described in Section 3.2 to generate the Z matrix

and the CPMs. As evident, this figure shows that our algorithm is highly computationally

efficient for large scale problems.

34

Page 35: Optimal Large-Scale Internet Media Selection

0 5 10 15

010

0020

0030

0040

0050

00

Time to run (in seconds)

Num

ber

of W

ebsi

tes

Figure A2: Algorithm Computational Efficiency

Appendix C Supplementary Information for Model Comparisons

with Danaher et al.’s Model

We first describe how we generate the Z matrix using the multivariate negative binomial

marginal distribution as described in Danaher et al. (2010)’s method. Under Danaher et

al.’s approach, page impressions are equivalent to ad appearances; thus what Danaher et al.

refer to as X is equivalent to the proposed Z in our paper To keep terminology consistent,

we will use Zj for the methodology in this section.

We first generate website 1’s data, Z1, from a typical negative binomial distribution (i.e.,

the marginal f1(Z1) distribution). Then, from Danaher 2007 (p. 425) we note that the

conditional distribution of Z2 given Z1 is given by f(Z2|Z1) =f(Z1,Z2)f(Z1)

, or

35

Page 36: Optimal Large-Scale Internet Media Selection

f(Z2 = z2|Z1 = z1) = f2(z2)

[

1 + ω

(

e−z1 −

(

α1

1− e−1 + α1

)r1)(

e−z2 −

(

α2

1− e−1 + α2

)r2)]

(A4)

We then use the following approach to generate the Z matrix following Danaher’s method-

ology:

1. Randomly generate n synthetic respondents from a negative binomial distribution,

corresponding to Z11, . . . , Zn1.

2. For each Zi1 randomly generate Zi2 by sampling from the probability distribution given

by (A4).

3. Repeat process for Z3, Z4, etc. until desired number of websites is reached.

Note the calculation of the conditional f(Zj|Z1, ..., Zj−1) will become increasingly com-

plex as each successive website’s viewership is calculated. For example, f(Z1, Z2, Z3) =

f(Z1)f(Z2|Z1)f(Z3|Z1, Z2). Thus we extend to seven websites for the example used in Sec-

tion 3.1. By combining these vectors, we can create the Z matrix based on the multiple

negative binomial distribution.

To make our simulated data as realistic as possible, we generate the simulated data using

values of α and r estimated from the top 7 most visited website from the December 2011

comScore data as the true parameter values of the MNBD. Since E(Zj) = rj/αj , we have

r̂j/α̂ = Z̄j or alternatively r̂j = Z̄jα̂j . We can find Z̄j easily from the data, as it is simply

the mean of the visit values for a particular website j. Further, given that the probability of

a NBD random variable taking on the value zero is given by (αj/(1+αj))rj , we can estimate

α̂j as the solution to

yj = (α̂j/(1 + α̂j))Z̄j α̂j , (A5)

where yj denotes the observed fraction of zero visits to a given website j. Equation (A5) can

be easily solved using a root solving function and in turn rj estimated using r̂j = Z̄jα̂j.

We used this approach to estimate α and r from Amazon, AOL, Edgesuite, Live, MSN,

Weatherbug and Yahoo, which provided the basis for the seven website simulation in Sec-

tion 3.1.1. Table A1 shows a comparison between the estimated and true α and r for the

simulated data. Here, the true values are from the seven previously mentioned websites,

36

Page 37: Optimal Large-Scale Internet Media Selection

while the estimated values are mean values from 100 simulation runs with matrices of 50,000

users each.10 The table also shows the mean squared error between the true and estimated

values over the 100 runs, as well as the mean absolute deviation. It is evident that the

estimated and true α and r values are reasonably close to one another.

Website 1 Website 2 Website 3 Website 4 Website 5 Website 6 Website 7

α 0.187 0.017 0.093 0.038 0.043 0.025 0.032

α̂ 0.187 0.018 0.093 0.039 0.043 0.025 0.033

MSE 2e−5 2e−5 4e−5 6e−5 6e−5 3e−5 3e−5

MAD 0.008 0.001 0.004 0.001 0.002 0.002 0.002

r 0.287 0.056 0.174 0.093 0.167 0.051 0.444

r̂ 0.287 0.057 0.174 0.093 0.168 0.051 0.445

MSE 8e−5 7e−5 2e−5 8e−5 6e−5 6e−5 7e−5

MAD 0.009 0.003 0.005 0.004 0.005 0.004 0.006

Table A1: True and Estimated Mean α, r Values, Simulated Data

Table A2 shows a comparison between the estimated and full α and r for the seven-

website data from comScore (Section 3.1.2), where the full values are values based on the

entire December 2011 comScore data set, and the estimated values are the mean values across

100 runs on random 10% subsets. The table also shows the mean squared error between the

full and estimated values over the 100 runs, as well as the mean absolute deviation. Again,

the estimated values based on the subset data highly resemble the values obtained from the

full data.

10Note the simulation used in Section 3.1 is done with 5,000 synthetic respondents due to the computationalcomplexity involved in estimating Danaher et al.’s method for 50,000 synthetic respondents.

37

Page 38: Optimal Large-Scale Internet Media Selection

Amazon AOL Edgesuite Live MSN Weatherbug Yahoo

Full α 0.187 0.017 0.093 0.038 0.043 0.025 0.032

Estimated α 0.188 0.017 0.094 0.038 0.043 0.025 0.032

MSE 2e−5 2e−6 3e−5 8e−6 6e−6 4e−6 2e−6

MAD 0.010 0.001 0.005 0.002 0.002 0.002 0.001

Full r 0.287 0.056 0.174 0.093 0.167 0.051 0.444

Estimated r 0.288 0.056 0.175 0.093 0.167 0.051 0.444

MSE 1e−4 4e−6 3e−5 1e−5 2e−5 4e−6 1e−4

MAD 0.010 0.002 0.005 0.003 0.004 0.002 0.008

Table A2: True and Estimated Mean α, r Values, Real Data

Appendix D Website Category Viewership Correlation Table

Table A3 provides an overview of correlation in viewership among the 16 website groups

in McRib example, both within groups and among groups. Within group correlation in

the table is calculated by taking the mean of all absolute correlations between websites

in a particular group. These are displayed in the diagonal of the table. For example, the

Newspaper category shows moderately high average correlation in viewership among websites

with a value of 0.48. In contrast, there is not much correlation in viewership among websites

in the E-mail category, only 0.01 on average.

The off-diagonal elements of Table A3 show the maximum absolute correlation between

each pair of groups. This is calculated by taking the maximum correlation between two

websites from the respective groups. For example, there is a high correlation of 0.96 between

Newspaper and Portal sites. In contrast, there is a low correlation between Filesharing and

E-mail sites, only 0.03.

38

Page 39: Optimal Large-Scale Internet Media Selection

Category Com Email Ent File Game Gen Info News Onl Photo Port Ret Serv Soc Sport Travel

Community 0.02 0.14 0.82 0.14 0.77 0.14 0.47 0.16 0.55 0.88 0.21 0.39 0.21 0.26 0.12 0.15

Email . 0.01 0.07 0.03 0.28 0.04 0.07 0.05 0.09 0.04 0.87 0.10 0.10 0.06 0.12 0.04

Entertainment . . 0.02 0.78 0.32 0.90 0.76 0.92 0.28 0.83 0.90 0.30 0.69 0.24 0.79 0.10

Fileshare . . . 0.05 0.27 0.05 0.15 0.56 0.67 0.13 0.17 0.10 0.13 0.14 0.10 0.07

Gaming . . . . 0.01 0.12 0.82 0.32 0.85 0.12 0.25 0.14 0.95 0.09 0.51 0.09

General News . . . . . 0.28 0.76 0.94 0.08 0.04 0.96 0.08 0.10 0.34 0.85 0.11

Information . . . . . . 0.02 0.77 0.51 0.18 0.76 0.30 0.11 0.24 0.65 0.27

Newspaper . . . . . . . 0.48 0.10 0.05 0.96 0.36 0.12 0.26 0.86 0.15

Online Shop . . . . . . . . 0.03 0.49 0.16 0.26 0.75 0.42 0.19 0.10

Photos . . . . . . . . . 0.02 0.11 0.09 0.09 0.41 0.04 0.05

Portal . . . . . . . . . . 0.06 0.19 0.19 0.12 0.87 0.09

Retail . . . . . . . . . . . 0.04 0.19 0.18 0.25 0.12

Service . . . . . . . . . . . . 0.01 0.15 0.19 0.05

Social Network . . . . . . . . . . . . . 0.02 0.10 0.26

Sports . . . . . . . . . . . . . . 0.07 0.08

Travel . . . . . . . . . . . . . . . 0.18

Table A3: Overview of viewership correlation within and across the sixteen website categories in Section 3.3.

39