social learning from online reviews with product choice · social learning from online reviews with...

Social Learning from Online Reviews with Product Choice

Stefano Vaccari∗1, Costis Maglaras†2, and Marco Scarsini‡3

1,3Dipartimento di Economia e Finanza, LUISS, Rome, Italy2Columbia Business School, Columbia University, New York, USA

Abstract

Product choice when consumers engage in social learning has significant implications on

learning outcomes and on the information accumulation rate. In many practical settings, con-

sumers have a choice on which product to buy, if any, among several possible alternatives. The

quality of these products may be unknown to consumers, but online platforms provide prod-

uct reviews so that, as time goes by, customers accumulate information about products’ quality.

This paper studies a model where consumers estimate the quality of products from online binary

product reviews (like/dislike), and subsequently make a choice among competing alternatives

using a multinomial logit model. The consumer learning model is naıve, i.e., consumers take

the ratio of likes over the total number of reviews as a proxy for quality. We explore the im-

pact of choice on the learning outcome, and show that consumers correctly learn the ranking

of the product qualities, but not the actual quality vector. We provide the conditions that

allow consumers to get arbitrarily close to the truth and characterize the consistency of their

product choices relative to the full information benchmark. Using a large market (fluid model)

approximation, we study how choice and product parameters affect learning speed and derive

some intuition on the primitives that matter the most. Finally, we address the following plat-

form control problem: assuming that consumers suffer some search cost to go down the list of

displayed products, which order of products should the platform use to speed up learning and

purchases? Without search costs, the platform has no leverage to accelerate learning, but if

search costs exist, and are significant, e.g., most people do not see past the top 10 or so options,

then (a) disregarding the search cost leads to significantly optimistic results in terms of infor-

mation accumulation speed, and (b) by carefully selecting the order in which product options

are displayed, the platform may in fact reduce the time-to-learn even when compared to the

case where there are no search costs.

Keywords: social learning, information aggregation, bounded rationality, online reviews.

∗[email protected].†[email protected].‡[email protected]. This author is a member of GNAMPA-INdAM.

1

1 Introduction

It is common for consumers to rely on online reviews and price comparisons when they plan

to purchase a new product or a new service. In fact, due to the remarkable variety of options

that are usually available to them, finding the product that best fits their needs can be quite

a difficult task. As a consequence, consumers, who may not be accurately informed about the

product attributes, often refer to information aggregator websites that compile product reviews,

such as TripAdvisor, Yelp, Imdb, etc., to quickly gather information and estimate the quality of

each alternative. As consumers observe and report online reviews over time, they engage in what

is called a social learning process. Namely, consumers share their experiences and opinions about

the different products and services with other consumers, who, in turn, use this information to

estimate the quality of the different alternatives to make better-informed purchase decisions.

However, given that users in such online marketplaces typically choose among many alternatives,

what matters for the selection is not only the amount of information about a particular product but

also the amount and level of information about its close substitutes. As a result, learning transients

are correlated across products: information accumulates faster for products that consumers perceive

as more appealing, as these products attract most of the observations. Despite the increasingly

central role of online reviews in the decision process of consumers in online marketplaces, social

learning in the presence of product choice is a relatively unexplored topic. This paper investigates

the interplay between the information dispersion effects induced by the presence of product choice

and the learning dynamics of consumers who communicate through the mechanism of online reviews.

In a bit more detail, this paper investigates a model of a marketplace where consumers arrive

sequentially over time and decide whether to buy one of the available products or to take an outside

option. Consumers are heterogeneous in the sense that their preferences towards the observable

features of the products are represented by a sequence of independent and identically distributed

(i.i.d.) Gumbel random variables. Although initially uninformed about the intrinsic quality of the

different products, consumers estimate it by observing the binary online reviews reported by other

purchasers prior to their arrival. Given their quality estimates, their prior quality beliefs, their

idiosyncratic preference parameters, and the prices, consumers choose on the base of a Multinomial

Logit model. We assume that purchasers of a given product experience its true quality plus a

small random perturbation, and report a binary online review expressing their level of satisfaction:

consumers report a “like” review if their ex-post utility exceeds the utility they expected before the

purchase, and a “dislike” review otherwise. Online reviews do not reveal idiosyncratic preference

parameters, nor consumers’ ex-post utility, nor the level of quality they experienced. As a result,

online reviews are only partially informative. Reviews are gathered and displayed to the upcoming

2

consumers by an information aggregator, the platform.

The classical fully-rational Bayesian update requires consumers to have computational capa-

bilities that appear unrealistic when consumers choose among multiple alternatives that differ in

prices, in technical features, and in the amount and level of available information about their qual-

ity. We therefore assume that consumers are boundedly rational and adopt a naıve estimation rule

that uses the ratio of positive reviews as a quality proxy, i.e., if 80% of the reviews are positive,

then consumers perceive that the value of the quality is 80% of the maximum possible quality, as

opposed to estimating the quality value that would be consistent with such reviews taking into

account the probabilistic model primitives, the competing alternatives, and the sequence of reviews

of all products.

We analyze this process and study what consumers learn in the long run. We establish that per-

fect social learning does not take place, i.e., consumers’ quality estimates do converge almost surely,

albeit not to the correct quality vector. However, despite the naıve estimation rule, consumers do

learn to rank products correctly according to their intrinsic quality. Further, we show that, even

if perfect learning fails, consumers get relatively close to the truth, i.e., the bias that the naıve

rule induces in the asymptotic quality estimates can be made arbitrarily small under some mild

conditions that we provide. Finally, we analyze how this inferential inefficiency affects the quality

of consumers’ choices, by estimating what fraction of consumers makes choices that are consistent

with the full information benchmark.

To characterize the learning transient, we derive a large market asymptotic (fluid) model and

study the dynamics of the learning process and the impact of product choice on the quality of

consumers’ decisions. We show that, if the arrival rate of consumers grows large, learning tra-

jectories can be approximated with arbitrary precision using the solution of a system of ordinary

differential equations (ODE). We study the dynamics of the learning paths in this deterministic

approximation, and we provide tight bounds for the rate of information accumulation for the differ-

ent products. Our analysis shows how the rate of information accumulation depends on the prices,

the distribution of product qualities, the prior quality vector, and the number of product options.

In particular, we prove that the time-to-learn is (roughly) inversely proportional to the number of

available options, and further quantify its dependence on the dispersion of product prices, the prior

quality estimates, and the distribution of true product qualities.

Finally, we address the placement problem of the platform that displays the online reviews. We

assume that consumers perceive an additional “search cost,” which is an increasing function of the

ranking in which the product is displayed by the platform. So, a product-display ranking induces a

vector of search costs that, in turn, affect the multinomial choice probabilities. The presence of these

search costs, allows the platform to influence consumer choice, information acquisition, and learning

3

transient, by choosing the ordering in which the products are displayed to arriving consumers. First,

we show that, through her ranking policy, the platform impacts only on the rate of information

accumulation for the different products, but has no influence on what consumers learn in the

long run, i.e., the asymptotic outcome of the learning process remains the same as if search costs

were absent. The speed of learning is, however, dramatically impacted by these search costs. For

example, consider a fixed (say alphabetical) product display ordering and linearly increasing search

costs. In this case, the search costs will ultimately render many of the products very undesirable

alternatives, and as a result the top-ranked products will not be slowed down in their learning

transient, while the bottom-ranked products will experience a slowdown that is exponential in the

number of products (as opposed to inversely proportional). In general, disregarding the effect of

search costs may lead to substantial overestimation of the information accumulation speed.

We motivate two variants of a platform control problem where the platform focuses on maximiz-

ing discounted cumulative revenue on one hand, and on minimizing the time-to-learn on the other.

While finding the optimal dynamic ranking policy is intractable, we study a family of heuristics that

either greedily maximize next period revenue (exploit), or maximize the speed of the next period

information acquisition (explore), or some variants of these two. We show that both the exploit and

explore product rankings are efficiently computable. We resort to a set of numerical simulations

to demonstrate that ranking policies that allow for a suitable amount of exploration may lead to

higher discounted profits compared to a full-exploit policy. And, perhaps surprisingly, in some

cases the platform can achieve a faster overall time-to-learn than for a system with no search costs;

this is achievable because search costs provide the platform with a mechanism to affect product

choice and therefore direct the information acquisition process as most needed – something that it

cannot do if there are no search costs and consumers are assumed to make choices by considering

all alternatives without any utility degradation due to their display rank.

We conclude this introduction with a brief literature review. Some of the first examples of papers

focusing on social learning trace back to the observational learning models studied in Banerjee

(1992) and Bikhchandani, Hirshleifer, and Welch (1992), which demonstrated in a model with

private signals, observable actions, and Bayesian updating, that rational agents eventually ignore

their private signals and decide to imitate their predecessors (herd). Actions stop being informative

and learning is not achieved. Smith and Sørensen (2000) showed that if agents receive signals with

unbounded precision, herding is avoided and asymptotic learning is achieved. Callander and Horner

(2009) established that the results of Smith and Sørensen (2000) hold even if agents observe the

total number of actions of each type, as opposed to observing the whole history of decisions. This

approach is close to ours, as agents in our model (which, differently from Callander and Horner

(2009) are non-Bayesian) only observe the total number of likes and dislikes reported by prior

4

agents.

Our work is more closely related to the non-Bayesian social learning literature that replaces the

rigorous Bayesian updating with simpler heuristic rules of thumb.1 The standard reference in this

stream of literature is DeGroot (1974). Here, the opinion of an agent in each period is the weighted

average of her private signals with the signals she received from their neighbors in the previous

period. Golub and Jackson (2010) showed that consensus in this model is achieved under mild

assumptions on the network structure. Other standard references for non-Bayesian social learning

include Ellison and Fudenberg (1993, 1995) who studied models where agents communicate their

experienced utility. We use a similar approach, albeit with online reviews instead of utilities. More

recently, Molavi, Tahbaz-Salehi, and Jadbabaie (2016) used an axiomatic approach to investigate

naıve social learning in great generality.

In recent years, part of the social learning literature has studied the interplay between the learning

process and the topological features of the social network of agents.2 An influential contribution

of this stream of literature is Acemoglu, Dahleh, Lobel, and Ozdaglar (2011), who showed that

learning fails in networks with non-expanding observations, i.e., if the network contains groups

of excessively influential agents, thus extending the results of the herding literature to a class of

networks. The speed-of-learning analysis that we propose in Section 4.2 is close in spirit to the

work of Golub and Jackson (2012). Here, in a non-Bayesian context, the authors studied how the

speed of learning is affected by the presence of correlations among the preferences of neighbors in

the social network, a phenomenon called homophily. Similar issues, however with Bayesian agents,

were examined in Lobel and Sadler (2016).

Several papers have considered price optimization problems in the presence of herd behavior

of consumers. Welch (1992) studied a monopolistic static pricing problem in a market where

consumers have private signals about the intrinsic value of a good and observe the history of

purchases, and Bose, Orosel, Ottaviani, and Vesterlund (2008) extended that model to allow for

dynamic pricing policies. Bergemann and Valimaki (1996, 1997) studied pricing considerations

respectively in an oligopoly and a duopoly where purchasers don’t know the quality of the products

and sellers are uncertain about buyers’ preferences.

A recent stream of papers has studied social learning from consumer reviews as opposed to signals.

Crapis, Ifrach, Maglaras, and Scarsini (2017) dealt with social learning from binary reviews in a

market with non-Bayesian agents, and the resulting pricing decision of a monopolist; their analysis

1Grimm and Mengel (2012), Mueller-Frank and Neri (2013), Chandrasekhar, Larreguy, and Xandri (2015) exper-imentally showed that individuals appear more inclined to use simple update rules, as opposed to performing a fullyrational Bayesian procedure.

2See Jackson (2011), Acemoglu and Ozdaglar (2011) and Golub and Sadler (2017) for extensive surveys of modelsof social learning models on networks.

5

is closely related to the one used in this paper. Ifrach, Maglaras, Scarsini, and Zseleva (2018) studied

social learning in a monopolistic market with Bayesian consumers and binary reviews, and Besbes

and Scarsini (2018), again in a Bayesian setting, addressed the issue of self-selection bias when

consumers report their ex-post utility. Dynamic monopolistic pricing in the presence of consumer

reviews is studied in Shin and Zeevi (2017). Papanastasiou and Savva (2017) solved a two-period

dynamic pricing problem in monopolistic market where strategic Bayesian consumers can delay their

decision to costly observe the reviews of first-period-buyers. Acemoglu, Makhdoumi, Malekian, and

Ozdaglar (2017) studied a monopolistic model of Bayesian social learning and characterize the speed

of learning under different rating systems.

Mostly owing to the proliferation of online platforms, the analysis of information disclosure

policies designed to tactically influence consumers’ decisions has recently attracted quite a lot of

attention.3 For instance, L’Ecuyer, Maille, Stier-Moses, and Tuffin (2017) have considered the

static optimal ranking policy of a revenue-maximizing search engine. They solved the platform’s

dilemma of balancing between ranking policies that minimize expected consumers’ regret, and

ranking policies designed to maximize long-term profits. We use a similar approach in Section 5,

with the additional difficulty of the presence of learning effects on the consumers’ side. Using a

multi-armed bandit approach, Papanastasiou, Bimpikis, and Savva (2017) investigated the optimal

messaging provision policy of an online platform that wants to maximize the sum of discounted

consumers’ surpluses. At a high level, both their conclusions about information obscuration and

our heuristic analysis of Section 5 point in the direction of policies that balance between exploration

and exploitation. Van Hentenryck, Abeliuk, Berbeglia, Maldonado, and Berbeglia (2016) studied

market predictability in a dynamic trial-offer market where consumers are sensible to products’

visibility and observe the number of purchases for each product. They showed that if products are

ranked according to their intrinsic quality, the market converges to a monopoly for the product

with the highest quality.

The Multinomial Logit Positioning Problem (MLPP) we solve in Section 5.2 belongs to the family

of assortment optimization problems with multiple products under a general discrete choice model.

Starting from Talluri and Van Ryzin (2004), this type of problems has been investigated when the

distribution of consumers’ preferences is a priori known (Davis, Gallego, and Topaloglu (2013)),

when preferences have to be learned by the designer along the selling horizon (Rusmevichientong,

Shen, and Shmoys (2010), Saure and Zeevi (2013), Agrawal, Avadhanula, Goyal, and Zeevi (2017)),

and when consumers are subject to position biases (Abeliuk, Berbeglia, Cebrian, and Van Henten-

ryck (2016)). The proof of Proposition 8 in this paper builds on the ideas and techniques introduced

3See, for instance, Kamenica and Gentzkow (2011), Kremer, Mansour, and Perry (2014), Che and Horner (2015),and Bimpikis, Ehsani, and Mostagir (2015)

6

in Rusmevichientong et al. (2010) and in Abeliuk et al. (2016).

2 Model Configuration

We consider a marketplace where a set PK := {1, 2, . . . ,K} of K substitutable goods or

services—henceforth, called the products—is simultaneously launched at time t = 0. An infinite

sequence of consumers indexed by i = 1, 2, . . . have to decide whether to buy one of the products

or to choose a zero-price outside option. Consumers arrive at random times t1, t2, . . . distributed

according to a Poisson process of parameter Λ, make a once-and-for-all decision and never re-enter

the market. Product k has an intrinsic quality qk ∈ [0, 1] and a fixed price pk. The no-purchase

option is indicated with the index k = 0 and has an intrinsic quality q0, which is common knowledge

in the market. Without loss of generality, we thus set q0 = 0.

Initially, consumers do not know the quality of the products and, in order to make their pur-

chase decision, use their available information to compute a vector of quality estimates qi :=

(q1,i, . . . , qK,i), where qk,i denotes the estimate of the quality of product k evaluated by consumer

i. We also assume that consumers share an initial prior quality belief q0 := (q1,0, . . . , qK,0) about

the vector of intrinsic qualities q := (q1, q2, . . . , qK).

In case consumer i decides to buy product k, she experiences a quality qk,i := qk+εk,i, where the

quality disturbance parameters {εk,i} are i.i.d. across consumers and products. We assume that

the quality shocks are drawn from a zero-mean symmetric distribution Fε(x) := P(εk,i ≤ x) such

that Var(ε) = s2ε, and we denote Fε(x) := 1−Fε(x) the corresponding survival function. Moreover,

we assume that Fε has a differentiable density fε, which is bounded by some positive constant fε

and which has connected support that is either [−ε, ε] with ε > 0, or (−∞,∞).

2.1 Purchase Decision

Consumers are heterogeneous, i.e., they differ in their personal inclination for the observable at-

tributes of the products. Specifically, consumer i has a preference for product k which is represented

by an idiosyncratic random parameter αk,i, and assigns a utility αk,i + qk,i − pk to the purchase

of product k.4 Given their vector of quality estimates, consumers decide according to a Multi-

nomial Logit (MNL) model. Specifically, the base valuation parameters {αk,i} are assumed to be

i.i.d. random variables distributed according to a standard Gumbel distribution.5 Consumer i buys

the product that maximizes her estimated utility, i.e., ci = arg maxk∈P0K

(αk,i + qk,i − pk

), where

{ci = k} denotes the decision of consumer i to purchase product k, and where P0K := {0} ∪PK

4The utility assigned to the outside option is not deterministic either, and is given by α0,i + q0 − p0 = α0,i.5Namely, P(αk,i ≤ x) = exp (− exp(−x)) for all i and k.

7

indicates the augmented product set. Under the above assumptions, the probability that consumer

i chooses product k is given by the multinomial logit demand function λk(qi,p), i.e.,6

P(ci = k) = λk(qi,p) :=exp(qk,i − pk)

1 +∑K

j=1 exp(qj,i − pj). (1)

The distributions of the idiosyncratic preferences {αk,i}, of the ex-post quality noise {εk,i}, of the

arrival times {ti}, are assumed to be mutually independent and common knowledge in the market.

2.2 Information Structure and Estimation Procedure

In our model, consumers are initially uniformed about the intrinsic quality of products and make

purchase decisions based on the observation of the binary online reviews reported by the consumers

who purchased the products before them. Namely, we assume that purchasers of a given product

truthfully declare their level of satisfaction reporting a binary online review (“like” or “dislike”).

An information aggregator—henceforth, the platform—gathers the online reviews reported by past

purchasers, and displays them to future potential buyers, who in turn use this information to

estimate the quality of products and make purchase decisions accordingly.

Specifically, in case consumer i purchases product k, she reports a binary online review rk,i

expressing whether she “liked” (rk,i = L) or “disliked” (rk,i = D) that product. Consumer i

reports a like review if αk,i + qk + εk,i− pk, the ex-post utility she experiences from buying product

k, exceeds the utility αk,i + qk,i − pk she estimated before buying. This is equivalent to say that,

conditional on consumer i purchasing product k, i.e., {ci = k}, we have

rk,i = L ⇐⇒ qk,i ≤ qk + εk,i. (2)

Notice that condition (2) implies that consumers may report a negative review even if their expe-

rienced utility is positive. Consumers expect a given level of utility when they purchase a given

product, and they will be satisfied only if they receive as much utility as they ex-ante estimated

disregarding whether this utility is positive or negative. Furthermore, if we assume that Fε does not

depend on the parameters of the model (prior beliefs, true qualities, prices, number of products),

then the base valuation parameters and the prices play a role only in the purchase decision, and

not when consumers report their reviews. We will see later in this section how this fact impacts on

the asymptotic outcome of the learning process of consumers.

The quantities Lk,i :=∑i−1

j=1 1{rk,j = L}, and Dk,i :=∑i−1

j=1 1{rk,j = D} respectively represent

the number of positive and negative reviews for product k observed by consumer i. The quantity

6See, for instance, McFadden (1973) for a detailed proof.

8

Bk,i := Lk,i+Dk,i is the total number of reviews for product k available to the i-th consumers. The

vector Ik,i := (Lk,i, Dk,i) represents the information regarding product k available to consumer i,

while we denote Ii := (I1,i, I2,i, . . . , IK,i) the global information available to consumer i.

At time t = 0, consumers have a prior belief q0, which is incorporated in their estimation by

introducing some fictitious likes and dislikes reviews L1,0, L2,0, . . . , LK,0 and D1,0, D2,0, . . . ,DK,0,

such that qk,j = qk,0 := Lk,0/(Lk,0 +Dk,0) whenever Bk,i = 0.

We make the following bounded rationality assumption on consumers’ behavior. Specifically,

consumers naıvely identify the quality of products with the fraction of positive reviews over the

total number of reviews for that product, including fictitious observations, i.e., we assume

qk,i :=Lk,i + Lk,0

Lk,i + Lk,0 +Dk,i +Dk,0=Lk,i + Lk,0Bk,i +Bk,0

, (3)

for all k ∈ PK and i = 1, 2, . . . , where Bk,0 := Lk,0 + Dk,0. Notice that, if Bk,i = 0 for some i, k,

then qk,i = qk,0, as requested.7

The total number of fictitious observations Bk,0 can be interpreted as the weight that consumers

assign to the prior belief qk,0, as it determines how the fictitious likes and dislikes influence con-

sumers final estimate qk,i. In fact, when only few consumers have purchased product k and Bk,i

is small with respect to Bk,0, the online information is dominated by the prior beliefs, and, as a

result, qk,i is closer to qk,0. As more reviews accumulate over time, consumers gradually forget their

prior beliefs, and the online information dominates the inference procedure.

3 Asymptotic Learning and Consistency of Choices

3.1 Asymptotic Learning

In this section we establish that if consumers use (3) to estimate the quality of products, they

fail to learn the true qualities of products asymptotically. We characterize what consumers learn

in the long run and we provide bounds for the asymptotic estimation bias. Then, for a particular

choice of the quality shocks, we explicitly characterize this asymptotic bias and also investigate the

number of consumers who, despite this inferential inefficiency, choose the same product they would

have selected in the complete information case.

The first result of this section establishes that the quality estimates of consumers converge almost

surely to a vector q∞ := (q1,∞, q2,∞, . . . , qK,∞) of asymptotically perceived qualities, characterized

below, which—in general—is different from the true quality vector q. In particular, Proposition 1

7The structure of (3) is similar to the linear credibility estimators used in actuarial science (see, for instance,Buhlmann and Gisler (2005)), and to the linear estimators that arise in Bayesian updating with conjugate priors (seeDiaconis and Ylvisaker (1979)).

9

shows that the asymptotic quality estimate of a given product only depends on the distribution of

the ex-post quality disturbances and on the true quality of that product.

Proposition 1. Consider the learning process described in the previous sections, where consumer

i estimates the quality of product k as qk,i using (3). Then, almost surely, qi → q∞ as i→∞. In

particular, qk,∞ is the unique solution of

qk,∞ = P(rk,i = L | ci = k) = Fε(qk,∞ − qk) (4)

and qk,∞ = qk if and only if qk = 12 .

Moreover, for all k, k1, k2 ∈PK , we have

qk,∞ > (<) qk ⇐⇒ qk < (>)1

2. (5a)

qk1,∞ ≤ qk2,∞ ⇐⇒ qk1 ≤ qk2 . (5b)

|qk,∞ − qk| ≤ s2/3ε . (5c)

The result of Proposition 1 states that, under the naıve rule, consumers’ estimates stabilize in

the long run, and converge almost surely to a quality vector q∞ whose components are implicitly

defined by (4).

Proposition 1 establishes three important properties of the information aggregation process of

consumers. First, the vector of quality estimates qi does not necessarily converge to the correct

vector q. In fact, consumers fail to learn the intrinsic quality of any product k for which qk 6=12 . Indeed, as (5a) highlights, they asymptotically overestimate (respectively, underestimate) the

quality of products with true quality qk < 12 (qk > 1

2). To see why this occurs, suppose that

qk > 1/2. If qk,i were to converge to qk, then, because of the review rule and of the symmetric

zero-mean distribution of the εk,i’s, consumers’ reviews in the steady state will be equally split

between likes and dislikes. Such distribution of reviews would push qk,i towards 1/2. However,

equilibrium cannot be at qk,i = 1/2 either, as this would lead to more likes than dislikes, pushing

qk,i towards qk. As a result, balance between the review mechanism and the naıve estimation rule

will be achieved at a quality level that is higher than 1/2 but lower than qk.

A similar intuition also explains (5b), which establishes that qk,∞ is strictly increasing in qk,

i.e., quality estimates stabilize at higher levels for products with higher intrinsic qualities. The

result of (5b) implies that, despite the naıvete of the inference mechanism and despite consumers’

prior quality beliefs, the ranking of the perceived qualities will reflect the ranking of the true

qualities in the long run. Namely, even if learning may not be achieved for all products, consumers

asymptotically learn to correctly rank products according to their intrinsic qualities.

10

Third, (5c) provides a bound for the bias that the naıve inference rule asymptotically creates in

the quality estimates of consumers. This bound is valid under mild assumptions on the distribution

Fε and only depends on sε, the standard deviation of the random ex-post quality disturbances

{εk,i}. In particular, (5c) shows that this estimation bias can be arbitrarily small if sε is small, i.e.,

despite the fact that perfect learning may fail, if the variance of the ex-post quality shocks is small,

consumers’ estimates eventually get relatively close to the correct values.

In the remainder of this paper we assume that the quality random disturbances are uniformly

distributed on the interval [−ε, ε], where ε > 0, i.e.,

Fε(x) =

0 for x < −ε,

(x+ ε)/2ε for − ε ≤ x < ε,

1 for x ≥ ε.

(6)

The following proposition characterizes explicitly q∞ under this assumption, and, in particular, it

shows that this choice improves the bound in (5c).

Proposition 2. Let εk,i ∼ U[−ε, ε] with ε > 0, for all i and k. Then q∞ is independent of the

number of products, of prior beliefs, of prior weights, and of prices. In particular, for all k ∈PK ,

qk,∞ =qk + ε

1 + 2ε. (7)

Moreover, for all k, k1, k2 ∈PK ,

• |qk,∞ − qk| ≤ε

1 + 2ε< ε, (8a)

• |qk1,∞ − qk2,∞| ≤ |qk1 − qk2 |. (8b)

Proposition 2 explicitly characterizes qk,∞ as a function of the true quality qk under the uniform

assumption on the ex-post quality noise {εk,i}. In particular, it establishes that qk,∞ depends neither

on the prices, nor on the prior beliefs, nor on the prior weights, nor on the number of products.

This property of qk,∞ stems from the fact that consumers report reviews only on the base of the

difference between expected and experienced quality. In fact, combining the independence of the

{εk,i}’s from the {αk,i}’s and (1), we obtain the probability of observing a positive review for

product k, conditionally on the purchase of that product:

P(rk,i = L | ci = k) =P(qk,i ≤ qk + εk,i, ci = k)

P(ci = k)= Fε(qk,i − qk). (9)

11

Namely, given the quality estimate qk,i, the probability of reporting a positive or negative review

for product k conditional on the purchase of product k only depends on the distribution Fε, which,

in turn, is independent of prices, of prior beliefs, of prior weights, and of the number of products.

Moreover, (8a) shows that assuming uniform ex-post quality disturbances improves the bound

provided in (5c). Especially if ε is small, we can thus speak of ε-learning to stress the fact that,

even if asymptotic learning may not be achieved for some product, agents apprehend the values of

the true qualities within an O(ε)-error. Also, it is worth noticing that this property also seems to

suggests that consumers’ failed learning is not the effect of some herd behavior but, rather, derives

from the naıve inference procedure.

Finally, (8b) highlights the presence of a shrinking effect in the asymptotic estimates of con-

sumers, that is, the naıve estimation procedure leads consumers to perceive qualities as closer than

they really are.

3.2 Fraction of Consumers Making the Correct Purchase

In our model, due to the presence of learning effects, consumers may make mistakes, that is, they

may end up choosing a different product from the one they would select if they knew the values

of the intrinsic qualities. If consumers learned the qualities of products over time, only a finite

number of them would select the wrong product and, as a result, the fraction of consumers making

correct choices throughout the process would converge asymptotically to one. This section shows

that this is not the case in our model and that, due to the bias induced by the naıve estimation

procedure, consumers keep making mistakes with positive probability.

In order to see this, let {cCIi = k} := {αk,i+qk−pk ≥ maxj∈P0

K(αj,i+qj−pj)} denote the choice of

consumer i to purchase product k in the complete information case. Instead, the choice of consumer

i to purchase product k if she uses the naıve rule (3) is given by {ci = k} := {αk,i + qk,i − pk ≥maxj∈P0

K(αj,i + qj,i − pj)}. We say that consumer i makes the correct choice buying product k

when the choices in the naıve and in the complete information cases coincide, i.e.,

{consumer i makes the correct choice buying product k} := {ci = k} ∩ {cCIi = k}.

If we define Mk,i :=∑i−1

j=1 1{cj = k} · 1{cCIj = k}, then the fraction of consumers that made the

correct choice buying product k prior to the arrival of consumer i can be written as

mk,i :=

0 for Bk,i = 0,

Mk,i/Bk,i for Bk,i > 0.(10)

12

The following proposition establishes that the K-dimensional process mi := (m1,i,m2,i, . . . ,mK,i)

converges almost surely to the limit m∞ := (m1,∞,m2,∞, . . . ,mK,∞), where

mk,∞ :=exp

(−qk + ε

1 + 2ε

)+∑K

j=1 exp(qj − qk

1 + 2ε− pj

)exp( 2ε

1 + 2ε

[qk −

1

2

]+− qk

)+∑K

j=1 exp( qj − qk

1 + 2ε1{qj ≤ qk}− pj

). (11)

Proposition 3. Let mi be defined by (10). Then, almost surely, mi →m∞ as i→∞. Moreover,

for all k ∈PK , we have

limε→0

mk,∞ = 1. (12)

The interpretation of (12) is immediate, since the estimation bias shrinks when ε→ 0 (see (8a)),

meaning that consumers do apprehend the true qualities in the long run, and eventually learn to

take the correct action.

4 The Fluid Approximation of the Learning Dynamics

The asymptotic analysis of this section provides insight on how the learning transient depends

on the number of products options, the prices, the distance between consumers’ prior estimates

and the asymptotic perceived qualities. We study the learning process in a large market setting,

and show that the learning trajectories converge to the solutions of a system of non-linear ODEs,

which we subsequently analyze.

4.1 Formulation

We consider a sequence of systems indexed by n = 1, 2, . . . . The n-th system describes a

market where consumers arrive according to a Poisson process with parameter Λn = nΛ, and

where the number of likes, of dislikes, and of purchases for product k before time t are respec-

tively defined as Lnk(t), Dnk (t) and Bn

k (t). The scaled state variables for product k are defined as

Ink (t) = (Lnk(t), Dnk (t)) := (Lnk(t)/n, Dn

k (t)/n) and In(t) := (In1 (t), In2 (t), . . . , InK(t)) is the total in-

formation available at time t in the n-th system.8 The following lemma establishes that, if the

arrival rate of consumers grows unbounded, there exist two deterministic processes Lk(t) and Dk(t)

that approximate with arbitrary precision the scaled state variables Lnk(t) and Dnk (t). In particular,

8A similar formulation allows to investigate the speed of the learning transient and related pricing issues in thesingle-product case in Crapis et al. (2017).

13

defining Bk(t) := Lk(t) +Dk(t) and

qk(t) :=Lk(t) + Lk,0

Lk(t) + Lk,0 +Dk(t) +Dk,0=Lk(t) + Lk,0

Bk(t) +Bk,0,

this deterministic approximation allows us to describe the learning trajectories as the continuous

time solution q(t) := (q1(t), q2(t), . . . , qK(t)) of a system of ODE.

Lemma 1. For every t > 0 and every k ∈ PK we have sup0≤s≤t |Lnk(s) − Lk(s)| → 0 and

sup0≤s≤t |Dnk (s)−Dk(s)| → 0 for n → ∞ almost surely. Moreover, for all k ∈PK , the processes

Lk(t) and Dk(t) are deterministic and satisfy the differential relations

Lk(t) = Λ P(rk(t) = L | I(t)) = Λλk(q(t),p) Fε(qk(t)− qk), (13)

Dk(t) = Λ P(rk(t) = D | I(t)) = Λλk(q(t),p)Fε(qk(t)− qk). (14)

where I(t) := (I1(t), I2(t), . . . , IK(t)) and Ik(t) := (Lk(t), Dk(t)).

In the context of this fluid approximation, the dynamics of the quality estimate qk(t) is governed

by the following non-linear system of ODE:

˙qk(t) = ΛBk(t)

Bk(t) +Bk,0

1− qk(t) for qk(t) ≤ qk − ε,1+2ε

2ε [qk,∞ − qk(t)] for qk − ε < qk(t) ≤ qk + ε,

−qk(t) for qk(t) ≥ qk + ε.

(15)

where Bk(t) = Lk(t)+Dk(t) = Λλk(q(t),p).9 The following proposition establishes some structural

properties of the solution of (15) and serves as a sanity check that q(t) converges to q∞.

Proposition 4. Let q(t) := (q1(t), q2(t), . . . , qK(t)) be the vector of quality estimates in the pro-

posed fluid approximation, which is the solution of the ODE system (15). Then q(t) → q∞ =

(q1,∞, q2,∞, . . . , qK,∞) for t → ∞, where qk,∞ was defined in (7). Moreover, if qk,0 < qk,∞, then

qk(t) is monotonically increasing for all t ≥ 0; otherwise, if qk,0 > qk,∞, then qk(t) is strictly

monotonically decreasing for all t ≥ 0.

For instance, the result of Proposition 4 establishes that in the underestimating prior belief case

qk,0 < qk, qk(t) increases monotonically towards qk,∞. In particular, given that qk − ε < qk,∞ (see

(8b)), qk(t) will reach the point qk − ε in finite time. This fact becomes particularly relevant in the

next section where we study the speed of the learning transient in the underestimating prior belief

case.9See the Appendix for the proof of (15).

14

4.2 Speed of the Learning Transient

Despite the simplification introduced by the fluid approximation, the ODE in (15) remains

intractable in most cases. However, assuming that consumers initially underestimate the quality of

products, i.e., when qk,0 ≤ qk,∞ for all k ∈PK , further simplifies the structure of (15) and allows

to provide an insight on how learning transients are correlated across products.10

In particular, in this section we assume qk,0 < qk − ε and we focus on the phase of the learning

process such that t ≤ τk, where τk := inf{t : t > 0, |qk(t) − qk| ≤ ε } is the ε-time-to-learn for

product k. The following proposition establishes that qk(t) converges exponentially fast to qk,∞ for

t ≤ τk, and it provides tight bounds for the rate of exponential convergence.

Proposition 5. Let qk(t) be the quality estimate for product k in the proposed fluid approximation.

Suppose that qk,0 < qk − ε, and that qj,0 ≤ qj,∞ for all j 6= k. Then, for 0 ≤ t ≤ τk, we have

|qk(t)− qk,∞| ≤ |qk,0 − qk,∞| exp

(−vk t

2

), (16)

The rate of convergence for product k, vk, is such that vk ≤ vk ≤ vk, where

vk :=1

Bk,0 +Bk(τk)

exp(qk − ε− pk)1 + exp(qk − ε− pk) +

∑j 6=k exp(qj,∞ − pj)

, (17)

vk :=1

Bk,0 +Bk(τk)


∑j 6=k exp(qj,0 − pj)

, (18)

and Bk(τk) = Bk,0qk−qk,0−ε1−(qk−ε) is the amount of observations needed to learn qk within an O(ε)-error.

Moreover, if qj,0 = qj,∞ for j 6= k we have vk = vk = vk, i.e., the bounds (17) and (18) are tight.

Proposition (5) allows to make some comparative statics on vk and vk, and to highlight which

factors influence the learning transients the most, as we briefly discuss in the sequel.

(i) Number of product options: vk and vk decrease with the number of product options K.11 In

particular, the expressions in (17) and (18) highlight that information dispersion effects can

have dramatic effects on the speed of learning transients if K grows large, given that both vk

and vk go to zero roughly like K−1 if K →∞.

(ii) Prior belief/true quality distance: From (16) we see that the learning transient also depends

on the the distance of the prior quality estimate from the asymptotic limiting estimate (which,

10Analogous results can be obtained in the case initially overestimate the quality of products, that is, whenqk,0 ≥ qk,∞ for all k ∈PK .

11The quantities vk and vk depend on K through the K − 1 (strictly positive) summands in∑j 6=k exp(qj,∞ − pj)

and∑j 6=k exp(qj,0 − pj) respectively.

15

in general, is different from the true quality estimate). Separately, focusing on the exponent

that governs the learning transient, we note that vk and vk depend on the inverse of Bk(τk),

which is the number of observations needed to learn qk with an O(ε)-error, and is itself

increasing in the distance of the prior quality estimate from the true quality estimate. So, the

time-to-learn is increasing in the distance of the prior from the truth and from the limiting

quality that the market will eventually learn.

(iii) Relative attractiveness viz competing alternatives: The definitions of vk and vk show that the

time-to-learn the quality of a product k depends on its eventual relative attractiveness viz the

competing alternatives, and on the relative attractiveness of its quality viz the prior estimates

of the competing alternatives. Products with a higher perceived-quality/price difference viz

their alternatives (either because they started from a higher prior belief, or because they

have higher intrinsic quality, or because they are cheaper) will be selected more frequently by

consumers, hindering information accumulation for their competitors.

(iv) Weight of prior estimate: vk and vk are both decreasing in Bk,0, i.e., if consumers assign a

higher weight to their prior beliefs they are less inclined to forget their prior belief qk,0 and,

as a result, the update of their quality beliefs will be slower.

5 The Platform Control Problem

Typically, when consumers search for a particular product on online markets, the available

options are displayed according to a ranking that is decided by the online platform. It is an

empirically well-established fact that products’ positioning in the search results highly influences

consumers’ purchase decisions, as products that occupy the first positions are picked with a higher

probability then the products that are displayed in lower positions, all other things being equal.12

In this section, we modify the model studied in the previous sections to introduce a search cost

perceived by consumers as they evaluate alternatives. This search cost is an increasing function

of the ordering in which products are displayed to the consumer. We formalize the concept of

randomized ranking policies, and provide a heuristic analysis of the information control problem of

the online platform which focuses on different information provision policies that the platform can

implement to influence consumers’ purchase decisions.

12See, e.g., Kempe and Mahdian (2008), Craswell, Zoeter, Taylor, and Ramsey (2008), Lerman and Hogg (2014)for evidence from online markets, search engines and online recommendation systems, respectively.

16

5.1 Formulation and Learning Results

In the remainder of the paper we assume that the platform receives a share 0 < s ≤ 1 of every

transaction that takes place in the website. The utility that consumer i expects to obtain from

the purchase of product k is given by αk,i + qk,i − pk − g(σk,i), where σk,i ∈ {1, 2, . . . ,K} is the

position of product k in the search results of consumer i, and where g : {1, 2, . . . ,K} → R is a

non-decreasing cost function.

Let ZK indicate the set of all permutations of {1, 2, . . . , n} and let ∆(ZK) be the space of

all probability distributions over ZK . We assume that, upon arrival of consumer i, the platform

chooses a probability distribution Πi ∈ ∆(ZK), and then displays products according to a random-

ized position assignment σi := (σ1,i, σ2,i, . . . , σK,i) ∼ Πi.13 Conditional on observing the position

assignment z ∈ ZK , the probability that consumer i purchases product k is

λk(qi,p, z) := P(ci = k | σi = z) =exp(qk,i − pk − g(zk))

1 +∑K

j=1 exp(qj,i − pj − g(zj)). (19)

Notice that if we set g(x) = 0 for all x we recover the demand function λk(qi,p) introduced in (1).

A ranking policy is a function that, given an information state Ii, returns a probability distribution

Πi ∈ ∆(ZK). The platform’s obiective is to choose the position assignment policy that maximizes

the expected discounted cumulative revenues

sE

[ ∞∑i=1

e−δ tiK∑k=1

pk 1{ci = k}

], (20)

where δ > 0 is the discount factor of the platform. Notice that the expectation in (20) is taken

with respect to consumers’ arrival times {ti}, the idiosyncratic preference parameters {αk,i}, the

ex-post quality disturbances {εk,i}, and the position assignments {σk,i}.The first result of this section establishes that, irrespective the ranking policy adopted by the

platform, consumers’ quality estimates converge to the point q∞ characterized in Proposition 2.

Proposition 6. Assume that the platform commits to a randomized ranking policy and that con-

sumers choose using a MNL model. Then, qi → q∞ for i → ∞ almost surely, where q∞ was

defined in (7).

The key observation at the basis of the proof of Proposition 6 is that, despite the introduction of

search costs in consumers’ utilities, the MNL model at the basis of consumers’ choices guarantees

that each product is selected infinitely often. Thus, the platform cannot hinder the disclosure of

13L’Ecuyer et al. (2017) consider similar randomized ranking policies, however in a static information environmentwith no learning on the consumers’ side.

17

new information and, as a result, consumers’ estimates will eventually converge to q∞. Moreover,

as established in (5b), consumers learn to rank products according to their true qualities in the

long run, which allows to introduce the concept of asymptotic myopic optimality of the platform’s

ranking policy.

Definition 1. Any ranking policy that, in the limit i → ∞, displays products according to σ∞ ∼Π∞ = arg maxΠ∈∆(ZK) EΠ

[∑Kk=1 pk λk(q∞,p,σ)

]is asymptotically optimal.

Before moving to the definition of the different ranking policies examined in our heuristic analysis,

we make a comment regarding the search costs introduced in the present section. As shown in

Section 4.2, when search costs are absent, the presence of a high number of alternatives in the

market can severely slow down information accumulation for all products. This is no longer the

case when search costs are present. To see this, assume that the platform commits to a static

ranking policy that displays products, for instance, in alphabetic order, i.e., σk,i = k for all i and

let g(k) = γ (k− 1), where γ is a positive constant. It is not difficult to see that the bounds in (17)

and (18) for the rate of convergence in the fluid model can be now reformulated as follows:

vk :=1

Bk,0 +Bk(τk)

exp[qk − ε− pk − γ(k − 1)]

1 + exp[qk − ε− pk − γ(k − 1)] +∑

j 6=k exp[qj,∞ − pj − γ(j − 1)],

vk :=1

Bk,0 +Bk(τk)

exp[qk − ε− pk − γ(k − 1)]

1 + exp[qk − ε− pk − γ(k − 1)] +∑

j 6=k exp[qj,0 − pj − γ(j − 1)].

The key difference with the case when search costs are absent, is that the summations∑

j 6=k exp[qj,∞−pj − γ(j − 1)] and

∑j 6=k exp[qj,0 − pj − γ(j − 1)] in the denominators of the above fractions now

converge to a finite constant as K → ∞. This implies that the rate of information accumula-

tion for products in the first O(γ−1) positions (products with a relatively low associated search

cost) becomes independent of K as K → ∞, and, in particular, it is higher compared to the

case with no search costs, where vk ∼ O(K−1) uniformly across products. On the other hand,

for bottom-displayed products information accumulation slows down roughly as exp(−γ K) as K

grows large, much slower than the case with no search costs. In other words, when K is large, con-

sidering bottom-ranked products is too costly for consumers, who ultimately restrict their option

set to the top-ranked products. For instance, for normalized prices and estimated qualities, so that

|qk,i − pk| ≤ 1 for all i, k, the expression for the expected demand in (19) suggests that consumers

consider only the first 15-20 products when γ = .2, whereas when γ = .8 this is true only for the

top 4-6 products.14

14A possible way to see this, is to fix 0 < η � 1, and to observe that, if g(σk,i) = γ(k − 1) and |qk,i − pk| ≤ 1 forall i, k, then λk(qi,p,σi) < exp[1− γ(k − 1)] ≤ η whenever k > 1 + (1− ln η)/γ.

18

The above observation also implies that, when search costs are present, the information accu-

mulation rate for a given product goes exponentially to zero with its position in the ranking. As a

result, the platform may want to consider dynamic policies that alternate which products occupy

the top positions to accelerate the information accumulation process for unexplored products.

However, direct solution of the sequential ranking problem (20) is intractable. In the sequel we

will briefly examine a set of myopic heuristic ranking policies designed either to maximize immediate

revenues (greedy ranking policies), or to optimize a measure of the information acquired by the

market in each transaction (explorative ranking policies).

5.2 The Greedy Ranking Policy

The first ranking policy considered in this section exploits the available information in the market

to maximize immediate profits. Formally, the greedy ranking policy displays a random position

drawn from the probability distribution ΠGi ∈ ∆(ZK) such that

ΠGi = arg max

Π∈∆(ZK)EΠ

[K∑k=1

pk λk(qi,p,σ)

]. (21)

Notice that (21) satisfies Definition 1 as i→∞, which allows to state the following corollary.

Corollary 2. The greedy ranking policy is asymptotically optimal.

The following proposition establishes that solving the optimization problem in (21) is equiv-

alent to finding the solution of a combinatorial optimization problem over the space of possible

deterministic position assignments ZK .

Proposition 7. Let ΠGi be defined as in (21). Then, we have

EΠGi

[K∑k=1

pk λk(qi,p,σi)

]= max

z∈ZK

K∑k=1

pk λk(qi,p, z). (22)

Proposition 7 establishes that under the myopically greedy ranking policy it is optimal for the

platform to choose a position assignment that guarantees the highest expected revenue. The optimal

position assignment is the solution of the combinatorial optimization problem in the r.h.s. of (22),

which is called the Multinomial Logit Positioning Problem (MLPP).15 A naıve approach to the

resolution of MLPP would suggest to find the set of optimal position assignments evaluating the

expected profit in the r.h.s. of (22) for each permutation in z ∈ ZK . This approach is, of course,

15Abeliuk et al. (2016) studied a similar combinatorial optimization problem, however in the presence of capacityconstraints.

19

unfeasible because the number of permutations of {1, 2, . . . ,K} grows more than exponentially with

K. However, MLPP can be solved in polynomial time, as the following proposition guarantees.

Proposition 8. MLPP can be solved in polynomial time. Moreover, any optimal position assign-

ment σ∗i and the corresponding optimal profit ρ∗ are such that

σ∗k1,i ≤ σ∗k2,i ⇐⇒ (pk1 − ρ∗) exp(qk1,i − pk1) ≥ (pk2 − ρ∗) exp(qk2,i − pk2), (23)

for all k1, k2 ∈PK . In particular, if there exist k1, k2 ∈PK such that pk1 = pk2, then σ∗k1,i ≤ σ∗k2,i

iff qk1,i ≥ qk2,i.

It is not difficult to see that MLPP may have multiple solutions. This is indeed the case when

there are ties between products, that is, when, there exists some i and k1, k2 ∈ PK such that

pk1 = pk2 and qk1,i = qk2,i. If this is the case, to guarantee that equally profitable products

occupy (on average) the same position in the ranking displayed to consumers, it suffices to assume

that, when the solution of (22) is not unique, the platform chooses one of the optimal position

assignments uniformly at random. This can still be done in polynomial time, starting from any

optimal position assignment σ∗i and taking random permutations of equally profitable products.

5.3 Explorative Ranking Policies

The greedy ranking policy defined in the previous section gives higher visibility to the prod-

ucts that guarantee higher immediate profits, disregarding the fact that there may be unexplored

products whose profit may be higher in the long run (for instance, high quality products with low

perceived quality). In this section we present a ranking policy that, introducing some randomness,

balances between exploitation and exploration.

First, we define the probability distribution ΠEi as follows:

ΠEi = arg max

Π∈∆(ZK)E

[K∑k=1

∣∣qk,i − qk,i−1

∣∣] . (24)

The following lemma explicitly characterizes the objective function in the r.h.s. of (24), and shows

that the optimal full-explore position assignment can be obtained as the solution of a suitable

MLPP. First, we define the explorative profits h1(Ii), h2(Ii), . . . , hK(Ii) as follows:

hk(Ii) =1

Bk,i +Bk,0

1− qk,i for qk,i < qk − ε

qk,i + (1− 2qk,i)Fε(qk,i − qk) for |qk,i − qk| ≤ ε,

qk,i for qk,i > qk + ε.

(25)

20

Proposition 9. Let the explorative profits h1(Ii), h2(Ii), . . . , hK(Ii) be defined as in (25). Then

maxΠ∈∆(ZK)

E[ K∑k=1


∣∣] = maxz∈ZK

K∑k=1

hk(Ii) λk(qi,p, z).

Observe that an exact computation of the explorative profits requires the knowledge of the

intrinsic qualities. We therefore introduce an approximation that can be used by the platform

to compute these parameters and is independent of the knowledge of the qk’s. The proposed

approximation builds on the fluid formulation of the model introduced in Section 4. Suppose that

qk,0 < qk − ε. Using (13) and (14), we can see that, for 0 ≤ t ≤ τk, Lk(t) = Bk(t) and Dk(t) = 0.

This implies that, before qk(t) reaches the level qk − ε, all purchasers of product k like it. Instead,

when t ≥ τk, both Lk(t) and Dk(t) are strictly positive, and product k receives both likes and

dislikes. In other words, in the fluid approximation of the dynamics, when qk,0 < qk− ε, the arrival

of the first dislike for product k at time t is a signal that qk(t) has reached the level qk− ε.16 Based

on this intuition, we define the approximated explorative profits as follows:

Definition 3. The approximated explorative profits h1(Ii), h2(Ii), . . . , hK(Ii) are defined as

hk(Ii) =1

Bk,i +Bk,0

1− qk,i for Lk,i > 0, Dk,i = 0

0 for Lk,i > 0, Dk,i > 0,

qk,i for Lk,i = 0, Dk,i > 0,

(1− qk,0)ηi + qk,0(1− ηi) for Bk,i = 0,

(26)

where ηi ∼ Ber(12) for all i = 1, 2, . . . .

Notice that, in the case Bk,i = 0, given that the platform is unable to distinguish whether

qk,0 < qk − ε or qk,0 > qk + ε, the proposed approximation selects one of the two options with

equal probability. Also notice that the approximation proposed in (26) assigns zero profit to the

products that have received enough observations, so that their quality estimate is ε-close to their

true quality, i.e., |qk,i − qk| ≤ ε. Since hk(Ii) ∼ (Bk,i + Bk,0)−1, unexplored products with a low

value of (Bk,i +Bk,0) receive higher explorative profits.

Notice that any fully-explorative ranking policy keeps exploring even after consumers have

learned to correctly rank products with respect to their true qualities. Therefore, it is more conve-

nient to introduce a new class of partially explorative policies, the β-explore ranking policies, which

display a position assignment drawn from ΠEi with probability β, and from ΠG

i with probability

16A similar argument can be used in the the overestimating case qk,0 > qk + ε, where we have Dk(t) = Bk(t) andLk(t) = 0 for all t ≤ τk.

21

1 − β. It is worth noting that, despite the fact that the β-explore ranking policy is not asymp-

totically optimal, either, as it continues to explore and sacrifice revenues throughout the process

with positive probability, a small amount of constant exploration may be beneficial in terms of

discounted profits for a sufficiently patient platform, as our numerical experiments in Section 6

highlight.

We want to find an explorative ranking policy that achieves asymptotic optimality. To do this

we propose a mechanism for which the probability β of displaying an explorative ranking changes

over time as a function of the learning process. We then replace β with

βi =1

K

K∑k=1

1{Lk,i ·Dk,i = 0},

that is, the fraction of product whose estimated quality is ε-close to the true quality. Then, the

learning-dependent (LD) ranking policy displays products to consumer i according to a position

assignment σi ∼ ΠEi with probability βi, and according to σi ∼ ΠG

i with probability 1−βi. Notice

that Proposition 6 guarantees that βi → 0 for i → ∞ almost surely, i.e., the learning-dependent

ranking policy converges to the greedy ranking policy in the long run, which allows us to state the

following corollary.

Corollary 4. The learning-dependent ranking policy is asymptotically optimal.

6 Numerical Analysis

In this section we numerically evaluate the performance of the various heuristics proposed in

the previous section. As highlighted in Section 5.3, the approximation for the explorative profits

presented in (26) is based on the simplifying assumption that the prior belief q0 is such that

|qk,0 − qk| > ε for all k. Furthermore, to focus our attention on learning aspects, we assume that

pk = 1 and Bk,0 = 1 for all k, ε = 0.05, and g(x) = γ(x − 1), where γ > 0. The choice of the

price level serves as an offset in the MNL choice probabilities and is not significant in the following

experiments; the specific choice of pk = 1, is close to what a monopolist seller would select when

serving a market where consumers know the true product qualities for a variety of quality vectors

in [0, 1]K .

The effect of the discount factor. In Figure 1 we compare the greedy ranking policy with

the exploit/explore and the learning dependent policy for different discount factors. We study

the quantity Rrank∞ , which is the sample average of the platform’s discounted cumulative revenues

evaluated over N = 105 independent realizations of the process under the ranking policy “rank.”

22

The idea behind Figure 1 is to answer the following question: given the prior estimate and the

discount rate, to what extent is it beneficial for an uniformed platform to implement explorative

strategies?

To answer this question, we fix γ = 1, n = 5, qk,0 = 0.08 for all k. Since the platform is not aware

of the intrinsic qualities of products, we randomize the values of the qk’s, i.e., in each simulated

realization of the process we assume that the qk’s are i.i.d. with qk ∼ U[0, 1] for k = 1, 2, . . . , 5.

Then, in Figure 1, we report the ratio Rβ−explore∞ /Rgreedy

∞ as a function of β (continuous lines), and

the ratio RLD∞ /Rgreedy

∞ (dotted lines). The latter ratio, of course, is constant with respect to β. The

experiment is repeated for different values of the discount factor δ.

Figure 1 shows that, for sufficiently small values of δ, there exists βc > 0 such that Rβ−explore∞ /Rgreedy

∞

≥ 1 for β ∈ [0, βc], i.e., a moderate explorative strategy achieves higher discounted revenues than

the greedy ranking policy, despite the fact that the β-explore policy is not asymptotically optimal,

since it continues to explore and sacrifice revenue even after the learning process has converged.

This result can be explained as follows. When the platform commits to explorative policies, con-

sumers learn sooner how to rank products correctly. If the platform is patient enough (that is, if δ

is small) this phenomenon takes place before the discount factor exp(−δt) start “killing” the impact

of future revenue. As a result, there is a larger time window where the greedy strategy—which is

implemented a fraction β of the times—is maximally effective, ultimately leading to better perfor-

mances is terms of discounted profits. However, if β is too high, there is not enough exploitation

after consumers’ beliefs have reached the correct ranking, and the platform may not benefit from

this excessive exploration. If the platform is impatient (high values of δ), there is not enough time

for learning to be achieved, and the benefits of exploration vanish.

Figure 1 also highlights that, for this particular configuration of the parameters, and for small

enough values of δ, RLD∞ > Rβ−explore

∞ for all β ≥ 0. This can be explained noticing that the online

tuning mechanism at the base of LD gradually reduces the amount of exploration after the learning

process gets closer to its target, thus being less wasteful and increasingly assigning more visibility

to more profitable products.

The effect of the magnitude of the search costs. We continue to assume that search costs

are given by g(x) = γ (x− 1), and study the behavior of the different ranking policies with respect

to the parameter γ, both in terms of average asymptotic discounted revenues and time-to-learn.

Notice that—since the case γ = 0 reproduces the model introduced in Section 2, where search costs

are absent and the platform has no direct influence on consumers’ decisions—this analysis provides

an insight on whether the platform can exploit the presence of search costs to either improve profits

23

Figure 1: Rβ−explore∞ /Rgreedy

∞ (continuous line) and

RLD∞ /Rgreedy

∞ (dotted lines) vs β, with randomized truequalities and for different values of the platform discountfactor δ. Here, n = 5, qk,0 = 0.016 for all k.

Figure 2: Rgreedy∞ , RLD

∞ , and R0.6−explore∞ , vs γ for n = 5,

δ = 0.01, qk = 0.8 and qk,0 = 0.2 for 1 ≤ k ≤ 4, andq5 = 0.2 and q5,0 = 0.8.

or speed up the learning process.

In Figure 2 we fix n = 5, δ = 0.01, qk = 0.8 and qk,0 = 0.2 for 1 ≤ k ≤ 4, and q5 = 0.2

and q5,0 = 0.8. Then, we plot Rrank∞ for the different policies “rank” as a function of γ. Figure 2

shows that asymptotic discounted revenues appears to be strictly decreasing in γ, irrespective of

the particular ranking policy considered. This suggests that the platform cannot benefit from the

presence of search costs for consumers. Moreover, our numerical experiments suggest that Rrank∞

is in general monotone with respect to γ, irrespective of the parameters of the model and the

platform’s ranking policy.

In Figure 3, we consider the case n = 5, qk = 0.9 for all k and for an initially underestimating

prior estimate qk,0 = k/10. In particular, Figure 3 reports the average ε-time-to-learn τk for

different ranking policies for the products k = 1, 3, 5. As we can see from the rightmost plot in

Figure 3, the average ε-time-to-learn for the product with the highest quality prior belief (k = 5)

under the greedy policy is strictly decreasing in γ, i.e., the ranking policy implemented by the

platform significantly reduces the time-to-learn for the top-ranked product compared to the case

where search costs are absent.17 In particular, if prices and intrinsic qualities are uniform, the

top-displayed product under the greedy ranking policy is with high probability the product with

the highest prior quality belief. Thus, as γ increases, this product will be selected more and more

frequently with respect to its competitors, which ultimately results in a lower average ε-time-to-

learn. This is consistent with the comments made earlier about the impact of search costs on the

17The role of γ when search costs are linear in the position in the ranking has been discussed in Section 5.1.

24

Figure 3: Average ε-time-to-learn τk (TtL) vs γ for the ranking policies greedy, learning-dependent (LD) and 0.8-explore. Here,n = 5, qk = 0.9, and qk,0 = k/10 for all k. The plot displays the cases k = 1, 3, 5.

speed of learning for high-ranked products. The opposite effect prevails for products with a low

prior belief: these products will be displayed last under the greedy policy, and, as a result, their

time-to-learn will be disproportionately slowed down in the presence of search costs. On the other

hand, from the left panel of Figure 3, we see that experimentation will speed up the learning for

these products that would otherwise be ranked and displayed last.

A similar intuition also explains why the average ε-time-to-learn can undergo significant vari-

ations over the range of values considered for γ. Indeed, observe that, given that we only have

5 products and maximum qualities normalized to 1, moderate values of γ are around (.25, .75).

Looking at Figure 3, if one compares the ε-time-to-learn for γ = 0 (i.e., if search costs ere disre-

garded) with the actual ε-time-to-learn at moderate search costs, the bias could be significant, i.e.,

τk when γ = 1 can be up to 10 times higher compared to the case with no search costs. As such, it

is important to incorporate search costs in the platform’s analysis, and in designing how to display

products during the learning transient. These observations become more acute in practical settings

where, usually, consumers can choose among 10-100 different alternatives.

References

Abeliuk, A., Berbeglia, G., Cebrian, M., and Van Hentenryck, P. (2016) Assortment

optimization under a multinomial logit model with position bias and social influence. 4OR 14,

57–75.

URL http://dx.doi.org/10.1007/s10288-015-0302-y.

Acemoglu, D., Dahleh, M. A., Lobel, I., and Ozdaglar, A. (2011) Bayesian learning in

25

http://dx.doi.org/10.1007/s10288-015-0302-y

social networks. Rev. Econom. Stud. 78, 1201–1236.

Acemoglu, D., Makhdoumi, A., Malekian, A., and Ozdaglar, A. (2017) Fast and slow

learning from reviews. Working Paper 24046, National Bureau of Economic Research.

URL http://www.nber.org/papers/w24046.

Acemoglu, D. and Ozdaglar, A. (2011) Opinion dynamics and learning in social networks.

Dyn. Games Appl. 1, 3–49.

URL https://doi.org/10.1007/s13235-010-0004-1.

Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. (2017) MNL-bandit: a dynamic

learning approach to assortment selection. Technical report, arXiv:1706.03880.

URL https://arxiv.org/abs/1706.03880.

Banerjee, A. V. (1992) A simple model of herd behavior. Quart. J. Econ. 797–817.

Bergemann, D. and Valimaki, J. (1996) Learning and strategic pricing. Econometrica 64,

1125–1149.

Bergemann, D. and Valimaki, J. (1997) Market diffusion with two-sided learning. Rand J.

Econom. 28, 773–795.

Besbes, O. and Scarsini, M. (2018) On information distortions in online ratings. Oper. Res.

forthcoming.

Bikhchandani, S., Hirshleifer, D., and Welch, I. (1992) A theory of fads, fashion, custom,

and cultural change as informational cascades. J. Polit. Econ. 100, 992–1026.

Bimpikis, K., Ehsani, S., and Mostagir, M. (2015) Designing dynamic contests. In Proceedings

of the Sixteenth ACM Conference on Economics and Computation, EC ’15, 281–282. ACM, New

York, NY, USA.

URL http://doi.acm.org/10.1145/2764468.2764473.

Bose, S., Orosel, G., Ottaviani, M., and Vesterlund, L. (2008) Monopoly pricing in the

binary herding model. Econom. Theory 37, 203–214.

URL https://doi.org/10.1007/s00199-007-0313-9.

Buhlmann, H. and Gisler, A. (2005) A Course in Credibility Theory and its Applications.

Springer-Verlag, Berlin.

26

http://www.nber.org/papers/w24046

https://doi.org/10.1007/s13235-010-0004-1

https://arxiv.org/abs/1706.03880

http://doi.acm.org/10.1145/2764468.2764473

https://doi.org/10.1007/s00199-007-0313-9

Callander, S. and Horner, J. (2009) The wisdom of the minority. J. Econom. Theory 144,

1421–1439.e2.

URL http://dx.doi.org/10.1016/j.jet.2009.02.001.

Chandrasekhar, A. G., Larreguy, H., and Xandri, J. P. (2015) Testing models of social

learning on networks: Evidence from a lab experiment in the field. Technical report, National

Bureau of Economic Research.

Che, Y.-K. and Horner, J. (2015) Optimal design for social learning. Technical report, Cowles

Foundation Discussion Paper No. 2000.

URL https://ssrn.com/abstract=2600931.

Crapis, D., Ifrach, B., Maglaras, C., and Scarsini, M. (2017) Monopoly pricing in the

presence of social learning. Management Sci. 63, 3586–3608.

URL https://doi.org/10.1287/mnsc.2016.2526.

Craswell, N., Zoeter, O., Taylor, M., and Ramsey, B. (2008) An experimental comparison

of click position-bias models. In Proceedings of the 2008 International Cconference on Web Search

and Data Mining, 87–94. ACM.

Davis, J., Gallego, G., and Topaloglu, H. (2013) Assortment planning under the multinomial

logit model with totally unimodular constraint structures. Technical report, Department of IEOR,

Columbia University.

URL http://www.columbia.edu/gmg2/logit_const.pdf.

DeGroot, M. H. (1974) Reaching a consensus. J. Amer. Statist. Assoc. 69, 118–121.

Diaconis, P. and Ylvisaker, D. (1979) Conjugate priors for exponential families. Ann. Statist.

7, 269–281.

URL http://links.jstor.org/sici?sici=0090-5364(197903)7:2<269:CPFEF>2.0.CO;2-5&

origin=MSN.

Ellison, G. and Fudenberg, D. (1993) Rules of thumb for social learning. J. Polit. Econ. 101,

612–643.

Ellison, G. and Fudenberg, D. (1995) Word-of-mouth communication and social learning.

Quart. J. Econ. 110, 93–125.

Golub, B. and Jackson, M. O. (2010) Naive learning in social networks and the wisdom of

crowds. Amer. Econ. J. Microeconomics 2, 112–149.

27

http://dx.doi.org/10.1016/j.jet.2009.02.001

https://ssrn.com/abstract=2600931

https://doi.org/10.1287/mnsc.2016.2526

http://www.columbia.edu/gmg2/logit_const.pdf

http://links.jstor.org/sici?sici=0090-5364(197903)7:2<269:CPFEF>2.0.CO;2-5&origin=MSN

http://links.jstor.org/sici?sici=0090-5364(197903)7:2<269:CPFEF>2.0.CO;2-5&origin=MSN

Golub, B. and Jackson, M. O. (2012) How homophily affects the speed of learning and best-

response dynamics. Quart. J. Econ. 127, 1287–1338.

Golub, B. and Sadler, E. D. (2017) Learning in social networks. Technical report, SSRN

eLibrary.


Grimm, V. and Mengel, F. (2012) An experiment on learning in a multiple games environment.

J. Econom. Theory 147, 2220–2259.

URL https://doi.org/10.1016/j.jet.2012.05.011.

Hardy, G. H., Littlewood, J. E., and Polya, G. (1988) Inequalities. Cambridge University

Press, Cambridge. Reprint of the 1952 edition.

Ifrach, B., Maglaras, C., Scarsini, M., and Zseleva, A. (2018) Bayesian social learning with

consumer reviews. Technical report, SSRN eLibrary.

URL http://ssrn.com/abstract=2293158.

Jackson, M. O. (2011) An overview of social networks and economic applications. In Benhabib,

J., Bisin, A., and Jackson, M. (eds.), The Handbook of Social Economics, volume 1, 511–585.

North Holland Press Amsterdam.

Kamenica, E. and Gentzkow, M. (2011) Bayesian persuasion. Amer. Econ. Rev. 101, 2590–

2615.

Kempe, D. and Mahdian, M. (2008) A cascade model for externalities in sponsored search. In

Papadimitriou, C. and Zhang, S. (eds.), Internet and Network Economics: 4th International

Workshop, WINE 2008, Shanghai, China, December 17-20, 2008. Proceedings, 585–596. Springer

Berlin Heidelberg.

URL https://doi.org/10.1007/978-3-540-92185-1_65.

Kremer, I., Mansour, Y., and Perry, M. (2014) Implementing the “wisdom of the crowd”. J.

Polit. Econ. 122, 988–1012.

Kurtz, T. G. (1977/78) Strong approximation theorems for density dependent Markov chains.

Stochastic Processes Appl. 6, 223–240.

Kushner, H. J. and Yin, G. G. (2003) Stochastic Approximation and Recursive Algorithms and

Applications. Springer-Verlag, New York, second edition.

28


https://doi.org/10.1016/j.jet.2012.05.011

http://ssrn.com/abstract=2293158

https://doi.org/10.1007/978-3-540-92185-1_65

L’Ecuyer, P., Maille, P., Stier-Moses, N. E., and Tuffin, B. (2017) Revenue-maximizing

rankings for online platforms with quality-sensitive consumers. Oper. Res. 65, 408–423.

URL http://dx.doi.org/10.1287/opre.2016.1569.

Lerman, K. and Hogg, T. (2014) Leveraging position bias to improve peer recommendation.

PloS one 9, e98914.

URL https://doi.org/10.1371/journal.pone.0098914.

Lobel, I. and Sadler, E. (2016) Preferences, homophily, and social learning. Oper. Res. 64,

564–584.

URL https://doi.org/10.1287/opre.2015.1364.

McFadden, D. (1973) Conditional logit analysis of qualitative choice behavior. In Zarembka,

P. (ed.), Frontiers in Econometrics, 105–142. Academic Press.

Molavi, P., Tahbaz-Salehi, A., and Jadbabaie, A. (2016) Foundations of non-Bayesian social

learning. Technical report, Columbia Business School Research Paper No. 15-95.


Mueller-Frank, M. and Neri, C. (2013) Social learning in networks: theory and experiments.

Technical report, SSRN eLibrary.


Papanastasiou, Y., Bimpikis, K., and Savva, N. (2017) Crowdsourcing exploration. Manage-

ment Sci. forthcoming.


Papanastasiou, Y. and Savva, N. (2017) Dynamic pricing in the presence of social learning and

strategic consumers. Management Sci. 63, 919–939.


Rusmevichientong, P., Shen, Z.-J. M., and Shmoys, D. B. (2010) Dynamic assortment opti-

mization with a multinomial logit choice model and capacity constraint. Oper. Res. 58, 1666–

1680.

URL https://doi.org/10.1287/opre.1100.0866.

Saure, D. and Zeevi, A. (2013) Optimal dynamic assortment planning with demand learning.

Manuf. Serv. Oper. Management 15, 387–404.

Shin, D. and Zeevi, A. (2017) Dynamic pricing and learning with online product reviews. Tech-

nical report, Columbia University.

29

http://dx.doi.org/10.1287/opre.2016.1569

https://doi.org/10.1371/journal.pone.0098914

https://doi.org/10.1287/opre.2015.1364





https://doi.org/10.1287/opre.1100.0866

Smith, L. and Sørensen, P. (2000) Pathological outcomes of observational learning. Economet-

rica 68, 371–398.

URL https://doi.org/10.1111/1468-0262.00113.

Talluri, K. and Van Ryzin, G. (2004) Revenue management under a general discrete choice

model of consumer behavior. Management Sci. 50, 15–33.

Van Hentenryck, P., Abeliuk, A., Berbeglia, F., Maldonado, F., and Berbeglia, G.

(2016) Aligning popularity and quality in online cultural markets. In Proceedings of the Tenth

International AAAI Conference on Web and Social Media, 398–407.

Welch, I. (1992) Sequential sales, learning, and cascades. J. Finance 47, 695–732.

A Appendix

Proofs of Section 3

The following lemma is instrumental for the proofs of Proposition 1, Proposition 3, and Propo-

sition 6. We omit the proof of the lemma, which can be found in page 13 of Crapis et al. (2017).

Lemma A.1. (Lemma 2, Crapis et al. (2017)). Let Bk,i :=∑i−1

j=1 1{cj = k} and suppose that

there exist C > 0 such that P(ci = k) ≥ 2C > 0 for all i. Then, for every positive A ∈ R we have∑∞j=1(Bk,j +A)−2 <∞ almost surely.

Before introducing the following lemma, which is needed to prove Proposition 1, we introduce

the variation operator

Yk,i := (1− qk,i)1{qk,i ≤ qi + εk,i} − qk,i1{qk,i > qi + εk,i}. (27)

Lemma A.2. Consider the learning process described in Section 2. Then, for all i = 1, 2, . . . , and

for all k ∈ PK , the process qk,i obeys the stochastic recursion qk,i+1 − qk,i = 1{ci = k}(Bk,i+1 +

Bk,0)−1Yk,i. Moreover, qj satisfies all the assumptions of the Main Convergence Theorem 2.1 of

Chapter 5 in Kushner and Yin (2003).

Proof. Observe that, by the independence of the αk,i’s and the εk,j ’s, we have 1{rk,i = L} = 1{ci =

k}1{qk,i ≤ qi + εk,i} and 1{rk,i = D} = 1{ci = k}1{qk,i > qi + εk,i}. Then

qk,i+1 =Lk,i+1 + Lk,0Bk,i+1 +Bk,0

=Lk,i + 1 + Lk,0Bk,i + 1 +Bk,0

1{rk,i = L}+Lk,i + Lk,0

Bk,i + 1 +Bk,01{rk,i = D}+ qk,i1{cj 6= i}

30

https://doi.org/10.1111/1468-0262.00113

=qk,i +

(Lk,i + 1 + Lk,0Bk,i+1 +Bk,0

−Lk,i + Lk,0Bk,i +Bk,0

)1{rk,i = L}+

(Lk,i + Lk,0Bk,i+1 +Bk,0

−Lk,i + Lk,0Bk,i +Bk,0

)1{rk,i = D}

=qk,i + (Bk,i+1 +Bk,0)−1[(1− qk,i)1{rk,i = L} − qk,i1{rk,i = D}

]=qk,i + 1{ci = k}(Bk,i+1 +Bk,0)−1Yk,i, (28)

which proves that qj satisfies the recursion relation mentioned in the statement of the lemma. In

the remainder of the proof we show that the process qj satisfies all the assumptions of the Main

Convergence Theorem 2.1 of Chapter 5 in Kushner and Yin (2003).

Assumption (A.2.1). It is straightforward to find that, for all i, k, |Yk,i| = (1 − qk,i)1{qk,i ≤qi + εk,i}+ qk,i1{qk,i > qi + εk,i}, which implies, for all i, k,

|Yk,i|2 = (1− qk,i)21{qk,i ≤ qi + εk,i}+ q2k,i1{qk,i > qi + εk,i} ≤ 1.

Hence, supi E |Yk,i|2 ≤ 1 ≤ ∞ for all k ∈PK , which proves that Assumption (A.2.1) is satisfied.

Assumption (A.2.2). We have

E[Yk,i | Yk,0, . . . ,Yk,i−1] =(1− qk,i)P(qk,i ≤ qi + εk,i)− qk,i P(qk,i > qi + εk,i) = Fε(qk,i − qk)− qk,i.

Hence, for k ∈PK , we can define the drift functions gk : [0, 1]→ [0, 1] such that

gk(qi) := E[Yk,i | Yk,0, . . . ,Yk,i−1] = Fε(qk,i − qk)− qk,i. (29)

Given that Fε is measurable and continuous, so it is gk, which establishes that Assumption (A.2.2)

and Assumption (A.2.3) are verified. Moreover, using (19), we conclude that there exist K strictly

positive real numbers η1, η2, . . . , ηK such that P(ci = k) ≥ 2 ηk > 0 for all i, k. Hence, we can apply

Lemma A.1 and establish that∑∞

i=1(Bk,i + Bk,0)−2 < ∞ for all k almost surely. As∑∞

i=1 1{ci =

k}2(Bk,i+Bk,0)−2 ≤∑∞

i=1(Bk,i+Bk,0)−2, this proves that Assumption (A.2.4) is satisfied. Finally,

Assumption (A.2.5) holds as the finite bias terms βk,i in (29) are null for all i, k.

Proof of Proposition 1. Observe that Lemma A.2 allows us to apply the Mean Convergence

Theorem 2.1 in Kushner and Yin (2003) to the process qj . In order to apply the theorem, we rewrite

(28) in a more convenient way. We define the martingale noise differences as Hk,i := Yk,i − gk(qi)for k ∈ PK . Then we have qk,i = qk,i−1 + (Bk,i−1)−1gk(qi) + (Bk,i−1)−1Hk,i−1.18 By invoking

Theorem 2.1 in Kushner and Yin (2003), we conclude that qi converges almost surely to the set of

18The projection terms Zk,i have been omitted given that, as 0 ≤ qk,i ≤ 1, then Zk,i = 0 for all i, k.

31

asymptotically stable points of the K-dimensional ODE

˙q = g(q), (30)

where g : [0, 1]K → [0, 1]K , g(q) := (g1(q), g2(q), . . . , gK(q)), and the functions gk(q) were defined

in (29).

We will now show that the ODE (30) has a unique globally asymptotically stable fixed point

q∞. It is immediate to see that, for all k ∈ PK , gk(qi) = 0 if and only if qi satisfies (4). To see

that (4) has a unique solution, consider the function gk(x) = Fε(x− qk)− x. Since Fε(0) = 12 and

since Fε(−qk) is a non-decreasing function of qk, we have gk(0) = Fε(−qk) ≥ 12 > 0. Moreover,

gk(1) = Fε(1− qk)− 1 ≤ 0. As gk(x) is strictly decreasing in x, we conclude that, for all k ∈PK ,

there exists a unique point qk,∞ such that

gk(x) ≥ (<) 0 for x ≤ (>) qk,∞, (31)

which implies that (4) has a unique solution qk,∞ ∈ [0, 1].

In order to prove that q∞ is globally asymptotic stable, let our candidate Lyapunov function be

V (q) := 12

∑Kk=1 (qk − qk,∞)2. It is immediate to see that V (q) ≥ 0 for all q ∈ [0, 1]K and that

V (q) = 0 if and only if q = q∞. Moreover,

V (q) =

K∑k=1

(qk − qk,∞) ˙qk =

K∑k=1

(qk − qk,∞) gk(q) =

K∑k=1

(qk − qk,∞)[Fε(qk − qk)− qk] ≤ 0, (32)

where the last inequality follows from (31). This shows that V (q) < 0 for all q ∈ [0, 1]K \{q∞} and

that V (q) = 0 if and only if q = q∞, which proves that q∞ is globally asymptotically stable for

(30). Invoking the Mean Convergence Theorem, we finally establish that qj → q∞ almost surely.

In the remainder of the proof we show (5a), (5b), and (5c). The first part of (5a) is immediate

as qk,∞ = qk implies qk,∞ = Fε(qk,∞− qk) = Fε(0) = 12 and as Fε(qk,∞− qk) = 1

2 implies qk,∞ = qk.

Using the latter result and the fact that F is non-increasing, we conclude that if qk < qk,∞ then

qk < qk,∞ = Fε(qk,∞ − qk) < 12 , which proves that qk,∞ > qk implies qk <

12 . To prove that qk <

12

implies qk,∞ > qk, assume by contradiction qk <12 and qk,∞ < qk. Then the result follows as we find

the contradiction 12 > qk > qk,∞ = Fε(qk,∞ − qk) > 1

2 . Since the proof of qk,∞ < qk ⇐⇒ qk >12

follows identical logical steps, this establishes (5a).

To prove (5b), first assume qk2,∞ ≥ qk1,∞. Then, qk2,∞ = F (qk2,∞−qk2) ≤ F (qk1,∞−qk1) = qk1,∞,

which, as Fε is non-increasing, implies qk2,∞ − qk2 ≤ qk1,∞ − qk1 . Rearranging the last inequality

we obtain qk2 − qk1 ≥ qk2,∞− qk1,∞ ≥ 0, proving that qk2,∞ ≥ qk1,∞ implies qk2 ≥ qk1 . To prove the

32

opposite direction, assume by contradiction that qk1 ≤ qk2 and qk1,∞ > qk2,∞. Then, qk1,∞ − qk1 <qk2,∞ − qk2 , which implies the contradiction qk1,∞ = Fε(qk1,∞ − qk1) > Fε(qk2,∞ − qk2) = qk2,∞.

Therefore, qk1 ≤ qk2 implies qk1 ≥ qk2,∞, concluding the proof of (5b).

In order to prove (5c), we first show that |qk,∞−qk| ≤ qk,∞. This is trivial if qk,∞ ≥ qk. Moreover,

notice that (5a) and (5b) allow us to conclude that qk,∞ < qk implies qk,∞ > 12 . Thus in the case

qk,∞ < qk we have 2qk,∞ > 1 ≥ qk and |qk,∞ − qk| = qk − qk,∞ < qk,∞. Moreover, observe that for

all a ∈ R and for any random variable X we can write P(X ≥ a) ≤ P(|X| ≥ |a|). Combining this

observation with (4), and applying Chebyshev inequality, we can write

|qk,∞ − qk| ≤ qk,∞ = Fε(qk,∞ − qk) = P(εk,i ≥ qk,∞ − qk) ≤ P(|εi,j | ≥ |qi,∞ − qk|) ≤s2ε

|qk,∞ − qk|2,

which proves |qk,∞ − qk|3 ≤ s2ε and concludes the proof.

Proof of Proposition 2. First, we establish that qk,∞ lies in the interior of the interval [qk −ε, qk + ε]. In fact, assume by contradiction that qk,∞ ≤ qk − ε. Then, as qk ∈ [0, 1] and ε >

0, we have qk,∞ = Fε(qk,∞ − qk) = 1 > qk − ε. Similarly, assuming qk,∞ ≥ qk + ε, we find

qk,∞ = Fε(qk,∞ − qk) = 0 < qk + ε. Having established |qk,∞ − qk| ≤ ε, (7) follows from solving

qk,∞ = [−(qk,∞ − qk) + ε]/2ε. Finally, using (7), we find the following two inequalities

|qk,∞ − qk| = ε| − 2qk + ε|

1 + 2ε<

ε

1 + 2ε, |qk1,∞ − qk2,∞| =

|qk1 − qk2 |1 + 2ε

< |qk1 − qk2 |,

for all k, k1, k2 ∈PK , which prove both (8a) and (8b), and conclude the proof.

We now present a lemma that is needed in the proof of Proposition 3.

Lemma A.3. Let k ∈ P0K , and let the functions ϕk : [0, 1]K → [0, 1] be defined as ϕk(qi) :=

P(cCIi = k | ci = k). Then ϕk(q∞) = mk,∞, defined as in (11).

Proof. For notation convenience, in the proof we omit the consumer index i. We have

P(cCI = k, c = k) = P{αk + qk − pk ≥ max

j∈P0K

(αj + qj − pj), αk + qk − pk ≥ maxj∈P0

K

(αj + qj − pj)}

= P{αk − pk ≥ max

[maxj∈P0

K

(αj + qj − qk − pj), maxj∈P0

K

(αj + qj − qk − pj)]}

= P{αk − pk ≥ max

j∈P0K

[αj + max(qj − qk, qj − qk)− pj

]}=

exp(−pk)

exp[−min(qk, qk)

]+∑K

j=1 exp[

max(qj − qk, qj − qk)− pj],

33

where the last equality stems from the properties of the MNL model. Then, applying Bayes’

theorem and using (1), we obtain

P(cCI = k | c = k) =exp(−qk) +

∑Kj=1 exp

[qj − qk − pj

]exp

[−min(qk, qk)

]+∑K

j=1 exp[

max(qj − qk, qj − qk)− pj]. (33)

Using the definition of qk,∞ in (5a), we obtain:

min (qk, qk,∞) = qk −2ε

1 + 2ε

[qk −

1

2

]+, (34)

max(qj − qk, qj,∞ − qk,∞) = max

(qj − qk,

qj − qk1 + 2ε

)=

qj − qk1 + 2ε1{qj ≤ qk}

. (35)

Then, (11) follows by plugging (34) and (35) into (33).

Proof of Proposition 3. First, notice that, for all k ∈ PK and for all i ≥ 1, the process mk,i

obeys the stochastic recursion

mk,i =Mk,i−1 + 1

Bk,i−1 + 11{cCI

i−1 = k, ci−1 = k}+Mk,i−1

Bk,i−1 + 11{cCI

i−1 6= k, ci−1 = k}+mk,i−11{ci−1 6= k}

= mk,i−1 +1

Bk,i−1 + 1

[(1−mk,i−1)1{cCI

i−1 = k, ci−1 = k} −mk,i−11{cCIi−1 6= k, ci−1 = k}

].

Moreover, Lemma A.2 establishes that qk,i − qk,i−1 = 1{ci−1 = k}(Bk,i + Bk,0)−1Yk,i−1 for all

k ∈PK and i ≥ 1. For all i ≥ 1 and k = K + 1, . . . , 2K, we define mk,i := mk−K,i and

ηk,i :=

1{ci = k}(Bk,i +Bk,0)−1 for k = 1, . . . ,K,

(Bk−K,i + 1)−1 for k = K + 1, . . . , 2K.

If we define

Yk,i :=

(1− qk,i)1{qk,i ≤ qk + εk,i} − qk,i1{qk,i > qk + εk,i} for k = 1, . . . ,K,

(1− mk,i)1{cCIi = k −K, ci = k −K} − mk,i1{cCI

i 6= k −K, ci = k −K} for k = K + 1, . . . , 2K,

then the 2K-dimensional process xi := (q1,i, q2,i, . . . , qK,i, mK+1,i, mK+2,i, . . . , m2K,i) satisfies the

recursion relation xk,i+1 = xk,i + ηk,iYk,i. We will now show that xi satisfies all the assumptions

of the Main Convergence Theorem in Kushner and Yin (2003). We will only prove the cases

k = K + 1, . . . , 2K, as the proof for the remaining cases is provided in Lemma A.2.

Assumption (A.2.1). We have |Yk,i| ≤ (1−mk,i)1{cCIi = k, ci = k−K}+mk,i1{cCI

i 6= k, ci = k−K}

34

for all i ≥ 1 and k = K + 1, . . . , 2K. Then,

|Yk,i|2 = (1− mk,i)21{cCI

i = k, ci = k −K}+ m2k,i1{cCI

i 6= k, ci = k −K} ≤ 1.

This proves that supi E |Yk,i|2 ≤ ∞ for all k = K + 1, . . . , 2K, as required by Assumption (A.2.1).

Assumption (A.2.2). We define Yi := (Y1,i, . . . , Y2K,i) for all i ≥ 1. We can write

E[Yk,i | Y1, . . . , Yi−1] =(1− mk,i)P(cCIi = k −K, ci = k −K)− mk,i P(cCI

i 6= k −K, ci = k −K)

=P(ci = k −K) [P(cCIi = k −K | ci = k −K)− mk,i] = λk(qi,p) [ϕk(qi)− mk,i],

where the last equality follows from Lemma A.3. Hence, we can define the following drift functions:

φk(xi) := E[Yk,i | Y1, . . . , Yi−1] =

Fε(qk,i − qk)− qk,i for k = 1, . . . ,K,

λk(qi,p) [ϕk(qi)− mk,i] for k = K + 1, . . . , 2K,(36)

which are measurable and continuous, as Fε and λk are measurable and continuous, which shows

that both Assumption (A.2.2) and Assumption (A.2.3) are satisfied. Moreover, observe that the

structure of the MNL choice probabilities guarantees that we can always find K positive real

numbers η1, η2, . . . , η2K such that P(ci = k) ≥ 2 ηk > 0 for all i, k. Hence, we can apply Lemma A.1

to conclude that∑∞

i=1 η2k,i <∞, which validates Assumption (A.2.4). Finally, Assumption (A.2.5)

is immediately verified, given that βk,i = 0 for all i, k, as shown in (36).

We define Hk,i := Yk,i − φk(xi) for k = 1, 2, . . . , 2K so that xk,i = xk,i−1 + (Bk,i−1)−1 φk(xi) +

(Bk,i−1)−1 Hk,i−1.19 We can now invoke Theorem 2.1 in Kushner and Yin (2003) and conclude that

the process xi converges to the set of asymptotically stable points of the 2K-dimensional ODE

x = φ(x), (37)

where the vector function φ : [0, 1]2K → [0, 1]2K is such that φ(x) := (φ1(x), φ2(x), . . . , φK(x)), and

where φk was defined in (36).

In the remainder we prove that (37) has a unique globally asymptotically stable fixed point

x∞ := (q1,∞, q2,∞, . . . , qK,∞, mK+1,∞, mK+2,∞, . . . , m2K,∞). Recall that, for k = K + 1, . . . , 2K,

mk,∞ = mk−K,∞, where mk,∞ = ϕk(q∞) was evaluated in (11) of Lemma A.3. Moreover, notice

that the first K components of (37) correspond to the ODE (30) studied in Proposition 1. Hence,

by applying a Lyapunov argument as in the proof of Proposition 1 (see (32)), we can show that

xk(t) → qk,∞ for t → ∞ for k = 1, 2, . . . ,K. In particular, this implies that ϕk(q(t)) → mk,∞ for

19The projection terms Zk,i have been omitted because they are null for all i, k.

35

t→∞ for k = K + 1, . . . , 2K.

In the remainder of the proof we assume k = K + 1, . . . , 2K, and we show that xk(t) → mk,∞

for t → ∞. Let δ > 0 and let 0 < δ∗ < δ. Since ϕk(q(t)) → mk,∞ for t → ∞, we can always find

T > 0 such that |ϕk(q(t))− mk,∞| < δ∗ for all t > T and for all k = K + 1, . . . , 2K. Moreover, for

all t > T and for all k, we have

λk(q(t),p)(mk,∞ − δ∗ − mk(t)) ≤ ˙mk(t) ≤ λk(q(t),p)(mk,∞ + δ∗ − mk(t)).

Thus, for all k, one can find two functions m−k (t) and m+k (t) such that m−k (T ) = m+

k (T ) = mk(T ),

˙m−k (t) = λk(q(t),p)(mk,∞ − δ∗ − m−k (t)), ˙m+k (t) = λk(q(t),p)(mk,∞ + δ∗ − m+

k (t)),

and that therefore satisfy m−k (t) ≤ mk(t) ≤ m+k (t), for all t ≥ T . Moreover, since ˙m−k (t) < 0 (> 0)

if and only if m−k (t) > mk,∞ − δ∗ (< mk,∞ − δ∗) and ˙m−k (t) = 0 if and only if m−k (t) = mk,∞ − δ∗,we have that m−k (t)→ mk,∞ − δ∗ for t→∞. Similarly, we can show that m+

k (t)→ mk,∞ + δ∗ for

t→∞. In particular, this implies that there exists Tδ∗ > 0 such that mk,∞−δ ≤ mk(t) ≤ mk,∞+δ

for all t ≥ Tδ∗ . Since δ is arbitrary, this proves that mk(t)→ mk,∞ for k = K+ 1,K+ 2, . . . , 2K as

t→∞, and hence that mk(t)→ mk,∞ for all k ∈PK as t→∞. Since (12) follows from a direct

check in (11), this concludes the proof.

Proofs of Section 4

Proof of Lemma 1. The proof consists in verifying the conditions of Theorem 2.2 of Kurtz

(1977/78). First, observe that In(t) ∈ {z/n | z ∈ Z2K+ }, where by Zd+ we denote the d-dimensional

integer lattice. To validate the remaining hypothesis of the theorem, we first need to show that the

scaled number of likes and dislikes Lnk(t) and Dnk (t) can be expressed as a suitable Poisson processes

with time-dependent rate, and then we must prove that the following inequalities hold

γLk (x) ≤ ΓL1 (1 + |x|), γDk (x) ≤ ΓD1 (1 + |x|), (38)

|γLk (x)− γLk (y)| ≤ ΓL2 |x− y|, |γDk (x)− γLk (y)| ≤ ΓD2 |x− y|, (39)

for some positive constants ΓL1 , ΓL2 , ΓD1 , and ΓD2 , and for all k ∈PK for all x, y ∈ R2K .

We define, for k ∈PK , the following functions:

γLk (I(t)) := P(rk(t) = L | I(t)), γDk (I(t)) := P(rk(t) = D | I(t)),

36

Furthermore, let An be a Poisson process with parameter Λn and let NLk (a), ND

k (a) be independent

Poisson processes with arrival rate a. Then we can write:

Lnk(t) =1

n

t∫0

1{rk(s) = L | In(s)} dAn(s) =1

nNLk

(Λn

t∫0

P(rk(s) = L | In(s)) ds), (40)

where in the last equality we used a Poisson thinning argument to replace the counting process of

consumers who liked product k with a Poisson process whose arrival rate is proportional to the prob-

ability of observing a like for product k. Similarly, one can show that Dnk (t) = 1

nNDk

(Λn

t∫0

P(rk(s) =

D | In(s)) ds)

.

It remains to prove the inequalities (38) and (39). Since γLk and γDk are probabilities, the inequalities

in (38) hold with ΓL1 = ΓD1 = 1 for all k ∈PK . Moreover, from (13), we observe that γLk depends

on I(t) only through the quality estimates q1(t), q2(t), . . . , qK(t). We now show that the quality es-

timate qk(t) = (Lk(t) +Lk,0)/(Bk(t) +Bk,0) is Lipschitz continuous in I(t). In fact, since qk(t) does

not depend on Lk(t) and Dk(t) if k 6= i it is trivially Lipschitz continuous in Ik(t) = (Lk(t), Dk(t))

if k 6= i. Moreover, for all Ik(t) = (Lk(t), Dk(t)) and I ′k(t) = (L′k(t), D′k(t)), we have∣∣∣∣∣∣Lk(t) + Lk,0

Bk(t) +Bk,0−L′k(t) + Lk,0

B′k(t) +Bk,0

∣∣∣∣∣∣ =

∣∣∣∣∣∣(Lk(t) + Lk,0)(D′k(t) +Dk,0)− (L′k(t) + Lk,0)(Dk(t) +Dk,0)

(Bk(t) +Bk,0)(B′k(t) +Bk,0)

∣∣∣∣∣∣=

∣∣∣∣∣∣(Lk(t) + Lk,0)(D′k(t) +Dk,0)− (L′k(t) + Lk,0)(D′k(t) +Dk,0)

(Bk(t) +Bk,0)(B′k(t) +Bk,0)

+(L′k(t) + Lk,0)(D′k(t) +Dk,0)− (L′k(t) + Lk,0)(Dk(t) +Dk,0)

(Bk(t) +Bk,0)(B′k(t) +Bk,0)

∣∣∣∣∣∣≤

d′k(t)

Bk(t) +Bk,0|Lk(t)− L′k(t)|+

l′k(t)

Bk(t) +Bk,0|Dk(t)−D′k(t)|

≤2

Bk,0|Ik(t)− I ′k(t)|,

for all k ∈ PK . Finally, noticing that λk(·,p) ∈ C∞([0, 1]K), that [0, 1]K is trivially a compact

convex set, and that Fε(x) has a bounded first derivative, we conclude that γLk is Lipschitz contin-

uous for all k ∈PK . Since an analogous proof can be provided for γDk for all k ∈PK , this proves

the inequalities in (39), and concludes the proof.

37

Proof of Equation (15). We have

˙qk(t) =d

dt

Lk(t) + Lk,0Bk(t) +Bk,0

=Lk(t)− lk(t)Bk(t)Bk(t) +Bk,0

= ΛBk(t)

Bk(t) +Bk,0[Fε(qk(t)− qk)− qk(t)].

Observe that if εk,i ∼ U[−ε, ε] for all i, k, then qk(t) > qk + ε (respectively, qk(t) < qk − ε)

implies Fε(qk(t) − qk) = 0 (= 1). Moreover, if |qk(t) − qk| ≤ ε, we have Fε(qk(t) − qk) − qk(t) =12ε(ε− qk(t) + qk)− qk(t) = 1+2ε

2ε [qi,∞ − qk(t)], which concludes the proof.

Proof of Proposition 4. Observe that Bk(t) = Lk(t) + Dk(t) = Λλk(q(t),p) > 0. Then, the

result follows immediately from noticing that, for all k ∈ PK and t > 0, we have qk(t) = 0 if and

only if qk(t) = qk,∞, and that ˙qk(t) > 0 (respectively, < 0) if and only if qk(t) < qk,∞ (> qk,∞).

Proof of Proposition 5. First observe that, for all j ∈ PK , qj,0 < qj,∞ implies qj(t) < qj,∞ for

all t ≥ 0 (Proposition 4), and that qk,0 < qk − ε implies ˙qk(t)/(1 − qk(t)) = Bk(t)/(Bk(t) + Bk,0)

for all 0 ≤ t ≤ τk (see (15)). Then, for 0 ≤ t ≤ τk we have

d

dt

1

2|qk(t)− qk,∞|2 =

Bk(t)

Bk(t) +Bk,0(qk(t)− qk,∞)(1− qk(t)) ≤ −vk (qk,∞ − qk(t))2, (41)

where vk := mint∈[0,τk]

[Bk(t)/(Bk(t) +Bk,0)

], and where Bk(t) = λk(q(t),p). Notice that (41)

implies (16). We now show that the function Bk(t)/(Bk(t) + Bk,0) is strictly decreasing for all

t ≥ 0, and hence that vk = Bk(τk)/(Bk(τk) + Bk,0). First, notice that Proposition 4 establishes

that qj,0 ≤ qj,∞ implies ˙qj(t) > 0 for all t ≥ 0, which allows us to write

d

dt

Bk(t)

Bk(t) +Bk,0=

Bk(t)

Bk(t) +Bk,0− Bk(t)

2

(Bk(t) +Bk,0)2

=Bk(t)

Bk(t) +Bk,0

[˙qk(t)−

n∑k=1

Bj(t) ˙qj(t)−Bk(t)

Bk(t) +Bk,0

]

=Bk(t)

Bk(t) +Bk,0

[− Bk(t)

Bk(t) +Bk,0qk(t)−

n∑k=1

Bj(t) ˙qj(t)

]< 0. (42)

In particular, (42) implies that vk = Bk(τk)/(Bk(τk)+Bk,0). The quantity Bk(τk) can be evaluated

rewriting (15) as ˙qk(t)/(1− qk(t)) = Bk(t)/(Bk(t) +Bk,0) and integrating on both sides for 0 ≤ t ≤τk, which gives

qk(t) = 1−Bk,0

Bk(t) +Bk,0(1− qk,0), 0 ≤ t ≤ τk. (43)

38

Since qk,0 ≤ qk − ε implies qk(τk) = qk − ε, Bk(τk) can be computed by inverting (43). Then,

vk =Bk(τk)

Bk(τk) +Bk,0=

1

Bk,0

1− qk,01− (qk − ε)


∑j 6=k exp(qj(τk)− pj)

. (44)

The bounds (17) and (18) follow immediately from noticing that λk(x,p) is strictly decreasing in

xj for j 6= k (to see this, observe that ∂λk(x,p)/∂xj = −λj(x,p)λk(x,p) < 0 for j 6= k). Finally,

observe that, if we fix qj,0 = qj,∞ for all j 6= k, then we have qj(t) = qj,0 = qj,∞ for all t ≥ 0,

which implies vk = vk = vk. This proves that the bounds (17) and (18) are tight, and concludes

the proof.

Proofs of Section 5

Proof of Proposition 6. (Sketch Only). The proof follows the line of the proof of Proposition 1.

In fact, notice that (28) still holds. Moreover, we can show that Assumptions (A.2.1)–(A.2.3) and

Assumption (A.2.5) of the Main Convergence Theorem of Kushner and Yin (2003) are satisfied,

using the same argument used in the proof of Proposition 1. Moreover, given that the search costs

are bounded, P(ci = k) = EΠi [λk(qi,p,σi)] ≥ 2 ηk > 0 for some ηk > 0 for all i, k. Hence, using

the same argument used in the proof of Proposition 1, we have∑∞

i=1 1{ci = k}2(Bk,i + Bk,0)−2 ≤∑∞i=1(Bk,i + Bk,0)−2 almost surely, which proves that also Assumption (A.2.4) is satisfied. The

remainder of the proof is identical to the proof of Proposition 1, and is therefore omitted.

Proof of Proposition 7. Let Πi(z) := P(σi = z). Then,

EΠi

[K∑k=1

pk λk(qi,p,σi)

]=

K∑k=1

pk∑z∈ZK

Πi(z)λk(qi,p, z) =∑z∈ZK

Πi(z)

K∑k=1

pkλk(qi,p, z).

Suppose that there exists zi ∈ ZK such that∑K

k=1 pkλk(qi,p, z) ≤∑K

k=1 pkλk(qi,p, z) for all

z ∈ ZK . Then, since∑

z∈ZKΠi(z) = 1 for all i, we have

∑z∈ZK

Πi(z)K∑k=1

pkλk(qi,p, z) ≤K∑k=1

pkλk(qi,p, z)∑x∈ZK

Πi(z) =K∑k=1

pkλk(qi,p, z),

which concludes the proof.

The following lemma is instrumental for the proof of Proposition 8.

Lemma A.4 (Hardy, Littlewood, and Polya (1988, Section 10.2, Theorem 368)). Let x1, x2, . . . , xN

and y1, y2, . . . , yN be real numbers such that x1 ≤ x2 ≤ · · · ≤ xN and y1 ≤ y2 ≤ · · · ≤ yN . Let σ be

any permutation of {1, 2, . . . , N}. Then, x1yσ1 + x2yσ2 + · · ·+ xNyσN ≤ x1y1 + x2y2 + · · ·+ xNyN .

39

Proof of Proposition 8. For notation convenience, in the present proof we omit the consumer

index i. This proof deals with the following generalized version of MLPP

maxσ∈ZK

∑Kk=1 ρkθσkwk

1 +∑K

k=1 θσkwk, (45)

in which the multinomial logit probabilities associated to each product are independent of prices.

Notice that, if ρk = pk, wk := exp(qk − pk) and θk := exp(−g(k)), (45) reduces to (22).

Let A1, A2, . . . , AN be the disjoint subsets of PK containing clusters of exchangeable products,

i.e., for all j and for all a1, a2 ∈ Aj we have wa1 = wa2 and ρa1 = ρa2 . The quantity N such that

1 ≤ N ≤ K is the number of clusters of exchangeable products in PK . Notice that, for all j1 6= j2

if a1 ∈ Aj1 and a2 ∈ Aj2 , then either ρa1 6= ρa2 or wa1 6= wa2 . Let PN := {k1, k2, . . . , kN : kj ∈Aj , j = 1, 2, . . . , N} ⊆PK as the set constructed by taking a single element from each of the sets

A1, A2, . . . , AN . Notice that, by construction, PN cannot contain exchangeable products, and that

it can be constructed in polynomial time. In the remainder of the proof, without loss of generality,

we assume that PN = {1, 2, . . . , N}.Let σ∗ be an optimal position assignment. The corresponding optimal profit ρ∗ is such that

ρ∗ =

∑Kk=1 ρkθσ∗kwk

1 +∑K

k=1 θσ∗kwk⇐⇒ ρ∗ =

K∑k=1

θσ∗kwk(ρk − ρ∗). (46)

Notice that, by optimality of ρ∗, Lemma A.4 guarantees that σ∗ satisfies (23). Thus, we can define

B : R → ZK as the function that, for a specific value of ρ, gives the position assignment that

maximizes the sum∑K

k=1 θσkwk(ρk − ρ), i.e., B(ρ) := arg maxσ∈ZK

∑Kk=1 θσkwk(ρk − ρ). Notice

that, for each ρ ∈ R, B(ρ) satisfies (23).

Let fk : R→ R, k ∈PK be the linear functions fk(x) = wk(ρk − x). Then, for any k1, k2 ∈ PN

the lines fk1(x), fk2(x) are either parallel or they intersect in a single point. Let N ≤(N2

)=

O(K2) be the the number of intersection points between the lines f1, f2, . . . , fN . Let us denote by

x1, x2, . . . , xN these intersection points and suppose, without loss of generality, that x1 ≤ x2 ≤ · · · ≤xN . Therefore, for ρ ∈ [xj−1, xj), B(ρ) outputs a position assignment which corresponds to the

constant ordering of the functions f1, f2, . . . , fN in ρ ∈ [xj−1, xj). Observe that B(ρ) is not single-

valued, as switching the position of two exchangeable products in an optimal position assignment,

produces an optimal position assignment. However, we can define a function B′ : R→ ZK that, for

each value of ρ, outputs the position assignment σ∗ corresponding to the ordering of f1, f2, . . . , fN

where exchangeable products are placed in lexicographic order (any other ordering rule would

still work). Then the result follows as it is immediate that one only needs to enumerate at most

N = O(K2) values of B′(ρ) to find ρ∗ and σ∗, and from the fact that also PN can be constructed

40

in polynomial time.

Proof of Proposition 9. Using Lemma A.2, and (27), we find

E[ K∑k=1


∣∣] =K∑k=1

(Bk,i +Bk,0)−1 E[1{ci−1 = k}

∣∣Yk,i−1|].

The result follows from using (6) in E[1{ci = k}

∣∣Yk,i|] = (1− qk,i)P(rk,i = L)+ qk,i P(rk,i = D).

41

social learning from online reviews with product choice · social learning from online reviews with...

Documents