strong stochastic dominance - taulehrer/papers/strongstochasticdominance.pdf · learning and...

Strong Stochastic Dominance

Ehud Lehrer∗ and Tao Wang†

January 7, 2020

Abstract

One probability distribution is said to strongly stochastically dominate another prob-

ability distribution if the former is a convex transformation of the latter. The main

contribution of this paper is the introduction of an equivalent condition phrased in ex-

pectation/behavioral terms. To show the usefulness of this equivalent condition, we

give several economic applications. These applications include a Bayesian learning and

dynamic model in which we examine the consequences of the fact that one prior being

strongly stochastically dominates another. We also show implications of our main the-

orem in models related to monotone comparative statics under uncertainty, pricing of

risky assets, implications to a portfolio’s value-at-risk and to production expansions.

Keywords: Strong stochastic dominance; Convex transformation; Bayesian learning;

Monotone comparative statics

JEL Codes: C11, D80, D83.

∗School of Mathematical Sciences, Tel Aviv University, Tel Aviv 69978, Israel and INSEAD, Bd.

de Constance, 77305 Fontainebleau Cedex, France, Email: [email protected]; Lehrer acknowledges

the support of ISF Grant #963/15 and NSFC-ISF Grant #2510/17.†School of Mathematical Sciences, Tel Aviv University, Tel Aviv 69978, Israel. Email:

[email protected]; Wang acknowledges the support of NSFC Grant #11761141007.

1 Introduction

The notion of stochastic dominance has been extensively discussed in the literature.

Many well-known stochastic dominance relations admit equivalent conditions expressed

in terms of expectations of functions with certain properties. The most famous ex-

amples are first and second order stochastic dominance: A distribution F first-order

stochastically dominates (FOSD) another distribution G if and only if the expected

value of any increasing function with respect to (henceforth, w.r.t.) F is greater than

the expected value w.r.t.G. Similarly, F second-order stochastically dominates (SOSD)

G if and only if the expected value of any increasing and concave function w.r.t. to

F is greater than the expected value w.r.t. G. Blackwell’s equivalence theorem for

comparing experiments (Blackwell (1953)) is another example: One experiment S (a

stochastic matrix) dominates1 another experiment T if and only if the ex-ante expected

value of any continuous and convex function under S is greater than the expected value

of the same function under T . The significance of these stochastic dominance relations

stems in part from the existence of such equivalent conditions, since they allow us to

abstract away from the underlying distributions and focus directly on expectations.

In this paper, we consider a notion of stochastic dominance that we call the strong

stochastic dominance (SSD) and provide an equivalent condition for SSD expressed in

terms of expectations. We then demonstrate that our equivalent condition has a wide

range of applications in a variety of economic models.

We say that a cumulative distribution function F strongly stochastically dominates

another cumulative distribution function G if F is a convex transformation of G. This

notion applies to any cumulative distribution functions, not only to continuous or

discrete ones. These distributions could be with atoms or non-atomic, with or without

densities. Therefore, the notion of SSD is more general than the widely used monotone

likelihood ratio property (MLRP), the property that the ratio of two density functions

is monotone. Studies of convex transformations of distribution functions have already

appeared in the literature in different contexts with a varying degree of generality

(see for instance, Chan et al. (1991), Bikhchandani et al. (1992), Chew et al. (1987)

and Chateauneuf et al. (2004)). However, none of these have mentioned/found an

equivalent condition in terms of expectations.

The main result of this paper shows that F strongly stochastically dominates G

is equivalent to the following condition: For every bounded, non-negative measurable

function h and every bounded and increasing function u,

EF (uh)EG(h) ≥ EG(uh)EF (h).

1More precisely, we say that S dominates T if there exists a stochastic matrix M such that T = SM ,

i.e., experiment T can be obtained from S by adding noise according to M .

1

Since no assumptions other than boundedness, monotonicity, non-negativity and mea-

surability are made regarding the functions u and h, we have considerable latitude in

specifying u and h, which makes wide applications possible.

Although overlooked by researchers so far, the equivalent condition we introduce

here seems to be fundamental. We justify this statement by providing applications to

both dynamic and static models.

Consider the framework of Bayesian learning and dynamic decision making, where

a decision maker receives sequential binary signals about an unknown state of nature.

Using our equivalent condition, we show that the following conditions are equivalent:

(i) a prior belief F dominates another prior belief G in the sense of SSD; (ii) the

posterior distribution given any set of histories of the same length under F FOSD the

posterior distribution under G; (iii) for any set of histories of the same length, the

distribution over this set of histories under F FOSD that under G. Conditions (i) and

(ii) are in terms of the prior and posterior beliefs, respectively, whereas condition (iii)

is in terms of distributions over a set of histories.

Our equivalent condition yields monotone comparative statics results in a class

of stochastic optimization problems. Specifically, we extend a well-known result of

Athey (2002) that states the following. Suppose that a utility function satisfies the

single-crossing property in action and state of nature, where the latter is governed by

a parameterized family of distributions, ordered by MLRP. Then, the optimal actions

that maximize the expected utility, subject to a constraint, are monotone in the sense

of strong set order w.r.t. the distribution and the constraint. We show that this result

still holds if we replace the MLRP by the weaker notion of SSD using the equivalent

condition stated in our main theorem.

Our equivalent condition can also be applied to other static models. One application

is concerned with the equilibrium market prices and the “value-at-risk” of a risky asset

under different distributions over the unknown rate of return of the asset. We show

that under a strongly stochastically dominating distribution, the equilibrium asset

price and the expected return above any percentile are higher. In another application,

we consider a firm’s decision regarding the output expansion investment when facing

uncertain market demand. In this setting, our result implies that under a strongly

stochastically dominating distribution of market demand, the firm expects a higher

rate of return from the output expansion investment.

To the best of our knowledge, Shaked & Shanthikumar (2007) and Muller & Stoyan

(2002) contain the most comprehensive reviews of known results about stochastic or-

ders. It is well known that when density functions exist, the fact that one distribution

MLRP another is equivalent to the latter being a convex transformation of the former

(e.g., Section 1.C.1 of Shaked & Shanthikumar (2007) and Theorem 1.4.3 of Muller &

2

Stoyan (2002)). Our paper does not require the existence of density functions. More

importantly, it provides a widely applicable condition which is equivalent to the exis-

tence of a convex transformation from one distribution to another. Convex (concave)

transformations of distribution functions also appear in a few models in the context

of decision theory. Bikhchandani et al. (1992) focus on Bayesian learning, where it

is well known that the FOSD relation of prior distribution functions may not be pre-

served after Bayesian updating. They provide equivalent conditions under which the

posterior belief over a set of probability distributions after any history of signals FOSD

the posterior of another belief. In the rank-dependent expected utility models, such

as Chew et al. (1987) and Chateauneuf et al. (2004), convex (concave) transforma-

tions of probability distributions are used to characterize “more risk averse” preference

relations.

The rest of the paper is organized as follows. Section 2 introduces the notion of

SSD and discusses its connection to convex transformation. Section 3 presents the main

result and its immediate implications. Sections 3 and 4 contain economic applications

of the main result to dynamic and static models, respectively. Omitted proofs can be

found in the Appendix.

2 Strong Stochastic Dominance

Let F and G be two cumulative distribution functions (CDF) defined on R.

Definition 1. We say that F strongly stochastically dominates G, denoted as F �SSD

G, if for any x1 > x2 > x3,

[F (x1)− F (x2)][G(x1)−G(x3)] ≥ [F (x1)− F (x3)][G(x1)−G(x2)]. (1)

Chan et al. (1991), Bikhchandani et al. (1992)2 refer to a transformation of CDF’s

w.r.t. ϕ which is a convex function defined on the range of G. It turns out that there

is a close connection between the fact that F �SSD G and the existence of a convex

functional transformation ϕ, as stated in the following proposition.

Proposition 1. Let x := inf{x|G(x) = 1}.3 F �SSD G if and only if the transforma-

tion ϕ such that F (x) = ϕ(G(x)) on (−∞, x] is a convex function.

2 Chan et al. (1991) consider convex-ordering among continuous distribution functions, which is

equivalent to MLRP. Bikhchandani et al. (1992) focus on discrete distributions. Here we impose

no restrictions on the type of distribution functions. They are allowed to be discontinuous or not

absolutely continuous with respect to the Lesbegue measure.3Since G is right-continuous, it implies G(x) = 1.

3

0

F

G

F (x1) = 1

G(x1) = 112

F (x3)

F (x2)

ϕ

Panel (a): F 6�SSD G

0

F

G

1

G(x1) = 1

F (x3)

G(x3)

F (x2)

G(x2)

F (x1)

ϕ

Panel (b): F 6�SSD G

Figure 1: Transformations that fail to satisfy F �SSD G

Given any two cumulative distribution functions F and G, one can find a correspon-

dence ϕ : [0, 1] ⇒ [0, 1] such that F (x) = ϕ(G(x)) on (−∞, x]. Indeed, define ϕ(y) =

F (G−1(y)), where y is in the range of G and G−1(y) = {x ∈ (−∞, x] | G(x) = y}. The

reason why ϕ(·) is a correspondence, rather than a function, is because G might be a

constant on some interval, while F is strictly increasing on the same interval. Such a

case is illustrated in Panel (a) of Figure 1. In this figure, G equals 12

on [x3, x2] while

F (x2) > F (x3), and ϕ maps 12

to a closed interval containing F (x2) and F (x3).

Proposition 1 states that F �SSD G implies that, on the domain where G is not

constantly 1, ϕ is a function rather than a correspondence. To see the reason, consider

again Panel (a) of Figure 1. Take x1 to be such that F (x1) = G(x1) = 1. In this

case, the slope of ϕ between the points (G(x1), F (x1)) and (G(x2), F (x2)) is strictly

less than the slope between (G(x1), F (x1)) and (G(x3), F (x3)). This violates Eq. (1).

The purpose of Panel (b) in Figure 1 is to show that the condition F (x) = ϕ(G(x))

on (−∞, x], cannot be replaced by the weaker condition that F (x) = ϕ(G(x)) on

the open ray (−∞, x). In this figure, G has an atom at x1 = x. Despite the fact

that ϕ is a convex function for G < 1, and that F (x) = ϕ(G(x)) on the open ray

(−∞, x), F �SSD G fails to hold. To see this, consider x1, x2 and x3 as illustrated.

The slope of ϕ between the points (G(x1), F (x1) and (G(x2), F (x2)) is strictly less

than the slope between the points (G(x1), F (x1)) and (G(x3), F (x3)). This contradicts

Eq. (1). Therefore, one needs to refer also to the point x and require ϕ such that

F (x) = ϕ(G(x)) on x ∈ (−∞, x] to be a convex function. Proposition 1 ensures that

this is sufficient for F �SSD G.

Figure 2 shows a case in which F �SSD G. Here, x = 23, the transformation ϕ such

4

that F (x) = ϕ(G(x)) on (−∞, 23] is a convex function.

0 x

1

123

23

x3 x2

G F

Panel (a): Distribution functions

0

F

G

1

123

G(x3)

F (x2)

G(x2)

F (x)

ϕ

Panel (b): Transformation ϕ

Figure 2: A graphical illustration of SSD

Remarks:

1. The definition of SSD applies to any CDF’s, not only to continuous or discrete

ones. These distributions could be with atoms or non-atomic, with or without

densities.

2. The set of CDF’s that strongly stochastically dominates a given CDF is convex.

This can be seen immediately from Eq. (1).

3. When F �SSD G, since both F and G are CDF’s, w.l.o.g., the transformation ϕ

such that F (x) = ϕ(G(x)) on (−∞, x] is increasing (here and in what follows,

increasing means weakly increasing) and satisfies ϕ(0) = 0.4 This implies that

either ϕ(x) < x for every x ∈ (0, 1), or always ϕ(x) = x. We therefore obtain

that F (x) = ϕ(G(x)) when x ≤ x and G(x) = 1 when x > x, implying G ≥ F .

We conclude that F �SSD G implies F FOSD G.

4. If F �SSD G, then an atom of G in the interior of the support5 of F is also an atom

of F . Figure 2 is an illustration. In Panel (a), we see that both F and G have

4 In general, the transformation ϕ such that F = ϕ ◦G on (−∞, x] may not satisfy ϕ(0) = 0 and

monotonicity. This happens, for example, when 0 is not in the range of both F and G (e.g., when

a family of normal distributions with the same variance is ordered by the mean). Whenever such

situations arise, one may let ϕ(0) = 0, in which case the function ϕ is assured to be increasing.5The support of a probability measure is the smallest closed set with measure 1.

5

an atom at x = 23

and they satisfy F �SSD G. If, on the contrary, x is an atom

of G, but not an atom of F (here x is in the interior of the support of F ), then

the situation is similar to Panel (b) of Figure 1, in which F �SSD G fails to hold.

More generally, we can show that F �SSD G implies that the probability measure

induced by G is absolutely continuous6 w.r.t. that induced by F on a certain set.

As a result, the Radon-Nikodym derivative of the probability measure induced

by G w.r.t. that induced by F exists and is decreasing. The reader is referred to

a more detailed discussion on absolute continuity, given in Section 6.1.

Proposition 2. In the special case in which the densities f and g of F and G exist

(w.r.t. the Lebesgue measure), Definition 1 is equivalent to the monotone likelihood

ratio property ( MLRP), namely7

f(x)g(y) ≥ f(y)g(x), ∀x > y. (2)

Proof. Assume (2) and let x1 > x2 > x3. It implies that∫ x1x2f(x)dx

∫ x2x3g(y)dy ≥∫ x2

x3f(y)dy

∫ x1x2g(x)dx and therefore F �SSD G. On the other hand, if F �SSD G,

divide both sides of (1) by (x1 − x2)(x2 − x3) and let8 x1 ↓ x2, x2 ↓ x3.

We complete this section with a lemma providing a few conditions equivalent to

the one given in the definition of SSD, Eq. (1).

Lemma 1. The following conditions are equivalent:

1. for any x1 > x2 > x3, [F (x1)− F (x2)][G(x1)−G(x3)] ≥ [F (x1)− F (x3)][G(x1)−G(x2)];

2. for any x1 > x2 > x3, [F (x1)− F (x2)][G(x2)−G(x3)] ≥ [F (x2)− F (x3)][G(x1)−G(x2)];

3. for and x1 > x2 > x3, [F (x1)− F (x3)][G(x2)−G(x3)] ≥ [F (x2)− F (x3)][G(x1)−G(x3)];

4. for any x1 > x2 ≥ x3 > x4, [F (x1)−F (x2)][G(x3)−G(x4)] ≥ [F (x3)−F (x4)][G(x1)−G(x2)].

3 The Main Theorem

Well-known stochastic dominance relations, such as FOSD, SOSD and Blackwell’s com-

parison of experiments, have equivalent properties expressed in terms of expectations.

6A measure ν is absolutely continuous w.r.t. another measure µ if for any measurable set A,

µ(A) = 0 implies ν(A) = 0.7We follow the literature and ignore issues related to sets of points whose Lebesgue probability is

0.8Again, ignoring a set of points with Lebesgue measure 0.

6

The following theorem, which is the main result of the paper, gives a condition equiv-

alent to F �SSD G, phrased in terms of expectations (i.e., decision-theoretical terms).9

The proof can be found in the Appendix.

Theorem 1. Consider two distribution functions F , G on R. Then, F �SSD G if and

only if for every bounded, non-negative measurable function h and every bounded and

increasing function u,

EF (uh)EG(h) ≥ EF (h)EG(uh). (3)

The significance of the equivalent condition provided in Theorem 1 stems from the

fact that one can abstract away from the underlying distributions and focus directly on

expectation, which is very often related to a significant consequence of the stochastic

dominance relation. This statement is supported by a variety of applications provided

below. Beyond mild conditions like boundedness, monotonicity, non-negativity and

measurability, Theorem 1 imposes no other requirements on the functions u and h.

This allows us considerable latitude in specifying u and h, which makes a wide rage of

applications possible.

As an illustration, we show that the following few implications, which are well-

known results under MLRP (see, for instance, Theorem 1.C.6 in Shaked & Shanthiku-

mar (2007) and Theorem 1.4.3 in Muller & Stoyan (2002)), can be easily extended by

directly applying the equivalent condition (3).

Implications. First, notice that Eq. (3) implies F FOSD G. Indeed, by setting

h = 1, Eq. (3) becomes EF (u) ≥ EG(u), for any bounded and increasing function u.

This is equivalent to F FOSD G.

Second, Theorem 1 implies that if F �SSD G, then the expectation of any monotone

function, conditional on any subset of the support with positive measure, is greater

under F . Formally,

Corollary 1. F �SSD G if and only if for any bounded and increasing function u and

any measurable set A ⊆ R with a positive probability w.r.t. both F and G,

EF (u|A) ≥ EG(u|A). (4)

Proof. To show the “only if” part, let 1A be the indicator function of A. Take h(x) =

1A(x) in Theorem 1. Then

EF (u|A) =EF (u1A)

EF (1A)≥ EG(u1A)

EG(1A)= EG(u|A),

9 The equivalent condition in Eq. (3) can also be derived directly from the MLRP condition.

However, to the best of our knowledge, this condition has not been stated explicitly and used in the

literature even in the case of MLRP.

7

as claimed.

The proof of the “if” part is relegated to the Appendix.

Note that for any measurable set A of positive probability, Eq. (4) holds for any

bounded and increasing function u is equivalent to F FOSD G, conditional on A (i.e.,

F |A FOSD G|A). Thus,

Corollary 2. F �SSD G if and only if for any measurable subset A ⊆ R with a positive

probability w.r.t. both F and G, F |A FOSD G|A.

When F dominates G in the sense of MLRP, it is a well-known result that F |AFOSD G|A for any measurable set A ⊆ R with positive probability (see 1.C.6 of Shaked

& Shanthikumar (2007)). Our result shows that a similar property holds under the

weaker notion of SSD. Corollary 2 also suggests that our notion of SSD implies the

“uniform conditional stochastic order”10 due to Whitt (1980).

Another immediately implication of Theorem 1 is that the SSD relation is preserved

when conditioned on any subset with positive probability.

Corollary 3. If F �SSD G, then for any measurable subset A of the support with a

positive probability w.r.t. both F and G, F |A �SSD G|A.

Proof. It suffices to show that F �SSD G implies that for any bounded and increasing

function u, and any non-negative, bounded measurable function h, both defined on A,

the inequality EF (uh|A)EG(h|A) ≥ EG(uh|A)EF (h|A) holds. This is equivalent to

EF (uh)

EF (1A)

EG(h)

EG(1A)≥ EG(uh)

EG(1A)

EF (h)

EF (1A),

or

EF(uh)EG(h) ≥ EG

(uh)EF (h),

where h(x) =

{h(x), if x ∈ A,0, if x /∈ A, and u is an extension of u (which is defined on A)

to the entire domain such that u is increasing and coincides with u on the set A.11

This inequality holds due to Theorem 1: u is increasing and bounded and h is

non-negative, bounded and measurable.10 We say that one probability measure is less than or equal to another in the sense of uniform

conditional stochastic order (UCSO) if FOSD holds for each pair of conditional probability measures

obtained by conditioning on certain appropriate subsets. The notion was first introduced in Whitt

(1980) and extended further to the multivariate case in Whitt (1982). See also Rinott & Scarsini

(2006) for a connection with total positivity order.11Clearly, such a monotone extension exists. For instance, for any x /∈ A, define u(x) :=

sup{x′∈A, x′<x} u(x′) if the set {x′ ∈ A | x′ < x} is not empty; otherwise, define u(x) = M , where M

is a constant satisfying M < infx′∈A u(x′) (such M exists because u is bounded). One can check that

the function u thus defined is indeed increasing on the entire domain.

8

Corollary 4. Suppose Fn �SSD Fn−1 · · · �SSD F1. Let p = (p1, ..., pn), q = (q1, ..., qn)

be probability distributions over {1, ..., n} with p �SSD q. Then∑n

i=1 piFi �SSD

∑ni=1 qiFi.

Proof. Let Fp :=∑n

i=1 piFi, Fq :=∑n

i=1 qiFi. W.l.o.g., assume for any i ∈ {1, ..., n}, piand qi are not all zero. Otherwise, Fi is not in the support of p and q and one can ignore

such Fi. The assumption p �SSD q implies that piqi

is increasing with i. Hence for every

bounded, non-negative measurable function h such that EFi(h) > 0 for at least one i,12(

piEFi(h)∑n

j=1 pjEFj(h)

)i

FOSD(

qiEFi(h)∑n

j=1 qjEFj(h)

)i. Denote I := {i ∈ {1, ..., n} | EFi

(h) > 0}. The

fact that Fn �SSD Fn−1 · · · �SSD F1 implies that for any bounded, increasing function

u,EFi

(uh)

EFi(h)

is increasing in i,i ∈ I. It follows that

∑i∈I

piEFi(h)∑n

j=1 pjEFj(h)

EFi(uh)

EFi(h)≥∑i∈I

qiEFi(h)∑n

j=1 qjEFj(h)

EFi(uh)

EFi(h)

,

which implies (notice that for i /∈ I,piEfi

(uh)∑nj=1 pjEFj

(h)= 0 and

qiEfi(uh)∑n

j=1 qjEFj(h)

= 0)∑ni=1 piEfi(uh)∑nj=1 pjEFj

(h)≥∑n

i=1 qiEfi(uh)∑nj=1 qjEFj

(h),

orEFp(uh)

EFp(h)≥

EFq(uh)

EFq(h).

Hence Fp �SSD Fq.

Note that Corollary 4 may not hold if we assume p FOSD q instead. The following

is a counter example. Let F1 be the uniform distribution on [0, 1]; F2 is degenerate at

x = 1; and F3 is uniformly distributed on [1, 2]. Consider p = (0, 12, 1

2), q = (1

2, 0, 1

2).

In this case, we have p FOSD q but p 6�SSD q. One can verify that∑3

i=1 piFi 6�SSD∑3i=1 qiFi

4 Bayesian Learning and Dynamics

Our main result, Theorem 1, has natural applications to Bayesian learning and dynamic

decision making. To see this, consider a decision maker (DM) who faces an unknown

state of nature p with support [0, 1]. The DM’s prior belief about p is captured by a

CDF F (p). Depending on the true state of nature, in each period, a random outcome

(either “U” or “D”) is generated with probability P(U) = p. Assume the random

outcomes are independent.

12The case in which EFi(h) = 0 for all i ∈ {1, ..., n} is trivial, since clearly EFp(uh)EFq(h) ≥EFq(uh)EFp(h) holds. We therefore focus on the case in which EFi

(h) > 0 for at least one i.

9

The parameter p may correspond, for instance, to the overall performance of a

company, and one may interpret “U” and “D” as upward and downward movements of

a company’s stock price, as in Cox et al. (1979). The parameter p may also correspond

to the failure rate of a machine. A new machine that functions properly has a known

failure rate of p0, published by the producer. A user keeps track on the performance

of the machine and its failures history, and updates accordingly his expectation about

the prospect of future failures. In this context, when the user has a prior on failure

rates that allows a variety of possibilities, he will focus particularly on p0. It is then

natural to assume that the prior over p’s will assign a positive probability to p0.

In another setting, “U” and “D” may also be regarded as increase and decrease in

a firm’s output or sales, or “thumbs up” and “thumbs down” in the context of binary

consumer rating (Example 1 below).

A history ht of t observations is an ordered sequence in {D,U}. Denote the set of all

histories of length t by H t := {D,U}t and denote a subset of H t by H t. Conditional

on the true state of nature being p, we denote the probability of observing a particular

history ht by B(p;ht).13 The probability of observing a history from a given set H t is

therefore B(p;H t) =∑

ht∈Ht B(p;ht).

We say that a distribution F is non-degenerate if the probability that p ∈ (0, 1)

(i.e., not 0 or 1) is positive. Otherwise, we say that the distribution F is degenerate.

Formally, F is non-degenerate if and only if EF (1{0,1}) < 1. Note that if F is non-

degenerate, then the probability of any set H t of histories is positive: PF (H t) > 0.

Moreover, F is degenerate if and only if F (p) = F (0) for every p ∈ (0, 1).

The following theorem shows that F �SSD G is equivalent to F FOSD G conditional

on any set of histories that has a positive probability. The case in which either F or

G is degenerate is treated in Section 6.2.

Theorem 2. Two non-degenerate priors F and G satisfy F �SSD G if and only if for

any t and any H t ⊆H t, F |H t FOSD G|H t.

Proof. We prove here the “only if” direction using Theorem 1. The proof of the “if”

direction is relegated to the Appendix.

Assume that F �SSD G and take any p ∈ (0, 1).14 We have

1− F (p|H t) =EF (B(p;H t) · 1(p,1])

EF (B(p;H t))≥

EG(B(p;H t) · 1(p,1])

EG(B(p;H t))= 1−G(p|H t).

This inequality holds due to Theorem 1: 1(p,1] is an increasing function on [0, 1] and

B(p;H t) is non-negative.

13If ht has k observations of outcome “U”, then B(p;ht) = pk(1− p)t−k.14At the risk of abusing the notation, we use here the same symbol “p” for both the random variable

and a specific value of the random variable.

10

As a special case of Theorem 2, we have F |ht FOSD G|ht for any history ht. It is

worth noting that in Theorem 2 the set of histories H t can be any nonempty subset

of H t. In particular, H t does not have to contain all histories with the same number

of “U”. When F �MLRP G, this result is well known and can be easily derived. Here,

our purpose is to demonstrate that it can be extended to SSD using the equilibrium

condition we characterized in Theorem 1.

Theorem 2 has natural applications to dynamic decision problems in which the

DM’s utility depends on the unknown parameter p, but is unaffected by the sequence

of noisy signals. The signals only provide information about p. In particular, when

the DM’s utility function is increasing in p, Theorem 2 implies that his expected util-

ity conditional on any set of histories of observations is greater under the strongly

stochastically dominating prior.

Example 1. A store starts to operate through an online platform. Suppose consumers

arrive sequentially to purchase from the store. Let p, p ∈ [0, 1], denote the unknown

probability that a random consumer is satisfied with the store’s service and leaves a

positive feedback.15 We may regard the parameter p as a measure of the quality of the

store’s service. Let F be a consumer’s prior belief about the store’s quality, starting

with no observation of past feedbacks.

A history is an ordered sequence of feedbacks. The online platform, however, may

not disclose the entire history of realized feedbacks. Instead, it often discloses some

statistics of the realized history, or a particular set of feedback histories to which the

realized one belongs. For instance, others may only disclose the feedbacks for the last T

periods, some platforms only disclose the total number of positive and negative feedbacks

accumulated so far. In other words, each disclosure policy of feedbacks corresponds to

a subset H t ⊆H t that contains the realized history.

What Theorem 2 tells us in this setup is the following: regardless of the disclosure

policy the online platform adopt, as long as one consumer’s prior belief dominates an-

other prior in the sense of SSD, the posterior beliefs about the store’s quality preserve

the FOSD relation after observing the disclosed information. Consequently, if the con-

sumers’ utility functions are increasing in the service quality, those consumers with

more favorable prior beliefs (in the sense of SSD) will expect to obtain higher utility

from the store’s service.

In some dynamic decision problems, however, a DM’s utility does hinge directly

on the history of signals. For instance, consider a firm that faces an unknown market

condition. The realized demand each period serves as a signal about the underlying

15Assume for simplicity that all consumers rate automatically and that feedbacks are binary: either

a positive or a negative feedback.

11

market condition, but at the same time, the firm’s profit is directly affected by its

demand level. In this kind of problems, it might be useful to examine the probability

distributions over a given set of histories induced by different prior beliefs in order to

evaluate and compare the DM’s expected utility. This is the motivation for our next

result, Proposition 3 below.

Before we state and prove the proposition, notice first that with binary outcomes,

we can order elements in H t by the number of outcome “U” observed. Let ht, ht ∈H t

be two histories of length t. We say that ht ranks above ht, and write ht � ht, if ht

contains (weakly) more “U” than ht.16 We then regard H t as a well-ordered set.

For any subset H t, its elements inherit the same ordering as in H t, and is therefore

meaningful to compare two posteriors over H t in the sense of stochastic dominance.

Fix a set of histories H t ⊆ H t. A history ht partitions H t into two disjoint

subsets: H t+(ht) :=

{ht ∈ H t | ht � ht

}and H t

−(ht) :={ht ∈ H t | ht � ht

}. The

subset H t+(ht) consists of those histories in H t that are ranked above ht (and including

ht itself), whereas H t−(ht) consists of those that rank below ht. Similarly, define another

set H t+(ht) :=

{ht ∈H t | ht � ht

}. Clearly, H t

+(ht) = H t+(ht)

⋂H t.

The following proposition establishes the relation between SSD and FOSD over any

set of histories with the same length.

Proposition 3. Let F and G be two non-degenerate prior beliefs over [0, 1] such

that F �SSD G. Then for any set of histories H t ⊆ H t, the probability distribution

over H t under F FOSD that under G. That is, for any ht ∈ H t, PF (H t+(ht)|H t) ≥

PG(H t+(ht)|H t).17

Proof. Let H t ⊆ H t be a set of histories and ht be a particular history in H t. The

distribution function on H t under F is therefore PF (H t−(ht)|H t) and the corresponding

16Note that given any prior distribution, the probabilities of observing two histories ht, ht that have

the same number of outcome “U” are equal. In this case the two histories are equivalent.17We use PF and PG to denote also the probability functions on histories induced by F and G,

respectively.

12

survival function is PF (H t+(ht)|H t) = 1− PF (H t

−(ht)|H t). With these notations,

PF (H t+(ht)|H t) =

PF (H t+(ht)

⋂H t)

PF (H t)

=PF (H t

+(ht)⋂H t)

PF (H t)

=EF (B(p; H t

+(ht)) ·B(p;H t))

EF (B(p;H t))

≥EG(B(p; H t

+(ht)) ·B(p;H t))

EG(B(p;H t))

= PG(H t+(ht)|H t).

The inequality holds due to Theorem 1: B(p; H t+(ht)) is increasing in p (see Lemma 5 in

the Appendix for a proof) and B(p;H t) is non-negative. This completes the proof.

Example 2. Consider a DM who faces a one-armed bandit machine that yields a payoff

of 1 or -1 in each period with unknown probabilities p and 1− p, respectively. Let F be

a prior belief on [0, 1]. In each period, the DM decides whether to switch to a safe arm

with known payoff forever, or pay a fixed cost to continue with the risky arm. Suppose

that the DM follows a particular stopping strategy. Proposition 3 implies the following:

ex-ante, conditional on the DM has not switched to the safe arm in a particular period

t (this event corresponds to a set of histories with length t), his expected discounted net

payoff in that period is greater under a strongly stochastically dominating prior belief.

Theorem 2 and Proposition 3 demonstrate that in the setup of Bayesian learning,

the SSD relation between two prior beliefs is related to two different conditions. On

the one hand, SSD is equivalent to the FOSD relation of posterior beliefs conditional

on any set of histories of the same length. On the other hand, it implies the FOSD

relation over any set of histories of the same length. The natural question is whether

these three conditions are equivalent. The answer is affirmative. We relegate the proof

of the following result to the Appendix.

Theorem 3. Let F and G be two non-degenerate priors. The following conditions are

equivalent:

1. F �SSD G;

2. for any t and any H t ⊆H t, F |H t FOSD G|H t;

3. for any t and any H t ⊆H t, the probability distribution over H t under F FOSD

that under G.

13

As a final application of Theorem 1 in the setup of learning and dynamics, consider

the expected probability of observing outcome “U” in the next period conditional on

any given set of length-t histories H t. This expected probability is simply

EF (p|H t) =EF (pB(p;H t))

EF (B(p;H t)).

Suppose F �SSD G. Then by taking u(p) = p and h(p) = B(p;H t), it follows from

Theorem 1 that

EF (p|H t) ≥ EG(p|H t). (5)

This implies that the expected probability of receiving outcome U conditional on any

set of histories H t is greater under a strongly stochastically dominating prior. If we

regard (1− EF (p|H t),EF (p|H t)) and (1− EG(p|H t),EG(p|H t)) as probability distri-

butions on {D,U} under the respective prior beliefs F and G, Eq. (5) also implies

that the distribution on outcomes {D,U} under F FOSD that under G. This echoes

Theorem 1 of Bikhchandani et al. (1992).18

5 Other Economic Applications: Extensions of Ex-

isting Results

In this section we demonstrate more applications of the equivalent condition in Theorem

1. A few of these applications, including the first one (monotone comparative statics),

extend existing models and show that the corresponding results can be proven by using

the condition given in Theorem 1.

5.1 Monotone Comparative Statics

One of the most important roles of MLRP in economic applications is to derive com-

parative statics results, in particular, in stochastic optimization problems (e.g., Karlin

et al. (1956), Athey (2002)). The seminal paper Athey (2002) introduced conditions

under which a decision maker’s optimal actions are monotone w.r.t. an exogenous pa-

rameter, which determines the distribution of a payoff-relevant state. One fundamental

result of Athey (2002) is that the single-crossing property of the utility function, com-

bined with a family of distributions ordered by MLRP, guarantees the monotonicity

18Bikhchandani et al. (1992) focus on the expected probability distribution over the signal space

conditional on any single history, rather than on any set of histories. In addition, the state space

in Bikhchandani et al. (1992) is restricted to be finite. In their paper, the authors also consider the

more general situation in which the signal space has more than two elements. They show that extra

structures on prior beliefs are needed in order to obtain similar results.

14

of the optimal actions. In this subsection we show that this result can be generalized

to the family of distributions (atomic and non-atomic alike), ordered by SSD. This is

done by using the equivalent condition stated in Theorem 1.

Consider a utility function u : X×S → R, where X ⊆ R is the action set and S ⊆ Ris the set of unknown states. Assume u(x, s) satisfies the single-crossing property in

(x, s).19 Let {Ft(s)}t∈T be a family of CDFs on S parameterized by t, where T is a

partially ordered set. Define

X∗(t, B) := argmaxx∈B

∫S

u(x, s)dFt(s).

This is the set of optimal actions,20 given t and the constraint B. We say that X∗(t, B)

satisfies the monotone comparative statics if X∗(t, B) is increasing in t and B in the

sense of strong set order. The following result is due to Athey (2002) (Theorem 2):

Theorem 4. If a measurable function u satisfies the single-crossing property in (x, s)

and the family {Ft(s)}t∈T is ordered by MLRP,21 then X∗(t, B) satisfies the monotone

comparative statics.

We show that a similar result holds under the weaker condition that the family of

distributions {Ft(s)}t∈T is ordered by SSD.

Theorem 5. If a measurable function u satisfies the single-crossing property in (x, s)

and the family {Ft(s)}t∈T is ordered by SSD, then X∗(t, B) satisfies the monotone

comparative statics.

Proof. Since u(x, s) satisfies single-crossing in (x, s), if we take any xH , xL ∈ X with

xH > xL, the function g(s) := u(xH , s) − u(xL, s) satisfies single-crossing in s. Hence

there exist s0, s1 ∈ S with s1 ≥ s0 such that g(s) ≤ (<)0 for all s ≤ (<)s0, and

g(s) ≥ (>)0 for all s ≥ (>)s1. Define G(t) :=∫Sg(s)dFt(s) = EFt(g). We first show

that G(t) satisfies the single-crossing property in t.

Let g+ := max{g, 0} and g− := −min{g, 0}. Then G(t) = EFt(g+) − EFt(g

−). It

suffices to show that if tH > tL, tH , tL ∈ T , then EFtH(g) ≤ (<) 0 implies EFtL

(g) ≤ (<

) 0, or equivalently, that EFtH(g+) ≤ (<) EFtH

(g−) implies EFtL(g+) ≤ (<) EFtL

(g−).

19Let X ⊆ R. A function g : X → R satisfies the single-crossing property in x if there exist

x0, x1 ∈ [inf X, supX] with x0 ≤ x1 such that g(x) < 0 (g(x) ≤ 0) for all x ∈ X such that x < x0(x ≤ x0), and g(x) > 0 (g(x) ≥ 0) for all x ∈ X such that x > x1 (x ≥ x0), and for x ∈ [x0, x1],

g(x) = 0. A function of two variables u : X × S → R satisfies the single-crossing property in (x, s)

if for all xH > xL, xH , xL ∈ X, the function g(s) := u(xH , s) − u(xL, s) satisfies the single-crossing

property in s.20Assume maximizers exist. The focus is on sufficient conditions that guarantee monotonicity of

the set of maximizers, rather than existence of maximizers.21Namely, for any tH , tL ∈ T with tH > tL, FtH �MLRP FtL . This is equivalent to the density ft(s)

being log-supermodular in (s, t).

15

Since FtH �SSD FtL , |g| is non-negative and since 1[s1,∞)∩S, 1(−∞,s0]∩S are increasing

and decreasing on S, respectively, it follows from Theorem 1 that

EFtL(|g| 1[s1,∞)∩S) EFtH

(|g|) ≤ EFtH(|g| 1[s1,∞)∩S) EFtL

(|g|), (6)

EFtL(|g| 1(−∞,s0]∩S) EFtH

(|g|) ≥ EFtH(|g| 1(−∞,s0]∩S) EFtL

(|g|). (7)

On the other hand, the assumption EFtH(g+) ≤ (<) EFtH

(g−) is equivalent to

EFtH(|g| 1[s1,∞)∩S) EFtL

(|g|) ≤ (<) EFtH(|g| 1(−∞,s0]∩S) EFtL

(|g|). (8)

Combining Eqs. (6), (7) and (8), we conclude that

EFtL(|g| 1[s1,∞)∩S) EFtH

(|g|) ≤ (<) EFtL(|g| 1(−∞,s0]∩S) EFtH

(|g|),

which implies that EFtL(g+) ≤ (<)EFtL

(g−). This establishes that G(t) is single-

crossing in t and hence U(x, t) :=∫Su(x, s)dFt(s) is single-crossing in (x, t).

It follows from a well-known monotonicity theorem due to Milgrom & Shannon

(1994) (Theorem 4)22 that X∗(t, B) is increasing in (t, B) in the sense of strong set

order. This completes the proof of Theorem 5.

5.2 Production Expansion

Production Expansion. Consider a monopolistic firm facing an inverse demand

function p(Q) = a − bQ, where p is the market price of its product, Q is the firm’s

output choice, and b is a positive constant. The intercept a is an uncertain parameter

with support [a, a] which captures the uncertain level of market demand. For each

realized a, the firm chooses output Q to maximize its net profit (p(Q)− c)Q, where c

is the constant marginal cost. However, due to fixed factory size, capital stock, etc.,

the firm’s total output level is bounded by a fixed capacity Q, which prevents the firm

from capturing extra demand once the realized market demand exceeds a certain level.

With the constraint Q, the firm’s optimal output strategy when the demand level is

a is Q∗(a) :=

{12b

(a− c), if a ∈ [a, a),

Q, if a ∈ [a, a],where the threshold a is pinned down by

Q = 12b

(a− c). The corresponding optimal profit function is

π∗(a) :=

{14b

(a− c)2, if a ∈ [a, a),12b

(a− c)(a− c)− 14b

(a− c)2, if a ∈ [a, a].

22 The theorem states the following. Let f : X × T → R, where X is a lattice and T is a partially

ordered set, and let B ⊆ X. Then argmaxx∈Bf(x, t) is increasing in (t, B) if and only if f is quasi-

supermodular in x and satisfies the single-crossing property in (x, t). For our case, where X ⊆ R, the

requirement that f is quasi-supermodular in x is satisfied.

16

The management of the firm is considering whether it is worthwhile to invest a

certain amount of money to expand production so as to remove the output constraint.23

Suppose the firm decides in favor of executing such an investment. The optimal output

and profit then become Q∗∗(a) := 12b

(a − c) and π∗∗(a) := 14b

(a − c)2, respectively, for

all a ∈ [a, a]. In particular, π∗∗(a) = u(a)π∗(a), where

u(a) =

{1, if a ∈ [a, a),a−ca−c

12− a−c

a−c

, if a ∈ [a, a],

is a non-negative, bounded and weakly increasing function of a.

Now consider two different beliefs F and G about the uncertain demand parameter

a, where F �SSD G. Theorem 1 implies

EF (π∗∗(a))

EF (π∗(a))=

EF (u(a)π∗(a))

EF (π∗(a))≥ EG(u(a)π∗(a))

EG(π∗(a))=

EG(π∗∗(a))

EG(π∗(a)).

Thus, the expected rate of return increases when the belief increases in the sense of

SSD. It follows that, given the production expansion cost k, if expansion decision is

worthwhile under one belief, then it is also worthwhile under any belief that is more

favorable in the sense of SSD.

The same conclusion can be reached by applying Theorem 5 directly. Suppose

that a firm chooses between two actions: invest an amount of k in order to expand

production (x = 1), or not (x = 0). The net profit depends on the firm’s action and

the underlying state of demand a according to the following formula:

Π(x, a) :=

14b

(a− c)2 − k, if x = 1,14b

(a− c)2, if x = 0, a ∈ [a, a),12b

(a− c)(a− c)− 14b

(a− c)2, if x = 0, a ∈ [a, a].

The function Π(x, a) satisfies single-crossing property. Indeed,

Π(1, a)− Π(0, a) =

{−k, if a ∈ [a, a),14b

(a− c)2 − k − 12b

(a− c)(a− c) + 14b

(a− c)2, if a ∈ [a, a],

is strictly increasing in a for a ∈ [a, a]. Therefore, if F �SSD G, then by Theorem 5,

the optimal production expansion decision under F is weakly higher than the optimal

decision under G.

23When the production constraint is only partially relaxed (namely, when the firm can produce up

to a level higher than the current constraint Q, but lower than the maximal market demand 12b (a−c)),

the result still holds. In this case, one can easily check that π∗∗(a)π∗(a) is increasing in a.

17

5.3 Securities Markets

Consider the following simple model of a securities market as in Milgrom (1981). There

are two security types: a riskless one that yields return 1, and a risky one that yields

a random return θ. Investors in the market are assumed to be homogeneous: all hold

1 unit of the riskless security and 1 unit of the risky security, and all share the same

concave and strictly increasing utility function of wealth U . Additionally, assume the

function U is differentiable and U ′(·) is bounded. We normalize the market price for

the riskless security to be 1 and let the price of the risky security be p. After observing

the security prices and taking into account budget constraints, each investor decides

how many units of each security to hold.

For the market to clear, the price p for the risky security must ensure that no

trading takes place, namely holding 1 unit of each kind of security is the equilibrium.

More specifically, let each individual choose to hold x units of the risky security and

1 + p− xp units of the riskless security to maximize expected utility, given price p:

maxx≥0

E [U((1 + p− xp) + xθ)] s.t. 1 + p− xp ≥ 0.

To clear the market, x∗ = 1 must be the solution, so it has to satisfy the first-order

condition

E [θU ′((1 + p− x∗p) + x∗θ)] = pE [U ′((1 + p− x∗p) + x∗θ)] ,

which can be written as

p =E [θU ′(1 + θ)]

E [U ′(1 + θ)].

Regard θ as the function u(θ) = θ and h(θ) = U ′(1 + θ), as in Theorem 1. It follows

that if F �SSD G, then pF ≥ pG, i.e., the equilibrium price of the risky security is

higher under F .

5.4 Value-at-Risk of a Portfolio

Let X be the random return of a portfolio. The ‘value-at-risk’ measurement is con-

cerned with the performance of the portfolio at a specific low percentile. For simplicity,

assume that both F and G are strictly increasing (but may possess atoms) on the same

interval in R. Suppose furthermore that F �SSD G.

Fix a probability p and denote c(F, p) := F−1(p) and c(G, p) := G−1(p). This means

that PF (X ≤ c(F, p)) = p and PG(X ≤ c(G, p)) = p. As F �SSD G implies F FOSD

18

G, it follows that c(F, p) ≥ c(G, p), and therefore, EG(X|X > c(F, p)) ≥ EG(X|X >

c(G, p)). The latter inequality is due to Corollary 1:

EF (X|X > c(F, p)) ≥ EG(X|X > c(F, p)) ≥ EG(X|X > c(G, p)).

This implies that when F strongly stochastically dominates G, the performance of X

above any percentile is better w.r.t. F than w.r.t. G.

6 Final Comments

6.1 On Absolute Continuity

Suppose that F �SSD G. Denote x := inf{x | F (x) > 0} and D := (x,∞). The follow-

ing proposition states that the probability measure induced by G is absolutely contin-

uous w.r.t. that induced by F on D. In other words, every event in D with positive

probability w.r.t. G has positive probability also w.r.t. F . This extends the observation

that any atom of G which is greater than x is also an atom of F . One can find examples

where G has an atom at x which is not an atom of F . For instance, consider F and G

with support [0, 1], where F (x) = x, x ∈ [0, 1] and G(x) =

{12, if x = 0,

12

+ 12x, if x ∈ (0, 1].

Proposition 4. If F �SSD G, then PG(·|D) is absolutely continuous w.r.t. PF (·|D).

The Radon-Nikodym derivative dPG(·|D)dPF (·|D)

is non-increasing, except on a set of measure 0

w.r.t. PF (·|D).

The proof is relegated to the Appendix (A.9).

6.2 More on Dynamics and Learning — The Degenerate Case

Recall that Theorem 2 assumes that both priors F and G are non-degenerate. Here

we consider the case in which at least one of them is degenerate.

Proposition 5. Suppose that at least one of F and G is degenerate. Then, F �SSD G

if and only if for any set of histories H t with positive probability, F |H t FOSD G|H t.

The proof can be found in the Appendix (A.9).

References

Athey, S. (2002). Monotone comparative statics under uncertainty. The Quarterly

Journal of Economics, 117 (1), 187–223.

19

Bikhchandani, S., Segal, U., & Sharma, S. (1992). Stochastic dominance under bayesian

learning. Journal of Economic Theory, 56 (2), 352–377.

Blackwell, D. (1953). Equivalent comparisons of experiments. The Annals of Mathe-

matical Statistics, 24 (2), 265–272.

Chan, W., Proschan, F., & Sethuraman, J. (1991). Convex-ordering among functions,

with applications to reliability and mathematical statistics, chapter in Topics in Sta-

tistical Dependence, (pp. 121–134). IMS Lecture Notes–Monograph Series, Vol. 16.

Chateauneuf, A., Cohen, M., & Meilijson, I. (2004). Four notions of mean-preserving

increase in risk, risk attitudes and applications to the rank-dependent expected utility

model. Journal of Mathematical Economics, 40 (5), 547–571.

Chew, S. H., Karni, E., & Safra, Z. (1987). Risk aversion in the theory of expected

utility with rank dependent probabilities. Journal of Economic Theory, 42 (2), 370–

381.

Cox, J. C., Ross, S. A., & Rubinstein, M. (1979). Option pricing: A simplified approach.

Journal of Financial Economics, 7 (3), 229–263.

Karlin, S., Rubin, H., et al. (1956). The theory of decision procedures for distributions

with monotone likelihood ratio. The Annals of Mathematical Statistics, 27 (2), 272–

299.

Milgrom, P. & Shannon, C. (1994). Monotone comparative statics. Econometrica,

62 (1), 157–180.

Milgrom, P. R. (1981). Good news and bad news: Representation theorems and appli-

cations. The Bell Journal of Economics, 12 (2), 380–391.

Muller, A. & Stoyan, D. (2002). Comparison methods for stochastic models and risks.

Wiley, New York.

Rinott, Y. & Scarsini, M. (2006). Total positivity order and the normal distribution.

Journal of Multivariate Analysis, 97 (5), 1251–1261.

Shaked, M. & Shanthikumar, J. G. (2007). Stochastic orders. Springer Science &

Business Media.

Whitt, W. (1980). Uniform conditional stochastic order. Journal of Applied Probability,

17 (1), 112–123.

20

Whitt, W. (1982). Multivariate monotone likelihood ratio and uniform conditional

stochastic order. Journal of Applied Probability, 19 (3), 695–701.

A Appendix

A.1 Proof of Proposition 1

“If” Part. Suppose that on the contrary, F 6�SSD G. Then there exist x1 > x2 > x3

such that (see Lemma 1)

[F (x1)− F (x2)][G(x2)−G(x3)] < [G(x1)−G(x2)][F (x2)− F (x3)]. (9)

This implies that G(x1)−G(x2) > 0 and F (x2)− F (x3) > 0, hence x2 < x. There are

two possibilities: either G(x2)−G(x3) = 0 or G(x2)−G(x3) > 0.

• When G(x2) − G(x3) = 0, it implies that ϕ maps the value G(x2) to a set

containing F (x3) and F (x2), or {F (x3), F (x2)} ⊆ ϕ(G(x2)), which contradicts

the assumption that ϕ with the property F (x) = ϕ(G(x)), x ∈ (−∞, x], is a

function.

• When G(x2)−G(x3) > 0, by Eq. (9),

F (x1)− F (x2)

G(x1)−G(x2)<F (x2)− F (x3)

G(x2)−G(x3). (10)

– If x1 ≤ x, then Eq. (10) implies that ϕ is not convex, a contradiction.

– If x1 > x, then G(x1) = G(x) = 1 and F (x) − F (x2) ≤ F (x1) − F (x2)

(notice that x2 < x), hence Eq. (10) implies

F (x)− F (x2)

G(x)−G(x2)≤ F (x1)− F (x2)

G(x1)−G(x2)<F (x2)− F (x3)

G(x2)−G(x3).

So, we found points x > x2 > x3 such that ϕ is not convex, which again is

a contradiction.

“Only If” Part. Suppose Eq. (1) holds, but the transformation ϕ such that F (x) =

ϕ(G(x)), x ∈ (−∞, x], is not a convex function. Then either ϕ is not a function, or ϕ

is a function, but is not convex.

• If ϕ is not a function for x ∈ (−∞, x], then there exist x2 and x3 smaller than

x such that G(x2) = G(x3) < 1 (w.l.o.g., assume x3 < x2), but F (x3) < F (x2).

21

Namely, ϕ maps G(x2) to a set containing F (x2) and F (x3). Take x1 = x. Then

have G(x1) = 1 > G(x2) = G(x3) and F (x1) ≥ F (x2) > F (x3). Hence,

[F (x1)− F (x2)][G(x2)−G(x3)] = 0 < [F (x2)− F (x3)][G(x1)−G(x2)],

which contradicts condition 2 in Lemma 1.

• ϕ is a function, but is not convex for x ∈ (−∞, x]. In this case, one can find

x1, x2, x3 with x ≥ x1 > x2 > x3 such that Eq. (1) does not hold, which leads

to a contradiction.

A.2 Proof of Lemma 1

To establish the equivalence of the first two conditions, simply add the constant

[F (x1) − F (x2)][G(x1) − G(x2)] to both sides of the second inequality. Similarly, the

equivalence between the second and third conditions can be established by adding the

constant [F (x2)− F (x3)][G(x2)−G(x3)] to both sides of the second inequality.

Next, we show that condition 2 implies condition 4. Let x1 > x2 ≥ x3 > x4. By

condition 2,

[F (x1)− F (x2)][G(x2)−G(x3)] ≥ [F (x2)− F (x3)][G(x1)−G(x2)], (11)

[F (x2)− F (x3)][G(x3)−G(x4)] ≥ [F (x3)− F (x4)][G(x2)−G(x3)], (12)

[F (x1)− F (x3)][G(x3)−G(x4)] ≥ [F (x3)− F (x4)][G(x1)−G(x3)]. (13)

In case [F (x2)−F (x3)][G(x2)−G(x3)] > 0, the product of Eqs. (11) and (12) would

give us condition 4. Otherwise, [F (x2) − F (x3)][G(x2) − G(x3)] = 0, which implies

one of two cases. (a) Both F (x2) − F (x3) = 0 and G(x2) − G(x3) = 0, implying,

F (x1)−F (x3) = F (x1)−F (x2) and G(x1)−G(x3) = G(x1)−G(x2). These equations

imply, by Eq. (13), condition 4. (b) Either F (x3)−F (x4) = 0, or G(x2)−G(x3) = 0. If

the first equality holds, then by Eq. (12), F (x3)−F (x4) = 0, which in turn guarantees

condition 4. If the second equality holds, then by Eq. (11), G(x1)−G(x2) = 0, which

again guarantees condition 4.

The fact that condition 4 implies condition 2 is immediate, simply take x2 = x3.

This completes the proof of Lemma 1.

A.3 Proof of Theorem 1

“If” Part. We first prove that Eq. (3) implies F �SSD G (or Eq. (1)). Suppose, by

contradiction, that F 6�SSD G. Then, there exist x1 > x2 > x3 such that

[F (x1)− F (x2)][G(x1)−G(x3)] < [F (x1)− F (x3)][G(x1)−G(x2)]. (14)

22

The monotonicity of F and G implies that for Eq. (14) to hold, it is necessary that

F (x1)− F (x3) > 0 and G(x1)−G(x2) > 0. It follows that G(x1)−G(x3) > 0 as well.

Dividing both sides of Eq. (14) by [G(x1)−G(x3)][F (x1)− F (x3)], we obtain

F (x1)− F (x2)

F (x1)− F (x3)<G(x1)−G(x2)

G(x1)−G(x3). (15)

Set h = 1(x3,x1] and u = 1(x2,∞). Then, uh = 1(x2,x1]. By Eq. (3),

F (x1)− F (x2)

F (x1)− F (x3)=

EF (uh)

EF (h)≥ EG(uh)

EG(h)=G(x1)−G(x2)

G(x1)−G(x3),

which contradicts Eq. (15).

“Only if” Part. Take any bounded, non-negative measurable function h. Consider

the probability measures PHFand PHG

defined as

PHF(A) =

∫A

h(x)

EF (h)dPF , and PHG

(A) =

∫A

h(x)

EG(h)dPG,

respectively, and let HF and HG be the corresponding CDFs. Here we assume that

EF (h) > 0 and EG(h) > 0. Otherwise, if one of them is 0, then Eq. (3) holds trivially.

We prove that HF FOSD HG.

Since h is bounded, non-negative and measurable w.r.t. PF and PG, there exists

an increasing sequence of simple functions {hN} such that h = limN→∞ hN a.e., hN =∑Ni=1 αi1Ai

, where Ai, Aj are disjoint intervals measurable w.r.t. PF and PG.

Now consider a fixed x2, let x3 < x2, and consider any measurable set Ai > x2.24

Using condition 4 in Lemma 1, we have

EF (1Ai)EG(1(x3,x2]) ≥ EF (1(x3,x2])EG(1Ai

).

We then obtain

EF

(N∑i=1

αi1Ai∩(x2,∞)

)EG(1(x3,x2]) ≥ EF (1(x3,x2])EG

(N∑i=1

αi1Ai∩(x2,∞)

),

and it follows that

EF

(N∑i=1

αi1Ai∩(x2,+∞)

)EG

(N∑i=1

αi1Ai∩(−∞,x2)

)

≥ EF

(N∑i=1

αi1Ai∩(−∞,x2]

)EG

(N∑i=1

αi1Ai∩(x2,+∞)

),

24This means that inf Ai > x2.

23

or

EF (hN1(x2,∞))EG(hN1(−∞,x2]) ≥ EG(hN1(x2,∞))EF (hN1(−∞,x2]).

This is equivalent to (again, assume both EF (hN) > 0 and EG(hN) > 0 — the statement

is trivial when either one is 0.)

EF(hN1(x2,∞)

EF (hN)

)EG(hN1(−∞,x2)

EG(hN)

)≥ EF

(hN1(−∞,x2)

EF (hN)

)EG(hN1(x2,∞)

EG(hN)

).

In the limit, we have

EF(h1(x2,∞)

EF (h)

)EG(h1(−∞,x2)

EG(h)

)≥ EF

(h1(−∞,x2)

EF (h)

)EG(h1(x2,∞)

EG(h)

),

or

(1−HF (x2))HG(x2) ≥ HF (x2) (1−HG(x2)) .

It follows that HF FOSD HG. Thus, for an increasing u, one obtains

EF (uh)EG(h) = EF (h)EF (uh).

This completes the proof of Theorem 1.

A.4 Proof of Corollary 1 — “If” Part

Suppose F 6�SSD G. By condition 3 of Lemma 1, that there exist x1 > x2 > x3 such

that

[F (x1)− F (x3)][G(x2)−G(x3)] < [F (x2)− F (x3)][G(x1)−G(x3)]. (16)

The monotonicity of F and G implies that for Eq. (16) to hold, it is necessary that

F (x2) − F (x3) > 0 and G(x1) − G(x3) > 0, hence F (x1) − F (x3) > 0. Dividing both

sides of Eq. (16) by [F (x1)− F (x3)][G(x1)−G(x3)], we obtain

F (x2)− F (x3)

F (x1)− F (x3)>G(x2)−G(x3)

G(x1)−G(x3). (17)

Since Eq. (4) is equivalent to F |A FOSD G|A for any measurable set A with positive

probability, by taking A = (x3, x1], it follows that F (x2)−F (x3)F (x1)−F (x3)

≤ G(x2)−G(x3)G(x1)−G(x3)

, which is a

contradiction to Eq. (17).


To establish the “if” part of Theorem 2, we need the following two lemmas. Lemma 2

states that for a sequence of Bernoulli random variables with mean p, as the length of

the sequence becomes sufficiently large, the frequency of occurrence is arbitrarily close

to the mean p.

24

Lemma 2. Suppose X1, ..., Xt are i.i.d. Bernoulli random variables with a parameter

p, p ∈ [0, 1]. Let Yt := 1t

∑ti=1 Xi. Then, for any δ > 0,

P (|Yt − p| > δ) < exp(−tδ2

).

Proof. In case p ∈ {0, 1} the assertion is clear. Otherwise, by the Chernoff-Hoeffding

theorem (applied to Bernoulli random variables), we have

P (|Yt − p| > δ) < exp

(− tδ2

2p(1− p)

)< exp

(−tδ2

).

Given a history ht of binary outcomes, let k(ht) denote the number of outcomes

“U” among the t observations.

Lemma 3. Let F be a CDF on [0, 1] and let p1, p3 ∈ (0, 1] be given numbers with

p3 < p1. Assume that p3 is not an atom and that either p1 = 1, or p1 is not an atom.

Consider the set of histories

H t :=

{ht ∈H t

∣∣ p3 <k(ht)

t≤ p1

}. (18)

Then, PF (H t)→ F (p1)− F (p3) as t→∞.

Proof. We prove the assertion for the case where both p1 and p3 are not atoms and

p1 < 1. The proof for the case where p1 = 1 is similar but easier, and is therefore

omitted.

Fix ε > 0. Since p1 and p3 are not atoms, F is continuous at these points. Thus,

one can find δ > 0 such that F (p3 + δ)− F (p3 − δ) + F (p1 + δ)− F (p1 − δ) < ε.

Recall that for a history ht with k observations of outcome “U”, we definedB(p;ht) =

pk(1− p)t−k and B(p;H t) =∑

ht∈Ht B(p;ht). Hence, we have

PF (H t) =

∫ 1

0

B(p;H t)dF (p)

=

∫ p3−δ

0

B(p;H t)dF (p) +

∫ p3+δ

p3−δB(p;H t)dF (p) +

∫ p1−δ

p3+δ

B(p;H t)dF (p)

+

∫ p1+δ


∫ 1

p1+δ

B(p;H t)dF (p). (19)

Since B(p;H t) ≤ 1 and, by the choice of δ,∫ p3+δ


∫ p1+δ

p1−δB(p;H t)dF (p) < ε. (20)

25

By Lemma 2 and the definition of H t, we have the following two inequalities:∫ p3−δ

0

B(p;H t)dF (p) +

∫ 1

p1+δ

B(p;H t)dF (p) < exp(−tδ2

), (21)

and (by the choice of δ as well),∫ p1−δ

p3+δ

B(p;H t)dF (p) >

∫ p1−δ

p3+δ

(1− exp

(−tδ2

))dF (p)

≥ F (p1 − δ)− F (p3 + δ)− exp(−tδ2

)≥ F (p1)− F (p3)− ε− exp

(−tδ2

). (22)

Combining Eqs. (19) to (22) yields∣∣PF (H t)− (F (p1)− F (p3))∣∣ < 2(ε+ exp(−tδ2)).

Thus, when t is sufficiently large,∣∣PF (H t) − (F (p1)− F (p3))

∣∣ < 4ε, which completes

the proof.

Proof of Theorem 2 — “If” Part. Suppose F 6�SSD G. Then, by Eq. 1 in

Definition 1, that there exist p1 > p2 > p3 such that

[F (p1)− F (p2)][G(p1)−G(p3)] < [F (p1)− F (p3)][G(p1)−G(p2)].

The monotonicity of F and G implies that in order for the inequality above to hold, it

must be true that F (p1)−F (p3) > 0 and G(p1)−G(p2) > 0, hence G(p1)−G(p3) > 0.

Consequently,F (p1)− F (p2)

F (p1)− F (p3)<G(p1)−G(p2)

G(p1)−G(p3). (23)

We show that this leads to a contradiction.

Since the functions F and G are right-continuous and each has at most countably

many atoms, we can assume w.l.o.g. that there are no atoms (neither of F , nor of G)

at p3, p2 and p1 or p1 = 1. Indeed, suppose that p1 < 1 and p1 is an atom. One can find

a point q1 > p1 which is not an atom of F or G (because there are at most countably

many atoms). Moreover, since a CDF is right-continuous, q1 can be chosen to be very

close to p1 (from the right), so that F (q1) and G(q1) are close to F (p1) and G(p1) to

the extent that the inequality in Eq. (23) remains valid when we replace p1 by q1. For

the same reason, it is w.l.o.g. to assume that p3 and p2 are not atoms.

Step 1. Consider the set of histories H t as defined by Eq. (18). We show that when

t is sufficiently large, EF (1(p2,1]|H t)→ F (p1)−F (p2)F (p1)−F (p3)

and EG(1(p2,1]|H t)→ G(p1)−G(p2)G(p1)−G(p3)

.

26

Clearly,

EF (1(p2,1]|H t) =∑ht∈Ht

PF (ht|H t)EF (1(p2,1]|ht) =∑ht∈Ht

PF (ht)

PF (H t)EF (1(p2,1]|ht).

For the denominator PF (H t), Lemma 3 implies that PF (H t)→ F (p1)−F (p3) as t→∞.

Now, consider∑

ht∈Ht PF (ht)EF (1(p2,1]|ht). We show∑

ht∈Ht PF (ht)EF (1(p2,1]|ht)→F (p1)− F (p2). Since

PF (ht) =

∫ 1

0

B(p;ht)dF (p), and EF (1(p2,1]|ht) =

∫ 1

0

1(p2,1]B(p;ht)∫ 1

0B(p;ht)dF (p)

dF (p),

it follows that∑ht∈Ht

PF (ht)EF (1(p2,1]|ht) =∑ht∈Ht

∫ 1

0

1(p2,1]B(p;ht)dF (p)

=

∫ 1

0

1(p2,1]B(p;H t)dF (p)

=

∫ p2+δ

p2

B(p;H t)dF (p) +

∫ p1−δ

p2+δ

B(p;H t)dF (p)

+

∫ p1+δ


∫ 1

p1+δ

B(p;H t)dF (p). (24)

Fix ε > 0. Since p1 and p2 are not atoms, F is continuous at these points. Thus,

one can find δ > 0 such that F (p2 + δ) − F (p2) + F (p1 + δ) − F (p1 − δ) < ε. Since

B(p;H t) ≤ 1, we have∫ p2+δ

p2

B(p;H t)dF (p) +

∫ p1+δ

p1−δB(p;H t)dF (p) < ε. (25)

By Lemma 2 and the definition of H t,∫ 1

p1+δ

B(p;H t)dF (p) < exp(−tδ2

), (26)

and (by the choice of δ as well),∫ p1−δ

p2+δ

B(p;H t)dF (p) >

∫ p1−δ

p2+δ

(1− exp

(−tδ2

))dF (p)

≥ F (p1 − δ)− F (p2 + δ)− exp(−tδ2

)≥ F (p1)− F (p2)− ε− exp

(−tδ2

). (27)

27

We combine Eqs. (24) to (27) to obtain,∣∣ ∑ht∈Ht

PF (ht)EF (1(p2,1]|ht)− (F (p1)− F (p2))∣∣ < 2(ε+ exp(−tδ2)).

Thus, for t sufficiently large,∣∣∑

ht∈Ht PF (ht)EF (1(p2,1]|ht) − (F (p1)− F (p2))∣∣ < 4ε.

This together with PF (H t)→ F (p1)− F (p3) establishes the desired result.

Step 2. Show that EF (1(p2,1]|H t) ≥ EG(1(p2,1]|H t).

Since 1(p2,1] is an increasing function on [0, 1] and F |H t FOSD G|H t, for any H t ⊆H t, it follows that EF (1(p2,1]|H t) ≥ EG(1(p2,1]|H t), as desired.

Step 1 and Step 2 together imply that

F (p1)− F (p2)

F (p1)− F (p3)≥ G(p1)−G(p2)

G(p1)−G(p3),

which contradicts Eq. (23). This completes the proof of the “if” part of Theorem 2.

A.6 Supplementary Results Related to Proposition 3

Lemma 4 and Lemma 5 below establish the result that the function B(p; H t+(ht)) is

increasing in p, which is used in the proof of Proposition 3.

Lemma 4. For any given t, k ∈ N, where 0 ≤ k ≤ t, the function Γ(p; t, k) :=∑t`=k

(t`

)(1− p)t−`(p)` is increasing in p.

Proof. Taking the derivative of Γ(p; t, k) w.r.t. p, we have

dΓ(p; t, k)

dp=

t−1∑`=k

(t

`

)[`p`−1(1− p)t−` − (t− `)p`(1− p)t−`−1] + tpt−1

=t−1∑`=k

(t

`

)p`−1(1− p)t−`−1(`− tp) + tpt−1

= tt−1∑`=k

(t− 1

`− 1

)p`−1(1− p)t−`−1 − t

t−1∑`=k

(t

`

)p`(1− p)t−`−1 + tpt−1.

Thus, dΓ(p;t,k)dp

≥ 0 if and only if

t−1∑`=k

(t− 1

`− 1

)p`−1(1− p)t−`−1 + pt−1 ≥

t−1∑`=k

(t

`

)p`(1− p)t−`−1,

28

if and only if (after multiplying by (1− p) and adding pt),

t−1∑`=k

(t− 1

`− 1

)p`−1(1− p)t−` + pt−1 ≥

t−1∑`=k

(t

`

)p`(1− p)t−` + pt,

if and only if

t−2∑`=k−1

(t− 1

`

)p`(1−p)t−`−1 +pt−1 =

t−1∑`=k−1

(t− 1

`

)p`(1−p)t−`−1 ≥

t∑`=k

(t

`

)p`(1−p)t−`.

Notice that LHS = Γ(p; t− 1, k − 1) while RHS = Γ(p; t, k). Since Γ(p; t− 1, k − 1) ≥Γ(p; t, k), we conclude that dΓ(p;t,k)

dp≥ 0 and therefore Γ(p; t, k) is non-decreasing with

p.

Lemma 5. Consider any ht ∈H t. The function B(p; H t+(ht)) is increasing in p.

Proof. Suppose the history ht has k observations of outcome “U”. Let r be the number

of those histories that rank above ht, but have the same number of observations of

outcome “U” as ht. Clearly, 0 ≤ r ≤(tk

). We have

B(p; H t+(ht)) =

t∑`=k+1

(t

`

)(1− p)t−`(p)` + rpk(1− p)t−k = Γ(p; t, k+ 1) + rpk(1− p)t−k.

Notice that as a function of p, the expression pk(1− p)t−k is increasing if and only

if p1−p ≤

kt−k . Therefore, for those p that satisfy p

1−p ≤kt−k , rpk(1− p)t−k is increasing

in p and it follows from Lemma 4 that Γ(p; t, k + 1) is increasing in p. Hence in this

case, B(p; H t+(ht)) is increasing in p.

For those p that satisfy p1−p >

kt−k , the expression −

((tk

)− r)pk(1−p)t−k is strictly

increasing in p and Lemma 4 implies that Γ(p; t, k) is increasing in p. Hence, in this

case

B(p; H t+(ht)) = Γ(p; t, k)−

((t

k

)− r)pk(1− p)t−k

is also increasing in p. We thus conclude that B(p; H t+(ht)) is an increasing function

of p.


The equivalence of conditions 1 and 2 was established in Theorem 2. By Proposition 3,

we know that condition 1 also implies condition 3. Therefore, it suffices to show that

condition 3 implies condition 1.

29

Suppose condition 1 does not hold, namely, F 6�SSD G. Then there exist p1, p2, p3 ∈[0, 1] with p1 > p2 > p3 such that

F (p2)− F (p3)

F (p1)− F (p3)>G(p2)−G(p3)

G(p1)−G(p3). (28)

As in the proof of Theorem 2 (see the discussion after Eq. (23)), w.l.o.g. one can

assume that p1, p2, p3 are not atoms (or p1 = 1).

For any p < q, consider the set H t(p, q) := {ht ∈ H t | p < k(ht)t≤ q}, where

k(ht) is the number of outcomes “U” observed for a given history ht. By Lemma

3, PF (H t(p3, p2)) → F (p2) − F (p3) and PF (H t(p3, p1)) → F (p1) − F (p3) as t → ∞.

Therefore,PF (H t(p3, p2))

PF (H t(p3, p1))→ F (p2)− F (p3)

F (p1)− F (p3).

Similarly, under prior belief G,

PG(H t(p3, p2))

PG(H t(p3, p1))→ G(p2)−G(p3)

G(p1)−G(p3).

By condition 3, the probability distribution on the set of histories H t(p3, p1) under

F FOSD that under G. Consequently, PF (Ht(p3,p2))PF (Ht(p3,p1))

≤ PG(Ht(p3,p2))PG(Ht(p3,p1))

for any t. Hence in

the limit, we haveF (p2)− F (p3)

F (p1)− F (p3)≤ G(p2)−G(p3)

G(p1)−G(p3),

which contradicts Eq. (28). This completes the proof of the theorem.


Let F0 be the set of finite or countable union of intervals of the form (a, b] with a < b.

The smallest σ-algebra that contains F0 is the Borel σ-algebra, B.

Suppose the first claim does not hold. Then there exist x1 > x2 > x such that

PF ((x2, x1]|D) = 0 (or F (x1)−F (x2) = 0), but PG((x2, x1]|D) > 0 (or G(x1)−G(x2) >

0). Take x3 < x, we have F (x3) = 0. Since x := inf{x|F (x) > 0}, the fact that x1 > x

implies that F (x1)− F (x3) > 0. Hence,

[F (x1)− F (x2)][G(x1)−G(x3)] = 0 < [F (x1)− F (x3)][G(x1)−G(x2)],

which contradicts Eq. (1).

Now we show that the Radon-Nikodym derivative g := dPG(·|D)dPF (·|D)

is non-increasing,

except on a PF (·|D)-null set. We show the following equivalent condition. For any λ > 0

such that the sets {x ∈ D | g(x) < λ} and {x ∈ D | g(x) > λ} have positive measures

30

w.r.t. PF (·|D),25 {x ∈ D | g(x) < λ} > {x ∈ D | g(x) > λ} except on a PF (·|D)-null

set.

Suppose that the claim that g is decreasing almost everywhere w.r.t. PF (·|D) does

not hold. Then there exist λ > 0 and two subsets Ek, E` whose PF (·|D)-measures are

positive such that Ek ⊆ {x ∈ D | g(x) < λ}, E` ⊆ {x ∈ D | g(x) > λ}, but E` > Ek.

Since PG(Ek|D) =∫EkgdPF and PG(E`|D) =

∫E`gdPF , we have PG(Ek|D) <

λPF (Ek|D) and PG(E`|D) > λPF (E`|D), concequently,

PG(Ek|D)

PF (Ek|D)< λ <

PG(E`|D)

PF (E`|D). (29)

However, since F �SSD G, it follows that PG(Ek|D)PF (Ek|D)

≥ PG(E`|D)PF (E`|D)

, which contradicts (29).


Assume that one of the priors F or G is degenerate. To establish the “only if” part,

consider the following two cases. (a) F is degenerate at 1. Then for any set of histories

H t with positive probability, F |H t is still degenerate at 1, so F |H t FOSD G|H t holds

trivially. (b) F is not degenerate at 1. Then it cannot happen that G is degenerate

at 1, since this would violate F �SSD G (note that F �SSD G implies F FOSD G).

Similarly, it cannot be true that F is degenerate at 0 while G is not degenerate at 0.

Hence, we only need to consider the case that G is degenerate at 0. But this implies

that G|H t is also degenerate at 0, for any H t with positive probability. So F |H t FOSD

G|H t holds trivially.

To establish the “if” part, consider again the following two cases. (a) F is degenerate

at 1. Then the transformation ϕ that satisfies F (p) = ϕ(G(p)), p ∈ [0, p], is ϕ(y) ={0, if y < 1,

1, if y = 1,where y is in the range of G. Clearly ϕ is a convex function, hence

F �SSD G. (b) F is not degenerate at 1. Then one cannot have that G is degenerate

at 1, or that F is degenerate at 0 while G is not degenerate at 0, since then one can

easily find H t with positive probability that violates F |H t FOSD G|H t. So we focus

on the case in which G is degenerate ar 0. Then, for any p1 > p2 > p3, we have

G(p1) − G(p3) = 0 and G(p1) − G(p2) = 0, so Eq. (1) holds trivially and F �SSD G.

This completes the proof of the proposition.

25Let A and B be two sets of real numbers. We write A > B if x ∈ A and y ∈ B imply y > x.

31

strong stochastic dominance - taulehrer/papers/strongstochasticdominance.pdf · learning and...

Documents