strong stochastic dominance - taulehrer/papers/strongstochasticdominance.pdf · learning and...
TRANSCRIPT
Strong Stochastic Dominance
Ehud Lehrer∗ and Tao Wang†
January 7, 2020
Abstract
One probability distribution is said to strongly stochastically dominate another prob-
ability distribution if the former is a convex transformation of the latter. The main
contribution of this paper is the introduction of an equivalent condition phrased in ex-
pectation/behavioral terms. To show the usefulness of this equivalent condition, we
give several economic applications. These applications include a Bayesian learning and
dynamic model in which we examine the consequences of the fact that one prior being
strongly stochastically dominates another. We also show implications of our main the-
orem in models related to monotone comparative statics under uncertainty, pricing of
risky assets, implications to a portfolio’s value-at-risk and to production expansions.
Keywords: Strong stochastic dominance; Convex transformation; Bayesian learning;
Monotone comparative statics
JEL Codes: C11, D80, D83.
∗School of Mathematical Sciences, Tel Aviv University, Tel Aviv 69978, Israel and INSEAD, Bd.
de Constance, 77305 Fontainebleau Cedex, France, Email: [email protected]; Lehrer acknowledges
the support of ISF Grant #963/15 and NSFC-ISF Grant #2510/17.†School of Mathematical Sciences, Tel Aviv University, Tel Aviv 69978, Israel. Email:
[email protected]; Wang acknowledges the support of NSFC Grant #11761141007.
1 Introduction
The notion of stochastic dominance has been extensively discussed in the literature.
Many well-known stochastic dominance relations admit equivalent conditions expressed
in terms of expectations of functions with certain properties. The most famous ex-
amples are first and second order stochastic dominance: A distribution F first-order
stochastically dominates (FOSD) another distribution G if and only if the expected
value of any increasing function with respect to (henceforth, w.r.t.) F is greater than
the expected value w.r.t.G. Similarly, F second-order stochastically dominates (SOSD)
G if and only if the expected value of any increasing and concave function w.r.t. to
F is greater than the expected value w.r.t. G. Blackwell’s equivalence theorem for
comparing experiments (Blackwell (1953)) is another example: One experiment S (a
stochastic matrix) dominates1 another experiment T if and only if the ex-ante expected
value of any continuous and convex function under S is greater than the expected value
of the same function under T . The significance of these stochastic dominance relations
stems in part from the existence of such equivalent conditions, since they allow us to
abstract away from the underlying distributions and focus directly on expectations.
In this paper, we consider a notion of stochastic dominance that we call the strong
stochastic dominance (SSD) and provide an equivalent condition for SSD expressed in
terms of expectations. We then demonstrate that our equivalent condition has a wide
range of applications in a variety of economic models.
We say that a cumulative distribution function F strongly stochastically dominates
another cumulative distribution function G if F is a convex transformation of G. This
notion applies to any cumulative distribution functions, not only to continuous or
discrete ones. These distributions could be with atoms or non-atomic, with or without
densities. Therefore, the notion of SSD is more general than the widely used monotone
likelihood ratio property (MLRP), the property that the ratio of two density functions
is monotone. Studies of convex transformations of distribution functions have already
appeared in the literature in different contexts with a varying degree of generality
(see for instance, Chan et al. (1991), Bikhchandani et al. (1992), Chew et al. (1987)
and Chateauneuf et al. (2004)). However, none of these have mentioned/found an
equivalent condition in terms of expectations.
The main result of this paper shows that F strongly stochastically dominates G
is equivalent to the following condition: For every bounded, non-negative measurable
function h and every bounded and increasing function u,
EF (uh)EG(h) ≥ EG(uh)EF (h).
1More precisely, we say that S dominates T if there exists a stochastic matrix M such that T = SM ,
i.e., experiment T can be obtained from S by adding noise according to M .
1
Since no assumptions other than boundedness, monotonicity, non-negativity and mea-
surability are made regarding the functions u and h, we have considerable latitude in
specifying u and h, which makes wide applications possible.
Although overlooked by researchers so far, the equivalent condition we introduce
here seems to be fundamental. We justify this statement by providing applications to
both dynamic and static models.
Consider the framework of Bayesian learning and dynamic decision making, where
a decision maker receives sequential binary signals about an unknown state of nature.
Using our equivalent condition, we show that the following conditions are equivalent:
(i) a prior belief F dominates another prior belief G in the sense of SSD; (ii) the
posterior distribution given any set of histories of the same length under F FOSD the
posterior distribution under G; (iii) for any set of histories of the same length, the
distribution over this set of histories under F FOSD that under G. Conditions (i) and
(ii) are in terms of the prior and posterior beliefs, respectively, whereas condition (iii)
is in terms of distributions over a set of histories.
Our equivalent condition yields monotone comparative statics results in a class
of stochastic optimization problems. Specifically, we extend a well-known result of
Athey (2002) that states the following. Suppose that a utility function satisfies the
single-crossing property in action and state of nature, where the latter is governed by
a parameterized family of distributions, ordered by MLRP. Then, the optimal actions
that maximize the expected utility, subject to a constraint, are monotone in the sense
of strong set order w.r.t. the distribution and the constraint. We show that this result
still holds if we replace the MLRP by the weaker notion of SSD using the equivalent
condition stated in our main theorem.
Our equivalent condition can also be applied to other static models. One application
is concerned with the equilibrium market prices and the “value-at-risk” of a risky asset
under different distributions over the unknown rate of return of the asset. We show
that under a strongly stochastically dominating distribution, the equilibrium asset
price and the expected return above any percentile are higher. In another application,
we consider a firm’s decision regarding the output expansion investment when facing
uncertain market demand. In this setting, our result implies that under a strongly
stochastically dominating distribution of market demand, the firm expects a higher
rate of return from the output expansion investment.
To the best of our knowledge, Shaked & Shanthikumar (2007) and Muller & Stoyan
(2002) contain the most comprehensive reviews of known results about stochastic or-
ders. It is well known that when density functions exist, the fact that one distribution
MLRP another is equivalent to the latter being a convex transformation of the former
(e.g., Section 1.C.1 of Shaked & Shanthikumar (2007) and Theorem 1.4.3 of Muller &
2
Stoyan (2002)). Our paper does not require the existence of density functions. More
importantly, it provides a widely applicable condition which is equivalent to the exis-
tence of a convex transformation from one distribution to another. Convex (concave)
transformations of distribution functions also appear in a few models in the context
of decision theory. Bikhchandani et al. (1992) focus on Bayesian learning, where it
is well known that the FOSD relation of prior distribution functions may not be pre-
served after Bayesian updating. They provide equivalent conditions under which the
posterior belief over a set of probability distributions after any history of signals FOSD
the posterior of another belief. In the rank-dependent expected utility models, such
as Chew et al. (1987) and Chateauneuf et al. (2004), convex (concave) transforma-
tions of probability distributions are used to characterize “more risk averse” preference
relations.
The rest of the paper is organized as follows. Section 2 introduces the notion of
SSD and discusses its connection to convex transformation. Section 3 presents the main
result and its immediate implications. Sections 3 and 4 contain economic applications
of the main result to dynamic and static models, respectively. Omitted proofs can be
found in the Appendix.
2 Strong Stochastic Dominance
Let F and G be two cumulative distribution functions (CDF) defined on R.
Definition 1. We say that F strongly stochastically dominates G, denoted as F �SSD
G, if for any x1 > x2 > x3,
[F (x1)− F (x2)][G(x1)−G(x3)] ≥ [F (x1)− F (x3)][G(x1)−G(x2)]. (1)
Chan et al. (1991), Bikhchandani et al. (1992)2 refer to a transformation of CDF’s
w.r.t. ϕ which is a convex function defined on the range of G. It turns out that there
is a close connection between the fact that F �SSD G and the existence of a convex
functional transformation ϕ, as stated in the following proposition.
Proposition 1. Let x := inf{x|G(x) = 1}.3 F �SSD G if and only if the transforma-
tion ϕ such that F (x) = ϕ(G(x)) on (−∞, x] is a convex function.
2 Chan et al. (1991) consider convex-ordering among continuous distribution functions, which is
equivalent to MLRP. Bikhchandani et al. (1992) focus on discrete distributions. Here we impose
no restrictions on the type of distribution functions. They are allowed to be discontinuous or not
absolutely continuous with respect to the Lesbegue measure.3Since G is right-continuous, it implies G(x) = 1.
3
0
F
G
F (x1) = 1
G(x1) = 112
F (x3)
F (x2)
ϕ
Panel (a): F 6�SSD G
0
F
G
1
G(x1) = 1
F (x3)
G(x3)
F (x2)
G(x2)
F (x1)
ϕ
Panel (b): F 6�SSD G
Figure 1: Transformations that fail to satisfy F �SSD G
Given any two cumulative distribution functions F and G, one can find a correspon-
dence ϕ : [0, 1] ⇒ [0, 1] such that F (x) = ϕ(G(x)) on (−∞, x]. Indeed, define ϕ(y) =
F (G−1(y)), where y is in the range of G and G−1(y) = {x ∈ (−∞, x] | G(x) = y}. The
reason why ϕ(·) is a correspondence, rather than a function, is because G might be a
constant on some interval, while F is strictly increasing on the same interval. Such a
case is illustrated in Panel (a) of Figure 1. In this figure, G equals 12
on [x3, x2] while
F (x2) > F (x3), and ϕ maps 12
to a closed interval containing F (x2) and F (x3).
Proposition 1 states that F �SSD G implies that, on the domain where G is not
constantly 1, ϕ is a function rather than a correspondence. To see the reason, consider
again Panel (a) of Figure 1. Take x1 to be such that F (x1) = G(x1) = 1. In this
case, the slope of ϕ between the points (G(x1), F (x1)) and (G(x2), F (x2)) is strictly
less than the slope between (G(x1), F (x1)) and (G(x3), F (x3)). This violates Eq. (1).
The purpose of Panel (b) in Figure 1 is to show that the condition F (x) = ϕ(G(x))
on (−∞, x], cannot be replaced by the weaker condition that F (x) = ϕ(G(x)) on
the open ray (−∞, x). In this figure, G has an atom at x1 = x. Despite the fact
that ϕ is a convex function for G < 1, and that F (x) = ϕ(G(x)) on the open ray
(−∞, x), F �SSD G fails to hold. To see this, consider x1, x2 and x3 as illustrated.
The slope of ϕ between the points (G(x1), F (x1) and (G(x2), F (x2)) is strictly less
than the slope between the points (G(x1), F (x1)) and (G(x3), F (x3)). This contradicts
Eq. (1). Therefore, one needs to refer also to the point x and require ϕ such that
F (x) = ϕ(G(x)) on x ∈ (−∞, x] to be a convex function. Proposition 1 ensures that
this is sufficient for F �SSD G.
Figure 2 shows a case in which F �SSD G. Here, x = 23, the transformation ϕ such
4
that F (x) = ϕ(G(x)) on (−∞, 23] is a convex function.
0 x
1
123
23
x3 x2
G F
Panel (a): Distribution functions
0
F
G
1
123
G(x3)
F (x2)
G(x2)
F (x)
ϕ
Panel (b): Transformation ϕ
Figure 2: A graphical illustration of SSD
Remarks:
1. The definition of SSD applies to any CDF’s, not only to continuous or discrete
ones. These distributions could be with atoms or non-atomic, with or without
densities.
2. The set of CDF’s that strongly stochastically dominates a given CDF is convex.
This can be seen immediately from Eq. (1).
3. When F �SSD G, since both F and G are CDF’s, w.l.o.g., the transformation ϕ
such that F (x) = ϕ(G(x)) on (−∞, x] is increasing (here and in what follows,
increasing means weakly increasing) and satisfies ϕ(0) = 0.4 This implies that
either ϕ(x) < x for every x ∈ (0, 1), or always ϕ(x) = x. We therefore obtain
that F (x) = ϕ(G(x)) when x ≤ x and G(x) = 1 when x > x, implying G ≥ F .
We conclude that F �SSD G implies F FOSD G.
4. If F �SSD G, then an atom of G in the interior of the support5 of F is also an atom
of F . Figure 2 is an illustration. In Panel (a), we see that both F and G have
4 In general, the transformation ϕ such that F = ϕ ◦G on (−∞, x] may not satisfy ϕ(0) = 0 and
monotonicity. This happens, for example, when 0 is not in the range of both F and G (e.g., when
a family of normal distributions with the same variance is ordered by the mean). Whenever such
situations arise, one may let ϕ(0) = 0, in which case the function ϕ is assured to be increasing.5The support of a probability measure is the smallest closed set with measure 1.
5
an atom at x = 23
and they satisfy F �SSD G. If, on the contrary, x is an atom
of G, but not an atom of F (here x is in the interior of the support of F ), then
the situation is similar to Panel (b) of Figure 1, in which F �SSD G fails to hold.
More generally, we can show that F �SSD G implies that the probability measure
induced by G is absolutely continuous6 w.r.t. that induced by F on a certain set.
As a result, the Radon-Nikodym derivative of the probability measure induced
by G w.r.t. that induced by F exists and is decreasing. The reader is referred to
a more detailed discussion on absolute continuity, given in Section 6.1.
Proposition 2. In the special case in which the densities f and g of F and G exist
(w.r.t. the Lebesgue measure), Definition 1 is equivalent to the monotone likelihood
ratio property ( MLRP), namely7
f(x)g(y) ≥ f(y)g(x), ∀x > y. (2)
Proof. Assume (2) and let x1 > x2 > x3. It implies that∫ x1x2f(x)dx
∫ x2x3g(y)dy ≥∫ x2
x3f(y)dy
∫ x1x2g(x)dx and therefore F �SSD G. On the other hand, if F �SSD G,
divide both sides of (1) by (x1 − x2)(x2 − x3) and let8 x1 ↓ x2, x2 ↓ x3.
We complete this section with a lemma providing a few conditions equivalent to
the one given in the definition of SSD, Eq. (1).
Lemma 1. The following conditions are equivalent:
1. for any x1 > x2 > x3, [F (x1)− F (x2)][G(x1)−G(x3)] ≥ [F (x1)− F (x3)][G(x1)−G(x2)];
2. for any x1 > x2 > x3, [F (x1)− F (x2)][G(x2)−G(x3)] ≥ [F (x2)− F (x3)][G(x1)−G(x2)];
3. for and x1 > x2 > x3, [F (x1)− F (x3)][G(x2)−G(x3)] ≥ [F (x2)− F (x3)][G(x1)−G(x3)];
4. for any x1 > x2 ≥ x3 > x4, [F (x1)−F (x2)][G(x3)−G(x4)] ≥ [F (x3)−F (x4)][G(x1)−G(x2)].
3 The Main Theorem
Well-known stochastic dominance relations, such as FOSD, SOSD and Blackwell’s com-
parison of experiments, have equivalent properties expressed in terms of expectations.
6A measure ν is absolutely continuous w.r.t. another measure µ if for any measurable set A,
µ(A) = 0 implies ν(A) = 0.7We follow the literature and ignore issues related to sets of points whose Lebesgue probability is
0.8Again, ignoring a set of points with Lebesgue measure 0.
6
The following theorem, which is the main result of the paper, gives a condition equiv-
alent to F �SSD G, phrased in terms of expectations (i.e., decision-theoretical terms).9
The proof can be found in the Appendix.
Theorem 1. Consider two distribution functions F , G on R. Then, F �SSD G if and
only if for every bounded, non-negative measurable function h and every bounded and
increasing function u,
EF (uh)EG(h) ≥ EF (h)EG(uh). (3)
The significance of the equivalent condition provided in Theorem 1 stems from the
fact that one can abstract away from the underlying distributions and focus directly on
expectation, which is very often related to a significant consequence of the stochastic
dominance relation. This statement is supported by a variety of applications provided
below. Beyond mild conditions like boundedness, monotonicity, non-negativity and
measurability, Theorem 1 imposes no other requirements on the functions u and h.
This allows us considerable latitude in specifying u and h, which makes a wide rage of
applications possible.
As an illustration, we show that the following few implications, which are well-
known results under MLRP (see, for instance, Theorem 1.C.6 in Shaked & Shanthiku-
mar (2007) and Theorem 1.4.3 in Muller & Stoyan (2002)), can be easily extended by
directly applying the equivalent condition (3).
Implications. First, notice that Eq. (3) implies F FOSD G. Indeed, by setting
h = 1, Eq. (3) becomes EF (u) ≥ EG(u), for any bounded and increasing function u.
This is equivalent to F FOSD G.
Second, Theorem 1 implies that if F �SSD G, then the expectation of any monotone
function, conditional on any subset of the support with positive measure, is greater
under F . Formally,
Corollary 1. F �SSD G if and only if for any bounded and increasing function u and
any measurable set A ⊆ R with a positive probability w.r.t. both F and G,
EF (u|A) ≥ EG(u|A). (4)
Proof. To show the “only if” part, let 1A be the indicator function of A. Take h(x) =
1A(x) in Theorem 1. Then
EF (u|A) =EF (u1A)
EF (1A)≥ EG(u1A)
EG(1A)= EG(u|A),
9 The equivalent condition in Eq. (3) can also be derived directly from the MLRP condition.
However, to the best of our knowledge, this condition has not been stated explicitly and used in the
literature even in the case of MLRP.
7
as claimed.
The proof of the “if” part is relegated to the Appendix.
Note that for any measurable set A of positive probability, Eq. (4) holds for any
bounded and increasing function u is equivalent to F FOSD G, conditional on A (i.e.,
F |A FOSD G|A). Thus,
Corollary 2. F �SSD G if and only if for any measurable subset A ⊆ R with a positive
probability w.r.t. both F and G, F |A FOSD G|A.
When F dominates G in the sense of MLRP, it is a well-known result that F |AFOSD G|A for any measurable set A ⊆ R with positive probability (see 1.C.6 of Shaked
& Shanthikumar (2007)). Our result shows that a similar property holds under the
weaker notion of SSD. Corollary 2 also suggests that our notion of SSD implies the
“uniform conditional stochastic order”10 due to Whitt (1980).
Another immediately implication of Theorem 1 is that the SSD relation is preserved
when conditioned on any subset with positive probability.
Corollary 3. If F �SSD G, then for any measurable subset A of the support with a
positive probability w.r.t. both F and G, F |A �SSD G|A.
Proof. It suffices to show that F �SSD G implies that for any bounded and increasing
function u, and any non-negative, bounded measurable function h, both defined on A,
the inequality EF (uh|A)EG(h|A) ≥ EG(uh|A)EF (h|A) holds. This is equivalent to
EF (uh)
EF (1A)
EG(h)
EG(1A)≥ EG(uh)
EG(1A)
EF (h)
EF (1A),
or
EF(uh)EG(h) ≥ EG
(uh)EF (h),
where h(x) =
{h(x), if x ∈ A,0, if x /∈ A, and u is an extension of u (which is defined on A)
to the entire domain such that u is increasing and coincides with u on the set A.11
This inequality holds due to Theorem 1: u is increasing and bounded and h is
non-negative, bounded and measurable.10 We say that one probability measure is less than or equal to another in the sense of uniform
conditional stochastic order (UCSO) if FOSD holds for each pair of conditional probability measures
obtained by conditioning on certain appropriate subsets. The notion was first introduced in Whitt
(1980) and extended further to the multivariate case in Whitt (1982). See also Rinott & Scarsini
(2006) for a connection with total positivity order.11Clearly, such a monotone extension exists. For instance, for any x /∈ A, define u(x) :=
sup{x′∈A, x′<x} u(x′) if the set {x′ ∈ A | x′ < x} is not empty; otherwise, define u(x) = M , where M
is a constant satisfying M < infx′∈A u(x′) (such M exists because u is bounded). One can check that
the function u thus defined is indeed increasing on the entire domain.
8
Corollary 4. Suppose Fn �SSD Fn−1 · · · �SSD F1. Let p = (p1, ..., pn), q = (q1, ..., qn)
be probability distributions over {1, ..., n} with p �SSD q. Then∑n
i=1 piFi �SSD
∑ni=1 qiFi.
Proof. Let Fp :=∑n
i=1 piFi, Fq :=∑n
i=1 qiFi. W.l.o.g., assume for any i ∈ {1, ..., n}, piand qi are not all zero. Otherwise, Fi is not in the support of p and q and one can ignore
such Fi. The assumption p �SSD q implies that piqi
is increasing with i. Hence for every
bounded, non-negative measurable function h such that EFi(h) > 0 for at least one i,12(
piEFi(h)∑n
j=1 pjEFj(h)
)i
FOSD(
qiEFi(h)∑n
j=1 qjEFj(h)
)i. Denote I := {i ∈ {1, ..., n} | EFi
(h) > 0}. The
fact that Fn �SSD Fn−1 · · · �SSD F1 implies that for any bounded, increasing function
u,EFi
(uh)
EFi(h)
is increasing in i,i ∈ I. It follows that
∑i∈I
piEFi(h)∑n
j=1 pjEFj(h)
EFi(uh)
EFi(h)≥∑i∈I
qiEFi(h)∑n
j=1 qjEFj(h)
EFi(uh)
EFi(h)
,
which implies (notice that for i /∈ I,piEfi
(uh)∑nj=1 pjEFj
(h)= 0 and
qiEfi(uh)∑n
j=1 qjEFj(h)
= 0)∑ni=1 piEfi(uh)∑nj=1 pjEFj
(h)≥∑n
i=1 qiEfi(uh)∑nj=1 qjEFj
(h),
orEFp(uh)
EFp(h)≥
EFq(uh)
EFq(h).
Hence Fp �SSD Fq.
Note that Corollary 4 may not hold if we assume p FOSD q instead. The following
is a counter example. Let F1 be the uniform distribution on [0, 1]; F2 is degenerate at
x = 1; and F3 is uniformly distributed on [1, 2]. Consider p = (0, 12, 1
2), q = (1
2, 0, 1
2).
In this case, we have p FOSD q but p 6�SSD q. One can verify that∑3
i=1 piFi 6�SSD∑3i=1 qiFi
4 Bayesian Learning and Dynamics
Our main result, Theorem 1, has natural applications to Bayesian learning and dynamic
decision making. To see this, consider a decision maker (DM) who faces an unknown
state of nature p with support [0, 1]. The DM’s prior belief about p is captured by a
CDF F (p). Depending on the true state of nature, in each period, a random outcome
(either “U” or “D”) is generated with probability P(U) = p. Assume the random
outcomes are independent.
12The case in which EFi(h) = 0 for all i ∈ {1, ..., n} is trivial, since clearly EFp(uh)EFq(h) ≥EFq(uh)EFp(h) holds. We therefore focus on the case in which EFi
(h) > 0 for at least one i.
9
The parameter p may correspond, for instance, to the overall performance of a
company, and one may interpret “U” and “D” as upward and downward movements of
a company’s stock price, as in Cox et al. (1979). The parameter p may also correspond
to the failure rate of a machine. A new machine that functions properly has a known
failure rate of p0, published by the producer. A user keeps track on the performance
of the machine and its failures history, and updates accordingly his expectation about
the prospect of future failures. In this context, when the user has a prior on failure
rates that allows a variety of possibilities, he will focus particularly on p0. It is then
natural to assume that the prior over p’s will assign a positive probability to p0.
In another setting, “U” and “D” may also be regarded as increase and decrease in
a firm’s output or sales, or “thumbs up” and “thumbs down” in the context of binary
consumer rating (Example 1 below).
A history ht of t observations is an ordered sequence in {D,U}. Denote the set of all
histories of length t by H t := {D,U}t and denote a subset of H t by H t. Conditional
on the true state of nature being p, we denote the probability of observing a particular
history ht by B(p;ht).13 The probability of observing a history from a given set H t is
therefore B(p;H t) =∑
ht∈Ht B(p;ht).
We say that a distribution F is non-degenerate if the probability that p ∈ (0, 1)
(i.e., not 0 or 1) is positive. Otherwise, we say that the distribution F is degenerate.
Formally, F is non-degenerate if and only if EF (1{0,1}) < 1. Note that if F is non-
degenerate, then the probability of any set H t of histories is positive: PF (H t) > 0.
Moreover, F is degenerate if and only if F (p) = F (0) for every p ∈ (0, 1).
The following theorem shows that F �SSD G is equivalent to F FOSD G conditional
on any set of histories that has a positive probability. The case in which either F or
G is degenerate is treated in Section 6.2.
Theorem 2. Two non-degenerate priors F and G satisfy F �SSD G if and only if for
any t and any H t ⊆H t, F |H t FOSD G|H t.
Proof. We prove here the “only if” direction using Theorem 1. The proof of the “if”
direction is relegated to the Appendix.
Assume that F �SSD G and take any p ∈ (0, 1).14 We have
1− F (p|H t) =EF (B(p;H t) · 1(p,1])
EF (B(p;H t))≥
EG(B(p;H t) · 1(p,1])
EG(B(p;H t))= 1−G(p|H t).
This inequality holds due to Theorem 1: 1(p,1] is an increasing function on [0, 1] and
B(p;H t) is non-negative.
13If ht has k observations of outcome “U”, then B(p;ht) = pk(1− p)t−k.14At the risk of abusing the notation, we use here the same symbol “p” for both the random variable
and a specific value of the random variable.
10
As a special case of Theorem 2, we have F |ht FOSD G|ht for any history ht. It is
worth noting that in Theorem 2 the set of histories H t can be any nonempty subset
of H t. In particular, H t does not have to contain all histories with the same number
of “U”. When F �MLRP G, this result is well known and can be easily derived. Here,
our purpose is to demonstrate that it can be extended to SSD using the equilibrium
condition we characterized in Theorem 1.
Theorem 2 has natural applications to dynamic decision problems in which the
DM’s utility depends on the unknown parameter p, but is unaffected by the sequence
of noisy signals. The signals only provide information about p. In particular, when
the DM’s utility function is increasing in p, Theorem 2 implies that his expected util-
ity conditional on any set of histories of observations is greater under the strongly
stochastically dominating prior.
Example 1. A store starts to operate through an online platform. Suppose consumers
arrive sequentially to purchase from the store. Let p, p ∈ [0, 1], denote the unknown
probability that a random consumer is satisfied with the store’s service and leaves a
positive feedback.15 We may regard the parameter p as a measure of the quality of the
store’s service. Let F be a consumer’s prior belief about the store’s quality, starting
with no observation of past feedbacks.
A history is an ordered sequence of feedbacks. The online platform, however, may
not disclose the entire history of realized feedbacks. Instead, it often discloses some
statistics of the realized history, or a particular set of feedback histories to which the
realized one belongs. For instance, others may only disclose the feedbacks for the last T
periods, some platforms only disclose the total number of positive and negative feedbacks
accumulated so far. In other words, each disclosure policy of feedbacks corresponds to
a subset H t ⊆H t that contains the realized history.
What Theorem 2 tells us in this setup is the following: regardless of the disclosure
policy the online platform adopt, as long as one consumer’s prior belief dominates an-
other prior in the sense of SSD, the posterior beliefs about the store’s quality preserve
the FOSD relation after observing the disclosed information. Consequently, if the con-
sumers’ utility functions are increasing in the service quality, those consumers with
more favorable prior beliefs (in the sense of SSD) will expect to obtain higher utility
from the store’s service.
In some dynamic decision problems, however, a DM’s utility does hinge directly
on the history of signals. For instance, consider a firm that faces an unknown market
condition. The realized demand each period serves as a signal about the underlying
15Assume for simplicity that all consumers rate automatically and that feedbacks are binary: either
a positive or a negative feedback.
11
market condition, but at the same time, the firm’s profit is directly affected by its
demand level. In this kind of problems, it might be useful to examine the probability
distributions over a given set of histories induced by different prior beliefs in order to
evaluate and compare the DM’s expected utility. This is the motivation for our next
result, Proposition 3 below.
Before we state and prove the proposition, notice first that with binary outcomes,
we can order elements in H t by the number of outcome “U” observed. Let ht, ht ∈H t
be two histories of length t. We say that ht ranks above ht, and write ht � ht, if ht
contains (weakly) more “U” than ht.16 We then regard H t as a well-ordered set.
For any subset H t, its elements inherit the same ordering as in H t, and is therefore
meaningful to compare two posteriors over H t in the sense of stochastic dominance.
Fix a set of histories H t ⊆ H t. A history ht partitions H t into two disjoint
subsets: H t+(ht) :=
{ht ∈ H t | ht � ht
}and H t
−(ht) :={ht ∈ H t | ht � ht
}. The
subset H t+(ht) consists of those histories in H t that are ranked above ht (and including
ht itself), whereas H t−(ht) consists of those that rank below ht. Similarly, define another
set H t+(ht) :=
{ht ∈H t | ht � ht
}. Clearly, H t
+(ht) = H t+(ht)
⋂H t.
The following proposition establishes the relation between SSD and FOSD over any
set of histories with the same length.
Proposition 3. Let F and G be two non-degenerate prior beliefs over [0, 1] such
that F �SSD G. Then for any set of histories H t ⊆ H t, the probability distribution
over H t under F FOSD that under G. That is, for any ht ∈ H t, PF (H t+(ht)|H t) ≥
PG(H t+(ht)|H t).17
Proof. Let H t ⊆ H t be a set of histories and ht be a particular history in H t. The
distribution function on H t under F is therefore PF (H t−(ht)|H t) and the corresponding
16Note that given any prior distribution, the probabilities of observing two histories ht, ht that have
the same number of outcome “U” are equal. In this case the two histories are equivalent.17We use PF and PG to denote also the probability functions on histories induced by F and G,
respectively.
12
survival function is PF (H t+(ht)|H t) = 1− PF (H t
−(ht)|H t). With these notations,
PF (H t+(ht)|H t) =
PF (H t+(ht)
⋂H t)
PF (H t)
=PF (H t
+(ht)⋂H t)
PF (H t)
=EF (B(p; H t
+(ht)) ·B(p;H t))
EF (B(p;H t))
≥EG(B(p; H t
+(ht)) ·B(p;H t))
EG(B(p;H t))
= PG(H t+(ht)|H t).
The inequality holds due to Theorem 1: B(p; H t+(ht)) is increasing in p (see Lemma 5 in
the Appendix for a proof) and B(p;H t) is non-negative. This completes the proof.
Example 2. Consider a DM who faces a one-armed bandit machine that yields a payoff
of 1 or -1 in each period with unknown probabilities p and 1− p, respectively. Let F be
a prior belief on [0, 1]. In each period, the DM decides whether to switch to a safe arm
with known payoff forever, or pay a fixed cost to continue with the risky arm. Suppose
that the DM follows a particular stopping strategy. Proposition 3 implies the following:
ex-ante, conditional on the DM has not switched to the safe arm in a particular period
t (this event corresponds to a set of histories with length t), his expected discounted net
payoff in that period is greater under a strongly stochastically dominating prior belief.
Theorem 2 and Proposition 3 demonstrate that in the setup of Bayesian learning,
the SSD relation between two prior beliefs is related to two different conditions. On
the one hand, SSD is equivalent to the FOSD relation of posterior beliefs conditional
on any set of histories of the same length. On the other hand, it implies the FOSD
relation over any set of histories of the same length. The natural question is whether
these three conditions are equivalent. The answer is affirmative. We relegate the proof
of the following result to the Appendix.
Theorem 3. Let F and G be two non-degenerate priors. The following conditions are
equivalent:
1. F �SSD G;
2. for any t and any H t ⊆H t, F |H t FOSD G|H t;
3. for any t and any H t ⊆H t, the probability distribution over H t under F FOSD
that under G.
13
As a final application of Theorem 1 in the setup of learning and dynamics, consider
the expected probability of observing outcome “U” in the next period conditional on
any given set of length-t histories H t. This expected probability is simply
EF (p|H t) =EF (pB(p;H t))
EF (B(p;H t)).
Suppose F �SSD G. Then by taking u(p) = p and h(p) = B(p;H t), it follows from
Theorem 1 that
EF (p|H t) ≥ EG(p|H t). (5)
This implies that the expected probability of receiving outcome U conditional on any
set of histories H t is greater under a strongly stochastically dominating prior. If we
regard (1− EF (p|H t),EF (p|H t)) and (1− EG(p|H t),EG(p|H t)) as probability distri-
butions on {D,U} under the respective prior beliefs F and G, Eq. (5) also implies
that the distribution on outcomes {D,U} under F FOSD that under G. This echoes
Theorem 1 of Bikhchandani et al. (1992).18
5 Other Economic Applications: Extensions of Ex-
isting Results
In this section we demonstrate more applications of the equivalent condition in Theorem
1. A few of these applications, including the first one (monotone comparative statics),
extend existing models and show that the corresponding results can be proven by using
the condition given in Theorem 1.
5.1 Monotone Comparative Statics
One of the most important roles of MLRP in economic applications is to derive com-
parative statics results, in particular, in stochastic optimization problems (e.g., Karlin
et al. (1956), Athey (2002)). The seminal paper Athey (2002) introduced conditions
under which a decision maker’s optimal actions are monotone w.r.t. an exogenous pa-
rameter, which determines the distribution of a payoff-relevant state. One fundamental
result of Athey (2002) is that the single-crossing property of the utility function, com-
bined with a family of distributions ordered by MLRP, guarantees the monotonicity
18Bikhchandani et al. (1992) focus on the expected probability distribution over the signal space
conditional on any single history, rather than on any set of histories. In addition, the state space
in Bikhchandani et al. (1992) is restricted to be finite. In their paper, the authors also consider the
more general situation in which the signal space has more than two elements. They show that extra
structures on prior beliefs are needed in order to obtain similar results.
14
of the optimal actions. In this subsection we show that this result can be generalized
to the family of distributions (atomic and non-atomic alike), ordered by SSD. This is
done by using the equivalent condition stated in Theorem 1.
Consider a utility function u : X×S → R, where X ⊆ R is the action set and S ⊆ Ris the set of unknown states. Assume u(x, s) satisfies the single-crossing property in
(x, s).19 Let {Ft(s)}t∈T be a family of CDFs on S parameterized by t, where T is a
partially ordered set. Define
X∗(t, B) := argmaxx∈B
∫S
u(x, s)dFt(s).
This is the set of optimal actions,20 given t and the constraint B. We say that X∗(t, B)
satisfies the monotone comparative statics if X∗(t, B) is increasing in t and B in the
sense of strong set order. The following result is due to Athey (2002) (Theorem 2):
Theorem 4. If a measurable function u satisfies the single-crossing property in (x, s)
and the family {Ft(s)}t∈T is ordered by MLRP,21 then X∗(t, B) satisfies the monotone
comparative statics.
We show that a similar result holds under the weaker condition that the family of
distributions {Ft(s)}t∈T is ordered by SSD.
Theorem 5. If a measurable function u satisfies the single-crossing property in (x, s)
and the family {Ft(s)}t∈T is ordered by SSD, then X∗(t, B) satisfies the monotone
comparative statics.
Proof. Since u(x, s) satisfies single-crossing in (x, s), if we take any xH , xL ∈ X with
xH > xL, the function g(s) := u(xH , s) − u(xL, s) satisfies single-crossing in s. Hence
there exist s0, s1 ∈ S with s1 ≥ s0 such that g(s) ≤ (<)0 for all s ≤ (<)s0, and
g(s) ≥ (>)0 for all s ≥ (>)s1. Define G(t) :=∫Sg(s)dFt(s) = EFt(g). We first show
that G(t) satisfies the single-crossing property in t.
Let g+ := max{g, 0} and g− := −min{g, 0}. Then G(t) = EFt(g+) − EFt(g
−). It
suffices to show that if tH > tL, tH , tL ∈ T , then EFtH(g) ≤ (<) 0 implies EFtL
(g) ≤ (<
) 0, or equivalently, that EFtH(g+) ≤ (<) EFtH
(g−) implies EFtL(g+) ≤ (<) EFtL
(g−).
19Let X ⊆ R. A function g : X → R satisfies the single-crossing property in x if there exist
x0, x1 ∈ [inf X, supX] with x0 ≤ x1 such that g(x) < 0 (g(x) ≤ 0) for all x ∈ X such that x < x0(x ≤ x0), and g(x) > 0 (g(x) ≥ 0) for all x ∈ X such that x > x1 (x ≥ x0), and for x ∈ [x0, x1],
g(x) = 0. A function of two variables u : X × S → R satisfies the single-crossing property in (x, s)
if for all xH > xL, xH , xL ∈ X, the function g(s) := u(xH , s) − u(xL, s) satisfies the single-crossing
property in s.20Assume maximizers exist. The focus is on sufficient conditions that guarantee monotonicity of
the set of maximizers, rather than existence of maximizers.21Namely, for any tH , tL ∈ T with tH > tL, FtH �MLRP FtL . This is equivalent to the density ft(s)
being log-supermodular in (s, t).
15
Since FtH �SSD FtL , |g| is non-negative and since 1[s1,∞)∩S, 1(−∞,s0]∩S are increasing
and decreasing on S, respectively, it follows from Theorem 1 that
EFtL(|g| 1[s1,∞)∩S) EFtH
(|g|) ≤ EFtH(|g| 1[s1,∞)∩S) EFtL
(|g|), (6)
EFtL(|g| 1(−∞,s0]∩S) EFtH
(|g|) ≥ EFtH(|g| 1(−∞,s0]∩S) EFtL
(|g|). (7)
On the other hand, the assumption EFtH(g+) ≤ (<) EFtH
(g−) is equivalent to
EFtH(|g| 1[s1,∞)∩S) EFtL
(|g|) ≤ (<) EFtH(|g| 1(−∞,s0]∩S) EFtL
(|g|). (8)
Combining Eqs. (6), (7) and (8), we conclude that
EFtL(|g| 1[s1,∞)∩S) EFtH
(|g|) ≤ (<) EFtL(|g| 1(−∞,s0]∩S) EFtH
(|g|),
which implies that EFtL(g+) ≤ (<)EFtL
(g−). This establishes that G(t) is single-
crossing in t and hence U(x, t) :=∫Su(x, s)dFt(s) is single-crossing in (x, t).
It follows from a well-known monotonicity theorem due to Milgrom & Shannon
(1994) (Theorem 4)22 that X∗(t, B) is increasing in (t, B) in the sense of strong set
order. This completes the proof of Theorem 5.
5.2 Production Expansion
Production Expansion. Consider a monopolistic firm facing an inverse demand
function p(Q) = a − bQ, where p is the market price of its product, Q is the firm’s
output choice, and b is a positive constant. The intercept a is an uncertain parameter
with support [a, a] which captures the uncertain level of market demand. For each
realized a, the firm chooses output Q to maximize its net profit (p(Q)− c)Q, where c
is the constant marginal cost. However, due to fixed factory size, capital stock, etc.,
the firm’s total output level is bounded by a fixed capacity Q, which prevents the firm
from capturing extra demand once the realized market demand exceeds a certain level.
With the constraint Q, the firm’s optimal output strategy when the demand level is
a is Q∗(a) :=
{12b
(a− c), if a ∈ [a, a),
Q, if a ∈ [a, a],where the threshold a is pinned down by
Q = 12b
(a− c). The corresponding optimal profit function is
π∗(a) :=
{14b
(a− c)2, if a ∈ [a, a),12b
(a− c)(a− c)− 14b
(a− c)2, if a ∈ [a, a].
22 The theorem states the following. Let f : X × T → R, where X is a lattice and T is a partially
ordered set, and let B ⊆ X. Then argmaxx∈Bf(x, t) is increasing in (t, B) if and only if f is quasi-
supermodular in x and satisfies the single-crossing property in (x, t). For our case, where X ⊆ R, the
requirement that f is quasi-supermodular in x is satisfied.
16
The management of the firm is considering whether it is worthwhile to invest a
certain amount of money to expand production so as to remove the output constraint.23
Suppose the firm decides in favor of executing such an investment. The optimal output
and profit then become Q∗∗(a) := 12b
(a − c) and π∗∗(a) := 14b
(a − c)2, respectively, for
all a ∈ [a, a]. In particular, π∗∗(a) = u(a)π∗(a), where
u(a) =
{1, if a ∈ [a, a),a−ca−c
12− a−c
a−c
, if a ∈ [a, a],
is a non-negative, bounded and weakly increasing function of a.
Now consider two different beliefs F and G about the uncertain demand parameter
a, where F �SSD G. Theorem 1 implies
EF (π∗∗(a))
EF (π∗(a))=
EF (u(a)π∗(a))
EF (π∗(a))≥ EG(u(a)π∗(a))
EG(π∗(a))=
EG(π∗∗(a))
EG(π∗(a)).
Thus, the expected rate of return increases when the belief increases in the sense of
SSD. It follows that, given the production expansion cost k, if expansion decision is
worthwhile under one belief, then it is also worthwhile under any belief that is more
favorable in the sense of SSD.
The same conclusion can be reached by applying Theorem 5 directly. Suppose
that a firm chooses between two actions: invest an amount of k in order to expand
production (x = 1), or not (x = 0). The net profit depends on the firm’s action and
the underlying state of demand a according to the following formula:
Π(x, a) :=
14b
(a− c)2 − k, if x = 1,14b
(a− c)2, if x = 0, a ∈ [a, a),12b
(a− c)(a− c)− 14b
(a− c)2, if x = 0, a ∈ [a, a].
The function Π(x, a) satisfies single-crossing property. Indeed,
Π(1, a)− Π(0, a) =
{−k, if a ∈ [a, a),14b
(a− c)2 − k − 12b
(a− c)(a− c) + 14b
(a− c)2, if a ∈ [a, a],
is strictly increasing in a for a ∈ [a, a]. Therefore, if F �SSD G, then by Theorem 5,
the optimal production expansion decision under F is weakly higher than the optimal
decision under G.
23When the production constraint is only partially relaxed (namely, when the firm can produce up
to a level higher than the current constraint Q, but lower than the maximal market demand 12b (a−c)),
the result still holds. In this case, one can easily check that π∗∗(a)π∗(a) is increasing in a.
17
5.3 Securities Markets
Consider the following simple model of a securities market as in Milgrom (1981). There
are two security types: a riskless one that yields return 1, and a risky one that yields
a random return θ. Investors in the market are assumed to be homogeneous: all hold
1 unit of the riskless security and 1 unit of the risky security, and all share the same
concave and strictly increasing utility function of wealth U . Additionally, assume the
function U is differentiable and U ′(·) is bounded. We normalize the market price for
the riskless security to be 1 and let the price of the risky security be p. After observing
the security prices and taking into account budget constraints, each investor decides
how many units of each security to hold.
For the market to clear, the price p for the risky security must ensure that no
trading takes place, namely holding 1 unit of each kind of security is the equilibrium.
More specifically, let each individual choose to hold x units of the risky security and
1 + p− xp units of the riskless security to maximize expected utility, given price p:
maxx≥0
E [U((1 + p− xp) + xθ)] s.t. 1 + p− xp ≥ 0.
To clear the market, x∗ = 1 must be the solution, so it has to satisfy the first-order
condition
E [θU ′((1 + p− x∗p) + x∗θ)] = pE [U ′((1 + p− x∗p) + x∗θ)] ,
which can be written as
p =E [θU ′(1 + θ)]
E [U ′(1 + θ)].
Regard θ as the function u(θ) = θ and h(θ) = U ′(1 + θ), as in Theorem 1. It follows
that if F �SSD G, then pF ≥ pG, i.e., the equilibrium price of the risky security is
higher under F .
5.4 Value-at-Risk of a Portfolio
Let X be the random return of a portfolio. The ‘value-at-risk’ measurement is con-
cerned with the performance of the portfolio at a specific low percentile. For simplicity,
assume that both F and G are strictly increasing (but may possess atoms) on the same
interval in R. Suppose furthermore that F �SSD G.
Fix a probability p and denote c(F, p) := F−1(p) and c(G, p) := G−1(p). This means
that PF (X ≤ c(F, p)) = p and PG(X ≤ c(G, p)) = p. As F �SSD G implies F FOSD
18
G, it follows that c(F, p) ≥ c(G, p), and therefore, EG(X|X > c(F, p)) ≥ EG(X|X >
c(G, p)). The latter inequality is due to Corollary 1:
EF (X|X > c(F, p)) ≥ EG(X|X > c(F, p)) ≥ EG(X|X > c(G, p)).
This implies that when F strongly stochastically dominates G, the performance of X
above any percentile is better w.r.t. F than w.r.t. G.
6 Final Comments
6.1 On Absolute Continuity
Suppose that F �SSD G. Denote x := inf{x | F (x) > 0} and D := (x,∞). The follow-
ing proposition states that the probability measure induced by G is absolutely contin-
uous w.r.t. that induced by F on D. In other words, every event in D with positive
probability w.r.t. G has positive probability also w.r.t. F . This extends the observation
that any atom of G which is greater than x is also an atom of F . One can find examples
where G has an atom at x which is not an atom of F . For instance, consider F and G
with support [0, 1], where F (x) = x, x ∈ [0, 1] and G(x) =
{12, if x = 0,
12
+ 12x, if x ∈ (0, 1].
Proposition 4. If F �SSD G, then PG(·|D) is absolutely continuous w.r.t. PF (·|D).
The Radon-Nikodym derivative dPG(·|D)dPF (·|D)
is non-increasing, except on a set of measure 0
w.r.t. PF (·|D).
The proof is relegated to the Appendix (A.9).
6.2 More on Dynamics and Learning — The Degenerate Case
Recall that Theorem 2 assumes that both priors F and G are non-degenerate. Here
we consider the case in which at least one of them is degenerate.
Proposition 5. Suppose that at least one of F and G is degenerate. Then, F �SSD G
if and only if for any set of histories H t with positive probability, F |H t FOSD G|H t.
The proof can be found in the Appendix (A.9).
References
Athey, S. (2002). Monotone comparative statics under uncertainty. The Quarterly
Journal of Economics, 117 (1), 187–223.
19
Bikhchandani, S., Segal, U., & Sharma, S. (1992). Stochastic dominance under bayesian
learning. Journal of Economic Theory, 56 (2), 352–377.
Blackwell, D. (1953). Equivalent comparisons of experiments. The Annals of Mathe-
matical Statistics, 24 (2), 265–272.
Chan, W., Proschan, F., & Sethuraman, J. (1991). Convex-ordering among functions,
with applications to reliability and mathematical statistics, chapter in Topics in Sta-
tistical Dependence, (pp. 121–134). IMS Lecture Notes–Monograph Series, Vol. 16.
Chateauneuf, A., Cohen, M., & Meilijson, I. (2004). Four notions of mean-preserving
increase in risk, risk attitudes and applications to the rank-dependent expected utility
model. Journal of Mathematical Economics, 40 (5), 547–571.
Chew, S. H., Karni, E., & Safra, Z. (1987). Risk aversion in the theory of expected
utility with rank dependent probabilities. Journal of Economic Theory, 42 (2), 370–
381.
Cox, J. C., Ross, S. A., & Rubinstein, M. (1979). Option pricing: A simplified approach.
Journal of Financial Economics, 7 (3), 229–263.
Karlin, S., Rubin, H., et al. (1956). The theory of decision procedures for distributions
with monotone likelihood ratio. The Annals of Mathematical Statistics, 27 (2), 272–
299.
Milgrom, P. & Shannon, C. (1994). Monotone comparative statics. Econometrica,
62 (1), 157–180.
Milgrom, P. R. (1981). Good news and bad news: Representation theorems and appli-
cations. The Bell Journal of Economics, 12 (2), 380–391.
Muller, A. & Stoyan, D. (2002). Comparison methods for stochastic models and risks.
Wiley, New York.
Rinott, Y. & Scarsini, M. (2006). Total positivity order and the normal distribution.
Journal of Multivariate Analysis, 97 (5), 1251–1261.
Shaked, M. & Shanthikumar, J. G. (2007). Stochastic orders. Springer Science &
Business Media.
Whitt, W. (1980). Uniform conditional stochastic order. Journal of Applied Probability,
17 (1), 112–123.
20
Whitt, W. (1982). Multivariate monotone likelihood ratio and uniform conditional
stochastic order. Journal of Applied Probability, 19 (3), 695–701.
A Appendix
A.1 Proof of Proposition 1
“If” Part. Suppose that on the contrary, F 6�SSD G. Then there exist x1 > x2 > x3
such that (see Lemma 1)
[F (x1)− F (x2)][G(x2)−G(x3)] < [G(x1)−G(x2)][F (x2)− F (x3)]. (9)
This implies that G(x1)−G(x2) > 0 and F (x2)− F (x3) > 0, hence x2 < x. There are
two possibilities: either G(x2)−G(x3) = 0 or G(x2)−G(x3) > 0.
• When G(x2) − G(x3) = 0, it implies that ϕ maps the value G(x2) to a set
containing F (x3) and F (x2), or {F (x3), F (x2)} ⊆ ϕ(G(x2)), which contradicts
the assumption that ϕ with the property F (x) = ϕ(G(x)), x ∈ (−∞, x], is a
function.
• When G(x2)−G(x3) > 0, by Eq. (9),
F (x1)− F (x2)
G(x1)−G(x2)<F (x2)− F (x3)
G(x2)−G(x3). (10)
– If x1 ≤ x, then Eq. (10) implies that ϕ is not convex, a contradiction.
– If x1 > x, then G(x1) = G(x) = 1 and F (x) − F (x2) ≤ F (x1) − F (x2)
(notice that x2 < x), hence Eq. (10) implies
F (x)− F (x2)
G(x)−G(x2)≤ F (x1)− F (x2)
G(x1)−G(x2)<F (x2)− F (x3)
G(x2)−G(x3).
So, we found points x > x2 > x3 such that ϕ is not convex, which again is
a contradiction.
“Only If” Part. Suppose Eq. (1) holds, but the transformation ϕ such that F (x) =
ϕ(G(x)), x ∈ (−∞, x], is not a convex function. Then either ϕ is not a function, or ϕ
is a function, but is not convex.
• If ϕ is not a function for x ∈ (−∞, x], then there exist x2 and x3 smaller than
x such that G(x2) = G(x3) < 1 (w.l.o.g., assume x3 < x2), but F (x3) < F (x2).
21
Namely, ϕ maps G(x2) to a set containing F (x2) and F (x3). Take x1 = x. Then
have G(x1) = 1 > G(x2) = G(x3) and F (x1) ≥ F (x2) > F (x3). Hence,
[F (x1)− F (x2)][G(x2)−G(x3)] = 0 < [F (x2)− F (x3)][G(x1)−G(x2)],
which contradicts condition 2 in Lemma 1.
• ϕ is a function, but is not convex for x ∈ (−∞, x]. In this case, one can find
x1, x2, x3 with x ≥ x1 > x2 > x3 such that Eq. (1) does not hold, which leads
to a contradiction.
A.2 Proof of Lemma 1
To establish the equivalence of the first two conditions, simply add the constant
[F (x1) − F (x2)][G(x1) − G(x2)] to both sides of the second inequality. Similarly, the
equivalence between the second and third conditions can be established by adding the
constant [F (x2)− F (x3)][G(x2)−G(x3)] to both sides of the second inequality.
Next, we show that condition 2 implies condition 4. Let x1 > x2 ≥ x3 > x4. By
condition 2,
[F (x1)− F (x2)][G(x2)−G(x3)] ≥ [F (x2)− F (x3)][G(x1)−G(x2)], (11)
[F (x2)− F (x3)][G(x3)−G(x4)] ≥ [F (x3)− F (x4)][G(x2)−G(x3)], (12)
[F (x1)− F (x3)][G(x3)−G(x4)] ≥ [F (x3)− F (x4)][G(x1)−G(x3)]. (13)
In case [F (x2)−F (x3)][G(x2)−G(x3)] > 0, the product of Eqs. (11) and (12) would
give us condition 4. Otherwise, [F (x2) − F (x3)][G(x2) − G(x3)] = 0, which implies
one of two cases. (a) Both F (x2) − F (x3) = 0 and G(x2) − G(x3) = 0, implying,
F (x1)−F (x3) = F (x1)−F (x2) and G(x1)−G(x3) = G(x1)−G(x2). These equations
imply, by Eq. (13), condition 4. (b) Either F (x3)−F (x4) = 0, or G(x2)−G(x3) = 0. If
the first equality holds, then by Eq. (12), F (x3)−F (x4) = 0, which in turn guarantees
condition 4. If the second equality holds, then by Eq. (11), G(x1)−G(x2) = 0, which
again guarantees condition 4.
The fact that condition 4 implies condition 2 is immediate, simply take x2 = x3.
This completes the proof of Lemma 1.
A.3 Proof of Theorem 1
“If” Part. We first prove that Eq. (3) implies F �SSD G (or Eq. (1)). Suppose, by
contradiction, that F 6�SSD G. Then, there exist x1 > x2 > x3 such that
[F (x1)− F (x2)][G(x1)−G(x3)] < [F (x1)− F (x3)][G(x1)−G(x2)]. (14)
22
The monotonicity of F and G implies that for Eq. (14) to hold, it is necessary that
F (x1)− F (x3) > 0 and G(x1)−G(x2) > 0. It follows that G(x1)−G(x3) > 0 as well.
Dividing both sides of Eq. (14) by [G(x1)−G(x3)][F (x1)− F (x3)], we obtain
F (x1)− F (x2)
F (x1)− F (x3)<G(x1)−G(x2)
G(x1)−G(x3). (15)
Set h = 1(x3,x1] and u = 1(x2,∞). Then, uh = 1(x2,x1]. By Eq. (3),
F (x1)− F (x2)
F (x1)− F (x3)=
EF (uh)
EF (h)≥ EG(uh)
EG(h)=G(x1)−G(x2)
G(x1)−G(x3),
which contradicts Eq. (15).
“Only if” Part. Take any bounded, non-negative measurable function h. Consider
the probability measures PHFand PHG
defined as
PHF(A) =
∫A
h(x)
EF (h)dPF , and PHG
(A) =
∫A
h(x)
EG(h)dPG,
respectively, and let HF and HG be the corresponding CDFs. Here we assume that
EF (h) > 0 and EG(h) > 0. Otherwise, if one of them is 0, then Eq. (3) holds trivially.
We prove that HF FOSD HG.
Since h is bounded, non-negative and measurable w.r.t. PF and PG, there exists
an increasing sequence of simple functions {hN} such that h = limN→∞ hN a.e., hN =∑Ni=1 αi1Ai
, where Ai, Aj are disjoint intervals measurable w.r.t. PF and PG.
Now consider a fixed x2, let x3 < x2, and consider any measurable set Ai > x2.24
Using condition 4 in Lemma 1, we have
EF (1Ai)EG(1(x3,x2]) ≥ EF (1(x3,x2])EG(1Ai
).
We then obtain
EF
(N∑i=1
αi1Ai∩(x2,∞)
)EG(1(x3,x2]) ≥ EF (1(x3,x2])EG
(N∑i=1
αi1Ai∩(x2,∞)
),
and it follows that
EF
(N∑i=1
αi1Ai∩(x2,+∞)
)EG
(N∑i=1
αi1Ai∩(−∞,x2)
)
≥ EF
(N∑i=1
αi1Ai∩(−∞,x2]
)EG
(N∑i=1
αi1Ai∩(x2,+∞)
),
24This means that inf Ai > x2.
23
or
EF (hN1(x2,∞))EG(hN1(−∞,x2]) ≥ EG(hN1(x2,∞))EF (hN1(−∞,x2]).
This is equivalent to (again, assume both EF (hN) > 0 and EG(hN) > 0 — the statement
is trivial when either one is 0.)
EF(hN1(x2,∞)
EF (hN)
)EG(hN1(−∞,x2)
EG(hN)
)≥ EF
(hN1(−∞,x2)
EF (hN)
)EG(hN1(x2,∞)
EG(hN)
).
In the limit, we have
EF(h1(x2,∞)
EF (h)
)EG(h1(−∞,x2)
EG(h)
)≥ EF
(h1(−∞,x2)
EF (h)
)EG(h1(x2,∞)
EG(h)
),
or
(1−HF (x2))HG(x2) ≥ HF (x2) (1−HG(x2)) .
It follows that HF FOSD HG. Thus, for an increasing u, one obtains
EF (uh)EG(h) = EF (h)EF (uh).
This completes the proof of Theorem 1.
A.4 Proof of Corollary 1 — “If” Part
Suppose F 6�SSD G. By condition 3 of Lemma 1, that there exist x1 > x2 > x3 such
that
[F (x1)− F (x3)][G(x2)−G(x3)] < [F (x2)− F (x3)][G(x1)−G(x3)]. (16)
The monotonicity of F and G implies that for Eq. (16) to hold, it is necessary that
F (x2) − F (x3) > 0 and G(x1) − G(x3) > 0, hence F (x1) − F (x3) > 0. Dividing both
sides of Eq. (16) by [F (x1)− F (x3)][G(x1)−G(x3)], we obtain
F (x2)− F (x3)
F (x1)− F (x3)>G(x2)−G(x3)
G(x1)−G(x3). (17)
Since Eq. (4) is equivalent to F |A FOSD G|A for any measurable set A with positive
probability, by taking A = (x3, x1], it follows that F (x2)−F (x3)F (x1)−F (x3)
≤ G(x2)−G(x3)G(x1)−G(x3)
, which is a
contradiction to Eq. (17).
A.5 Proof of Theorem 2
To establish the “if” part of Theorem 2, we need the following two lemmas. Lemma 2
states that for a sequence of Bernoulli random variables with mean p, as the length of
the sequence becomes sufficiently large, the frequency of occurrence is arbitrarily close
to the mean p.
24
Lemma 2. Suppose X1, ..., Xt are i.i.d. Bernoulli random variables with a parameter
p, p ∈ [0, 1]. Let Yt := 1t
∑ti=1 Xi. Then, for any δ > 0,
P (|Yt − p| > δ) < exp(−tδ2
).
Proof. In case p ∈ {0, 1} the assertion is clear. Otherwise, by the Chernoff-Hoeffding
theorem (applied to Bernoulli random variables), we have
P (|Yt − p| > δ) < exp
(− tδ2
2p(1− p)
)< exp
(−tδ2
).
Given a history ht of binary outcomes, let k(ht) denote the number of outcomes
“U” among the t observations.
Lemma 3. Let F be a CDF on [0, 1] and let p1, p3 ∈ (0, 1] be given numbers with
p3 < p1. Assume that p3 is not an atom and that either p1 = 1, or p1 is not an atom.
Consider the set of histories
H t :=
{ht ∈H t
∣∣ p3 <k(ht)
t≤ p1
}. (18)
Then, PF (H t)→ F (p1)− F (p3) as t→∞.
Proof. We prove the assertion for the case where both p1 and p3 are not atoms and
p1 < 1. The proof for the case where p1 = 1 is similar but easier, and is therefore
omitted.
Fix ε > 0. Since p1 and p3 are not atoms, F is continuous at these points. Thus,
one can find δ > 0 such that F (p3 + δ)− F (p3 − δ) + F (p1 + δ)− F (p1 − δ) < ε.
Recall that for a history ht with k observations of outcome “U”, we definedB(p;ht) =
pk(1− p)t−k and B(p;H t) =∑
ht∈Ht B(p;ht). Hence, we have
PF (H t) =
∫ 1
0
B(p;H t)dF (p)
=
∫ p3−δ
0
B(p;H t)dF (p) +
∫ p3+δ
p3−δB(p;H t)dF (p) +
∫ p1−δ
p3+δ
B(p;H t)dF (p)
+
∫ p1+δ
p1−δB(p;H t)dF (p) +
∫ 1
p1+δ
B(p;H t)dF (p). (19)
Since B(p;H t) ≤ 1 and, by the choice of δ,∫ p3+δ
p3−δB(p;H t)dF (p) +
∫ p1+δ
p1−δB(p;H t)dF (p) < ε. (20)
25
By Lemma 2 and the definition of H t, we have the following two inequalities:∫ p3−δ
0
B(p;H t)dF (p) +
∫ 1
p1+δ
B(p;H t)dF (p) < exp(−tδ2
), (21)
and (by the choice of δ as well),∫ p1−δ
p3+δ
B(p;H t)dF (p) >
∫ p1−δ
p3+δ
(1− exp
(−tδ2
))dF (p)
≥ F (p1 − δ)− F (p3 + δ)− exp(−tδ2
)≥ F (p1)− F (p3)− ε− exp
(−tδ2
). (22)
Combining Eqs. (19) to (22) yields∣∣PF (H t)− (F (p1)− F (p3))∣∣ < 2(ε+ exp(−tδ2)).
Thus, when t is sufficiently large,∣∣PF (H t) − (F (p1)− F (p3))
∣∣ < 4ε, which completes
the proof.
Proof of Theorem 2 — “If” Part. Suppose F 6�SSD G. Then, by Eq. 1 in
Definition 1, that there exist p1 > p2 > p3 such that
[F (p1)− F (p2)][G(p1)−G(p3)] < [F (p1)− F (p3)][G(p1)−G(p2)].
The monotonicity of F and G implies that in order for the inequality above to hold, it
must be true that F (p1)−F (p3) > 0 and G(p1)−G(p2) > 0, hence G(p1)−G(p3) > 0.
Consequently,F (p1)− F (p2)
F (p1)− F (p3)<G(p1)−G(p2)
G(p1)−G(p3). (23)
We show that this leads to a contradiction.
Since the functions F and G are right-continuous and each has at most countably
many atoms, we can assume w.l.o.g. that there are no atoms (neither of F , nor of G)
at p3, p2 and p1 or p1 = 1. Indeed, suppose that p1 < 1 and p1 is an atom. One can find
a point q1 > p1 which is not an atom of F or G (because there are at most countably
many atoms). Moreover, since a CDF is right-continuous, q1 can be chosen to be very
close to p1 (from the right), so that F (q1) and G(q1) are close to F (p1) and G(p1) to
the extent that the inequality in Eq. (23) remains valid when we replace p1 by q1. For
the same reason, it is w.l.o.g. to assume that p3 and p2 are not atoms.
Step 1. Consider the set of histories H t as defined by Eq. (18). We show that when
t is sufficiently large, EF (1(p2,1]|H t)→ F (p1)−F (p2)F (p1)−F (p3)
and EG(1(p2,1]|H t)→ G(p1)−G(p2)G(p1)−G(p3)
.
26
Clearly,
EF (1(p2,1]|H t) =∑ht∈Ht
PF (ht|H t)EF (1(p2,1]|ht) =∑ht∈Ht
PF (ht)
PF (H t)EF (1(p2,1]|ht).
For the denominator PF (H t), Lemma 3 implies that PF (H t)→ F (p1)−F (p3) as t→∞.
Now, consider∑
ht∈Ht PF (ht)EF (1(p2,1]|ht). We show∑
ht∈Ht PF (ht)EF (1(p2,1]|ht)→F (p1)− F (p2). Since
PF (ht) =
∫ 1
0
B(p;ht)dF (p), and EF (1(p2,1]|ht) =
∫ 1
0
1(p2,1]B(p;ht)∫ 1
0B(p;ht)dF (p)
dF (p),
it follows that∑ht∈Ht
PF (ht)EF (1(p2,1]|ht) =∑ht∈Ht
∫ 1
0
1(p2,1]B(p;ht)dF (p)
=
∫ 1
0
1(p2,1]B(p;H t)dF (p)
=
∫ p2+δ
p2
B(p;H t)dF (p) +
∫ p1−δ
p2+δ
B(p;H t)dF (p)
+
∫ p1+δ
p1−δB(p;H t)dF (p) +
∫ 1
p1+δ
B(p;H t)dF (p). (24)
Fix ε > 0. Since p1 and p2 are not atoms, F is continuous at these points. Thus,
one can find δ > 0 such that F (p2 + δ) − F (p2) + F (p1 + δ) − F (p1 − δ) < ε. Since
B(p;H t) ≤ 1, we have∫ p2+δ
p2
B(p;H t)dF (p) +
∫ p1+δ
p1−δB(p;H t)dF (p) < ε. (25)
By Lemma 2 and the definition of H t,∫ 1
p1+δ
B(p;H t)dF (p) < exp(−tδ2
), (26)
and (by the choice of δ as well),∫ p1−δ
p2+δ
B(p;H t)dF (p) >
∫ p1−δ
p2+δ
(1− exp
(−tδ2
))dF (p)
≥ F (p1 − δ)− F (p2 + δ)− exp(−tδ2
)≥ F (p1)− F (p2)− ε− exp
(−tδ2
). (27)
27
We combine Eqs. (24) to (27) to obtain,∣∣ ∑ht∈Ht
PF (ht)EF (1(p2,1]|ht)− (F (p1)− F (p2))∣∣ < 2(ε+ exp(−tδ2)).
Thus, for t sufficiently large,∣∣∑
ht∈Ht PF (ht)EF (1(p2,1]|ht) − (F (p1)− F (p2))∣∣ < 4ε.
This together with PF (H t)→ F (p1)− F (p3) establishes the desired result.
Step 2. Show that EF (1(p2,1]|H t) ≥ EG(1(p2,1]|H t).
Since 1(p2,1] is an increasing function on [0, 1] and F |H t FOSD G|H t, for any H t ⊆H t, it follows that EF (1(p2,1]|H t) ≥ EG(1(p2,1]|H t), as desired.
Step 1 and Step 2 together imply that
F (p1)− F (p2)
F (p1)− F (p3)≥ G(p1)−G(p2)
G(p1)−G(p3),
which contradicts Eq. (23). This completes the proof of the “if” part of Theorem 2.
A.6 Supplementary Results Related to Proposition 3
Lemma 4 and Lemma 5 below establish the result that the function B(p; H t+(ht)) is
increasing in p, which is used in the proof of Proposition 3.
Lemma 4. For any given t, k ∈ N, where 0 ≤ k ≤ t, the function Γ(p; t, k) :=∑t`=k
(t`
)(1− p)t−`(p)` is increasing in p.
Proof. Taking the derivative of Γ(p; t, k) w.r.t. p, we have
dΓ(p; t, k)
dp=
t−1∑`=k
(t
`
)[`p`−1(1− p)t−` − (t− `)p`(1− p)t−`−1] + tpt−1
=t−1∑`=k
(t
`
)p`−1(1− p)t−`−1(`− tp) + tpt−1
= tt−1∑`=k
(t− 1
`− 1
)p`−1(1− p)t−`−1 − t
t−1∑`=k
(t
`
)p`(1− p)t−`−1 + tpt−1.
Thus, dΓ(p;t,k)dp
≥ 0 if and only if
t−1∑`=k
(t− 1
`− 1
)p`−1(1− p)t−`−1 + pt−1 ≥
t−1∑`=k
(t
`
)p`(1− p)t−`−1,
28
if and only if (after multiplying by (1− p) and adding pt),
t−1∑`=k
(t− 1
`− 1
)p`−1(1− p)t−` + pt−1 ≥
t−1∑`=k
(t
`
)p`(1− p)t−` + pt,
if and only if
t−2∑`=k−1
(t− 1
`
)p`(1−p)t−`−1 +pt−1 =
t−1∑`=k−1
(t− 1
`
)p`(1−p)t−`−1 ≥
t∑`=k
(t
`
)p`(1−p)t−`.
Notice that LHS = Γ(p; t− 1, k − 1) while RHS = Γ(p; t, k). Since Γ(p; t− 1, k − 1) ≥Γ(p; t, k), we conclude that dΓ(p;t,k)
dp≥ 0 and therefore Γ(p; t, k) is non-decreasing with
p.
Lemma 5. Consider any ht ∈H t. The function B(p; H t+(ht)) is increasing in p.
Proof. Suppose the history ht has k observations of outcome “U”. Let r be the number
of those histories that rank above ht, but have the same number of observations of
outcome “U” as ht. Clearly, 0 ≤ r ≤(tk
). We have
B(p; H t+(ht)) =
t∑`=k+1
(t
`
)(1− p)t−`(p)` + rpk(1− p)t−k = Γ(p; t, k+ 1) + rpk(1− p)t−k.
Notice that as a function of p, the expression pk(1− p)t−k is increasing if and only
if p1−p ≤
kt−k . Therefore, for those p that satisfy p
1−p ≤kt−k , rpk(1− p)t−k is increasing
in p and it follows from Lemma 4 that Γ(p; t, k + 1) is increasing in p. Hence in this
case, B(p; H t+(ht)) is increasing in p.
For those p that satisfy p1−p >
kt−k , the expression −
((tk
)− r)pk(1−p)t−k is strictly
increasing in p and Lemma 4 implies that Γ(p; t, k) is increasing in p. Hence, in this
case
B(p; H t+(ht)) = Γ(p; t, k)−
((t
k
)− r)pk(1− p)t−k
is also increasing in p. We thus conclude that B(p; H t+(ht)) is an increasing function
of p.
A.7 Proof of Theorem 3
The equivalence of conditions 1 and 2 was established in Theorem 2. By Proposition 3,
we know that condition 1 also implies condition 3. Therefore, it suffices to show that
condition 3 implies condition 1.
29
Suppose condition 1 does not hold, namely, F 6�SSD G. Then there exist p1, p2, p3 ∈[0, 1] with p1 > p2 > p3 such that
F (p2)− F (p3)
F (p1)− F (p3)>G(p2)−G(p3)
G(p1)−G(p3). (28)
As in the proof of Theorem 2 (see the discussion after Eq. (23)), w.l.o.g. one can
assume that p1, p2, p3 are not atoms (or p1 = 1).
For any p < q, consider the set H t(p, q) := {ht ∈ H t | p < k(ht)t≤ q}, where
k(ht) is the number of outcomes “U” observed for a given history ht. By Lemma
3, PF (H t(p3, p2)) → F (p2) − F (p3) and PF (H t(p3, p1)) → F (p1) − F (p3) as t → ∞.
Therefore,PF (H t(p3, p2))
PF (H t(p3, p1))→ F (p2)− F (p3)
F (p1)− F (p3).
Similarly, under prior belief G,
PG(H t(p3, p2))
PG(H t(p3, p1))→ G(p2)−G(p3)
G(p1)−G(p3).
By condition 3, the probability distribution on the set of histories H t(p3, p1) under
F FOSD that under G. Consequently, PF (Ht(p3,p2))PF (Ht(p3,p1))
≤ PG(Ht(p3,p2))PG(Ht(p3,p1))
for any t. Hence in
the limit, we haveF (p2)− F (p3)
F (p1)− F (p3)≤ G(p2)−G(p3)
G(p1)−G(p3),
which contradicts Eq. (28). This completes the proof of the theorem.
A.8 Proof of Proposition 4
Let F0 be the set of finite or countable union of intervals of the form (a, b] with a < b.
The smallest σ-algebra that contains F0 is the Borel σ-algebra, B.
Suppose the first claim does not hold. Then there exist x1 > x2 > x such that
PF ((x2, x1]|D) = 0 (or F (x1)−F (x2) = 0), but PG((x2, x1]|D) > 0 (or G(x1)−G(x2) >
0). Take x3 < x, we have F (x3) = 0. Since x := inf{x|F (x) > 0}, the fact that x1 > x
implies that F (x1)− F (x3) > 0. Hence,
[F (x1)− F (x2)][G(x1)−G(x3)] = 0 < [F (x1)− F (x3)][G(x1)−G(x2)],
which contradicts Eq. (1).
Now we show that the Radon-Nikodym derivative g := dPG(·|D)dPF (·|D)
is non-increasing,
except on a PF (·|D)-null set. We show the following equivalent condition. For any λ > 0
such that the sets {x ∈ D | g(x) < λ} and {x ∈ D | g(x) > λ} have positive measures
30
w.r.t. PF (·|D),25 {x ∈ D | g(x) < λ} > {x ∈ D | g(x) > λ} except on a PF (·|D)-null
set.
Suppose that the claim that g is decreasing almost everywhere w.r.t. PF (·|D) does
not hold. Then there exist λ > 0 and two subsets Ek, E` whose PF (·|D)-measures are
positive such that Ek ⊆ {x ∈ D | g(x) < λ}, E` ⊆ {x ∈ D | g(x) > λ}, but E` > Ek.
Since PG(Ek|D) =∫EkgdPF and PG(E`|D) =
∫E`gdPF , we have PG(Ek|D) <
λPF (Ek|D) and PG(E`|D) > λPF (E`|D), concequently,
PG(Ek|D)
PF (Ek|D)< λ <
PG(E`|D)
PF (E`|D). (29)
However, since F �SSD G, it follows that PG(Ek|D)PF (Ek|D)
≥ PG(E`|D)PF (E`|D)
, which contradicts (29).
A.9 Proof of Proposition 5
Assume that one of the priors F or G is degenerate. To establish the “only if” part,
consider the following two cases. (a) F is degenerate at 1. Then for any set of histories
H t with positive probability, F |H t is still degenerate at 1, so F |H t FOSD G|H t holds
trivially. (b) F is not degenerate at 1. Then it cannot happen that G is degenerate
at 1, since this would violate F �SSD G (note that F �SSD G implies F FOSD G).
Similarly, it cannot be true that F is degenerate at 0 while G is not degenerate at 0.
Hence, we only need to consider the case that G is degenerate at 0. But this implies
that G|H t is also degenerate at 0, for any H t with positive probability. So F |H t FOSD
G|H t holds trivially.
To establish the “if” part, consider again the following two cases. (a) F is degenerate
at 1. Then the transformation ϕ that satisfies F (p) = ϕ(G(p)), p ∈ [0, p], is ϕ(y) ={0, if y < 1,
1, if y = 1,where y is in the range of G. Clearly ϕ is a convex function, hence
F �SSD G. (b) F is not degenerate at 1. Then one cannot have that G is degenerate
at 1, or that F is degenerate at 0 while G is not degenerate at 0, since then one can
easily find H t with positive probability that violates F |H t FOSD G|H t. So we focus
on the case in which G is degenerate ar 0. Then, for any p1 > p2 > p3, we have
G(p1) − G(p3) = 0 and G(p1) − G(p2) = 0, so Eq. (1) holds trivially and F �SSD G.
This completes the proof of the proposition.
25Let A and B be two sets of real numbers. We write A > B if x ∈ A and y ∈ B imply y > x.
31