f071604: bayesian statisticsluoshan.weebly.com/uploads/2/5/3/1/25310789/chapter1.pdf ·...
TRANSCRIPT
F071604: Bayesian Statistics
Luo Shan
Department of Mathematics
Shanghai Jiao Tong University
February 28th, 2014
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 1 / 36
Course Information
Consulting Time and Venue:Fridays (2:00 -5:00pm), Math Building 2304
Text Book:Berger, James O. Statistical Decision Theory and Bayesian Analysis.Springer, 1985.
Reference Books:1: Ghosh, J. K., Delampady, M., & Samanta, T. (Eds.). An introduction
to Bayesian analysis: theory and methods. Springer, 2006.2: +t, )Õâ. dÚO. ¥IÚOÑ, 2012.3: Ming-hui Chen, Dipak K.Dey, Peter Muller, Dong-chu Sun, Ke-ying Ye.
Frontiers of statistical decision making and Bayesian analysis: In Honorof James O. Berger. Springer, 2010.
Evaluation:
1: Mid-term paper and presentation (30%, ReferencePapersList.pdf ).2: Final-term examination (70%, one A4 size help sheet is allowed).
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 2 / 36
Chapter 1: Basic Concepts
1 Introduction
2 Basic Elements
3 Expected Loss, Decision Rules, and RiskBayesian Expected LossFrequentist Risk
4 Randomized Decision Rules
5 Decision PrinciplesThe Conditional Bayes Decision PrincipleFrequentist Decision Principles
6 Sufficient Statistics
7 Convexity
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 3 / 36
Introduction
(1)
Statistical decision theory : to make decisions in the presence ofstatistical knowledge which sheds light on some of the uncertainties(denoted by θ)
Classical statistics: to use sample information to make inferencesabout θ.
Decision theory : to combine the sample information with otherrelevant aspects to make the decision
Two typical types of relevant nonsample information:
A knowledge of the possible consequences of the decisions (loss forvarious values of θ: loss function)
Information arising from sources other than the statistical investigation,generally comes from past experience (prior information)
Bayesian analysis: a statistical approach which formally seeks toutilize prior information
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 4 / 36
Basic Elements
Notations and Terminologies
θ: the unknown quantity, state of nature (parameter)
Θ: the set of all possible states of nature (parameter space)
a: actions (decisions)
A: the set of all possible actions
L(θ, a): the loss function, θ is the true state of nature, a is the actiontaken, (L(θ, a) ≥ −K > −∞)
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 5 / 36
Basic Elements
Notations and Terminologies (cont.)
X = (X1,X2, · · · ,Xn): the outcome, Xi s are independentobservations from a common distribution
x : a particular realization of X
H: the set of possible outcomes (the sample space,e.g,Rn)
Pθ(A): or Pθ(X ∈ A): the probability of the event A (A ( H) when θis the true state of nature.
For example, X will be assumed to be either a continuous or a discreterandom variable with density f (x |θ), let FX (x |θ) be the cumulativedistribution function (c.d.f), then Pθ(A) =
∫A dFX (x |θ), the expectation
of function h(x), Eθ[h(X)] is defined to be∫H h(x)dFX (x |θ)
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 6 / 36
Basic Elements
Prior Information
It is useful and natural to state prior beliefs in terms of probabilities ofvarious possible values of θ being true.
Let π(θ) represent a prior density of θ (continuous or discrete), if A ( Θ,
P(θ ∈ A) =
∫A
dF π(θ) = ∫A π(θ)dθ (continuous)∑θ∈A π(θ) (discrete)
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 7 / 36
Basic Elements
An Example:
Example 1
A drug company needs to decide on whether or not to market the drug,how much to market, what price to charge, etc. Two main factors affectingits decision are the proportion of people for which the drug will proveeffective θ1, and the proportion of the market the drug will capture (θ2).
Assume it is desired to estimate θ2. Clearly, Θ = [0, 1] and A = [0, 1].
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 8 / 36
Basic Elements
An Example (cont.):
The loss function might be
L(θ2, a) =
θ2 − a if θ2 − a ≥ 0
2(a− θ2) if θ2 − a ≤ 0
The sample information about θ2 is obtained from a sample survey.For example, assume n people are interviewed, and the number X whowould buy the drug is observed. It might be reasonable to assume
f (x |θ2) =
(n
x
)θx2 (1− θ2)n−x .
There could well be considerable prior information about θ2, arisingfrom previous introductions of new similar drugs into the market.Assume the prior information could be modeled by giving θ2 aU(0.1, 0.2) prior density, i.e.,letting
π(θ2) = 10I(0.1,0.2)(θ2)
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 9 / 36
Basic Elements
It is important to note that:
Loss function and prior information could be very vague or evennonunique
The goal in statistical inference is to provide a “summary” of thestatistical evidence which a wide variety of future “users” of thisevidence can easily incorporate into their own decision-makingprocesses
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 10 / 36
Basic Elements
The reasons for involvement of losses and prior information in inference:
Reports from statistical inferences should (ideally) be constructed sothat they can be easily utilized in individual decision making
The investigator may very well possess such information; he will oftenbe very informed about the uses to which his inferences are likely tobe put, and may have considerable prior knowledge about thesituation. It is then almost imperative that he present suchinformation in his analysis
The choice of an inference can be viewed as a decision problem
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 11 / 36
Basic Elements
To summarize: decision theory can be useful even when such incorporationis proscribed. This is because many standard inference criteria can beformally reproduced as decision-theoretic criteria with respect to certainformal loss functions.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 12 / 36
Expected Loss, Decision Rules, and Risk Bayesian Expected Loss
Chapter 1
1 Introduction
2 Basic Elements
3 Expected Loss, Decision Rules, and RiskBayesian Expected LossFrequentist Risk
4 Randomized Decision Rules
5 Decision PrinciplesThe Conditional Bayes Decision PrincipleFrequentist Decision Principles
6 Sufficient Statistics
7 Convexity
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 13 / 36
Expected Loss, Decision Rules, and Risk Bayesian Expected Loss
Definition 2 (Bayesian Expected Loss)
lf π?(θ) is the believed probability distribution of θ at the time of decisionmaking, the Bayesian expected loss of an action a isρ(π?, a) = Eπ
?L(θ, a) =
∫Θ L(θ, a)dFπ?
(θ).
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 14 / 36
Expected Loss, Decision Rules, and Risk Bayesian Expected Loss
Example 3 (Example 1 cont.)
Assume no data is obtained, so that the believed distribution of θ2 issimply π(θ2) = 10I(0.1,0.2)(θ2). Then
ρ(π, a) =
∫ 1
0L(θ2, a)π(θ2)dθ2
=
∫ a
02(a− θ2)10I(0.1,0.2)(θ2)dθ2
+
∫ 1
a(θ2 − a)10I(0.1,0.2)(θ2)dθ2
=
0.15− a if a ≤ 0.1
15a2 − 4a + 0.3 if 0.1 ≤ a ≤ 0.22a− 0.3 if a ≥ 0.2
.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 15 / 36
Expected Loss, Decision Rules, and Risk Frequentist Risk
Chapter 1
1 Introduction
2 Basic Elements
3 Expected Loss, Decision Rules, and RiskBayesian Expected LossFrequentist Risk
4 Randomized Decision Rules
5 Decision PrinciplesThe Conditional Bayes Decision PrincipleFrequentist Decision Principles
6 Sufficient Statistics
7 Convexity
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 16 / 36
Expected Loss, Decision Rules, and Risk Frequentist Risk
Nonrandomized decision rule
Definition 4 (Nonrandomized Decision Rule)
A (nonrandomized) decision rule δ(x) is a function from H into A. IfX = x is the observed value of the sample information, then δ(x) is theaction that will be taken. ( For a no-data problem, a decision rule issimply an action.) Two decision rules, δ1 and δ2, are considered equivalentif Pθ(δ1(X ) = δ2(X )) = 1 for all θ.
Example 5 (Example 1 cont.)
For the situation of Example 1, δ(x) = x/n is the standard decision rule(estimator) for estimating θ2.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 17 / 36
Expected Loss, Decision Rules, and Risk Frequentist Risk
Definition 6 (Risk Function For Nonrandomized Decision Rules)
The risk function of a decision rule δ(x) is defined byR(θ, δ) = EX
θ [L(θ, δ(X ))] =∫H L(θ, δ(x))dFX (x |θ). For a no-data
problem, R(θ, δ) ≡ L(θ, δ).
To a frequentist, it is desirable to use a decision rule δ which has smallR(θ, δ). However, whereas the Bayesian expected loss of an action was asingle number. the risk is a function on Θ, and since θ is unknown wehave a problem in saying what “small” means.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 18 / 36
Expected Loss, Decision Rules, and Risk Frequentist Risk
Definition 7
A decision rule δ1, is R − better than a decision rule δ2 ifR(θ, δ1) ≤ R(θ, δ2) for all θ ∈ Θ, with strict inequality for some θ. A ruleδ1 is is R − equivalent to δ2 if R(θ, δ1) = R(θ, δ2) for all θ.
Definition 8 (Admissibility)
A decision rule δ is admissible if there exists no R − better decision rule. Adecision rule δ is inadmissible if there does exist an R − better decisionrule.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 19 / 36
Expected Loss, Decision Rules, and Risk Frequentist Risk
Example 9
Assume X is N (θ, 1) and that it is desired to estimate θ under lossL(θ, a) = (θ − a)2. (This loss is called squared-error loss.) Consider thedecision rules δc(x) = cx . Clearly the risk functionR(θ, δx) = EX
θ L(θ, δc(X )) = EXθ (θ − cX )2 = c2 + (1− c)2θ2. Hence, δ1 is
R − better than δc for c > l , the rules δc are inadmissible for c > l .
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 20 / 36
Expected Loss, Decision Rules, and Risk Frequentist Risk
Definition 10 (Bayes Risk)
The Bayes risk of a decision rule δ, with respect to a prior distribution πon Θ is defined as r(π, δ) = Eπ[R(θ, δ)].
Example 11 (Example 9 cont.)
In Example 9, suppose that π(θ) is a N(0, τ2) density. Then, for thedecision rule δc ,
r(π, δc) = Eπ[R(θ, δc)] = Eπ[c2 + (1− c)2θ2] = c2 + (1− c)2τ2.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 21 / 36
Randomized Decision Rules
Example 12 (Matching Pennies)
Persons A and B are to simultaneously uncover a penny (they cansecretly turn the penny to heads or tails)
If the two coin match, A wins $1 from B; otherwise, B wins $1 from A
Two available actions for A: a1-choose heads, or a2-choose tails
Two possible states of nature: θ1-B’s coin is a head, and θ2-B’s coinis a tail
Both a1 and a2 are admissible actions (Why?)
If the game is to be played a number of times, a way of preventing ultimatedefeat is to choose a1 and a2 with probabilities p and l − p respectively.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 22 / 36
Randomized Decision Rules
Definition 13 (Randomized Decision Rule)
A randomized decision rule δ?(x , ·) is, for each x , a probability distributionon A§ with the interpretation that if x is observed, δ?(x ,A) is theprobability that an action in A (a subset of A) will be chosen. In no-dataproblems, a randomized decision rule, also called a randomized action,denoted by δ?(·) is a probability distribution on A.
Remark: Nonrandomized decision rules will be considered a special caseof randomized rules, in that they correspond to the randomized ruleswhich, for each x , choose a specific action with probability 1. If δ(x) is anonrandomized decision rule, let < δ > denote the equivalent randomizedrule given by
< δ > (x ,A) = IA(δ(x)) =
1 if δ(x) ∈ A,0 if δ(x) /∈ A
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 23 / 36
Randomized Decision Rules
Example 14
In Example 12, the randomized action is defined byδ?(a1) = p, δ?(a2) = 1− p. It is equivalent to be denoted byδ? = p < a1 > +(1− p) < a2 > .
Definition 15 (Risk Function for Randomized Decision Rules)
The loss function L(θ, δ?(x , ·)) of the randomized rule δ? is defined to beL(θ, δ?(x , ·)) = Eδ
?(x ,·)[L(θ, a)], where the expectation is taken over a (arandom variable with distribution δ?(x , ·)). The risk function of δ? isR(θ, δ?) = EX
θ [L(θ, δ?(X , ·))].
Definition 16
Let D? be the set of all randomized decision rules δ? for whichR(θ, δ?) < +∞ for all θ. A decision rule will be said to be admissible ifthere exists no R-better randomized decision rules in D?.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 24 / 36
Decision Principles The Conditional Bayes Decision Principle
Chapter 1
1 Introduction
2 Basic Elements
3 Expected Loss, Decision Rules, and RiskBayesian Expected LossFrequentist Risk
4 Randomized Decision Rules
5 Decision PrinciplesThe Conditional Bayes Decision PrincipleFrequentist Decision Principles
6 Sufficient Statistics
7 Convexity
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 25 / 36
Decision Principles The Conditional Bayes Decision Principle
Theorem 17 (The Conditional Bayes Principle)
Choose an action a ∈ A which minimize ρ(π?, a) (def 2) (assume theminimum is attained). Such an action will be called a Bayes action andwill be denoted by aπ
?.
Example 18 (Example 3 cont.)
In Example 3, when a = 2/15, ρ(π, a) reaches its minimum 1/30. Hencethe Bayes estimator of θ2 is 2/15 assuming no data was available.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 26 / 36
Decision Principles Frequentist Decision Principles
Chapter 1
1 Introduction
2 Basic Elements
3 Expected Loss, Decision Rules, and RiskBayesian Expected LossFrequentist Risk
4 Randomized Decision Rules
5 Decision PrinciplesThe Conditional Bayes Decision PrincipleFrequentist Decision Principles
6 Sufficient Statistics
7 Convexity
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 27 / 36
Decision Principles Frequentist Decision Principles
Theorem 19 (The Bayes Risk Principle)
A decision rule δ1 is preferred to a rule δ2 if r(π, δ1) < r(π, δ2). A decisionrule which minimizes r(π, δ) is optimal; it is called a Bayes rule and will bedenoted by δπ. The quantity r(π) = r(π, δπ) is then called the Bayes riskfor π.
Example 20 (Example 11 cont.)
In Example 11, r(π, δc) is minimized when c = c0def= τ2/(1 + τ2), hence
δc0 is the Bayes rule (or Bayes estimator).
Remark: In a no-data problem, the Bayes risk principle will give the sameanswer as the conditional Bayes decision principle.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 28 / 36
Decision Principles Frequentist Decision Principles
Theorem 21 (The Minimax Principle)
A decision rule δ?1 is preferred to a rule δ?2 if supθ
R(θ, δ?1) < supθ
R(θ, δ?2). A
rule δ?M is a minimax decision rule if it minimizes R(θ, δ?) among allrandomized rules in D?, i.e,
supθ
R(θ, δ?M) = infδ?∈D?
supθ
R(θ, δ?)(minimax value)
Remark: minimax action (minimax decision rule in no-data problems),minimax nonrandomized rule, minimax nonrandomized action.
Example 22 (Example 9 cont.)
In Example 9, for decision rules δc ,
supθ
R(θ, δc) =
1 if c = 1∞ if c 6= 1
The minimax rule is δ1.Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 29 / 36
Decision Principles Frequentist Decision Principles
Theorem 23 (The Invariance Principle)
If two decision problems have the same formal structure(e.g, X, Y=g(X)),then the same decision rule should be used in each problem.i.e, let δ, δ? bethe decision rules used in the X and Y problems, then g(δ(x)) = δ?(y). Ifa decision problem is invariant under a group φ of transformations, a(nonrandomized) decision rule δ(x) is invariant under φ if for allx ∈ H, g ∈ φ, δ(g(x)) = g(δ(x)).
Remark: For the comparisons of the above principles, please refer to Sec1.6.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 30 / 36
Sufficient Statistics
Definition 24 (Sufficient Statistics)
Let X be a random variable whose distribution depends on the unknownparameter θ. but is otherwise known. A function T of X is said to be asufficient statistic for θ if the conditional distribution of X, givenT (X) = t, is independent of θ (with probability one).
Definition 25
If T (X) is a statistic with range F (i.e.F = T (x) : x ∈ H), the partitionof H induced by T is the collection of all sets of the formHt = x ∈ H : T (x) = t for t ∈ F .
Definition 26
A sufficient partition of H is a partition induced by a sufficient statistic T .
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 31 / 36
Sufficient Statistics
Theorem 27
Assume that T is a sufficient statistic for θ, and let δ?0(x , ·) be anyrandomized rule in D?. Then (subject to measurability conditions) thereexists a randomized rule δ?1(t, ·), depending only on T (x), which is R-equivalent to δ?0 .
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 32 / 36
Convexity
Definition 28
A set Q ( Rm is convex if for any two points x and y in Ω, the point[αx + (l − α)y] is in Ω for 0 ≤ α ≤ 1. (i.e, the line segment joining x andy is a subset of Ω.)
Definition 29
If x1, x2, · · · is a sequence of points in Rm, and O ≤ αi ≤ 1 are
numbers such that∞∑i=1
αi = 1, then∞∑i=1
αixi (providing it is finite) is called
a convex combination of the xi. The convex hull of a set Ω is the set ofall points which are convex combinations of points in Ω.
Definition 30
A real valued function g(x) defined on a convex set Ω is convex ifg(αx + (1− α)y) ≤ αg(x) + (1− α)g(y) for all x ∈ Ω, and 0 < α < 1. Ifthe inequality is strict for x 6= y, then g is strictly convex. Ifg(αx + (1− α)y) ≥ αg(x) + (1− α)g(y) then g is concave. If theinequality is strict for x 6= y, then g is strictly concave.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 33 / 36
Convexity
Lemma 31
Let g(x) be a function defined on an open convex subset Ω of Rm for
which all second-order partial derivatives g (i ,j)(x) = ∂2
∂xi∂xjg(x) exist and
are finite. Then g is convex if and only if the matrix G = (g (i ,j)(x)) isnonnegative definite for x ∈ Ω. Likewise, g is concave if −G isnonnegative definite. If G is positive (negative) definite, then g is strictlyconvex (strictly concave).
Lemma 32
Let X be an m-variate random vector such that E(|X|) <∞ andP(X ∈ Ω) = 1, where Ω is a convex subset of Rm. Then E(X) ∈ Ω.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 34 / 36
Convexity
Theorem 33 (Jensen’s Inequality)
Let g(x) be a convex real-valued function defined on a convex subset Ω ofRm, and let X be an m-variate random vector for which E(|X|) <∞.Suppose also that P(X ∈ Ω) = 1. Then g(E(X)) ≤ E[g(X)], with strictinequality if g is strictly convex and X is not concentrated at a point.
Theorem 34
Assume that A is a convex subset of Rm, and that for each θ ∈ Θ the lossfunction L(θ, a) is a convex function of a. Let δ? be a randomized decisionrule in D? for which Eδ
?(x ,·)[|a|] <∞ for all x ∈ H. Then (subject tomeasurability conditions) the nonrandomized rule δ(x) = Eδ
?(x ,·)[a] hasL(θ, δ(x)) ≤ L(θ, δ?(x , ·)) for all x and θ.
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 35 / 36
Convexity
Theorem 35 (Rao-Blackwell Theorem)
Assume that A is a convex subset of Rm and that L(θ, a) is a convexfunction of a for θ ∈ Θ. Suppose also that T is a sufficient statistic for θ,and that δ0(x) is a nonrandomized decision rule in D. Then the decisionrule, based on T (x) = t, defined by δ1(t) = EX |t [δ0(X)], is R-equivalentto or R-better than δ0, provided the expectation exists.
Proof.
By the definition of a sufficient statistic, the expectation above does notdepend on θ, so that δ1 is an obtainable decision rule. By Jensen’sinequality L(θ, δ1(t)) = L(θ,EX |t [δ0(X)]) ≤ EX |t [L(θ, δ0(X))]. Hence
R(θ, δ1) = ETθ [L(θ, δ1(T ))]
≤ ETθ [EX |T [L(θ, δ0(X))]]
= EX [L(θ, δ0(X))] = R(θ, δ0).
Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 36 / 36