f071604: bayesian statisticsluoshan.weebly.com/uploads/2/5/3/1/25310789/chapter1.pdf ·...

F071604: Bayesian Statistics

Luo Shan

Department of Mathematics

Shanghai Jiao Tong University

February 28th, 2014

Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 1 / 36

Course Information

Consulting Time and Venue:Fridays (2:00 -5:00pm), Math Building 2304

Text Book:Berger, James O. Statistical Decision Theory and Bayesian Analysis.Springer, 1985.

Reference Books:1: Ghosh, J. K., Delampady, M., & Samanta, T. (Eds.). An introduction

to Bayesian analysis: theory and methods. Springer, 2006.2: +t, )Õâ. dÚO. ¥IÚOÑ, 2012.3: Ming-hui Chen, Dipak K.Dey, Peter Muller, Dong-chu Sun, Ke-ying Ye.

Frontiers of statistical decision making and Bayesian analysis: In Honorof James O. Berger. Springer, 2010.

Evaluation:

1: Mid-term paper and presentation (30%, ReferencePapersList.pdf ).2: Final-term examination (70%, one A4 size help sheet is allowed).


ReferencePapersList.pdf

Chapter 1: Basic Concepts

1 Introduction

2 Basic Elements

3 Expected Loss, Decision Rules, and RiskBayesian Expected LossFrequentist Risk

4 Randomized Decision Rules

5 Decision PrinciplesThe Conditional Bayes Decision PrincipleFrequentist Decision Principles

6 Sufficient Statistics

7 Convexity


Introduction

(1)

Statistical decision theory : to make decisions in the presence ofstatistical knowledge which sheds light on some of the uncertainties(denoted by θ)

Classical statistics: to use sample information to make inferencesabout θ.

Decision theory : to combine the sample information with otherrelevant aspects to make the decision

Two typical types of relevant nonsample information:

A knowledge of the possible consequences of the decisions (loss forvarious values of θ: loss function)

Information arising from sources other than the statistical investigation,generally comes from past experience (prior information)

Bayesian analysis: a statistical approach which formally seeks toutilize prior information


Basic Elements

Notations and Terminologies

θ: the unknown quantity, state of nature (parameter)

Θ: the set of all possible states of nature (parameter space)

a: actions (decisions)

A: the set of all possible actions

L(θ, a): the loss function, θ is the true state of nature, a is the actiontaken, (L(θ, a) ≥ −K > −∞)


Basic Elements

Notations and Terminologies (cont.)

X = (X1,X2, · · · ,Xn): the outcome, Xi s are independentobservations from a common distribution

x : a particular realization of X

H: the set of possible outcomes (the sample space,e.g,Rn)

Pθ(A): or Pθ(X ∈ A): the probability of the event A (A ( H) when θis the true state of nature.

For example, X will be assumed to be either a continuous or a discreterandom variable with density f (x |θ), let FX (x |θ) be the cumulativedistribution function (c.d.f), then Pθ(A) =

∫A dFX (x |θ), the expectation

of function h(x), Eθ[h(X)] is defined to be∫H h(x)dFX (x |θ)


Basic Elements

Prior Information

It is useful and natural to state prior beliefs in terms of probabilities ofvarious possible values of θ being true.

Let π(θ) represent a prior density of θ (continuous or discrete), if A ( Θ,

P(θ ∈ A) =

∫A

dF π(θ) = ∫A π(θ)dθ (continuous)∑θ∈A π(θ) (discrete)


Basic Elements

An Example:

Example 1

A drug company needs to decide on whether or not to market the drug,how much to market, what price to charge, etc. Two main factors affectingits decision are the proportion of people for which the drug will proveeffective θ1, and the proportion of the market the drug will capture (θ2).

Assume it is desired to estimate θ2. Clearly, Θ = [0, 1] and A = [0, 1].


Basic Elements

An Example (cont.):

The loss function might be

L(θ2, a) =

θ2 − a if θ2 − a ≥ 0

2(a− θ2) if θ2 − a ≤ 0

The sample information about θ2 is obtained from a sample survey.For example, assume n people are interviewed, and the number X whowould buy the drug is observed. It might be reasonable to assume

f (x |θ2) =

(n

x

)θx2 (1− θ2)n−x .

There could well be considerable prior information about θ2, arisingfrom previous introductions of new similar drugs into the market.Assume the prior information could be modeled by giving θ2 aU(0.1, 0.2) prior density, i.e.,letting

π(θ2) = 10I(0.1,0.2)(θ2)


Basic Elements

It is important to note that:

Loss function and prior information could be very vague or evennonunique

The goal in statistical inference is to provide a “summary” of thestatistical evidence which a wide variety of future “users” of thisevidence can easily incorporate into their own decision-makingprocesses


Basic Elements

The reasons for involvement of losses and prior information in inference:

Reports from statistical inferences should (ideally) be constructed sothat they can be easily utilized in individual decision making

The investigator may very well possess such information; he will oftenbe very informed about the uses to which his inferences are likely tobe put, and may have considerable prior knowledge about thesituation. It is then almost imperative that he present suchinformation in his analysis

The choice of an inference can be viewed as a decision problem


Basic Elements

To summarize: decision theory can be useful even when such incorporationis proscribed. This is because many standard inference criteria can beformally reproduced as decision-theoretic criteria with respect to certainformal loss functions.


Expected Loss, Decision Rules, and Risk Bayesian Expected Loss

Chapter 1

1 Introduction

2 Basic Elements





7 Convexity



Definition 2 (Bayesian Expected Loss)

lf π?(θ) is the believed probability distribution of θ at the time of decisionmaking, the Bayesian expected loss of an action a isρ(π?, a) = Eπ

?L(θ, a) =

∫Θ L(θ, a)dFπ?

(θ).



Example 3 (Example 1 cont.)

Assume no data is obtained, so that the believed distribution of θ2 issimply π(θ2) = 10I(0.1,0.2)(θ2). Then

ρ(π, a) =

∫ 1

0L(θ2, a)π(θ2)dθ2

=

∫ a

02(a− θ2)10I(0.1,0.2)(θ2)dθ2

+

∫ 1

a(θ2 − a)10I(0.1,0.2)(θ2)dθ2

=

0.15− a if a ≤ 0.1

15a2 − 4a + 0.3 if 0.1 ≤ a ≤ 0.22a− 0.3 if a ≥ 0.2

.


Expected Loss, Decision Rules, and Risk Frequentist Risk

Chapter 1

1 Introduction

2 Basic Elements





7 Convexity



Nonrandomized decision rule

Definition 4 (Nonrandomized Decision Rule)

A (nonrandomized) decision rule δ(x) is a function from H into A. IfX = x is the observed value of the sample information, then δ(x) is theaction that will be taken. ( For a no-data problem, a decision rule issimply an action.) Two decision rules, δ1 and δ2, are considered equivalentif Pθ(δ1(X ) = δ2(X )) = 1 for all θ.


For the situation of Example 1, δ(x) = x/n is the standard decision rule(estimator) for estimating θ2.



Definition 6 (Risk Function For Nonrandomized Decision Rules)

The risk function of a decision rule δ(x) is defined byR(θ, δ) = EX

θ [L(θ, δ(X ))] =∫H L(θ, δ(x))dFX (x |θ). For a no-data

problem, R(θ, δ) ≡ L(θ, δ).

To a frequentist, it is desirable to use a decision rule δ which has smallR(θ, δ). However, whereas the Bayesian expected loss of an action was asingle number. the risk is a function on Θ, and since θ is unknown wehave a problem in saying what “small” means.



Definition 7

A decision rule δ1, is R − better than a decision rule δ2 ifR(θ, δ1) ≤ R(θ, δ2) for all θ ∈ Θ, with strict inequality for some θ. A ruleδ1 is is R − equivalent to δ2 if R(θ, δ1) = R(θ, δ2) for all θ.

Definition 8 (Admissibility)

A decision rule δ is admissible if there exists no R − better decision rule. Adecision rule δ is inadmissible if there does exist an R − better decisionrule.



Example 9

Assume X is N (θ, 1) and that it is desired to estimate θ under lossL(θ, a) = (θ − a)2. (This loss is called squared-error loss.) Consider thedecision rules δc(x) = cx . Clearly the risk functionR(θ, δx) = EX

θ L(θ, δc(X )) = EXθ (θ − cX )2 = c2 + (1− c)2θ2. Hence, δ1 is

R − better than δc for c > l , the rules δc are inadmissible for c > l .



Definition 10 (Bayes Risk)

The Bayes risk of a decision rule δ, with respect to a prior distribution πon Θ is defined as r(π, δ) = Eπ[R(θ, δ)].


In Example 9, suppose that π(θ) is a N(0, τ2) density. Then, for thedecision rule δc ,

r(π, δc) = Eπ[R(θ, δc)] = Eπ[c2 + (1− c)2θ2] = c2 + (1− c)2τ2.


Randomized Decision Rules

Example 12 (Matching Pennies)

Persons A and B are to simultaneously uncover a penny (they cansecretly turn the penny to heads or tails)

If the two coin match, A wins $1 from B; otherwise, B wins $1 from A

Two available actions for A: a1-choose heads, or a2-choose tails

Two possible states of nature: θ1-B’s coin is a head, and θ2-B’s coinis a tail

Both a1 and a2 are admissible actions (Why?)

If the game is to be played a number of times, a way of preventing ultimatedefeat is to choose a1 and a2 with probabilities p and l − p respectively.



Definition 13 (Randomized Decision Rule)

A randomized decision rule δ?(x , ·) is, for each x , a probability distributionon A§ with the interpretation that if x is observed, δ?(x ,A) is theprobability that an action in A (a subset of A) will be chosen. In no-dataproblems, a randomized decision rule, also called a randomized action,denoted by δ?(·) is a probability distribution on A.

Remark: Nonrandomized decision rules will be considered a special caseof randomized rules, in that they correspond to the randomized ruleswhich, for each x , choose a specific action with probability 1. If δ(x) is anonrandomized decision rule, let < δ > denote the equivalent randomizedrule given by

< δ > (x ,A) = IA(δ(x)) =

1 if δ(x) ∈ A,0 if δ(x) /∈ A



Example 14

In Example 12, the randomized action is defined byδ?(a1) = p, δ?(a2) = 1− p. It is equivalent to be denoted byδ? = p < a1 > +(1− p) < a2 > .

Definition 15 (Risk Function for Randomized Decision Rules)

The loss function L(θ, δ?(x , ·)) of the randomized rule δ? is defined to beL(θ, δ?(x , ·)) = Eδ

?(x ,·)[L(θ, a)], where the expectation is taken over a (arandom variable with distribution δ?(x , ·)). The risk function of δ? isR(θ, δ?) = EX

θ [L(θ, δ?(X , ·))].

Definition 16

Let D? be the set of all randomized decision rules δ? for whichR(θ, δ?) < +∞ for all θ. A decision rule will be said to be admissible ifthere exists no R-better randomized decision rules in D?.


Decision Principles The Conditional Bayes Decision Principle

Chapter 1

1 Introduction

2 Basic Elements





7 Convexity


Decision Principles The Conditional Bayes Decision Principle

Theorem 17 (The Conditional Bayes Principle)

Choose an action a ∈ A which minimize ρ(π?, a) (def 2) (assume theminimum is attained). Such an action will be called a Bayes action andwill be denoted by aπ

?.


In Example 3, when a = 2/15, ρ(π, a) reaches its minimum 1/30. Hencethe Bayes estimator of θ2 is 2/15 assuming no data was available.


Decision Principles Frequentist Decision Principles

Chapter 1

1 Introduction

2 Basic Elements





7 Convexity



Theorem 19 (The Bayes Risk Principle)

A decision rule δ1 is preferred to a rule δ2 if r(π, δ1) < r(π, δ2). A decisionrule which minimizes r(π, δ) is optimal; it is called a Bayes rule and will bedenoted by δπ. The quantity r(π) = r(π, δπ) is then called the Bayes riskfor π.


In Example 11, r(π, δc) is minimized when c = c0def= τ2/(1 + τ2), hence

δc0 is the Bayes rule (or Bayes estimator).

Remark: In a no-data problem, the Bayes risk principle will give the sameanswer as the conditional Bayes decision principle.



Theorem 21 (The Minimax Principle)

A decision rule δ?1 is preferred to a rule δ?2 if supθ

R(θ, δ?1) < supθ

R(θ, δ?2). A

rule δ?M is a minimax decision rule if it minimizes R(θ, δ?) among allrandomized rules in D?, i.e,

supθ

R(θ, δ?M) = infδ?∈D?

supθ

R(θ, δ?)(minimax value)

Remark: minimax action (minimax decision rule in no-data problems),minimax nonrandomized rule, minimax nonrandomized action.


In Example 9, for decision rules δc ,

supθ

R(θ, δc) =

1 if c = 1∞ if c 6= 1

The minimax rule is δ1.Luo Shan (SJTU) F071604: Bayesian Statistics February 28th, 2014 29 / 36


Theorem 23 (The Invariance Principle)

If two decision problems have the same formal structure(e.g, X, Y=g(X)),then the same decision rule should be used in each problem.i.e, let δ, δ? bethe decision rules used in the X and Y problems, then g(δ(x)) = δ?(y). Ifa decision problem is invariant under a group φ of transformations, a(nonrandomized) decision rule δ(x) is invariant under φ if for allx ∈ H, g ∈ φ, δ(g(x)) = g(δ(x)).

Remark: For the comparisons of the above principles, please refer to Sec1.6.


Sufficient Statistics

Definition 24 (Sufficient Statistics)

Let X be a random variable whose distribution depends on the unknownparameter θ. but is otherwise known. A function T of X is said to be asufficient statistic for θ if the conditional distribution of X, givenT (X) = t, is independent of θ (with probability one).

Definition 25

If T (X) is a statistic with range F (i.e.F = T (x) : x ∈ H), the partitionof H induced by T is the collection of all sets of the formHt = x ∈ H : T (x) = t for t ∈ F .

Definition 26

A sufficient partition of H is a partition induced by a sufficient statistic T .


Sufficient Statistics

Theorem 27

Assume that T is a sufficient statistic for θ, and let δ?0(x , ·) be anyrandomized rule in D?. Then (subject to measurability conditions) thereexists a randomized rule δ?1(t, ·), depending only on T (x), which is R-equivalent to δ?0 .


Convexity

Definition 28

A set Q ( Rm is convex if for any two points x and y in Ω, the point[αx + (l − α)y] is in Ω for 0 ≤ α ≤ 1. (i.e, the line segment joining x andy is a subset of Ω.)

Definition 29

If x1, x2, · · · is a sequence of points in Rm, and O ≤ αi ≤ 1 are

numbers such that∞∑i=1

αi = 1, then∞∑i=1

αixi (providing it is finite) is called

a convex combination of the xi. The convex hull of a set Ω is the set ofall points which are convex combinations of points in Ω.

Definition 30

A real valued function g(x) defined on a convex set Ω is convex ifg(αx + (1− α)y) ≤ αg(x) + (1− α)g(y) for all x ∈ Ω, and 0 < α < 1. Ifthe inequality is strict for x 6= y, then g is strictly convex. Ifg(αx + (1− α)y) ≥ αg(x) + (1− α)g(y) then g is concave. If theinequality is strict for x 6= y, then g is strictly concave.


Convexity

Lemma 31

Let g(x) be a function defined on an open convex subset Ω of Rm for

which all second-order partial derivatives g (i ,j)(x) = ∂2

∂xi∂xjg(x) exist and

are finite. Then g is convex if and only if the matrix G = (g (i ,j)(x)) isnonnegative definite for x ∈ Ω. Likewise, g is concave if −G isnonnegative definite. If G is positive (negative) definite, then g is strictlyconvex (strictly concave).

Lemma 32

Let X be an m-variate random vector such that E(|X|) <∞ andP(X ∈ Ω) = 1, where Ω is a convex subset of Rm. Then E(X) ∈ Ω.


Convexity

Theorem 33 (Jensen’s Inequality)

Let g(x) be a convex real-valued function defined on a convex subset Ω ofRm, and let X be an m-variate random vector for which E(|X|) <∞.Suppose also that P(X ∈ Ω) = 1. Then g(E(X)) ≤ E[g(X)], with strictinequality if g is strictly convex and X is not concentrated at a point.

Theorem 34

Assume that A is a convex subset of Rm, and that for each θ ∈ Θ the lossfunction L(θ, a) is a convex function of a. Let δ? be a randomized decisionrule in D? for which Eδ

?(x ,·)[|a|] <∞ for all x ∈ H. Then (subject tomeasurability conditions) the nonrandomized rule δ(x) = Eδ

?(x ,·)[a] hasL(θ, δ(x)) ≤ L(θ, δ?(x , ·)) for all x and θ.


Convexity

Theorem 35 (Rao-Blackwell Theorem)

Assume that A is a convex subset of Rm and that L(θ, a) is a convexfunction of a for θ ∈ Θ. Suppose also that T is a sufficient statistic for θ,and that δ0(x) is a nonrandomized decision rule in D. Then the decisionrule, based on T (x) = t, defined by δ1(t) = EX |t [δ0(X)], is R-equivalentto or R-better than δ0, provided the expectation exists.

Proof.

By the definition of a sufficient statistic, the expectation above does notdepend on θ, so that δ1 is an obtainable decision rule. By Jensen’sinequality L(θ, δ1(t)) = L(θ,EX |t [δ0(X)]) ≤ EX |t [L(θ, δ0(X))]. Hence

R(θ, δ1) = ETθ [L(θ, δ1(T ))]

≤ ETθ [EX |T [L(θ, δ0(X))]]

= EX [L(θ, δ0(X))] = R(θ, δ0).


f071604: bayesian statisticsluoshan.weebly.com/uploads/2/5/3/1/25310789/chapter1.pdf ·...

Documents