9 sebiyaan nferencei - oxford statisticsdlunn/b8_02/b8pdf_9.pdf9 sebiyaan nferencei 1702 - 1761 9.1...

9 Bayesian inference

1702 - 1761

9.1 Subjective probability

This is probability regarded as degree of belief.

A subjective probability of an event A is assessed as p if you are preparedto stake £pM to win £M and equally prepared to accept a stake of £pM towin £M .

In other words ...

... the bet is fair and you are assumed to behave rationally.

9.1.1 Kolmogorov’s axioms

How does subjective probability fit in with the fundamental axioms?

Let A be the set of all subsets of a countable sample space Ω. Then

(i) P (A) ≥ 0 for every A ∈ A;(ii) P (Ω) = 1;

83

(iii) If Aλ : λ ∈ Λ is a countable set of mutually exclusive events belongingto A, then

P

(⋃λ∈Λ

Aλ

)=∑λ∈Λ

P (Aλ) .

Obviously the subjective interpretation has no difficulty in conforming with(i) and (ii). (iii) is slightly less obvious.

Suppose we have 2 events A and B such that A ∩ B = ∅. Consider a stakeof £pAM to win £M if A occurs and a stake £pBM to win £M if B occurs.The total stake for bets on A or B occurring is £pAM+ £pBM to win £Mif A or B occurs. Thus we have £(pA + pB)M to win £M and so

P (A ∪ B) = P (A) + P (B)

9.1.2 Conditional probability

Define pB , pAB , pA|B such that

£pBM is the fair stake for £M if B occurs;£pABM is the fair stake for £M if A and B occur;£pA|BM is the fair stake for £M if A occurs given B has occurred − other-wise the bet is off.

Consider the 3 alternative outcomes A only, B only, A and B. If G1, G2, G3

are the gains, then

A : −pBM2 − pABM3 = G1

B : −pA|BM1 + (1− pB)M2 − pAB)M3 = G2

A ∩ B : (1 − pA|B)M1 + (1 − pB)M2 + (1 − pAB)M3 = G3

The principle of rationality does not allow the possibility of setting the staketo obtain sure profit. Thus∣∣∣∣∣∣

0 −pB −pAB−pA|B 1 − pB −pAB1 − pA|B 1 − pB 1− pAB

∣∣∣∣∣∣ = 0

⇒ pAB − pBpA|B = 0

or

P (A | B) =P (A ∩ B)

P (B).

84

9.2 Estimation formulated as a decision problem

A The action space, which is the set of all possible actions a ∈ A available.

Θ The parameter space, consisting of all possible “states of nature”, onlyone of which occurs or will occur ( this “true” state being unknown).

L The loss function, having domain Θ×A (the set of all possible conse-quences (θ, a)) and codomain R.

RX Containing all possible realisations x ∈ RX of a random variable Xbelonging to the family f(x; θ) : θ ∈ Θ .

D The decision space, consisting of all possible decisions δ ∈ D, eachdecision function having domain RX and codomain δ.

9.2.1 The basic idea

• The true state of the world is unknown when the action is chosen.

• The loss is known which results from each of the possible consequences(θ, a).

• Data are collected to provide information about the unknown θ.

• A choice of action is taken for each possible observation of X.

Decision theory is concerned with the criteria for making a choice.

The form of the loss function is crucial − it depends upon the nature of the

problem.

Example

You are a fashion buyer and must stock up with trendy dresses. The truenumber you will sell is θ (unknown).

85

You stock up with a dresses. If you overstock you must sell off at your end-of-season sale and lose £A per dress. If you understock you lose the profit of£B per dress. Thus

L(θ, a) =

A(a− θ), a ≥ θ,B(θ − a), a ≤ θ.

Special forms of loss function

If A = B, you have modular loss,

L(θ, a) = A |θ − a|

To suit comparison with classical inference, quadratic loss is often used.

L(θ, a) = C(θ − a)2

9.2.2 The risk function

This is a measure of the quality of a decision function. Suppose we choose adecision δ(x) (i.e. choose a = δ(x)).

The function R with domain Θ×D and codomain R defined by

R(θ, δ) =∫RX

L (θ, δ(x)) f(x; θ)dx

or R(θ, δ) =∑

x∈RXL (θ, δ(x)) p(x; θ)

is called the risk function.

We may write this as

R(θ, δ) = E [L (θ, δ(x))] .

Example

SupposeX is a random sample fromN(θ, θ) and suppose we assume quadraticloss.

86

Consider

δ1(X) = X, δ2(X) =1

n− 1

n∑i=1

(Xi −X)2.

R(θ, δ1) = E[(θ −X)2

].

We know that E(X) = θ, so

R(θ, δ1) = V (X) =θ

n.

R(θ, δ2) = E[(θ − δ2(X))2

].

But E [δ2(X)] = θ, so that

R(θ, δ2) = V [δ2(X)]

=θ2

(n− 1)2V

[n∑i=1

(Xi −X)2

θ

]

=θ2

(n− 1)2· 2(n − 1) =

2θ2

n− 1.

d2

d1

θ

(n − 1)/2n2

R(θ,d)

(n − 1)/2n

Without knowing θ, how do we choose between δ1 and δ2?

Two sensible criteria

1. Take precautions against the worst.

2. Take into account any further information you may have.

87

Criterion 1 leads to minimax. Choose the decision leading to the smallestmaximum risk, i.e. choose δ∗ ∈ D such that

supθ∈Θ

R(θ, δ∗) = infδ∈D

supθ∈Θ

R(θ, δ) .

Minimax rules have disadvantages, the main one being that they treat Natureas an adversary. Minimax guards against the worst choice. The value of θshould not be regarded as a value deliberately chosen by Nature to make therisk large.

Fortunately in many situations they turn out to be reasonable in practice, ifnot in concept.

Criterion 2 involves introducing beliefs about θ into the analysis. If π(θ)represents our beliefs about θ, then define Bayes risk r(π, δ) as

r(π, δ) =

∫ΘR(θ, δ)π(θ)dθ if θ is continuous,∑θ∈ΘR(θ, δ)π(θ) if θ is discrete.

Now choose δ∗ such that

r(π, δ∗) = infδ∈D

r(π, δ) .

Bayes Rule

The Bayes rule with respect to a prior π is the decision rule δ∗ that minimisesthe Bayes risk among all possible decision rules.

Bayes’ tomb

88

9.2.3 Decision terminology

δ∗ is called a Bayes decision function.

δ(X) is a Bayes estimator.

δ(x) is a Bayes estimate.

If a decision function δ′ is such that ∃ δ′′ satisfying

R(θ, δ′′) ≤ R(θ, δ′) ∀θ ∈ Θ

andR(θ, δ′′) < R(θ, δ′) for some θ ∈ Θ,

then δ′ is dominated by δ′′ and δ′ is said to be inadmissible.

All other decisions are admissible.

89

9.3 Prior and posterior distributions

The prior distribution represents our initial belief in the parameter θ. Thismeans that θ is regarded as a random variable with p.d.f. π(θ).

From Bayes Theorem,

f(x | θ)π(θ) = π(θ | x)f(x)or

π(θ | x) = f(x | θ)π(θ)∫Θf(x | θ)π(θ)dθ .

π(θ | x) is called the posterior p.d.f.

It represents our updated belief in θ, having taken account of the data.

We write the expression in the form

π(θ | x) ∝ f(x | θ)π(θ)or

Posterior ∝ Likelihood× Prior.

Example

Suppose X1,X2, . . . ,Xn is a random sample from a Poisson distribution,Poi(θ), where θ has a prior distribution

π(θ) =θα−1e−θ

Γ(α), θ ≥ 0, α > 0.

The posterior distribution is given by

π(θ | x) ∝ θ∑

xie−nθ∏xi!

· θα−1e−θ

Γ(α)

which simplifies to

π(θ | x) ∝ θ∑

xi+α−1e−(n+1)θ, θ ≥ 0.

This is recognisable as a Γ-distribution and, by comparison with π(θ), wecan write

π(θ | x) = (n+ 1)∑

xi+αθ∑

xi+α−1e−(n+1)θ

Γ(∑

xi + α), θ ≥ 0,

which represents our posterior belief in θ.

90

9.4 Decisions

We have already defined the Bayes risk to be

r(π, δ) =

∫Θ

R(θ, δ)π(θ)dθ

where

R(θ, δ) =

∫RX

L (θ, δ(x)) f(x | θ)dx

so thatr(π, δ) =

∫Θ

∫RX

L (θ, δ(x)) f (x | θ)dxπ(θ)dθ

=∫RX

∫ΘL (θ, δ(x))π(θ | x)dθf (x)dx

For any given x, minimising the Bayes risk is equivalent to minimising∫Θ

L (θ, δ(x)) π(θ | x)dθ.

i.e. we minimise the posterior expected loss.

Example

Take the posterior distribution in the previous example and, for no specialreason, assume quadratic loss

L (θ, δ(x)) = [θ − δ(x)]2 .

Then we minimise ∫Θ

[θ − δ(x)]2 π(θ | x)dθ

by differentiating with respect to δ to obtain

−2

∫Θ

[θ − δ(x)]π(θ | x)dθ = 0

⇒ δ(x) =

∫Θ

θπ(θ | x)dθ.

91

This is the mean of the posterior distribution.

δ(x) =

∫ ∞

0

(n + 1)∑

xi+αθ∑

xi+αe−(n+1)θ

Γ(∑

xi + α)dθ

=(n+ 1)

∑xi+αΓ(

∑xi + α+ 1)

Γ(∑

xi + α)(n+ 1)∑

xi+α+1

=

∑xi + α

n+ 1

Theorem:Suppose that δ π is a decision rule that is Bayes with respect to some priordistribution π. If the risk function of δ π satisfies

R(θ, δ π) ≤ r(π, δ π) for all θ ∈ Θ,

then δ π is a minimax decision rule.

Proof :Suppose δ π is not minimax. Then there is a decision rule δ′ such that

supθ∈Θ

R(θ, δ′) < supθ∈Θ

R(θ, δ π).

For this decision rule we have, since, if f(x) ≤ k, thenE (f (X)) ≤ k,

r(π, δ′) ≤ supθ∈Θ

R(θ, δ′)

< supθ∈Θ

R(θ, δ π)

≤ r(π, δ π),

contradicting the statement that δ π is Bayes with respect to π. Hence δ π isminimax.

Example

Let X1, . . . ,Xn be a random sample from a Bernoulli distribution B(1, p).Let Y =

∑Xi and consider estimating p using the loss function

L(p, a) =(p− a)2

p(1 − p),

92

a loss which penalises mistakes more if p is near 0 or 1. The m.l.e. is p = Y/nand

R(p, p) = E

((p− p)2

p(1− p)

)

=1

p(1− p)

p(1 − p)

n=

1

n

The risk is constant and, therefore, minimax.[NB A decision rule with constant risk is called an equalizer rule]

We now show that p is the Bayes estimator with respect to the U (0, 1) priorwhich will imply that p is minimax.

π(θ | y) ∝ py(1− p)n−y

so we want the value of a which minimises∫ 1

0

(p− a)2

p(1− p)py(1− p)n−ydp.

Differentiation with respect to a yields

a =

∫ 10py(1 − p)n−y−1dp∫ 1

0py−1(1− p)n−y−1dp

.

Now a β-distribution with parameters α, β has pdf

f (x) =Γ(α + β)xα−1(1 − x)β−1

Γ(α)Γ(β), x ∈ [0, 1],

so

a =Γ(y + 1)Γ(n− y)

Γ(n+ 1)

Γ(n)

Γ(y)Γ(n− y)

and, using the identity Γ(α + 1) = αΓ(α),

a =y

n. i.e. p = Y/n.

93

(iii) The mode of π(θ | x) is the Bayes estimate with respect to zero-one

loss (i.e. loss of zero for a correct estimate and loss of one for anyincorrect eastimate).

Proof

The loss function has the form

L (θ, δ(x)) =

1, if δ(x) = θ,0, if δ(x) = θ.

This is only a useful idea for a discrete distribution.∑θ∈Θ L (θ, δ(x)) π(θ | x) =

∑θ∈Θ\θ∗

π(θ | x)= 1 − π(θ∗ | x)

where δ(x) = θ∗.

Minimisation means maximising π(θ∗ | x), so δ(x) is taken to be the‘most likely’ value − the mode.

Remember that π(θ | x) represents our most up-to-date belief in θ and is allthat is needed for inference.

95

9.6 Credible intervals

Definition

The interval (θ1, θ2) is a 100(1− α)% credible interval for θ if∫ θ2

θ1

π(θ | x)dθ = 1− α, 0 ≤ α ≤ 1.

θ1 θ2θ1 θ2

θ

π(θ x)

Both (θ1, θ2) and (θ′1, θ′2) are 100(1− α)% credible intervals.

Definition

The interval (θ1, θ2) is a 100(1− α)% highest posterior density interval if

(i) (θ1, θ2) is a 100(1 − α)% credible interval;

(ii) ∀θ′ ∈ (θ1, θ2) and θ′′ /∈ (θ1, θ2),

π(θ′ | x) ≥ π(θ′′ | x)

Obviously this defines the shortest possible 100(1− α)% credible interval.

96

θ1 θ2

θ

π(θ x)

1−α

97

9.7 Features of Bayesian models

(i) If there is a sufficient statistic, say T , the posterior distribution willdepend upon the data only through T .

(ii) There is a natural parametric family of priors such that the posteriordistributions also belong to this family. Such priors are called conjugate

priors.

Example: Binomial likelihood, beta prior

A binomial likelihood has the form

p(x | θ) =(n

x

)θx (1− θ)n−x , x = 0, 1, . . . , n,

and a beta prior is

π(θ) =θa−1(1− θ)b−1

B(a, b), θ ∈ [0, 1],

where

B(a, b) =Γ(a)Γ(b)

Γ(a+ b).

Posterior ∝ Likelihood× Prior,

so thatπ(θ | x) ∝ θx+a−1 (1− θ)n−x+b−1

and comparison with the prior leads to

π(θ | x) = θx+a−1 (1− θ)n−x+b−1

B(x+ a, n− x+ b), θ ∈ [0, 1],

which is also a beta distribution.

98

9.8 Hypothesis testing

The Bayesian only has a concept of a hypothesis test in certain very specificsituations. This is because of the alien idea of having to regard a parameteras a fixed point. However the concept has some validity in an example suchas the following.

Example:

Water samples are taken from a reservoir and checked for bacterial content.These provide information about θ, the proportion of infected samples. If itis believed that bacterial density is such that more than 1% of samples areinfected, there is a legal obligation to change the chlorination procedure.

Bayesian procedure is

(i) use the data to obtain the posterior distribution of θ;

(ii) calculate P (θ > 0.01).

EC rules are formulated in statistical terms and, in such cases, allow for a 5%significance level. This could be interpreted in a Bayesian sense as allowingfor a wrong decision with probability no more than 0.05, so order increasedchlorination if P (θ > 0.01) > 0.05.

Where for some reason (e.g. legal reason) a fixed parameter value is of interestbecause it defines a threshold, then a Bayesian concept of a test is possible.It is usually performed either by calculating a credible bound or a credibleinterval and looking at it in relation to that threshold value or values, or bycalculating posterior probabilities based on the threshold or thresholds.

99

9.9 Inferences for normal distributions

9.9.1 Unknown mean, known variance

Consider data from N(θ, σ2) and a prior N(τ , κ2).

Posterior ∝ Likelihood× Prior.

so, since

f(x | θ, σ2) = (2πσ2)−n/2 exp

(− 1

2σ2

∑(xi − θ)2

)

and

π(θ) =1√2πκ2

exp

(− 1

2κ2(θ − τ )2

), θ ∈ R,

π(θ | x, σ2, τ , κ2) ∝ exp

(− n

2σ2(θ2 − 2θx)− 1

2κ2(θ2 − 2θτ )

).

Butn

2σ2(θ2 − 2θx) +

1

2κ2(θ2 − 2θτ )

=1

2

(n

σ2+

1

κ2

)θ2 −

(nx

σ2+

τ

κ2

)θ + garbage

=1

2

(n

σ2+

1

κ2

)(θ − nx /σ2 + τ /κ2

n /σ2 + 1 /κ2

)2

so the posterior p.d.f. is proportional to

exp

[− 1

2(n /σ2 + 1/κ2 )−1

(θ − nx /σ2 + τ /κ2

n /σ2 + 1/κ2

)2]

This means that

θ | x, σ2, τ , κ2 ∼ N

(nx /σ2 + τ /κ2

n /σ2 + 1 /κ2, (n

/σ2 + 1

/κ2 )−1

).

The form is a weighted average.

The prior mean τ has weight 1 /κ2 :this is 1/(prior variance),. . .

100

and the observed sample mean has weight n /σ2 :this is 1/(variance of sample mean).

This is reasonable because

limn→∞

(nx /σ2 + τ /κ2

n /σ2 + 1 /κ2

)= x

and

limκ2→∞

(nx /σ2 + τ /κ2

n /σ2 + 1 /κ2

)= x.

The posterior mean tends to x as the amount of data swamps the prior or asprior beliefs become very vague.

9.9.2 Unknown variance, known mean

We need to assign a prior p.d.f. π(σ2).The Γ family of p.d.f.’s provides a flexible set of shapes over [0,∞). We take

τ =1

σ2∼ Γ

(λ

2,νλ

2

)

where λ and ν are chosen to provide suitable location and shape.

In other words, the prior p.d.f. of τ = 1 /σ2 is chosen to be of the form

π(τ) =(νλ)ν/2τ ν/2−1

2ν/2Γ(ν /2)exp

(−νλτ

2

), τ ≥ 0.

or

π(σ2)=

(νλ)ν/2 (σ2)−ν/2−1

2ν/2Γ(ν /2)exp

(− νλ

2σ2

).

τ = 1 /σ2 is called the precision.The posterior p.d.f. is given by

π(σ2 | x, θ) ∝ f

(x | θ, σ2)π (σ2) ,

and the right-hand side as a function of σ2, is proportional to

(σ2)−n/2

exp

(− 1

2σ2

∑(xi − θ)2

)× (σ2)−ν/2−1 exp(− νλ

2σ2

).

101

Thus π (σ2 | x, θ) is proportional to(σ2)−(ν+n)/2−1

exp

(− 1

2σ2

[νλ+

∑(xi − θ)2

])

This is the same as the prior except that ν is replaced by ν + n and λ isreplaced by

νλ+∑

(xi − θ)2

ν + n.

Consider the random variable

W =νλ+

∑(xi − θ)2

σ2.

ClearlyfW (w) = Cw(ν+n)/2−1e−w/2 , w ≥ 0.

This is a χ2-distribution with ν + n degrees of freedom. In other words,

νλ+∑

(xi − θ)2

σ2∼ χ2(ν + n).

Again this agrees with intuition. As n → ∞ we approach the classicalestimate, and also as ν → 0 which corresponds to vague prior knowledge.

9.9.3 Mean and variance unknown

We now have to assign a joint prior p.d.f. π (θ, σ2) to obtain a joint posteriorp.d.f.

π(θ, σ2 | x) ∝ f

(x | θ, σ2)π (θ, σ2) .

We shall look at the special case of a non-informative prior. From theoreticalconsiderations beyond the scope of this course, a non-informative prior maybe approximated by

π(θ, σ2

) ∝ 1

σ2.

102

Note that this is not to be thought of as an expression of prior beliefs butrather as an approximate prior which allows the posterior to be dominatedby the data.

π(θ, σ2 | x) ∝ (σ2)−n/2−1 exp(− 1

2σ2

∑(xi − θ)2

).

The right-hand side may be written in the form

(σ2)−n/2−1

exp

(− 1

2σ2

[∑(xi − x)2 + n(x− θ)2

]).

We can use this form to integrate out, in turn, θ and σ2.

Using ∫ ∞

−∞

exp(− n

2σ2(θ − x)2

)dθ =

√2πσ2

n

we find

π(σ2 | x) ∝ (

σ2)−(n−1)/2−1

exp

(− 1

2σ2

∑(xi − x)2

)

and, putting W =∑

(xi − x)2 /σ2 ,

fW (w) = Cw(n−1)/2−1e−w/2 , w ≥ 0,

so we see that ∑(xi − x)2

σ2∼ χ2(n− 1).

To integrate out σ2, make the substitution σ2 = τ−1. Then π (θ | x) isproportional to∫ ∞

0

τn/2−1 exp(−τ2

[∑(xi − x)2 + n(x− θ)2

])dτ .

From the definition of the Γ-function,∫ ∞

0

yα−1e−βydy =Γ(α)

βα,

and we see, since only the terms in θ are relevant,

π (θ | x) ∝(∑

(xi − x)2 + n(x− θ)2)−n/2

.

103

or

π (θ | x) ∝(1 +

t2

n − 1

)−n/2where

t =

√n(θ − x)√∑

(xi − x)2 /(n− 1).

104

9 sebiyaan nferencei - oxford statisticsdlunn/b8_02/b8pdf_9.pdf9 sebiyaan nferencei 1702 - 1761 9.1...

Documents