statistical decision theory bayes’ theorem : for discrete events

Statistical Decision TheoryBayes’ theorem:

For discrete events

iiMi jj

iii

M

jjM

BBABBA

BBAAB

SBBB

PrPrPrPr

PrPrPr

space) (sample with events exclusivemutually ,,

1

11

yfyxfxf

yfyxf

dzzfzxf

yfyxfxyf YYX

X

YYX

YYX

YYXXY

For probability density functions

The Bayesian “philosophy”

The classical approach (frequentist’s view):

The random sample X = (X1, … , Xn ) is assumed to come from a distribution with a probability density function f (x; ) where is an unknown but fixed parameter.

The sample is investigated from its random variable properties relating to f (x; ) . The uncertainty about is solely assessed on basis of the sample properties.

The Bayesian approach:

The random sample X = (X1, … , Xn ) is assumed to come from a distribution with a probability density function f (x; ) where the uncertainty about is modelled with a probability distribution (i.e. a p.d.f), called the prior distribution

The obtained values of the sample, i.e. x = (x1, … , xn ) are used to update the information from the prior distribution to a posterior distribution for

Main differences:

In the classical approach, is fix, while in the Bayesian approach is a random variable.

In the classical approach focus is on the sampling distribution of X, while in the Bayesian the sample focus is on the variation of .

Bayesian: “What we observe is fixed, what we do not observe is random.”

Frequentist: “What we observe is random, what we do not observe is fixed.”

Concepts of the Bayesian framework

Prior density: p( )

Likelihood: L( ; x ) “as before”

Posterior density: q( | x ) = q( ; x ) The book uses the second notation

Relation through Bayes’ theorem:

θθ;θθ;

λλλ;θθ;

λλλθθθθ

θ

pLf

pLdpL

pL

dpfpf

dpfpf

q

x

xx

xx

xx

xx

x

X

X

X

X

X

;;notation sBook'

The textbook writes

Still the posterior is referred to as the distribution of conditional on x

xxx

xx

xx

X

hpLq

hf

qq

θθθ

θθ

;;

;

Decision-theoretic elements

1. One of a number of actions should be decided on.

2. State of nature: A number of states possible. Usually represented by

3. For each state of nature the relative desirabilities of the different actions possible can be quantified

4. Prior information for the different states of nature may be available: Prior distribution of

5. Data may be available. Usually represented by x. Can be used to update the knowledge about the relative desirabilities of the different actions.

In mathematical notation for this course:

True state of nature: Uncertainty described by the prior p ( )

Data: x observation of X, whose p.d.f. depends on

(data is thus assumed to be available)

Decision procedure:

Action: (x) The decision procedure becomes an action when applied to given data x

Loss function: LS ( , (x) ) measures the loss from taking action (x) when holds

Risk function Xxxx X ,;,, θθθθ SS LEdLLR

Note that the risk function is the expected loss with respect to the simultaneous distribution of X1, … , Xn

Note also that the risk function is for the decision procedure, and not for the particular action

Admissible procedures:

A procedure 1 is inadmissible if there exists another procedure such that R( , 1 ) R( , 2 ) for all values of .

A procedure which is not inadmissible (i.e. no other procedure with lower risk function for any can be found) is said to be admissible

Xxxx X ,;,, θθθθ SS LEdLLR

Minimax procedure:

A procedure * is a minimax procedure if

i.e. is chosen to be the “worst” possible value, and under that value the procedure that gives the lowest possible risk is chosen

The minimax procedure uses no prior information about , thus it is not a Bayesian procedure.

,maxmin, * θθ

θRR

Example

Suppose you are about to make a decision on whether you should buy or rent a new TV.

1 = “Buy the TV” 2 = “Rent the TV”

Now, assume is the mean time until the TV breaks down for the first time

Let assume three possible values 6, 12 and 24 months

The cost of the TV is $500 if you buy it and $30 per month if you rent it

If the TV breaks down after 12 months you’ll have to replace it for the same cost as you bought it if you bought it. If you rented it you will get a new TV for no cost provided you proceed with your contract.

Let X be the time in months until the TV breaks down and assume this variable is exponentially distributed with mean

A loss function for an ownership of maximum 24 months may be defined as

LS ( , 1(X ) ) = 500 + 500 H (X – 12) and LS ( , 2(X ) ) = 30 24 = 720

Then

Now compare the risks for the three possible values of

Clearly the risk for the first procedure increases with while the risk for the second in constant. In searching for the minimax procedure we therefore focus on the largest possible value of where 2 has the smallest risk

2 is the minimax procedure

720,

1500

50050012500500,

2

1212

11

1

Re

dxeXHER xX

R( , 1 ) R( , 2 )

6 568 720

12 684 720

24 803 720

Bayes procedure

Bayes risk:

Uses the prior distribution of the unknown parameter

A Bayes procedure is a procedure that minimizes the Bayes risk

θθθ dpRB ,

θθθ dpRB

,minarg

Example cont.

Assume the three possible values of (6, 12 and 24) has the prior probabilities 0.2, 0.3 and 0.5.

Then

Thus the Bayes risk is minmized by 1 and therefore 1 is the Bayes procedure

) on dependnot (does 720280

5.013.012.01500

2

241212126121

B

eeeB

Decision theory applied on point estimation

The action is a particular point estimator

State of nature is the true value of

The loss function is a measure of how good (desirable) the estimator is of :

Prior information is quantified by the prior distribution (p.d.f.) p( )

Data is the random sample x from a distribution with p.d.f. f (x ; )

ˆ,SS LL

Three simple loss functions

Zero-one loss:

Absolute error loss:

Quadratic (error) loss:

0,|ˆ||ˆ|0ˆ,

bababLS

0|ˆ|ˆ, aaLS

0ˆˆ,2

aaLS

Minimax estimators:

Find the value of that maximizes the expected loss with respect to the sample values, i.e. that maximizes

Then, the particular estimator that minimizes the risk for that value of is the minimax estimator

Not so easy to find!

XX ˆ estimators ofset over the ˆ,SX LE

Bayes estimators

A Bayes estimator is the estimator that minimizes

For any given value of x what has to be minimized is

xxxx

xxxx

xxx

xxx

ddqLh

ddhqL

ddpLL

dpdLLdpR

S

S

S

S

;ˆ,

;ˆ,

;ˆ,

;ˆ,ˆ,

dqLS xx ;ˆ,

The Bayes philosophy is that data (x ) should be considered to be given and therefore the minimization cannot depend on x.

Now minimization with respect to different loss functions will result in measures of location in the posterior distribution of .

Zero-one loss:

Absolute error loss:

Quadratic loss:

given for mode posterior theis ˆ xx

given for median posterior theis ˆ xx

given for mean posterior theis ˆ xx

About prior distributions

Conjugate prior distributionsExample: Assume the parameter of interest is , the proportion of some property of interest in the population (i.e. the probability for this property to occur)

A reasonable prior density for is the Beta density:

function Beta called-so the

, 1 , and

parameters (constant) twoare 0 and 0 where

10;,

1,;

1

0

11

11

dxxxB

Bp

0 0.5 1

Beta(1,1)Beta(5,5)Beta(1,5)Beta(5,1)Beta(2,5)Beta(5,2)Beta(0.5,0.5)Beta(0.3,0.7)Beta(0.7,0.3)

Now, assume a sample of size n from the population in which y of the values possess the property of interest.

The likelihood becomes

yny

yn

yL

1;

ynyB

dxxxdxxxxx

dxB

xxxxyn

Byn

dxxpyxL

pyLyq

yny

yny

yny

yny

yny

yny

yny

,1

1

1

11

11

,11

,11

;

;;

11

1

011

11

1

011

11

1

0

11

11

1

0

Thus, the posterior density is also a Beta density with parameters y + and n – y +

Prior distributions that combined with the likelihood gives a posterior in the same distributional family are named conjugate priors.

(Note that by a distributional family we mean distributions that go under a common name: Normal distribution, Binomial distribution, Poisson distribution etc. )

A conjugate prior always go together with a particular likelihood to produce the posterior.

We sometimes refer to a conjugate pair of distributions meaning

(prior distribution, sample distribution = likelihood)

In particular, if the sample distribution, i.e. f (x; ) belongs to the k-parameter exponential family of distributions:

we may put

where 1 , … , k + 1 are parameters of this prior distribution and K( ) is a function of 1 , … , k + 1 only .

θθ

θDxCxBA

k

jjj

exf

1;

θθθθ

θDAKDA k

k

jjjkkk

k

jjj

eep

1

1111

1,,,

Then

i.e. the posterior distribution is of the same form as the prior distribution but with parameters

instead of

θθ

θθ

θθθθ

θθθ

DnxBA

DnxBAKxC

KDAnDxCxBA

k

k

jj

n

iijj

k

k

jj

n

iijj

kk

n

ii

kkk

k

jjj

n

ii

k

j

n

iijj

e

eee

ee

pLq

11 1

11 1111

111111 1

,,,

,,,

;;

xx

nxBxB k

n

iikk

n

ii

1

1111 ,,,

11 ,,, kk

Some common cases:Conjugate prior Sample distribution Posterior

Beta Binomial Beta

Normal Normal, known 2 Normal

Gamma Poisson Gamma

Pareto Uniform Pareto

,~ Beta xnxBetax ,~| ,~ nBinX

2,~ N 2,~ NX i

22

22

22

2

22

2,~|

nx

nn

nNx

,~ Gamma PoX i ~ nxGammax ii ,~|

;p ,0~ UX i

nn xq ,max;; x

Example

Assume we have a sample x = (x1, … , xn ) from U (0, ) and that a prior density for is the Pareto density

What is the Bayes estimator of under quadratic loss?

The Bayes estimator is the posterior mean.

The posterior distribution is also Pareto with

0,1;2,1 1 p

n

nnn xxnq ,max,,max1; 1 x

nMLBn

nnn

n

x

nn

n

x

nnn

x

nnn

x

nnn

xxnn

nx

xn

nxn

dxn

dnx

dxnE

n

n

n

n

ˆ with Compare ˆ,max21

2,max

0,max1

2,max1

,max1

1,max

,max1|

21

,max

21

,max

11

,max

1

,max

1x

Non-informative priors (uninformative)

A prior distribution that gives no more information about than possibly the parameter space is called a non-informative or uninformative prior.

Example: Beta(1,1) for an unknown proportion simply says that the parameter can be any value between 0 and 1 (which coincides with its definition)

A non-informative prior is characterized by the property that all values in the parameter space are equally likely.

Proper non-informative priors:

The prior is a true density or mass function

Improper non-informative priors:

The prior is a constant value over Rk

Example: N ( , ) for the mean of a normal population

0

0.05

0.1

0.15

0.2

0.25

1 2 3 4 5

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Decision theory applied on hypothesis testing

Test of H0: = 0 vs. H1: = 1

Decision procedure: C = Use a test with critical region C

Action: C (x) = “Reject H0 if x C , otherwise accept H0 ”

Loss function: H0 true H1 true

Accept H0 0 b

Reject H0 a 0

Risk function

Assume a prior setting p0 = Pr (H0 is true) = Pr ( = 0) and p1 = Pr (H1 is true) = Pr ( = 1)

The prior expected risk becomes

bbRaaR

CH

CHLER

C

C

CSC

10;10;

|Pr valuefor true accepting when Loss

|Pr valuefor true rejecting when Loss,;

1

0

0

0

θθ

θθ

θθθθ

X

XXX

10; pbpaRE C θθ

Bayes test:

Minimax test:

Lemma 6.1: Bayes tests and most powerful tests (Neyman-Pearson lemma) are equivalent in that

every most powerful test is a Bayes test for some values of p0 and p1 and every Bayes test is a most powerful test with

10minarg;minarg pbpaRECC

CB

θθ

baR

CC

C ,maxminarg;maxminarg*

θθθ

bp

apLL

1

0

0

1;;

xx

θθ

Example:

Assume x = (x1, x2 ) is a random sample from Exp( ), i.e.

We would like to test H0: = 1 vs. H0: = 2 with a Bayes test with losses a = 2 and b = 1 and with prior probabilities p0 and p1

0;0,;11 xexf x

1

021

1

0

1

0

2

10

10

11

11

0

1

2ln

24

44;;

21

21

21

21

21

011

0

21

111

1

ppxx

pp

bpape

eee

ee

eeLL

xx

xxxx

xx

xx

xx

xx

Now,

1

0

0

12ln1

1

02ln1

121

1

01

10

111

01

1

010

1

0 012

1121

2ln121

2ln

1

1

1

0

1

0

1111

1

11

1

1

11

1

1

11

11

1

1

2

21

11

1

1

2

21

11

pp

ppe

ppe

etetXXPete

exedxee

dxeedxee

dxdxeetXXP

pp

pp

tttt

tx

txt

x

tx

t

x

xtxt

x

xtx

xx

t

x

xt

x

xx

A fixed size gives conditions on p0 and p1, and a certain choice will give a minimized

Sequential probability ratio test (SPRT)

Suppose that we consider the sampling to be the observation of values in a “stream” x1, x2, … , i.e. we do not consider a sample with fixed size.

We would like to test H0: = 0 vs. H1: = 1

After n observations have been taken we have xn = (x1, … , xn ) , and we put

as the current test statistic.

n

nLLnLR

xx

;;

0

1

The frequentist approach:

Specify two numbers K1 and K2 not depending on n such that 0 < K1 < K2 < .

Then

If LR(n) K1 Stop sampling, accept H0

If LR(n) K2 Stop sampling, reject H0

If K1 < LR(n) < K2 Take another observation

Usual choice of K1 and K2 (Property 6.3):

If the size and the power 1 – are pre-specified, put

This gives approximate true size and approximate true power 1 –

1 and

1 21 KK

The Bayesian approach:

The structure is the same, but the choices of K1 and K2 is different.

Let c be the cost of taking one observation, and let as before a and b be the loss values for taking the wring decisions, and p0 and p1 be the prior probabilities of H0 and H1 respectively.

Then the Bayesian choices of K1 and K2 are

10

110

0

10

112

212121

12

221

012

2112

12

10

,21

;;ln and

;;ln

where

ln1ln11

ln1ln11

minarg,21

HXfXfEH

XfXfE

kkkkkkkkc

kkbkkp

kkkkkkc

kkakp

KKkk

Bayesian inference“Very much lies in the posterior distribution”

Bayesian definition of sufficiency:

A statistic T (x1, … , xn ) is sufficient for if the posterior distribution of given the sample observations x1, … , xn is the same as the posterior

distribution of given T, i.e.

The Bayesian definition is equivalent with the original definition (Theorem 7.1)

xx Tqq ||

Credible regionsLet be the parameter space for and let Sx be a subset of .

If

then Sx is a 1 – credible region for .

For a scalar we refer to it as a credible interval

The important difference compared to confidence regions:

A credible region is a fixed region (conditional on the sample) to which the random variable belongs with probability 1 – .

A confidence region is a random region that covers the fixed with probability 1 –

x

xS

dq 1| θθ

Highest posterior density (HPD) regions

Equal-tailed credible intervals

An equal-tailed 1 – credible interval is a 1 – credible interval (a , b ) such that

region credible (HPD)density posterior highest 1 a called is then

||

, allfor If

21

21

x

xx

xx

S

qq

SS

θθ

θθ

2|Pr|Pr xx ba

Example:

In a consignment of 5000 pills suspected to contain the drug Ecstasy a sample of 10 pills are sampled for chemical analysis. Let be the unknown proportion of Ecstasy pills in the consignment and assume we have a high prior belief that this proportion is 100%.

Such a high prior belief can be modelled with a Beta density

where is set to 1 and is set to a fairly high value, say 20, i.e. Beta (20,1)

,1 11

Bp

Beta(20,1)

0 0.2 0.4 0.6 0.8 1

Now, suppose after chemical analysis of the 10 pills, all of them showed to contain Ecstasy.

The posterior density for is Beta( + x, + n – x) (conjugate prior with binomial sample distribution as the population is large)

Beta (20 + 10, 1 + 10 – 10) = Beta (30, 1)

Then a lower-tail 99% credible interval for satisfies

Thus with 99% certainty we can state that at least 85.8% of the pills in the consignment consist of Ecstasy

99.01,30

1 29 dx

Bx

a

858.01,303099.01

99.011,3030

1

301

30

Ba

aB

Comments:

We have used the binomial distribution for the sample. More correct would be to use the hypergeometric distribution, but the binomial is a good approximation.

For a smaller consignment (i.e. population) we can benefit on using the result that the posterior for the number of pills containing Ecstasy in the rest of the consignment after removing the sample is beta-binomial.

This would however give similar results

If a sample consists of 100% of one kind, how would a confidence interval for be obtained?

Bayesian hypothesis testing

The issue is to test H0 vs. H1

Without specifying the hypotheses further, we seek to judge upon which of the two hypothesis that, conditional on the sample, is the most probable.

Note the difference compared to the classical approach: There we seek to reject H0

in favour of H1 and never the opposite.

The aim of Bayesian hypothesis testing is to determine the posterior “odds”

With Q* > 1 we then say that conditional on the sample H0 is Q* times more probable than H1 and with Q* < 1 the expression is reversed.

x

x|Pr|Pr

1

0*

HHQ

Some review of probability theory:

For two events A and B from a random experiment we have

This result is usually referred to as Bayes theorem on odds form

AA

ABAB

BAAB

BAAB

BABA

BAABBA

PrPr

|Pr|Pr

PrPr|Pr

PrPr|Pr

|Pr|Pr

PrPr|Pr|Pr

AOdds

ABABBAOdds

A

CA

|Pr|Pr|

writtenbecan relation theused is

but when ,event other any by replaced becan Actually

Now, it is possible to replace A with H0 , A with H1 and B with x (the sample)

where Q is the prior “odds”

Note that

where L(H ; x) is the likelihood of H . The concept of likelihood is not restricted to parameters.

Note also that f (x | H0) need not have the same functional form as f (x | H1)

Q

HfHfQ

HH

HfHf

HH

1

0*

1

0

1

0

1

0

||

PrPr

||

|Pr|Pr

xx

xx

xx

x

xxx

;; written becan

||

1

0

1

0HLHL

HfHf

ratio likelihood thecalled sometimes isbut

theas toreferred be will||

1

0 factor BayesHfHfB

xx

To make the comparison with classical hypothesis testing more transparent the posterior odds may be transformed to posterior probabilities for each of the two hypotheses:

Now, if the posterior probability of H1 is 0.95 this would be a result that could be compared with “H0 is rejected at 5% level of significance”

However, the two approaches cannot be made equal

1

1|Pr

1|Pr

*1

*

*

0

QH

QQH

x

x

Another example from forensic science

Assume a crime has been conducted where a blood stain was left at the crime scene. A suspect is identified and a saliva sample is taken from this person. The DNA profiles are compared between the saliva sample and the blood stain and they appear to match (i.e. they are equal).

Put H0: “The blood stain comes from the suspect”

H1: “The blood stain comes from another person than the suspect

The Bayes factor becomes

Now, if laboratory mistakes can be discarded

Pr (Matching DNA profiles | H0 ) = 1

How about the probability in the denominator of B ?

1

0|profilesDNA MatchingPr|profilesDNA MatchingPrHHB

This probability relates to the commonness of the current profile among people in general.

Today’s DNA analysis is such that if a full profile is obtained (i.e. if there are no missing DNA markers), the probability is very low, about 1 in 10 million.

B becomes very large

If the suspect was caught on reasons not related to the DNA-analysis (as it should be), the prior odds, Q = Pr (H0 ) / Pr (H1 ) is probably greater than 1

If we calculate with Q =1 then Q* = B and thus very large

The DNA analysis very strongly supports H0 to be true.

Hypotheses expressed in terms of a parameter

For sake of simplicity we express the parameter as a scalar, but the results also apply to multidimensional parameters

Case 1: H0: = 0 H1: = 1

Example: Let x be an observation from Bin(n, ) and H0: = 0 vs. H1: = 1

1

0

1100

;;

Pr;Pr

xx

ffB

pHpH

1

0

1

0

1

0*

1

0

1

0

11

00

1

0

11

11

1

1;

ppQ

xnxn

BppQ

xnx

xnx

xnx

xnx

Case 2: H0: H1: – ( \ )

Note that the textbook uses the somewhat redundant notation

p0 ( | H0 ) and p1 ( | H1 ) for the prior(s)

dpf

dpf

B

dpdpHdpdpHHH

;

;

Pr;Pr10

10

x

x

Case 3: H0: = 0 H1: 0

It can be shown (Theorem 7.3) that in this case

(As 0 defines the region for H1 the conditioning on H1 in the textbook seems redundant. )

0

01

;;

Pr;Pr

0

100

dpffB

dpdpHpHH

xx

pqB x|lim

0

Nuisance parameters and predictive distributions

If the parameters involved are two and and the parameter of interest is , then is referred to as a nuisance parameter.

The marginal posterior density for is then obtained by integrating out from the joint posterior density:

where is the parameter space for .

λλθθ dqq

xx |,|

Predictive distributions

Let xn = ( x1, … , xn ) be a random sample from a distribution with p.d.f. f (x ; )

Suppose we will take a new observation xn + 1 and would like to make so-called predictive inference about it. In practice this means that we would like to express the uncertainty about it in terms of a prediction interval.

The marginal p.d.f. of Xn + 1 is the same as that of each variable Xi in the sample, i.e. f (x ; )

However, if we want to make use of the sample we should rather study the simultaneous density of Xn + 1 and | xn .

Xn + 1 and Xn are independent by definition

Xn + 1 and | xn are also independent as the latter is conditional on xn

The simultaneous density of Xn + 1 and | x is

Now, treating (temporarily) as a nuisance parameter.

The posterior predictive distribution (density) for Xn + 1 given and xn is then

g can be used to

• find a point prediction of Xn + 1 as

– the mean of g with quadratic loss, i.e.

– the median of g with absolute error loss

– the mode of g with zero-one loss

• compute a 1 – prediction interval for Xn + 1 by solving for c and d

nn qxf x|;1 θθ

θθθ dqxfxg nnnn xx |;| 11

tailsequalith usually w1| d

c n dttg x

1

|R

dttgt nx

Example

The number of calls to a telephone central during an hour can usually be shown to follow a Poisson distribution with mean . Assume that a prior for is a Gamma (a,b)-distribution, i.e. the prior density is

Now assume that we have observed x1, x2 and x3 calls during each of the three previous hours and we wish to make predictive inference about the number of calls x4 during the current hour.

The posterior distribution is also Gamma with

11

aebp

baa

321

31

321

31

321

with 1

31

3,,|321321

xxxtta

ebxxxa

ebxxxqbtatabxxxaxxxa

Now,

and e.g. a point prediction under quadratic loss of the number of calls is obtained by

!41

11

43

14

43

11

!1

13

!1

13

!,,|

4

1

density) (Gamma 10 4

41

1

14

4

0

41

4

0

31

43214

4

44

4

4

4

xbtabb

dxta

ebb

btaxta

x

dta

ebx

dta

ebex

xxxxg

x

ta

bxtaxta

xta

ta

bxtata

btatax

0

1

!41

11

43

xx

ta

xbtabbx

Empirical Bayes

“Something between the Bayesian and the frequentist approach”

General idea:

Use some of the available data (sample) to estimate the prior for

Use the rest of the available data to make inference about

Parametric set-up:

Let data-point i be represented by the bivariate random variable (Xi , i ), i =1, 2, …, k

Let f (x; i ) be the p.d.f. for Xi and p( ) be the marginal density of i

f is assumed to be known part from and p is assumed to be unknown.

1, … , k are unobservable quantities.

With Empirical Bayes we try to make inference about k

Use x1, … , xk – 1 to estimate p( )

Then make inference about k through its posterior distribution conditional on xk

the usual way:

• point estimation under certain choices of loss functions

• credible regions

• hypothesis testing

dpxf

pxfxqk

kkkkk ˆ;

ˆ;|

How to estimate p( ) ?

The marginal density for Xi is

Let p = p( ; ) where is a (multidimensional) parameter defining the prior assuming the functional form of p is known.

Method of moments:

Put up the equations

from which an estimate of can be deduced.

dpxfxf ;

||

;;|

XEVarXVarEXVar

dpdxxxfXEEXEdxxxf

ψψ

ψ ψ

Maximum-Likelihood:

is estimated as

Apparently, the textbook is not correct here as the estimation of p should be based on the first k – 1 observations and the kth should be excluded from that stage.

See further the textbook for simplifications of the MLE procedure.

1

111 ;;,,;maxargˆ

k

iik dpxfxxL ψψψ

ψ

statistical decision theory bayes’ theorem : for discrete events

Documents