statistical decision theory bayes’ theorem: for discrete events for probability density functions

Statistical Decision TheoryBayes’ theorem:

For discrete events

iiM

i jj

iii

M

jjM

BBABBA

BBAAB

SBBB

PrPrPrPr

PrPrPr

space) (sample with events exclusivemutually ,,

1

11

yfyxfxf

yfyxf

dzzfzxf

yfyxfxyf YYX

X

YYX

YYX

YYXXY

For probability density functions

The Bayesian “philosophy”

The classical approach (frequentist’s view):

The random sample X = (X1, … , Xn ) is assumed to come from a distribution with a probability density function f (x; ) where is an unknown but fixed parameter.

The sample is investigated from its random variable properties relating to f (x; ) . The uncertainty about is solely assessed on basis of the sample properties.

The Bayesian approach:

The random sample X = (X1, … , Xn ) is assumed to come from a distribution with a probability density function f (x; ) where the uncertainty about is modelled with a probability distribution (i.e. a p.d.f), called the prior distribution

The obtained values of the sample, i.e. x = (x1, … , xn ) are used to update the information from the prior distribution to a posterior distribution for

Main differences:

In the classical approach, is fix, while in the Bayesian approach is a random variable.

In the classical approach focus is on the sampling distribution of X, while in the Bayesian the sample focus is on the variation of .

Bayesian: “What we observe is fixed, what we do not observe is random.”

Frequentist: “What we observe is random, what we do not observe is fixed.”

Concepts of the Bayesian framework

Prior density: p( )

Likelihood: L( | x ) “as before”

Posterior density: q( | x )

Relation through Bayes’ theorem:

θ|θ

θ|θ

λλ|λ

θ|θ

λλλ

θθθθθ

pLf

pL

dpL

pL

dpf

pf

dpf

pfq

x

xx

x

x

x

x

x

xx

X

X

X

X

X

;

;

Decision-theoretic elements

1. One of a number of actions should be decided on.

2. State of nature: A number of states possible. Usually represented by

3. For each state of nature the relative desirability of each of the different actions possible can be quantified

4. Prior information for the different states of nature may be available: Prior distribution of

5. Data may be available. Usually represented by x. Can be used to update the knowledge about the relative desirability of (each of) the different actions.

In mathematical notation for this course:

True state of nature: Uncertainty described by the prior p ( )

Data: x observation of X, whose p.d.f. depends on (data is thus assumed to be available)

Decision procedure:

Action: (x) The decision procedure becomes an action when applied to given data x

Loss function: LS ( , (x) ) measures the loss from taking action (x) when holds

Risk function Xxxx X ,|,, θθθθ SS LEdLLR

Note that the risk function is the expected loss with respect to the simultaneous distribution of X1, … , Xn

Note also that the risk function is for the decision procedure, and not for the particular action

Xxxx X ,|,, θθθθ SS LEdLLR

Minimax procedure:

A procedure * is a minimax procedure if

i.e. is chosen to be the “worst” possible value, and under that value the procedure that gives the lowest possible risk is chosen

The minimax procedure uses no prior information about , thus it is not a Bayesian procedure.

,maxmin, * θθ

θRR

Example

Suppose you are about to make a decision on whether you should buy or rent a new TV.

1 = “Buy the TV” 2 = “Rent the TV”

Now, assume is the mean time until the TV breaks down for the first time

Let assume three possible values 6, 12 and 24 months

The cost of the TV is $500 if you buy it and $30 per month if you rent it

If the TV breaks down after 12 months you’ll have to replace it for the same cost as you bought it if you bought it. If you rented it you will get a new TV for no cost provided you proceed with your contract.

Let X be the time in months until the TV breaks down and assume this variable is exponentially distributed with mean

A loss function for an ownership of maximum 24 months may be defined as

LS ( , 1(X ) ) = 500 + 500 H (X – 12) and

LS ( , 2(X ) ) = 30 24 = 720

Then

Now compare the risks for the three possible values of

Clearly the risk for the first procedure increases with while the risk for the second in constant. In searching for the minimax procedure we therefore focus on the largest possible value of where 2 has the smallest risk

2 is the minimax procedure

720,

1500

50050012500500,

2

12

12

11

1

R

e

dxeXHER xX

R( , 1 ) R( , 2 )

6 568 720

12 684 720

24 803 720

Bayes procedure

Bayes risk:

Uses the prior distribution of the unknown parameter

A Bayes procedure is a procedure that minimizes the Bayes risk

θθθ dpRB ,

θθθ dpRB

,minarg

Example cont.

Assume the three possible values of (6, 12 and 24) has the prior probabilities 0.2, 0.3 and 0.5.

Then

Thus the Bayes risk is minimized by 1 and therefore 1 is the Bayes procedure

) on dependnot (does 720

280

5.013.012.01500

2

241212126121

B

eeeB

Decision theory applied on point estimation

The action is a particular point estimator

State of nature is the true value of

The loss function is a measure of how good (desirable) the estimator is of :

Prior information is quantified by the prior distribution (p.d.f.) p( )

Data is the random sample x from a distribution with p.d.f. f (x ; )

ˆ,SS LL

Three simple loss functions

Zero-one loss:

Absolute error loss:

Quadratic (error) loss:

0,|ˆ|

|ˆ|0ˆ,

baba

bLS

0|ˆ|ˆ, aaLS

0ˆˆ,2

aaLS

Minimax estimators:

Find the value of that maximizes the expected loss with respect to the sample values, i.e. that maximizes

Then, the particular estimator that minimizes the risk for that value of is the minimax estimator

Not so easy to find!

XX ˆ estimators ofset over the ˆ,SX LE

Bayes estimators

A Bayes estimator is the estimator that minimizes

For any given value of x what has to be minimized is

xxxx

xxxx

xxx

xxx

ddqLf

ddfqL

ddpLL

dpdLLdpR

SX

XS

S

S

|ˆ,

|ˆ,

|ˆ,

|ˆ,ˆ,

dqLS xx |ˆ,

The Bayes philosophy is that data (x ) should be considered to be given and therefore the minimization cannot depend on x.

Now minimization with respect to different loss functions will result in measures of location in the posterior distribution of .

Zero-one loss:

Absolute error loss:

Quadratic loss:

given for mode posterior theis ˆ xx

given for median posterior theis ˆ xx

given for mean posterior theis ˆ xx

About prior distributions

Conjugate prior distributions

Example: Assume the parameter of interest is , the proportion of some property of interest in the population (i.e. the probability for this property to occur)

A reasonable prior density for is the Beta density:

function Beta called-so the

, 1 , and

parameters (constant) twoare 0 and 0 where

10;,

1,;

1

0

11

11

dxxxB

Bp

0 0.5 1

Beta(1,1)

Beta(5,5)

Beta(1,5)

Beta(5,1)

Beta(2,5)

Beta(5,2)

Beta(0.5,0.5)

Beta(0.3,0.7)

Beta(0.7,0.3)

Now, assume a sample of size n from the population in which y of the values possess the property of interest.

The likelihood becomes

yny

y

nyL

1;

ynyB

dxxxdxxxxx

dxB

xxxx

y

n

By

n

dxxpyxL

pyLyq

yny

yny

yny

yny

yny

yny

yny

,1

1

1

11

11

,1

1

,1

1

|

||

11

1

0

11

11

1

0

11

11

1

0

11

11

1

0

Thus, the posterior density is also a Beta density with parameters y + and n – y +

Prior distributions that combined with the likelihood gives a posterior in the same distributional family are named conjugate priors.

(Note that by a distributional family we mean distributions that go under a common name: Normal distribution, Binomial distribution, Poisson distribution etc. )

A conjugate prior always go together with a particular likelihood to produce the posterior.

We sometimes refer to a conjugate pair of distributions meaning

(prior distribution, sample distribution = likelihood)

In particular, if the sample distribution, i.e. f (x; ) belongs to the k-parameter exponential family (class) of distributions:

we may put

where 1 , … , k + 1 are parameters of this prior distribution and K( ) is a function of 1 , … , k + 1 only .

θθ

θDxCxBA

k

jjj

exf

1;

θθ

θθθ

DA

KDA

k

k

jjj

kkk

k

jjj

e

ep

11

1111

,,,

Then

i.e. the posterior distribution is of the same form as the prior distribution but with parameters

instead of

θθ

θθ

θθθθ

θθθ

DnxBA

DnxBAK

xC

KDAnDxCxBA

k

k

jj

n

iijj

k

k

jj

n

iijj

kk

n

ii

kkk

k

jjj

n

ii

k

j

n

iijj

e

eee

ee

pLq

11 1

11 1111

111111 1

,,,

,,,

||

xx

nxBxB k

n

iikk

n

ii

1

1111 ,,,

11 ,,, kk

Some common cases (within or outside the exponential family):

Conjugate prior Sample distribution Posterior

Beta Binomial Beta

Normal Normal, known 2 Normal

Gamma Poisson Gamma

Pareto Uniform Pareto

,~ Beta xnxBetax ,~| ,~ nBinX

2,~ N 2,~ NX i

22

22

22

2

22

2

,~|

n

xn

n

nNx

,~ Gamma PoX i ~ nxGammax ii ,~|

;p ,0~ UX i n

n xq ,max;; x

Example

Assume we have a sample x = (x1, … , xn ) from U (0, ) and that a prior density for is the Pareto density

What is the Bayes estimator of under quadratic loss?

The Bayes estimator is the posterior mean.

The posterior distribution is also Pareto with

0,1;2,1 1 p

n

nnn xxnq ,max,,max1| 1 x

nMLBn

nnn

n

x

nn

n

x

nnn

x

nnn

x

nnn

xxn

n

n

xxn

nxn

dxn

dnx

dxnE

n

n

n

n

ˆ with Compare ˆ,max2

1

2

,max0,max1

2,max1

,max1

1,max

,max1|

21

,max

21

,max

11

,max

1

,max

1x

Non-informative priors (uninformative)

A prior distribution that gives no more information about than possibly the parameter space is called a non-informative or uninformative prior.

Example: Beta(1,1) for an unknown proportion simply says that the parameter can be any value between 0 and 1 (which coincides with its definition)

A non-informative prior is characterized by the property that all values in the parameter space are equally likely.

Proper non-informative priors:

The prior is a true density or mass function

Improper non-informative priors:

The prior is a constant value over Rk

Example: N ( , ) for the mean of a normal population

0

0.05

0.1

0.15

0.2

0.25

1 2 3 4 5

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Decision theory applied on hypothesis testing

Test of H0: = 0 vs. H1: = 1

Decision procedure: C = Use a test with critical region C

Action: C (x) = “Reject H0 if x C , otherwise accept H0 ”

Loss function:

H0 true H1 true

Accept H0 0 b

Reject H0 a 0

Risk function

Assume a prior setting p0 = Pr (H0 is true) = Pr ( = 0) and p1 = Pr (H1 is true) = Pr ( = 1)

The prior expected risk becomes

bbR

aaR

CH

CH

LER

C

C

CSC

10;

10;

|Pr valuefor true accepting when Loss

|Pr valuefor true rejecting when Loss

,;

1

0

0

0

θ

θ

θθ

θθ

θθ

X

X

XX

10; pbpaRE C θθ

Bayes test:

Minimax test:

Lemma: Bayes tests and most powerful tests (Neyman-Pearson lemma) are equivalent in that

every most powerful test is a Bayes test for some values of p0 and p1 and every Bayes test is a most powerful test with

10minarg;minarg pbpaRECC

CB

θθ

baR

CC

C ,maxminarg;maxminarg*

θθθ

bp

ap

L

L

1

0

0

1

;

;

x

x

θ

θ

Example:

Assume x = (x1, x2 ) is a random sample from Exp( ), i.e.

We would like to test H0: = 1 vs. H0: = 2 with a Bayes test with losses a = 2 and b = 1 and with prior probabilities p0 and p1

0;0,|11 xexf x

1

021

1

0

1

0

2

10

10

11

11

0

1

2ln

24

44

||

21

21

21

21

21

011

0

21

111

1

pp

xx

pp

bpap

e

eee

ee

eeLL

xx

xxxx

xx

xx

xx

xx

Now,

1

0

0

12ln1

1

02ln1

121

1

01

10

111

01

1

010

1

0 012

1121

2ln1

21

2ln

1

1

1

0

1

0

1111

1

11

1

1

11

1

1

11

11

1

1

2

21

11

1

1

2

21

11

p

p

p

pe

p

pe

etetXXPete

exedxee

dxeedxee

dxdxeetXXP

p

p

p

p

tttt

t

xtx

t

x

tx

t

x

xtxt

x

xt

xxx

t

x

xt

x

xx

A fixed size gives conditions on p0 and p1, and a certain choice will give a minimized

Utility

Alternatively to a loss function we can define a utility function

Each decision procedure would have a consequence and the decision maker can associate a consequence with a measure of desirability with respect to the true state of nature. This measure is called utility and a utility function describes the utilities for different procedures and true states of nature:

,U

The expected utility of a procedure is obtained by integrating the utility function with the probability distribution of (prior or posterior to obtained data):

dgUU ,

If we consider the case where data (x) should be taken into account, the procedure is evaluated as an action and the distribution of is a posterior density (or a posterior probability mass function)

dqUU xxx ,

The loss function can be defined from the utility function as

,,max, xxx UdULd

S

where is the set of all possible decision procedures

Hence maximizing the (posterior) expected utility

dqUUdd

xxx ,maxmax

dqdL

dpdRdB

Sd

dd

xx

xx

|,min

,minmin θθθ

is equivalent to minimizing the Bayes risk

statistical decision theory bayes’ theorem: for discrete events for probability density functions

Documents