bayesian decision theory - svcl · bayesian decision theory recall that we have • y – state of...

Bayesian decision theory

Nuno Vasconcelos ECE Department, UCSDp ,

Bayesian decision theoryrecall that we have• Y – state of the worldY – state of the world• X – observations• g(x) – decision function• L[g(x),y] – loss of predicting y with g(x)

the expected value of the loss is called the risk

• which can be written as

[ ]),(, YXLERisk YX=

dxixgLxiPRiskM

iXY∫∑

=

= ]),([),(1

,

2

Bayesian decision theory PY|X(1|x) = 1

• from thisM

∫∑• by chain rule

dxixgLxiPRiski

XY∫∑=

= ]),([),(1

,

dxixgLxiPxPRiskM

iXYX∫ ∑

=

= ]),([)|()(1

|

∫

PY|X(1|x) = 0

( ) 1

• where

)]([)()( xREdxxRxP XX == ∫M

g(x)=0

g(x)=1

• is the conditional risk given the observation x

]),([)|()(1

| ixgLxiPxRM

iXY∑

=

=

3

• is the conditional risk, given the observation x

Bayesian decision theory• since, by definition,

iL 0])([ ∀≥ g(x)=0

g(x)=1

• it follows thatyxixgL , ,0]),([ ∀≥

M

g( )

h

xixgLxiPxRi

XY ∀≥=∑=

,0]),([)|()(1

|

• hence

is minimum if we minimize R(x) at all x i e if we use pick the

)]([ xRERisk X=

is minimum if we minimize R(x) at all x, i.e., if we use pick the decision function

])([)|(minarg)(* ixgLxiPxgM

∑=

4

]),([)|(minarg)(1

|)(

ixgLxiPxgi

XYxg∑=

=

Bayesian decision theorythis is the Bayes decision rule

M which decision rule has

the associated risk

]),([)|(minarg)(1

|)(

* ixgLxiPxgM

iXY

xg∑=

=g(x)=1

which decision rule hassmaller conditional risk?

• the associated risk

dxixgLxiPRM

iXY∫∑

=

= ]),([),( *

1,

*g(x)=0

• ori=1

dxixgLxiPxPRM

∫ ∑= ])([)|()(* *

• is the Bayes risk and cannot be beaten

dxixgLxiPxPRi

XYX∫ ∑=

= ]),([)|()(*1

|

5

is the Bayes risk, and cannot be beaten

Examplelet’s consider a binary classification problem

PY|X(1|x) = 1

• for which the conditional risk is

}1,0{)(* ∈xgx

]),([)|()(1

| ixgLxiPxRM

iXY∑

=

=

]1)([)|1(]0)([)|0( LPLPPY|X(1|x) = 0

• we have two options

]1),([)|1(]0),([)|0( || xgLxPxgLxP XYXY +=

]10[)|1(]00[)|0()(0)( LxPLxPxRxg +⇒ ]1,0[)|1(]0,0[)|0()(0)( ||0 LxPLxPxRxg XYXY +=⇒=

]1,1[)|1(]0,1[)|0()(1)( ||1 LxPLxPxRxg XYXY +=⇒=

6

• and should pick the one of smaller conditional risk

Example• i.e. pick g(x) = 0 if R0(x) < R1(x) and g(x)=1 otherwise• this can be written as, pick 0 if PY|X(1|x) = 1, p

]1,1[)|1(]0,1[)|0( ]1,0[)|1(]0,0[)|0(

||

||

LxPLxPLxPLxP

XYXY

XYXY

+<

<+

• or

],[)|(],[)|( || XYXY

{ }]0,1[]0,0[)|0(| LLxP XY <−

x

• usually there is no loss associated with the correct decision

{ }{ }]1,0[]1,1[)|1(

][][)|(

|

|

LLxP XY

XY

−<PY|X(1|x) = 0

• and this is the same as

0]0,0[]1,1[ == LL

7

]1,0[)|1(]0,1[)|0( || LxPLxP XYXY >

Example• or, “pick 0” if

]01[]1,0[

)|1()|0(|

LL

PxP XY >

PY|X(1|x) = 1

• and applying Bayes rule]0,1[)|1(| LxP XY

]10[)0()0|(| LPxP YYXx

• which is equivalent to “pick 0” if]0,1[]1,0[

)1()1|()0()0|(

|

|

LL

PxPPxP

YYX

YYX >

PY|X(1|x) = 0

)0(]0,1[)1(]1,0[

)1|()0|( *

|

|

Y

Y

YX

YX

PLPLT

xPxP

=>

PY|X(1|x) 0

• i.e. we pick 0, when the probability of X given that Y=0 dividedby that given Y=1 is greater than a thresholdth ti l th h ld T* d d th t f th t t f

)(],[)|(| YYX

8

• the optimal threshold T* depends on the costs of the two types of error and the probabilities of the two classes

Examplebut

dxxxgyPR ∫ ≠= ))(*(*

• is really just the probability of error of the decision rule g*(x)f ( )

dxxxgyPR XY∫ ≠= )),((,

• note that the same result would hold for any g(x), i.e. R would be the probability of error of g(x)

• this implies the following

for the “0-1” loss• the Bayes decision rule is the MAP rule

• the risk is the probability of error of this rule (Bayes error)

)|(maxarg)(* | xiPxg XYi

=

11

• there is no other decision function with lower error

MAP ruleusually can be written in a simple form given a probabilistic model for X and Ypconsider the two-class problem, i.e. Y=0 or Y=1• the BDR is

PY|X(1|x) = 1

⎫⎧ ≥

=

)|1()|0(0

)|(maxarg)( |*

xPxP

xiPxi XYi

if

x

• pick “0” when and “1” otherwise

⎭⎬⎫

⎩⎨⎧

<≥

=)|1()|0(,1)|1()|0(,0

||

||

xPxPxPxP

XYXY

XYXY

if if

)|1()|0( xPxP ≥

PY|X(1|x) = 0

• pick 0 when and 1 otherwise• using Bayes rule

)|1()|0( || xPxP XYXY ≥

)1()1|()0()0|()|1()|0(

||

||

PxPPxPxPxP

YYXYYX

XYXY ⇔≥

12

)()1()1|(

)()0()0|( ||

xPPxP

xPPxP

X

YYX

X

YYX ≥

MAP rule• noting that PX(x) is a non-negative quantity this is the same as• pick “0” whenp

by using the same reasoning this can be easily)1()1|()0()0|( || YYXYYX PxPPxP ≥

by using the same reasoning, this can be easily generalized to

)()|(maxarg)( |* iPixPxi YYX=

• note that:• many class-conditional distributions are exponential (e.g. the

|i

y p ( gGaussian)

• this product can be tricky to compute (e.g. the tail probabilities are quite small)

13

• we can take advantage of the fact that we only care about the order of the terms on the right-hand side

The log trickthis is the log trick• which is to take logs

log a

which is to take logs• note that the log is a monotonically

increasing function

log b

• from which

baba loglog >⇔>

ab ab

( ))()|(logmaxarg

)()|(maxarg)( |*

iPixP

iPixPxi YYXi

=

=

( ))(log)|(logmaxarg

)()|(logmaxarg

|

|

iPixP

iPixP

YYXi

YYXi

+=

=

14

• the order is preserved

MAP rulein summary• for the zero/one loss the following three decision rules arefor the zero/one loss, the following three decision rules are• optimal and equivalent

• 1) )|(maxarg)( |* xiPxi XY

i=

[ ]• 2) [ ])()|(maxarg)( |* iPixPxi YYX

i=

• 3) [ ])(log)|(logmaxarg)( |* iPixPxi YYX

i+=

15

• 1) is usually hard to use, 3) is frequently easier than 2)

Examplethe Bayes decision rule is usually highly intuitiveexample: communicationsexample: communications• a bit is transmitted by a source, corrupted by noise, and received

by a decoder

channelY X

16

• Q: what should the optimal decoder do to recover Y?

Exampleintuitively, it appears that it should just threshold X• pick Tpick T• decision rule

⎩⎨⎧

><

=T if,1T if,0

xx

Y

• what is the threshold value?

17

• let’s solve the problem with the BDR

Examplewe need• class probabilities:

• in the absence of any other info let’s say

21)1()0( == YY PP

• class-conditional densities:• noise results from thermal processes, electrons moving around and

b i h th

2

bumping each other• a lot of independent events that add up• by the central limit theorem it appears reasonable to assume that the

noise is Gaussian

we denote a Gaussian random variable of mean µ and variance σ2 by 2

18

σ y),(~ 2σµNX

Examplethe Gaussian probability density function is

2)(2

2

2)(

221),,()( σ

µ

πσσµ

−−

==x

X exGxP

since noise is Gaussian, and assuming it is just added to the signal we have

channelY X

),0(~ , 2σεε NYX +=

• in both cases, X corresponds to a constant (Y) plus zero-mean Gaussian noise

• this simply adds Y to the mean of the Gaussian

19

• this simply adds Y to the mean of the Gaussian

Examplein summary

),1,()1|(),0,()0|(

|

|

σ

σ

xGxPxGxP

YX

YX

=

=2

1)1()0( == YY PP

• or, graphically,

10

20

10

Exampleto compute the BDR, we recall that

[ ]

d t th t

[ ])(log)|(logmaxarg)( |* iPixPxi YYX

i+=

and note that • terms which are constant (as a function of i) can be dropped• since we are just looking for the i that maximizes the functionsince we are just looking for the i that maximizes the function• since this is the case for the class-probabilities

1)1()0( == PP• we have

21)1()0( == YY PP

)|(logmaxarg)( |* ixPxi YX=

21

)|(logmaxarg)( | ixPxi YXi

BDRthis is intuitive• we pick the class that “best explains” (gives higher probability)we pick the class that best explains (gives higher probability)

the observation• in this case, we can solve visually

b t th th ti l l ti i ll i l

10

pick 1pick 0

22

• but the mathematical solution is equally simple

BDRlet’s consider the more general case

)()1|()()0|( GPGP

• for which

),,()1|( ),,()0|( 1|0| σµσµ xGxPxGxP YXYX ==

)|(logmaxarg)( |* ixPxi YX

i=

)(1 2

2µx i ⎪⎫⎪⎧−

−

2

22

)(1

21logmaxarg 2

µ

πσσ

i

x

e

⎫⎧

⎪⎭

⎪⎬

⎪⎩

⎪⎨=

2

22

)(

2)()2log(

21maxarg

µ

σµπσ i

i

x

x

⎭⎬⎫

⎩⎨⎧ −

−−=

23

22)(minarg

σµi

i

x −=

BDR• or

2)(minarg* 2

2i

i

xiσµ−

=

)2(minarg

)2(minarg

2

22ii

i

x

xx

µµ

µµ

+−=

+−=

• the optimal decision is, thereforei k 0 if

)2(minarg iii

x µµ +=

• pick 0 if

22

211

200

)(2

22

µµµµ

µµµµ

<

+−<+−

x

xx

• or, pick 0 if

0101 )(2 µµµµ −<−x

µµ +

24

201 µµ +

<x

BDRfor a problem with Gaussian classes, equal variancesand equal class probabilitiesq p• optimal decision boundary is the threshold• at the mid-point between the two means

µ1µ0

pick 1pick 0

25

BDRback to our signal decoding problem• in this case T = 0 5in this case T = 0.5• decision rule

⎩⎨⎧

><

=5.0 if,15.0 if,0

xx

Y⎩

• this is, once again, intuitive

26

• we place the threshold midway along the noise sources

BDRwhat is the point of going through all the math?• now we know that the intuitive threshold is actually optimal andnow we know that the intuitive threshold is actually optimal, and

in which sense it is optimal (minimum probability or error)• the Bayesian solution keeps us honest.• it forces us to make all our assumptions explicit• assumptions we have made

• uniform class probabilities 21)1()0( == YY PPuniform class probabilities

• Gaussianity

2)()( YY

),,()|(| iiYX xGixP σµ=

• the variance is the same under the two states

• noise is additive ε+= YX

ii ∀= ,σσ

27

• even for a trivial problem, we have made lots of assumptions

BDRwhat if the class probabilities are not the same?• e g coding scheme 7 = 11111110e.g. coding scheme 7 = 11111110• in this case PY(1) >> PY(0)• how does this change the optimal decision rule?

{ })(log)|(logmaxarg)( |* iPixPxi YYX

i+=

⎥⎤

⎢⎡ ⎪⎫⎪⎧

−−1 )( 2x iµ

⎫⎧

⎥⎥⎦

⎤

⎢⎢⎣

⎡+

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

=−

)(1

)(log2

1logmaxarg

2

22

2

x

iPe Yi

µ

πσσ

⎫⎧ −

⎭⎬⎫

⎩⎨⎧

+−

−−=

)(

)(log2

)()2log(21maxarg

2

22

x

iPxY

i

i

µ

σµπσ

28

⎭⎬⎫

⎩⎨⎧

−= )(log2

)(minarg 2 iPxY

i

i σµ

BDR• or )(log

2)(minarg* 2

2

iPxi Yi

i σµ

⎭⎬⎫

⎩⎨⎧

−−

=

))(log22(minarg

))(log22(minarg

22

222

iPx

iPxx Yiii

σµµ

σµµ

+=

−+−=

• the optimal decision is, thereforei k 0 if

))(log22(minarg iPx Yiii

σµµ −+−=

• pick 0 if

)0(l2)(2

)1(log22)0(log22

222

2211

2200

Y

YY

PPxPx σµµσµµ

+<

−+−<−+−

• or, pick 0 if)1()(log2)(2 2

0101Y

Y

Px σµµµµ +−<−

)0(2 Pσµµ +

29

)1()0(log

2 01

01

Y

Y

PPx

µµσµµ−

++

<

BDRwhat is the role of the prior for class probabilities?

)(2

the prior moves the threshold up or down in an intuitive way

)1()0(log

2 01

201

Y

Y

PPx

µµσµµ−

++

<

• the prior moves the threshold up or down, in an intuitive way• PY(0)>PY(1) : threshold increases• since 0 has higher probability, we care more about errors on the 0

idside• by using a higher threshold we are making it more likely to pick 0• if PY(0)=1, all we care about is Y=0, the threshold becomes infinite• we never say 1

• how relevant is the prior?• it is weighed by 1

30

• it is weighed by

201

1

σµµ −

BDRhow relevant is the prior?• it is weighed by the inverse of the normalized distance between

the meansthe means

201

1µµ − distance between the means

• if the classes are very far apart, the prior makes no differencethis is the easy situation the observations are very clear Bayes says

2σ in units of variance

• this is the easy situation, the observations are very clear, Bayes says “forget the prior knowledge”

• if the classes are exactly equal (same mean) the prior gets infinite weightweight• in this case the observations do not say anything about the class,

Bayes says “forget about the data, just use the knowledge that you started with”

31

started with• even if that means “always say 0” or “always say 1”

bayesian decision theory - svcl · bayesian decision theory recall that we have • y – state of...

Documents