materi 3 bayesian decision theory

Bayesian Decision Theory

Team teaching

Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning

Lecture 2:

Bayesian Decision Theory 1. Diagram and formula=on 2. Bayes rule for inference 3. Bayesian decision 4. Discriminant func=ons and space par==on 5. Advanced issues


Diagram of pa;ern classifica=on Procedure of pattern recognition and decision making

subjects Features x

Observables X

Ac=on α

Inner belief w

X--- all the observables using existing sensors and instruments x --- is a set of features selected from components of X, or linear/non-linear functions of X. w --- is our inner belief/perception about the subject class. α  --- is the action that we take for x. We denote the three spaces by

},...,,{,class ofindex theis

vectorais),...,,(

α,,

k21C

d21

αCd

wwwwxxxx

wx

=Ω

=

Ω∈Ω∈Ω∈


Examples

Ex 1: Fish classifica=on

X=I is the image of fish, x =(brightness, length, fin#, ….) w is our belief what the fish type is

Ωc={“sea bass”, “salmon”, “trout”, …}

α is a decision for the fish type, in this case Ωc= Ωα Ωα ={“sea bass”, “salmon”, “trout”,

…}

Ex 2: Medical diagnosis X= all the available medical tests, imaging scans that a

doctor can order for a patient x =(blood pressure, glucose level, cough, x-ray….) w is an illness type

Ωc={“Flu”, “cold”, “TB”, “pneumonia”, “lung cancer”…}

α is a decision for treatment, Ωα ={“Tylenol”, “Hospitalize”, …}


Tasks

subjects Features x

Observables X

Decision α

Inner belief w

control sensors

selec=ng Informa=ve features

sta=s=cal inference

risk/cost minimiza=on

In Bayesian decision theory, we are concerned with the last three steps in the big ellipse assuming that the observables are given and features are selected.

Bayes Decision It is the decision making when all underlying probability distribu=ons are known. It is op=mal given the distribu=ons are known. For two classes ω1 and ω2 , Prior probabilities for an unknown new observation:

P(ω1) : the new observation belongs to class 1 P(ω2) : the new observation belongs to class 2

P(ω1 ) + P(ω2 ) = 1 It reflects our prior knowledge. It is our decision rule when no feature on the new object is available: Classify as class 1 if P(ω1 ) > P(ω2 )


Bayesian Decision Theory

Features x

Decision α(x)

Inner belief p(w|x)

sta=s=cal Inference

risk/cost minimiza=on

Two probability tables: a). Prior p(w) b). Likelihood p(x|w)

A risk/cost function (is a two-way table) λ(α | w)

The belief on the class w is computed by the Bayes rule The risk is computed by

)()()|()|(

xpwpwxpxwp =

∑=

=k

xxR1j

jjii )|)p(ww|()|( αλα

Bayes Decision We observe features on each object.

P(x| ω1) & P(x| ω2) : class-specific density The Bayes rule:


Decision Rule

A decision is made to minimize the average cost / risk, It is minimized when our decision is made to minimize the cost / risk for each instance

x.

∫= dx)()|)(( xpxxRR α

αα Ω→Ωd:)(x

∑=ΩΩ

==k

jjj xwpwxRx

1)|()|(minarg)|(minarg)( αλαα

αα

A decision rule is a mapping function from feature space to the set of actions we will show that randomized decisions won’t be optimal.


Bayesian error In a special case, like fish classification, the action is classification, we assume a 0/1 error.

jiji

jiji

wifwwifw

≠=

==

ααλ

ααλ

1)|(

0)|(

)|(1)|p(w)|( iw

jiij

xpxxR ααα

−== ∑≠

The risk for classifying x to class αi is,

The optimal decision is to choose the class that has maximum posterior probability

)|(maxarg))|(1(minarg)( xpxpx ααααα ΩΩ

=−=

The total risk for a decision rule, in this case, is called the Bayesian error

dxxpxxpdxxpxerrorperrorpR )())|)((1()()|()( ∫∫ −=== α


Par==on of feature space αα Ω→Ωd:)(x

The decision is a par==on /coloring of the feature space into k subspaces

ji,jiik1i ≠=Ω∩ΩΩ∪=Ω = φ

1 4

3 5

2


An example of fish classifica=on


Decision/classifica=on Boundaries


Close-‐form solu=ons In two case classifica=on, k=2, with p(x|w) being Normal densi=es, the decision Boundaries can be computed in close-‐form.

Excercise •  Consider the example of sea bass – salmon classifier, let these two

possible ac=ons: A1: Decide the input is sea bass; A2: Decide the input is salmon. Prior for sea bass and salmon are 2/3 and 1/3, respec=vely.

•  The cost of classifying a fish as a salmon when it truly is sea bass is 2$, and The cost of classifying a fish as a sea bass when it is truly a salmon is 1$.

•  Find the decision for input X = 13, whereas the likelihood P(X|ω1) = 0.28,

and P(X|ω2) = 0.17

Nominal features

Training data. Three input dimensions rash (R), temperature (T ), dizzy (D); output class (C). All variables binary

Training Phase :

We can summarise the training data as follows, and es=mate the likelihoods as rela=ve frequencies

•  Classify the test data by compu=ng the posterior probabili=es. If P(C = 1 | X) > 0.5 classify as C = 1 else classify as C = 0 (since it is a two class problem)

Testing phase :

Univariate Normal Distribu=on (Con=nuous)

Training phase

compute the standard devia=on of both trolls and smurfs

Finally we compute the class condi=onal probability of observing a height given either a smurf or a troll

Finally, we compute the prior probability of observing a troll or a

smurf regardless of height as

Tes.ng phase:

In general, we have the most important parts done. However, lets take a look

at Bayes rule in this context:

We no=ce that in order to figure out the final probability, we need the marginal probability p(x) which is the base probability of height. Basically, p(x) is there to make sure that if we summed up all the P(j|x) they would equal one. This is simply

which would be:

Finally.... We have everything we need to determine if a creature is most likely a troll or a smurf based upon its height. Thus, we can now ask, if we observe a creature that is 2” tall how likely is

it to be a smurf?

Next we ask what is the likelihood of that we have a troll given the observation of 2”?

then we can decide that we are most likely observing a smurf

The end result is that if

Exercise

Height Creature

2.70” Smurf

2.52” Smurf

2.57” Smurf

2.22” Smurf

3.16” Troll

3.58” Troll

3.16” Troll

if we test a creature that is 2.90” tall, how likely is it to be? Smurf or Troll?

The Mul.variate Normal Distribu.on (Con.nuous)

a variance-covariance matrix:

In two dimensions, the so called bivariate normal, we are concerned with two means:

Curvature Diameter Quality Control Result

2.95 6.63 Passed

2.53 7.79 Passed

3.57 5.65 Passed

3.57 5.45 Passed

3.16 4.46 Not passed



•  As a consultant to the factory, you get a task to set up the criteria for automa=c quality control. Then, the manager of the factory also wants to test your criteria upon new type of chip rings that even the human experts are argued to each other. The new chip rings have curvature 2.81 and diameter 5.46.

•  Can you solve this problem by employing Bayes Classifier?

•  X = features (or independent variables) of all data. Each row (denoted by ) represents one object; each column stands for one feature.

•  Y = group of the object (or dependent variable) of all data. Each row represents one object and it has only one column.

Training phase

x= y=

€

2.952.353.573.162.582.163.27

6.637.795.655.474.466.223.52

"

#

$ $ $ $ $ $ $ $ $

%

&

' ' ' ' ' ' ' ' '

€

1111222

"

#

$ $ $ $ $ $ $ $ $

%

&

' ' ' ' ' ' ' ' '

•  Xk = data of row k, for example x3 = •  g=number of gropus in y, in our example, g=2 •  Xi = features data for group i . Each row represents one object; each column stands for one feature. We separate x into several groups based on the number of category in y.

€

3.57 5.65[ ]

x1= x2=

€

2.952.533.573.16

6.637.795.655.47

"

#

$ $ $ $

%

&

' ' ' '

€

2.582.163.27

4.466.223.52

"

#

$ $ $

%

&

' ' '

•  μi = mean of features in group i, which is average of xi

•  μ1 = , μ2 =

•  μ = global mean vector, that is mean of the whole data set.

•  In this example, μ =

€

2.67 4.73[ ]

€

3.05 6.38[ ]

€

2.88 5.676[ ]

•  = mean corrected data, that is the features data for group i, xi , minus the global mean vector μ

= =

€

−0.305−0.7320.386

−1.2180.547−2.155

#

$

% % %

&

'

( ( (

€

0.060−0.3570.6790.269

0.9512.109−0.025−0.209

#

$

% % % %

&

'

( ( ( (

€

xi0

€

x10

€

x20

Covariance matrix of group i = Σ1 = Σ2 =

Σi =(xi

0 )T xi0

ni

€

0.166−0.192

−0.1921.349

#

$ %

&

' (

€

0.259−0.286

−0.2862.142

#

$ %

&

' (

Finally we compute the class condi=onal probability of observing curvature = 2.81 and diameter = 5.46 given

either a passed or not passed

p(2.81, 5.46 | Passed) = 2(3.14)−1/2(2) 0.166 −0.192−0.192 1.349

"

#$

%

&'

−1/2

e−1/2( 2.81

5.46

"

#$

%

&'−

3.056.38

"

#$

%

&')' 0.166 −0.192

−0.192 1.349

"

#$

%

&'

−1

( 2.815.46

"

#$

%

&'−

3.056.38

"

#$

%

&')

p(2.81, 5.46 | Not _ passed) = 2(3.14)−1/2(2) 0.259 −0.286−0.286 2.142

"

#$

%

&'

−1/2

e−1/2( 2.81

5.46

"

#$

%

&'−

2.674.73

"

#$

%

&')' 0.259 −0.286

−0.286 2.142

"

#$

%

&'

−1

( 2.815.46

"

#$

%

&'−

2.674.73

"

#$

%

&')

HP

Sticky Note

sigma -> kovarian

P = prior probability vector (each row represent prior probability of group ). If we do not know the prior probability, we just

assume it is equal to total sample of each group divided by the total samples, that is

•  p(Passed) = 4/7 •  p(Not_passed) = 3/7

Tes.ng phase

In general, we have the most important parts done. However, lets take a look at Bayes rule in this context:

Basically, p(x) is there to make sure that if we summed up all the P(j|x) they would equal one. This is simply

Finally.... We have everything we need to determine if a creature is most likely a troll or a smurf based upon its height. Thus, we can now ask, if we observe curvature = 2.81 and diameter = 5.46

how likely is it to be a passed?

Next we ask what is the likelihood of that we have not passed by given the observation of curvature = 2.81 and diameter = 5.46?

then we can decide that is most likely classified as Not Passed

If the end result is

p(Passed | 2.81, 5.46) = p(2.81, 5.46 | Passed)p(Passed)p(2.81, 5.46)

p(Not _ passed | 2.81, 5.46) = p(2.81, 5.46 | Not _ passed)p(Not _Passed)p(2.81, 5.46)

p(Not _ passed | 2.81, 5.46)> p(Passed | 2.81, 5.46)

materi 3 bayesian decision theory

Documents