materi 3 bayesian decision theory
DESCRIPTION
kuliah bayesianTRANSCRIPT
Bayesian Decision Theory
Team teaching
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Lecture 2:
Bayesian Decision Theory 1. Diagram and formula=on 2. Bayes rule for inference 3. Bayesian decision 4. Discriminant func=ons and space par==on 5. Advanced issues
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Diagram of pa;ern classifica=on Procedure of pattern recognition and decision making
subjects Features x
Observables X
Ac=on α
Inner belief w
X--- all the observables using existing sensors and instruments x --- is a set of features selected from components of X, or linear/non-linear functions of X. w --- is our inner belief/perception about the subject class. α --- is the action that we take for x. We denote the three spaces by
},...,,{,class ofindex theis
vectorais),...,,(
α,,
k21C
d21
αCd
wwwwxxxx
wx
=Ω
=
Ω∈Ω∈Ω∈
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Examples
Ex 1: Fish classifica=on
X=I is the image of fish, x =(brightness, length, fin#, ….) w is our belief what the fish type is
Ωc={“sea bass”, “salmon”, “trout”, …}
α is a decision for the fish type, in this case Ωc= Ωα Ωα ={“sea bass”, “salmon”, “trout”,
…}
Ex 2: Medical diagnosis X= all the available medical tests, imaging scans that a
doctor can order for a patient x =(blood pressure, glucose level, cough, x-ray….) w is an illness type
Ωc={“Flu”, “cold”, “TB”, “pneumonia”, “lung cancer”…}
α is a decision for treatment, Ωα ={“Tylenol”, “Hospitalize”, …}
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Tasks
subjects Features x
Observables X
Decision α
Inner belief w
control sensors
selec=ng Informa=ve features
sta=s=cal inference
risk/cost minimiza=on
In Bayesian decision theory, we are concerned with the last three steps in the big ellipse assuming that the observables are given and features are selected.
Bayes Decision It is the decision making when all underlying probability distribu=ons are known. It is op=mal given the distribu=ons are known. For two classes ω1 and ω2 , Prior probabilities for an unknown new observation:
P(ω1) : the new observation belongs to class 1 P(ω2) : the new observation belongs to class 2
P(ω1 ) + P(ω2 ) = 1 It reflects our prior knowledge. It is our decision rule when no feature on the new object is available: Classify as class 1 if P(ω1 ) > P(ω2 )
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Bayesian Decision Theory
Features x
Decision α(x)
Inner belief p(w|x)
sta=s=cal Inference
risk/cost minimiza=on
Two probability tables: a). Prior p(w) b). Likelihood p(x|w)
A risk/cost function (is a two-way table) λ(α | w)
The belief on the class w is computed by the Bayes rule The risk is computed by
)()()|()|(
xpwpwxpxwp =
∑=
=k
xxR1j
jjii )|)p(ww|()|( αλα
Bayes Decision We observe features on each object.
P(x| ω1) & P(x| ω2) : class-specific density The Bayes rule:
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Decision Rule
A decision is made to minimize the average cost / risk, It is minimized when our decision is made to minimize the cost / risk for each instance
x.
∫= dx)()|)(( xpxxRR α
αα Ω→Ωd:)(x
∑=ΩΩ
==k
jjj xwpwxRx
1)|()|(minarg)|(minarg)( αλαα
αα
A decision rule is a mapping function from feature space to the set of actions we will show that randomized decisions won’t be optimal.
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Bayesian error In a special case, like fish classification, the action is classification, we assume a 0/1 error.
jiji
jiji
wifwwifw
≠=
==
ααλ
ααλ
1)|(
0)|(
)|(1)|p(w)|( iw
jiij
xpxxR ααα
−== ∑≠
The risk for classifying x to class αi is,
The optimal decision is to choose the class that has maximum posterior probability
)|(maxarg))|(1(minarg)( xpxpx ααααα ΩΩ
=−=
The total risk for a decision rule, in this case, is called the Bayesian error
dxxpxxpdxxpxerrorperrorpR )())|)((1()()|()( ∫∫ −=== α
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Par==on of feature space αα Ω→Ωd:)(x
The decision is a par==on /coloring of the feature space into k subspaces
ji,jiik1i ≠=Ω∩ΩΩ∪=Ω = φ
1 4
3 5
2
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
An example of fish classifica=on
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Decision/classifica=on Boundaries
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Close-‐form solu=ons In two case classifica=on, k=2, with p(x|w) being Normal densi=es, the decision Boundaries can be computed in close-‐form.
Excercise • Consider the example of sea bass – salmon classifier, let these two
possible ac=ons: A1: Decide the input is sea bass; A2: Decide the input is salmon. Prior for sea bass and salmon are 2/3 and 1/3, respec=vely.
• The cost of classifying a fish as a salmon when it truly is sea bass is 2$, and The cost of classifying a fish as a sea bass when it is truly a salmon is 1$.
• Find the decision for input X = 13, whereas the likelihood P(X|ω1) = 0.28,
and P(X|ω2) = 0.17
Nominal features
Training data. Three input dimensions rash (R), temperature (T ), dizzy (D); output class (C). All variables binary
Training Phase :
We can summarise the training data as follows, and es=mate the likelihoods as rela=ve frequencies
• Classify the test data by compu=ng the posterior probabili=es. If P(C = 1 | X) > 0.5 classify as C = 1 else classify as C = 0 (since it is a two class problem)
Testing phase :
Univariate Normal Distribu=on (Con=nuous)
Training phase
compute the standard devia=on of both trolls and smurfs
Finally we compute the class condi=onal probability of observing a height given either a smurf or a troll
Finally, we compute the prior probability of observing a troll or a
smurf regardless of height as
Tes.ng phase:
In general, we have the most important parts done. However, lets take a look
at Bayes rule in this context:
We no=ce that in order to figure out the final probability, we need the marginal probability p(x) which is the base probability of height. Basically, p(x) is there to make sure that if we summed up all the P(j|x) they would equal one. This is simply
which would be:
Finally.... We have everything we need to determine if a creature is most likely a troll or a smurf based upon its height. Thus, we can now ask, if we observe a creature that is 2” tall how likely is
it to be a smurf?
Next we ask what is the likelihood of that we have a troll given the observation of 2”?
then we can decide that we are most likely observing a smurf
The end result is that if
Exercise
Height Creature
2.70” Smurf
2.52” Smurf
2.57” Smurf
2.22” Smurf
3.16” Troll
3.58” Troll
3.16” Troll
if we test a creature that is 2.90” tall, how likely is it to be? Smurf or Troll?
The Mul.variate Normal Distribu.on (Con.nuous)
a variance-covariance matrix:
In two dimensions, the so called bivariate normal, we are concerned with two means:
Curvature Diameter Quality Control Result
2.95 6.63 Passed
2.53 7.79 Passed
3.57 5.65 Passed
3.57 5.45 Passed
3.16 4.46 Not passed
2.58 6.22 Not passed
2.16 3.52 Not passed
• As a consultant to the factory, you get a task to set up the criteria for automa=c quality control. Then, the manager of the factory also wants to test your criteria upon new type of chip rings that even the human experts are argued to each other. The new chip rings have curvature 2.81 and diameter 5.46.
• Can you solve this problem by employing Bayes Classifier?
• X = features (or independent variables) of all data. Each row (denoted by ) represents one object; each column stands for one feature.
• Y = group of the object (or dependent variable) of all data. Each row represents one object and it has only one column.
Training phase
x= y=
€
2.952.353.573.162.582.163.27
6.637.795.655.474.466.223.52
"
#
$ $ $ $ $ $ $ $ $
%
&
' ' ' ' ' ' ' ' '
€
1111222
"
#
$ $ $ $ $ $ $ $ $
%
&
' ' ' ' ' ' ' ' '
• Xk = data of row k, for example x3 = • g=number of gropus in y, in our example, g=2 • Xi = features data for group i . Each row represents one object; each column stands for one feature. We separate x into several groups based on the number of category in y.
€
3.57 5.65[ ]
x1= x2=
€
2.952.533.573.16
6.637.795.655.47
"
#
$ $ $ $
%
&
' ' ' '
€
2.582.163.27
4.466.223.52
"
#
$ $ $
%
&
' ' '
• μi = mean of features in group i, which is average of xi
• μ1 = , μ2 =
• μ = global mean vector, that is mean of the whole data set.
• In this example, μ =
€
2.67 4.73[ ]
€
3.05 6.38[ ]
€
2.88 5.676[ ]
• = mean corrected data, that is the features data for group i, xi , minus the global mean vector μ
= =
€
−0.305−0.7320.386
−1.2180.547−2.155
#
$
% % %
&
'
( ( (
€
0.060−0.3570.6790.269
0.9512.109−0.025−0.209
#
$
% % % %
&
'
( ( ( (
€
xi0
€
x10
€
x20
Covariance matrix of group i = Σ1 = Σ2 =
Σi =(xi
0 )T xi0
ni
€
0.166−0.192
−0.1921.349
#
$ %
&
' (
€
0.259−0.286
−0.2862.142
#
$ %
&
' (
Finally we compute the class condi=onal probability of observing curvature = 2.81 and diameter = 5.46 given
either a passed or not passed
p(2.81, 5.46 | Passed) = 2(3.14)−1/2(2) 0.166 −0.192−0.192 1.349
"
#$
%
&'
−1/2
e−1/2( 2.81
5.46
"
#$
%
&'−
3.056.38
"
#$
%
&')' 0.166 −0.192
−0.192 1.349
"
#$
%
&'
−1
( 2.815.46
"
#$
%
&'−
3.056.38
"
#$
%
&')
p(2.81, 5.46 | Not _ passed) = 2(3.14)−1/2(2) 0.259 −0.286−0.286 2.142
"
#$
%
&'
−1/2
e−1/2( 2.81
5.46
"
#$
%
&'−
2.674.73
"
#$
%
&')' 0.259 −0.286
−0.286 2.142
"
#$
%
&'
−1
( 2.815.46
"
#$
%
&'−
2.674.73
"
#$
%
&')
P = prior probability vector (each row represent prior probability of group ). If we do not know the prior probability, we just
assume it is equal to total sample of each group divided by the total samples, that is
• p(Passed) = 4/7 • p(Not_passed) = 3/7
Tes.ng phase
In general, we have the most important parts done. However, lets take a look at Bayes rule in this context:
Basically, p(x) is there to make sure that if we summed up all the P(j|x) they would equal one. This is simply
Finally.... We have everything we need to determine if a creature is most likely a troll or a smurf based upon its height. Thus, we can now ask, if we observe curvature = 2.81 and diameter = 5.46
how likely is it to be a passed?
Next we ask what is the likelihood of that we have not passed by given the observation of curvature = 2.81 and diameter = 5.46?
then we can decide that is most likely classified as Not Passed
If the end result is
p(Passed | 2.81, 5.46) = p(2.81, 5.46 | Passed)p(Passed)p(2.81, 5.46)
p(Not _ passed | 2.81, 5.46) = p(2.81, 5.46 | Not _ passed)p(Not _Passed)p(2.81, 5.46)
p(Not _ passed | 2.81, 5.46)> p(Passed | 2.81, 5.46)