part 2: statistical pattern classification : optimal classification with bayes rule
DESCRIPTION
PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule. METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications. Statistical Approach to P.R. Dimension of the feature space : Set of different state s of nature: Categories: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/1.jpg)
PART 2: Statistical Pattern Classification: Optimal Classification with Bayes Rule
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications
![Page 2: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/2.jpg)
Statistical Approach to P.R
Dimension of the feature space:Set of different states of nature:Categories:find
set of possible actions (decisions): Here, a decision might include a ‘reject option’ A Discriminant Function
in region ; decision rule : if
],...,,[ 21 dXXXX
d
},...,,{ 21 c
c
iR ji RR di RuR
},...,,{ 21 a
)()( XgXg ji
iR
)(Xg i ci 1
k )()( XgXg jk
1R1g
2g2R
3R3g
![Page 3: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/3.jpg)
A Pattern Classifier
So our aim now will be to define these functionsto minimize or optimize a criterion.
)(1 Xg
X
)(Xgc
)(2 Xg Max
cggg ,...,, 21
k
![Page 4: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/4.jpg)
Parametric Approach to Classification
• 'Bayes DecisionTheory' is used for minimum-error/minimum
risk pattern classifier design.
• Here, it is assumed that if a sample is drawn from a class it is a random variable represented with a multivariate probability density function.
‘Class- conditional density function’
X i
)( iXP
![Page 5: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/5.jpg)
• We also know a-priori probability
• (c is no. of classes)
• Then, we can talk about a decision rule that minimizes the probability of error.
• Suppose we have the observation • This observation is going to change a-priori assumption to a-
posteriori probability:
• which can be found by the Bayes Rule.
)( XP i
ci 1)( iP
X
![Page 6: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/6.jpg)
• can be found by Total Probability Rule:
• When ‘s are disjoint,
• Decision Rule: Choose the category with highest a-posteriori probability, calculated as above, using Bayes Rule.
)(/),()( XPXPXP ii
c
iii PXPXP
1
)().()(
X
21
)(
)().(
XP
PXP ii
)(XP
i
c
ii XPXP
1
),()(
![Page 7: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/7.jpg)
• then, 1
• Decision boundary:
• or in general, decision boundaries are where:
• between regions and
2R1R
21 gg
)()( XPXg ii
21 gg
12 gg
)()( XgXg ji iR jR
![Page 8: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/8.jpg)
• Single feature – decision boundary – point • 2 features – curve • 3 features – surface • More than 3 – hypersurface
• •
• Sometimes, it is easier to work with logarithms
• Since logarithmic function is a monotonically increasing function, log fn will give the same result.
)().()( iii PXPXg
)]().(log[)( iii PXPXg
)(log)(log)( iii PXPXg
)(
)().()(
XP
PXPXgi ii
![Page 9: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/9.jpg)
2 Category Case:
Assign to if if
But this is the same as: if
By throwing away ‘s, we end up with:
if
Which the same as: Likelihood ratio
kXP
XP
XP
XP)(
)(
)(
)(
1
2
2
1
21,cc
1c )()( 21 XPXP
2c
)().()().( 221 PXPPXP i
)(XP
)(
)().(
)(
)().( 2211
XP
PXP
XP
PXP 1c
)()( 21 XPXP
1c
)( 2
)( 1
![Page 10: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/10.jpg)
• Example: a single feature, 2 category problem with gaussian density
• : Diagnosis of diabetes using sugar count X• state of being healthy• state of being sick (diabetes)•
•
•
• The decision rule:
• if or•
)().()().( 2211 cPcXPcPcXP
1c 7.0)( 1 cP
2c
22
22 2/)(
22
2 .2
1)(
mXecXP
)(3.0)(7.0 21 cXPcXP
21
21 2/)(
21
1 .2
1)(
mXecXP
X
3.0)( 2 cP
1c
1m
)(XP
21
)( 2cXP
)().( 22 cPcXP
)().( 11 cXPcXP
)( 1cXP
d 2m
![Page 11: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/11.jpg)
• Assume now: • • And we measured:
• Assign the unknown sample: to the correct category.
• Find likelihood ratio: for
• Compare with:
• So assign: to .•
17X
101 m 202 m
006.043.07.0
3.0
)(
)(
1
2 cP
cP
006.09.4 e
8/)20(
8/)10(
2
2
X
X
e
e
2cX
221
17X
X
![Page 12: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/12.jpg)
• Example: A discrete problem• Consider a 2-feature, 3 category case
• where:•
• And , ,• • Find the decision boundaries and regions:•
• Solution:•
•
• •
1.0)4.0(4
1
2.0)( 3 cP
ii bXa 1
4.0)( 1 cP
ii bXa 2
0
)(
1
),( 221 iii bacXXP
3
1R
4.0)( 2 cP
3
4
5.3
1
3
2
1
b
b
b
3
5.0
1
3
2
1
a
a
a
3R
9
4.04.0
9
1
for
1
wiseother
2.02.01
2R
![Page 13: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/13.jpg)
• Remember now that for the 2-class case:• • if • • or•
Likelihood ratio
• • • Error probabilities and a simple proof of minimum error
• Consider again a 2-class 1-d problem:
•
• • • Let’s show that: if the decision boundary is (intersection
point) • rather than any arbitrary point .
•
)().( 11 cPcXP
d
)2().2()1().1( cPcXPcPcXP
1R
1c
d
d
)().( 22 cPcXP
d2R
kcXP
cXP
cXP
cXP)(
)(
)(
)(
1
2
2
1
![Page 14: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/14.jpg)
• Then (probability of error) is minimum.
• It can very easily be seen that the is minimum if .
•
12
)().()().( 2211 RRdXcPcXPdXcPcXP
)(EP
)().()().( 221112 cPcRXPcPcRXP
),(),()( 2112 cRXPcRXPEP
2 1
)(].)([)(].)([ 2211R RcPdXcXPcPdXcXP
dd )(EP
d d
![Page 15: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/15.jpg)
Minimum Risk Classification Risk associated with incorrect decision might be more important
than the probability of error. So our decision criterion might be modified to minimize the
average risk in making an incorrect decision. We define a conditional risk (expected loss) for decision when
occurs as:
Where is defined as the conditional loss associated with decision when the true class is . It is assumed that is known.
The decision rule: decide on if for all
The discriminant function here can be defined as: 4
i X
)()( XRXg ii
ji cj 1)()( XRXR ji ic
ji
)( ji
c
jjji
i XPXR1
)().()(
![Page 16: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/16.jpg)
• We can show that minimum – error decision is a special case of above rule where:
• • • • • then,
•
• so the rule is if•
0)( ii
1)( ji
)()( XRXP ji
)(1 XP i
i )(1)(1 XPXP ji
ij
ji XPXR )()(
![Page 17: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/17.jpg)
• For the 2 – category case, minimum – risk classifier becomes:
• •
• • if
•
• if
• Otherwise, .
• This is the same as likelihood rule if• and
)()()( 2121111 XPXPXR
1 )()()()( 121222212111 XPXPXPXP
)()()( 1212222 XPXPXR
)().()().( 2221212111 XPXP
)()().()().().( 222212112111 PXPPXP
1)(
)(.)(
)(
)(
)(
1
2
1121
2212
2
1
P
P
XP
XP
2
01122 12112
![Page 18: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/18.jpg)
Discriminant Functions so far
For Minimum Error:
For Minimum Risk:
Where
)(log)(log
)().(
)(
ii
ii
i
PXP
PXP
XP
)(XR i
c
jjji
i XPXR1
)().()(
![Page 19: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/19.jpg)
Bayes (Maximum Likelihood)Decision:
• Most general optimal solution• Provides an upper limit(you cannot do better with
other rule)• Useful in comparing with other classifiers
![Page 20: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/20.jpg)
Special Cases of Discriminant Functions
Multivariate Gaussian (Normal) Density :
The general density form: Here in the feature vector of size . M : d element mean vector
: covariance matrix (variance of feature ) - symmetric when and are statistically independent.
d
dxd
jX
),( MN
iX
TdMXE ],...,,[)( 21
X
2i
])[(
)])([(2
iiii
jjiiij
XE
XXE
1
)()(2/1
2/12/)2(
1)(
MXMX
d
T
eXP
iX0ij
![Page 21: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/21.jpg)
• - determinant of
•General shape: Hyper ellipsoids where
• is constant:• Mahalanobis
Distance • •
• 2 – d problem: ,
• If ,• (statistically independent features) then,• major axes are parallel to major ellipsoid axes
•
21, XX2X
1X
021 012
2221
1221
2
1
M
1)()( MXMX T
2X1X
2
1
![Page 22: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/22.jpg)
•
• if in addition• circular
• in general, the equal density curves are hyper ellipsoids. Now
• is used for since its ease in manipulation
•
• is a quadratic function of as will be shown.
22
21
),( iiMN
X
2
1
)(loglog)2/1(
)()).(2/1()(1
ii
i iT
ii
P
MXMXXg
)(Xgi
)(log)(log)( ieiei PXPXg
![Page 23: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/23.jpg)
a scalarThen,
On the decision boundary,
)(loglog.2/1
2/1.2/1
.2/1.2/1)(
1
11
11
ii
i
Tiii
T
i iTii
Ti
P
XMMX
MMXXXg
)(loglog.2/1.2/1
.2/1
1
1
1
iiiiTiio
i
Tii
ii
PMMW
MV
W
ioiiT
i WXVXWXXg )(
0
)()(
joiojijT
iT
ji
WWXVXVXWXXWX
XgXg
![Page 24: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/24.jpg)
Decision boundary function is hyperquadratic in general.Example in 2d.
Then, above boundary becomes
0
0)()()(
0
WVXWXX
WWXVVXWWXT
joiojijiT
21
21
2221
1211
xxX
vvV
W
![Page 25: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/25.jpg)
General form of hyper quadratic boundary IN 2-d.The special cases of Gaussian: AssumeWhere is the unit matrixIi
2
02 0221122222112
2111 Wxvxvxxxx
002
121
2
1
2221
121121
W
x
xvv
x
xxx
I
2
2
2
2
000
0..00
000
000
i
di
![Page 26: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/26.jpg)
(not a function of X so can be removed)
Now assume
euclidian distance between X and Mi
Then,the decision boundary is linear !
iM
Ii 21 1
iii MXdMXXg ,,)( 22
212
)()( ji PP
)(log,)(
)(loglog).()()(2
21
221
21
2
2
iii
id
iT
ii
PMXXg
PMXMXXg
![Page 27: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/27.jpg)
Decision
Rule: Assign the unknown sample to the closest mean’s category
unknown sample
d= Perpendicular bisector that will move towards the less probable category
2d
d
1M2M
)()( ji PP
1d
![Page 28: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/28.jpg)
Minimum Distance Classifier
• Classify an unknown sample X to the category with closest mean !
• Optimum when gaussian densities with equal variance and equal a-priori probability.
Piecewise linear boundary in case of more than 2 categories.
1M
4R
2R
3R
4M
3M
2M1R
![Page 29: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/29.jpg)
• Another special case: It can be shown that when (Covariance matrices are the same)
• Samples fall in clusters of equal size and shape unknown sample
is called Mahalonobis Distance
is called Mahalonobis Distance
)()( ji PP
)()(
)(log)()()(
121
121
iT
i
iiT
ii
MXMX
PMXMXXg
i
![Page 30: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/30.jpg)
Then, if The decision rule:
if (Mahalanobis Distance of unknown sample to ) > (Mahalanobis Distance of unknown sample to )
IfThe boundary moves toward the less probable one.
i
)()( ji PP
)()( ji PP
jM
iM
![Page 31: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/31.jpg)
Binary Random Variables
• Discrete features: Features can take only discrete values. Integrals are replaced by summations.
• Binary Features: 0 or 1
• Assume binary features are statistically independent.• Where is binary
)1(
)1(
2
1
ii
ii
Xq
Xp
)( 1iXP
iX
TdXXXX ,...,, 21
ipip1
0 1 iX
![Page 32: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/32.jpg)
Binary Random Variables
Example: Bit – matrix for machine – printed characters
a pixel
Here, each pixel may be taken as a feature For above problem, we have
is the probability that for letter A,B,…
iX
1iX
1001010 d
1
0
ip
iX
AB
![Page 33: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/33.jpg)
defined for undefined elsewhere:
• If statistical independence of features is assumed.
• Consider the 2 category problem; assume:
d
ikiiiikkk
xi
xi
d
ii
d
i
i
xi
xii
wPpxpxwPwXPXg
ppxPXP
x
ppxP
ii
ii
1
1
11
1
)(log)1log()1(log))(log)(log()(
)1()()()(
1,0
)1()()(
)1(
)1(
2
1
xq
xp
i
i
![Page 34: PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule](https://reader030.vdocument.in/reader030/viewer/2022033105/56812d50550346895d92528f/html5/thumbnails/34.jpg)
then, the decision boundary is:
So if
The decision boundary is linear in X. a weighted sum of the inputs where:
and
0)( WXWXg ii
2
10
loglog)1(log
0)(log)(log
)1log()1(log)1log()1(log
)()(
11
21
2
1
else
category
xx
PP
qxqxpxpx
PP
qp
iqp
i
iiiiiiii
i
i
i
i
)()(
11
0 2
1lnln
PP
qp
i
iW)1()1(lnii
ii
pqqP
iW