dates and timescharlesr/9881/lecture18_blanks.pdf · ¥ presentations next week ¥ assignment 5 due...
TRANSCRIPT
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition
Memorial University of Newfoundland
Pattern RecognitionLecture 18, July 13, 2006
http://www.engr.mun.ca/~charlesr
Office Hours: Tuesdays & Thursdays 8:30 - 9:30 PM
EN-3026
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition
Dates and Times
• Presentations next week
• Assignment 5 due on Monday, July 24th
• Final Reports due on July 28th
2
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition
Presentations - July 18th and 20th
• July 18th
• Liang Chen
• Chen Hao
• Chao Ying
• Shenqiu Zhang
• July 20th
• Zhang Chong
• Anjana Punchihewa
• Liang Zhang
• Yan Zhang
3
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition
Recap
• Feature selection
• choose n of m measurements
• Evaluation:
• Feature interclass distance
• Selection:
• Feature ranking
• Incrementally best feature
• Successive addition/deletions
4
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition
Feature Extraction
5
Given m measurements {x1, ..., xm}, find n < m functions
yj=fj(x1, ..., xm), j=1..n which produce the n best features.
Some applications warrant non-linear functions, including look-
up tables. We will only consider the linear case.
! ! ! ! ! ! ! ! ! y = A x
Find suitable criteria for A.
Usual approaches:
- Intraclass distance (for 1 class)
- Interclass distance (for k labelled classes)
- Representation error (clustering)
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition
Single-class Feature Extraction
6
Choose A to minimize the intra-class distance:
We could formally solve optimization, but we have done the
same calculation before in orthonormal whitening. We want the
minimum variant subspace.
Pattern Recognition
Charles RobertsonJuly 13, 2006
1 Feature Extraction
Given m measurements
{x1, ..., xm}
find n functions
fi{x1, ..., xm}
to yield
yi = fi(x1, ..., xm), i = 1..n
y = Ax
1. Single class feature extraction problem
J(A) =N!
i=1
N!
j=1
"""yi! y
j
"""2
=N!
i=1
N!
j=1
(xi ! xj)T AT A(xi ! xj)
1
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition 7
Minimum variant direction = minimum eigenvalue eigenvector
Minimum variance plane = pair of minimum eigenvalue
eigenvectors
1. FEATURE EXTRACTION 2
Min. variance plane:
!1,!
2with "1 ! "2 ! ... ! "m
|Sy| =!!ASAT
!!
If columns of A are eigenvectors of S:
SAT = AT !
S!i= "i!i
!!ASAT!! =
!!AAT !!!
Sy =
"
#$
!T1...
!Tn
%
&'(!
1...!
n
)"
#$"1 0
. . .0 "n
%
&' =n*
i=1
"i
since
!Ti!
j= #ij
2. K classes
Fisher’s criterion
J(A) =!!ASBAT
!!|ASW AT |
1. FEATURE EXTRACTION 2
Min. variance plane:
!1,!
2with "1 ! "2 ! ... ! "m
3D diagram
!1
!2
!3
|Sy| =!!ASAT
!!
If columns of A are eigenvectors of S:
SAT = AT !
S!i= "i!i
!!ASAT!! =
!!AAT !!!
Sy =
"
#$
!T1...
!Tn
%
&'(!
1...!
n
)"
#$"1 0
. . .0 "n
%
&' =n*
i=1
"i
1. FEATURE EXTRACTION 2
Min. variance plane:
!1,!
2with "1 ! "2 ! ... ! "m
3D diagram
!1
!2
!3
|Sy| =!!ASAT
!!
If columns of A are eigenvectors of S:
SAT = AT !
S!i= "i!i
!!ASAT!! =
!!AAT !!!
Sy =
"
#$
!T1...
!Tn
%
&'(!
1...!
n
)"
#$"1 0
. . .0 "n
%
&' =n*
i=1
"i
1. FEATURE EXTRACTION 2
Min. variance plane:
!1,!
2with "1 ! "2 ! ... ! "m
3D diagram
!1
!2
!3
|Sy| =!!ASAT
!!
If columns of A are eigenvectors of S:
SAT = AT !
S!i= "i!i
!!ASAT!! =
!!AAT !!!
Sy =
"
#$
!T1...
!Tn
%
&'(!
1...!
n
)"
#$"1 0
. . .0 "n
%
&' =n*
i=1
"i
x1
x2
x3
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition 8
If columns of A are eigenvectors of S:
1. FEATURE EXTRACTION 2
Min. variance plane:
!1,!
2with "1 ! "2 ! ... ! "m
3D diagram
!1
!2
!3
|Sy| =!!ASAT
!!
If columns of A are eigenvectors of S:
SAT = AT !
S!i= "i!i
!!ASAT!! =
!!AAT !!!
Sy =
"
#$
!T1...
!Tn
%
&'(!
1...!
n
)"
#$"1 0
. . .0 "n
%
&' =n*
i=1
"i
1. FEATURE EXTRACTION 2
Min. variance plane:
!1,!
2with "1 ! "2 ! ... ! "m
3D diagram
!1
!2
!3
|Sy| =!!ASAT
!!
If columns of A are eigenvectors of S:
SAT = AT !
S!i= "i!i
!!ASAT!! =
!!AAT !!!
Sy =
"
#$
!T1...
!Tn
%
&'(!
1...!
n
)"
#$"1 0
. . .0 "n
%
&' =n*
i=1
"i
so
1. FEATURE EXTRACTION 2
Min. variance plane:
!1,!
2with "1 ! "2 ! ... ! "m
3D diagram
!1
!2
!3
|Sy| =!!ASAT
!!
If columns of A are eigenvectors of S:
SAT = AT !
S!i= "i!i
!!ASAT!! =
!!AAT !!!
Sy =
"
#$
!T1...
!Tn
%
&'(!
1...!
n
)"
#$"1 0
. . .0 "n
%
&' =n*
i=1
"i
So choosing the smallest eigenvalues provide smallest scatter of
the new features.
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition
K-classes feature extraction
9
We know that there are k classes, and have labelled samples.
We want to maximize interclass distance to give the maximum
separation.
Fisher’s Criterion:
This is a multidimensional version of Fisher’s Linear discriminant.
1. FEATURE EXTRACTION 3
since
!Ti!
j= "ij
2. K classes
Fisher’s criterion
J(A) =!!ASBAT
!!|ASW AT |
multidimensional version of
J(w) =wT SBw
wT SW w
w = S!1W (m1 !m2)
set
#J(A)#A
= 0
Use result:
#
#A
!!ASAT!! = 2
!!ASAT!! (ASAT )!1AS
#J(A)#A
=#
#A
!!ASBAT!!
|ASW AT |
= 2!!ASBAT
!!|ASW AT | (ASBAT )!1ASB ! 2
!!ASBAT!!
|ASW AT | (ASW AT )!1ASW
= 0
1. FEATURE EXTRACTION 3
since
!Ti!
j= "ij
2. K classes
Fisher’s criterion
J(A) =!!ASBAT
!!|ASW AT |
multidimensional version of
J(w) =wT SBw
wT SW w
w = S!1W (m1 !m2)
set
#J(A)#A
= 0
Use result:
#
#A
!!ASAT!! = 2
!!ASAT!! (ASAT )!1AS
#J(A)#A
=#
#A
!!ASBAT!!
|ASW AT |
= 2!!ASBAT
!!|ASW AT | (ASBAT )!1ASB ! 2
!!ASBAT!!
|ASW AT | (ASW AT )!1ASW
= 0
Recall the 1-D case:
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition 10
Now we need to do! ! ! and set it equal to 0.
1. FEATURE EXTRACTION 3
since
!Ti!
j= "ij
2. K classes
Fisher’s criterion
J(A) =!!ASBAT
!!|ASW AT |
multidimensional version of
J(w) =wT SBw
wT SW w
w = S!1W (m1 !m2)
set
#J(A)#A
= 0
Use result:
#
#A
!!ASAT!! = 2
!!ASAT!! (ASAT )!1AS
#J(A)#A
=#
#A
!!ASBAT!!
|ASW AT |
= 2!!ASBAT
!!|ASW AT | (ASBAT )!1ASB ! 2
!!ASBAT!!
|ASW AT | (ASW AT )!1ASW
= 0
We can use the following linear algebra result:
1. FEATURE EXTRACTION 3
since
!Ti!
j= "ij
2. K classes
Fisher’s criterion
J(A) =!!ASBAT
!!|ASW AT |
multidimensional version of
J(w) =wT SBw
wT SW w
w = S!1W (m1 !m2)
set
#J(A)#A
= 0
Use result:
#
#A
!!ASAT!! = 2
!!ASAT!! (ASAT )!1AS
#J(A)#A
=#
#A
!!ASBAT!!
|ASW AT |
= 2!!ASBAT
!!|ASW AT | (ASBAT )!1ASB ! 2
!!ASBAT!!
|ASW AT | (ASW AT )!1ASW
= 0
So...
1. FEATURE EXTRACTION 3
since
!Ti!
j= "ij
2. K classes
Fisher’s criterion
J(A) =!!ASBAT
!!|ASW AT |
multidimensional version of
J(w) =wT SBw
wT SW w
w = S!1W (m1 !m2)
set
#J(A)#A
= 0
Use result:
#
#A
!!ASAT!! = 2
!!ASAT!! (ASAT )!1AS
#J(A)#A
=#
#A
!!ASBAT!!
|ASW AT |
= 2!!ASBAT
!!|ASW AT | (ASBAT )!1ASB ! 2
!!ASBAT!!
|ASW AT | (ASW AT )!1ASW
= 0
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition 11
Continuing...
1. FEATURE EXTRACTION 4
so
! 0 = (ASBAT )!1ASB ! (ASW AT )!1ASW
ASB = (ASBAT )(ASW AT )!1ASW =ASBAT
ASW ATASW
Recall 1-D Case
SBw =wT SBw
wT SW wSW w
Here we have a system of such equations
ASB = !ASW " SBAT = SW AT ! " S!1W SBAT = AT !
Columns
S!1W SB
Note:
J(A) =!!ASBAT
!!|ASW AT | = |!|
A =
"
#$
!T1...
!Tn
%
&'
yi = !Tix
to get more than k-1 features
1. FEATURE EXTRACTION 4
so
! 0 = (ASBAT )!1ASB ! (ASW AT )!1ASW
ASB = (ASBAT )(ASW AT )!1ASW =ASBAT
ASW ATASW
Recall 1-D Case
SBw =wT SBw
wT SW wSW w
Here we have a system of such equations
ASB = !ASW " SBAT = SW AT ! " S!1W SBAT = AT !
Columns
S!1W SB
Note:
J(A) =!!ASBAT
!!|ASW AT | = |!|
A =
"
#$
!T1...
!Tn
%
&'
yi = !Tix
to get more than k-1 features
1. FEATURE EXTRACTION 4
so
! 0 = (ASBAT )!1ASB ! (ASW AT )!1ASW
ASB = (ASBAT )(ASW AT )!1ASW =ASBAT
ASW ATASW
Recall 1-D Case
SBw =wT SBw
wT SW wSW w
Here we have a system of such equations
ASB = !ASW " SBAT = SW AT ! " S!1W SBAT = AT !
Columns
S!1W SB
Note:
J(A) =!!ASBAT
!!|ASW AT | = |!|
A =
"
#$
!T1...
!Tn
%
&'
yi = !Tix
to get more than k-1 features
Recall the 1-D case:
Here we have a system of such equations, and
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition 12
Therefore the columns of AT are eigenvectors of SW-1SB.
Notes:
1. FEATURE EXTRACTION 4
so
! 0 = (ASBAT )!1ASB ! (ASW AT )!1ASW
ASB = (ASBAT )(ASW AT )!1ASW =ASBAT
ASW ATASW
Recall 1-D Case
SBw =wT SBw
wT SW wSW w
Scalar:
SBw = !SW w
Here we have a system of such equations
ASB = !ASW " SBAT = SW AT ! " S!1W SBAT = AT !
Columns
S!1W SB
Note:
J(A) =!!ASBAT
!!|ASW AT | = |!|
A =
"
#$
"T1...
"Tn
%
&'
1. FEATURE EXTRACTION 4
so
! 0 = (ASBAT )!1ASB ! (ASW AT )!1ASW
ASB = (ASBAT )(ASW AT )!1ASW =ASBAT
ASW ATASW
Recall 1-D Case
SBw =wT SBw
wT SW wSW w
Scalar:
SBw = !SW w
Here we have a system of such equations
ASB = !ASW " SBAT = SW AT ! " S!1W SBAT = AT !
Columns
S!1W SB
Note:
J(A) =!!ASBAT
!!|ASW AT | = |!|
A =
"
#$
"T1...
"Tn
%
&'2. CLUSTERING 5
yi = !Tix
to get more than k-1 features
J(A) =!!AST AT
!!|ASW AT |
! ST AT = SW AT !
! S!1W ST AT = AT !
2 Clustering
y = Ax =
"
#$
!T1...
!Tn
%
&'x
x̂ = y1!1+ y2!2
+ ... + yn!n
minimize
E(|x" x̂|2
)
!i
E(|x" x̂|2
)= E
"
$!!!!!x"
n*
i=1
yi!i
!!!!!
2%
'
and we can write
Projection onto the ith eigenvector of SW-1SB.
In fact, they are the maximum eigenvalue eigenvectors of SW-1SB.
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition 13
Notes:
SW-1SB is not generally symmetric. Thus the eigenvectors are not
orthogonal!
Also, n must be less than the number of classes k for |SB| " 0.
To get more than k-1 features, we can use
which are the maximum eigenvectors of SW-1ST
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition 14
W1
W2
FIGURE 3.6. Three three-dimensional distributions are projected onto two-dimensionalsubspaces, described by a normal vectors W1 and W2. Informally, multiple discriminantmethods seek the optimum such subspace, that is, the one with the greatest separation ofthe projected distributions for a given total within-scatter matrix, here as associated withW1. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c! 2001 by John Wiley & Sons, Inc.
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition
Clustering
15
What are appropriate criteria for extracting features for clusters?
- Uncorrelated features
- Maximize the variance of the features (assumes between class
scatter > within class scatter)
- Representation error
Suppose:
2. CLUSTERING 5
yi = !Tix
to get more than k-1 features
J(A) =!!AST AT
!!|ASW AT |
! ST AT = SW AT !
! S!1W ST AT = AT !
2 Clustering
y = Ax =
"
#$
!T1...
!Tn
%
&'x
x̂ = y1!1+ y2!2
+ ... + yn!n
minimize
E(|x" x̂|2
)
!i
E(|x" x̂|2
)= E
"
$!!!!!x"
n*
i=1
yi!i
!!!!!
2%
'
and we can write
So ! ! ! ! ! ! ! ! is an approximation to x in n
dimensions.
2. CLUSTERING 5
yi = !Tix
to get more than k-1 features
J(A) =!!AST AT
!!|ASW AT |
! ST AT = SW AT !
! S!1W ST AT = AT !
2 Clustering
y = Ax =
"
#$
!T1...
!Tn
%
&'x
x̂ = y1!1+ y2!2
+ ... + yn!n
minimize
E(|x" x̂|2
)
!i
E(|x" x̂|2
)= E
"
$!!!!!x"
n*
i=1
yi!i
!!!!!
2%
'
and we can write
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition 16
We’d like to minimize "" " " . If we require that all
are orthogonal and assume we have shifted the origin to the
mean of all samples
2. CLUSTERING 5
yi = !Tix
to get more than k-1 features
J(A) =!!AST AT
!!|ASW AT |
! ST AT = SW AT !
! S!1W ST AT = AT !
2 Clustering
y = Ax =
"
#$
!T1...
!Tn
%
&'x
so
x = AT y
x̂ = y1!1+ y2!2
+ ... + yn!n
minimize
E(|x" x̂|2
)
!i
2. CLUSTERING 5
yi = !Tix
to get more than k-1 features
J(A) =!!AST AT
!!|ASW AT |
! ST AT = SW AT !
! S!1W ST AT = AT !
2 Clustering
y = Ax =
"
#$
!T1...
!Tn
%
&'x
so
x = AT y
x̂ = y1!1+ y2!2
+ ... + yn!n
minimize
E(|x" x̂|2
)
!i
2. CLUSTERING 6
x ! x"m
E!|x" x̂|2
"= E
#
$%%%%%x"
n&
i=1
yi!i
%%%%%
2'
(
and we can write
x =m&
i=1
yi!i
! e = x" x̂ =m&
i=n+1
yi!i
E!|e|2
"= E
#
$m&
i=n+1
m&
j=n+1
yiyj!Ti!
j
'
(
=m&
i=n+1
E!y2
i
"
But
E!y2
i
"
E[y] = E[A(x"m)]
{!1
... !n}
Special case:
and we can write
if we use all m components of some orthogonal basis.
2. CLUSTERING 6
x ! x"m
E!|x" x̂|2
"= E
#
$%%%%%x"
n&
i=1
yi!i
%%%%%
2'
(
and we can write
x =m&
i=1
yi!i
! e = x" x̂ =m&
i=n+1
yi!i
E!|e|2
"= E
#
$m&
i=n+1
m&
j=n+1
yiyj!Ti!
j
'
(
=m&
i=n+1
E!y2
i
"
But
E!y2
i
"
E[y] = E[A(x"m)]
{!1
... !n}
Special case:
then the representation error is
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition 17
2. CLUSTERING 6
x ! x"m
E!|x" x̂|2
"= E
#
$%%%%%x"
n&
i=1
yi!i
%%%%%
2'
(
and we can write
x =m&
i=1
yi!i
! e = x" x̂ =m&
i=n+1
yi!i
E!|e|2
"= E
#
$m&
i=n+1
m&
j=n+1
yiyj!Ti!
j
'
(
=m&
i=n+1
E!y2
i
"
But
E!y2
i
"
E[y] = E[A(x"m)]
{!1
... !n}
Special case:
2. CLUSTERING 6
x ! x"m
E!|x" x̂|2
"= E
#
$%%%%%x"
n&
i=1
yi!i
%%%%%
2'
(
and we can write
x =m&
i=1
yi!i
! e = x" x̂ =m&
i=n+1
yi!i
E!|e|2
"= E
#
$m&
i=n+1
m&
j=n+1
yiyj!Ti!
j
'
(
=m&
i=n+1
E!y2
i
"
But
E!y2
i
"
E[y] = E[A(x"m)]
{!1
... !n}
Special case:
Representation error:
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition 18
But! ! ! = the variance of feature i,
since
2. CLUSTERING 6
x ! x"m
E!|x" x̂|2
"= E
#
$%%%%%x"
n&
i=1
yi!i
%%%%%
2'
(
and we can write
x =m&
i=1
yi!i
! e = x" x̂ =m&
i=n+1
yi!i
E!|e|2
"= E
#
$m&
i=n+1
m&
j=n+1
yiyj!Ti!
j
'
(
=m&
i=n+1
E!y2
i
"
But
E!y2
i
"
E[y] = E[A(x"m)]
{!1
... !n}
Special case:
2. CLUSTERING 6
x ! x"m
E!|x" x̂|2
"= E
#
$%%%%%x"
n&
i=1
yi!i
%%%%%
2'
(
and we can write
x =m&
i=1
yi!i
! e = x" x̂ =m&
i=n+1
yi!i
E!|e|2
"= E
#
$m&
i=n+1
m&
j=n+1
yiyj!Ti!
j
'
(
=m&
i=n+1
E!y2
i
"
But
E!y2
i
"
E[y] = E[A(x"m)]
{!1
... !n}
Special case:
So we can minimize the error if we choose the n maximum
variance directions as our set
2. CLUSTERING 6
x ! x"m
E!|x" x̂|2
"= E
#
$%%%%%x"
n&
i=1
yi!i
%%%%%
2'
(
and we can write
x =m&
i=1
yi!i
! e = x" x̂ =m&
i=n+1
yi!i
E!|e|2
"= E
#
$m&
i=n+1
m&
j=n+1
yiyj!Ti!
j
'
(
=m&
i=n+1
E!y2
i
"
But
E!y2
i
"
E[y] = E[A(x"m)]
{!1
... !n}
Special case:The n maximum eigenvalue eigenvectors of the total sample
covariance matrix produce a set of n
! - uncorrelated features
! - maximum variance features
! - minimum representation error features.
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition 19
These eigenvectors are called the Principal Components.
- account for as much variance as possible
- maximum scatter subspace
ENG 8801/9881 - Special Topics in Computer Engineering: Pattern Recognition
Feature Extraction Summary
20
y = A x
Criterion A
Single Class MICD Eigenvectors of S
k Classes Fisher’s Criterion Eigenvectors of SW-1SB
Clustering Representation Error Eigenvectors of S
where y is the extracted features, and A is the linear transformation of x, the original measurements.