feature extraction and fusion techniques for patch-based face recogition berkay topcu sabancı...
TRANSCRIPT
FEATURE EXTRACTION AND FUSION TECHNIQUES FOR PATCH-BASED FACE RECOGITION
Berkay TopcuSabancı University, 2009
Outline
Introduction Feature Extraction
Dimensionality Reduction Normalization Methods
Patch-Based Face Recognition Patch-Based Methods Classification: Nearest Neighbor Classification Feature Fusion Decision Fusion
Experiments and Results Databases and Experiment Set-up Closed Set Identification Open Set Identification Verification
Conclusions and Future Work
Face Recognition
Face Image
Feature Extraction
Recognition
Classification
Dimensionality Reduction and Normalization
Feature / Decision Fusion
Closed Set Id. / Open Set Id. / Verification
Dimensionality Reduction Feature selection Dimension reduction
extract relevant structures and relationships Projecting or mapping d-dimensional data into p-dimensions where
p <d Given d-dimensional data , we want to find p-dimensional
data such that :
x f
1
2
1d d
x
x
x
x
1p p
f
f
f
1
2
T
T
Tp p d
w
wW
w
Tf W x
1
ˆp
i ii
f
x w
1 2 6 7
3 5 8 13
4 9 12 14
10 11 15 16
Expresses data as summation of cosine functions Due to its strong energy compaction property, most of the signal
information is concantrated in a few low components
Zig-zag scan First basis : the average intensity Second and third basis : the average horizontal
and vertical intensity change
Discrete Cosine Transform (DCT)
1 1
0 0
1 1( , ) ( , )cos cos
2 2
N M
i j
u v i j i u j vN M
f x
Principal Component Analysis (PCA)
Maps data into a lower dimension by preserving most of its variance
Rows of : eigenvectors that corresponds to the p highest eigenvalues of scatter matrix,
Does not take class information into account, no guarantee for discrimination.
1
( )( )n
Tk k
k
m m
S x x
WS
Principal Component Analysis (PCA)
64 x 64 = 4096 pixels/dimensions 192 dimensions
First 16 principal components (eigenfaces)
Linear Discriminant Analysis (LDA)
Finds the linear combination of features which separate two or more classes
The goal is to maximize between-class scatter while minimizing within-class scatter
Rows of : eigenvectors that corresponds to the p highest eigenvalues of
1
ˆ ˆ( )( )N
TB i i i
i
p
S m m m m1
N
w ii
S Sarg maxT
B
TWW
WS W
WS W
W
1W BS S
Deficiencies of PCA and LDA PCA does not take class information into account
LDA faces computational difficulties with large number of highly correlated features, scatter matrices might become singular
When there is less data for each class, scatter matrices are not estimated reliably and there are also numerical problems related to the singularity of scatter matrices
Outlier classes dominate the eigenvalue decomposition, therefore the influence of already well separated classes are overweighted Distance of already separated classes are preserved, causing overlap
of neighboring classes
Approximate Pairwise Accuracy Criterion (APAC)
-class LDA can be decomposed into a sum of two-class LDA problems
Contribution of each two-class LDA to the overall criterion is weighted
Rows of : eigenvectors that corresponds to the p highest eigenvalues of
erf : Bayes error of two normal distributed classes
1( ) ( )Tij i j w i j
m m S m m
W
1 12 2( )i j ij w ij wp p w
S S S
2
1( ) erf ( )
2 2 2ij
ijij
w
N 12 ( 1)N N
( )( )Tij i j i j S m m m m
PCA maximizes the sum of all squared pairwise distances between projected vectors
The idea is to maximize a weighted (pairwise dissimilarities) sum of pairwise distances
Rows of : generalized eigenvectors that corresponds to the p highest eigenvalues of
where is a Laplacian matrix derived by pairwise dissimilarities and is data matrix (one sample in each row)
Normalized PCA (NPCA)
2(dist )pij iji j
d
T d T(X L X, X X)
W
dL
0 , if and belong to same class
1, if and are from different classes
distij
ij
i j
di j
X
Normalized PCA (NPCA)
Normalized LDA (NLDA) Pairwise simillarities are introduced
Aim is to induce “attraction” between elements of the same class and “repulsion” between elements of different classes, by maximizing
Rows of : generalized eigenvectors that corresponds to the p highest eigenvalues of
2
2
(dist )
(dist )
pij ij
i j
pij ij
i j
d
s
0 , if and belong to same class
1, if and are from different classes
distij
ij
i j
di j
1, if and belong to same class
dist
0 , if and are from different classes
ijij
i js
i j
T d T s(X L X, X L X)
W
Normalized LDA (NLDA)
Nearest Neighbor Discriminant Analysis
(NNDA) Maximizes the distance between classes, while minimizing the
expected distance among the samples of same class.
where is the sample weight definde as:
2 2
1
arg max ( )totalN
E In n n
n
w
W
projected differences :
E En
I In
W
W
: nonparametric extra-class and intra-class differences
E E
I I
x x
x x
In
nI En n
w
nw
Nearest Neighbor Discriminant Analysis
(NNDA) Rows of : generalized eigenvectors that corresponds to the
p highest eigenvalues of
Extra-class and intra-class differences are calculated in the original space and then projected into low dimensional space, they do not exactly agree with differences in projection space
Stepwise Dimensionality Reduction : In each step, distances are recalculated in its current dimensionality
' 'B WS S
W
' '
1 1
( )( ) ( )( )total totalN N
E E T I I TB n n n W n n n
n n
w w
S S
Nearest Neighbor Discriminant Analysis
(NNDA)
Normalization Methods Image Domain Mean and Variance Normalization: Aims to
exract similar visual feature vectors from each blocks across sessions of same subject.
1ˆ ( )b b b
b
x x
Feature Normalization Aims to reduce inter-session variability and intra-class variance
1. Norm Division (ND):
2. Sample Variance Normalization (SVN):
3. Block Mean and Variance Normalization (BMVN):
4. Feature Vector Mean and Variance Normalization (FMVN):
f f f
( )i i if f f
1( )b b
b fbf
f f
1( )f
f
f f
Patch-Based Face Recognition
In order to eliminate or lower the effects of illumination changes, occlusion and expression changes by analyzing face images locally A detected face is divided into blocks of 16x16 or 8x8 pixels size Dimensionality reduction techniques are applied on each block separately
16x16 blocks 8x8 blocks
Patch-Based Face Recognition
Dimension Reduction 64x64=4096 features 192 features (16x12 or 64x3)
Following feature extraction Feature Fusion: Concatenate features from each block in order to
create visual feature vector of an image Decision Fusion: Classify each block separately and then combine
individual recognition results of each block Originating point of this study: Global PCA vs. Patch-based PCA
Global PCA 83.45%
Block PCA (8x8) 83.78%
Block PCA (16x16)
83.78%
Classification Method: Nearest Neighbor
Classifier Why nearest neighbor classification? Different distance metrics:
Lp-norm between d –dimensional training sample and test sample
Cosine angle between d –dimensional training sample and test sample
In our experiments, we have used L2 –norm but we have also experimented some promising methods with L1 –norm and COS
1
train, test ,1
( ( ) ) pd
pp n n
n
L
f f
train test
train test
COS.
T
T
f f
f f
Classification Method: Nearest Neighbor
Classifier Distance to class posterior probabilities:
Depends on the distance of to the nearest training sample from each class
After calculating posterior probability for each class, they are normalized by dividing to their summation so that they sum up to 1
( | )ip C xx
D = D(1) D(2) D(N)
( )( | ) sigm( log( ))
( )j i
i
jp C
i
Dx
D
1sigm( )
1 xx
e
Feature Fusion Defining an image in a vector from as
where B is the number of blocks and denotes vectorized bth block of the image, we find a linear tranformation matrix, , such that
1T T T
B x x x
bxbW
b b bf W x 1T T T
B f f f
Decision Fusion Combining the decisions of each classifier trained by different
blocks Output of a classifier is class posterior probabilities Fixed combiners
Mean, maximum, minimum, median, sum, product of the set
Majority voting of the individual classifier decisions Trainable combiners
Use the output of the classifier as a feature set From class posterior probabilities of several classifiers, a new
classifier is trained to provide an ultimate decision
( | ) : 1i bp C i Nx
{ ( | ) : 1 }i bp C b Bx
ˆ arg max ( | ) { ( | ) : 1 }i i bi
i p C rule p C b B x x
Trainable Combiners Training data is separated as train data and validation data Stacked generalization
Trainable Combiners Resulting class posterior probabilities are concatenated into a
vector as
The length of this input feature vector of the combiner is In sum rule (fixed combiner)
The posterior probabilities for one class from each classifier are summed.
Weighted summation of posterior probabilities can be performed
Fixed combination method with trainable weights How to assign weights?
1 1 2 1 1[ ( | ), ( | ), , ( | ), ( | )]N B N Bp C p C p C p Cx x x x
N B
1
( | ) ( | )B
i b i bb
p C w p C
x x
Block Weights (Offline Weights)
Learned from training data and independent of test data
1. Equal Weights (EW): Contribution of each block assumed to be same
2. Score Weighting (SW): Depends on the posterior probability distribution of true and wrong labels on validation data
where and
1bw B
1 2PS= ( | ) ( | ) ( | )i i i Bp C p C p Cx x x
1 2NS= ( | ) ( | ) ( | )j j j Bp C p C p C x x x
1:j N j i
Negative examples
Positive examples
Block Weights (Offline Weights)
2. (SW continued) LDA finds the linear combination of vectors, such that these vectors are most separated in the projected space.
Project 16-dimensional score matrices to 1-dimension and use the coefficients used in this mapping.
1
( | ) ( | )B
i b i bb
p C w p C
x x
Block Weights (Offline Weights)
3. Validation Accuracy Weighting (VAW): Depends on the individual recognition rates on validation data for each block.
However, the most trusted blocks might not contain that much information in a test image due to partial occlusion
a weighting scheme that depends on the training dataset might not be trustworthy and a more interactive scheme that is related with the test sample is believed to provide more accurate weight assignments
1
acc( )
acc( )b B
k
bw
k
Confidence Weighting(Online Weighting)
Each test sample is treated separately and individual block weights for each test sample is calculated according to its reliability or confidence
Confidence features are extracted from each block for each sample in the validation data and labeled as “correctly classified” or “misclassified”
Similarity, a measure of closeness of a feature to the mean feature
Block selection Aims to discard blocks that are not helpful Blocks are sorted according to block similarity Selected blocks are weighted according to their confidence weights The remaining blocks are discarded (their weights are assigned as zero)
)
.
b T b
b bs
(f
f
Experiments and Results - Databases
M2VTS database 37 subjects – 5 video shots (selected random 8 frames at each
video) 4 tapes for training – 1 tape for testing (includes variations such
as different hairstyles, glasses, hats and scarfs) 32 training images/subject – 8 test images/subject 1184 (32x37) training images – 296 (8x37) testing images
Experiments and Results - Databases
AR database 120 subjects – two sessions (13 images in each session) First 7 images of each session training
Remaining 6 images of each session testing (include sun glasses and scarf – partial occlusion)
14 training images/subject12 test images/subject
1680 training images – 1440 test images
Closed Set Identification Identifying an unknown face if the subject is known to be in
the database Experiments on the M2VTS database
Effect of image domain normalization
w/o IDN with IDN
DCT 85.47% 87.84%
PCA 83.78% 87.50%
LDA 84.80% 84.46%
APAC 86.15% 86.15%
NPCA 83.78% 87.50%
NLDA 87.16% 83.11%
NNDA 85.47% 87.84%
Experiments on the M2VTS database
Feature Fusion LDA, APAC, NLDA provide higher recognition accuracies FMVN increases accuracies, other normalization methods are
inconsistent 16x16 blocks provide higher results than 8x8 blocks The highest accuracy obtained by NLDA + FMVN : 93.45%
Decision Fusion DCT and NNDA provide highest recognition accuracies Image normalization contributes positively (except DCT) All feature normalization methods are helpful Baseline is EW and in most cases SW and VAW perform better The highest accuracies are
DCT + ND (VAW) : 97.30% NNDA + SVN (SW) : 96.96%
Experiments on the AR database
Feature Fusion Less data dependent transforms, DCT, PCA, NPCA and NNDA
perform well LDA, APAC and NLDA face problems when there is not enough
training data Image domain normalization is not helpful as train and test data
have similar illumination conditions ND increases accuracies The highest recognition rate
NNDA + ND : 48.08%
Experiments on the AR database
Decision Fusion DCT,PCA and NNDA provide highest recognition accuracies Image normalization is not helpful All feature normalization methods are helpful Baseline is EW and in most cases SW and VAW perform better The highest accuracies are
NNDA + ND (VAW) : 85.97% DCT+ SVN (VAW) : 84.65% Single training data experiment
To illustrate the effect of normalization methods By using DCT and EW
NN 42.36%
ND 44.03%
BMVN
43.82%
FMVN 45.14%
Confidence Weighting and Block Selection
The weights calculated are close to each other (almost same as EW)
PCA without any normalization methods and EW : 65.49% (AR)
Different Distance Metrics For some of the cases that provide the highest recognition
rates
L2 –norm L1 –norm COS
M2VTS DCT 97.30% 96.62% 93.92%
M2VTS NNDA
96.96% 87.16% 91.55%
AR NNDA 85.90% 88.47% 89.10%
AR DCT 84.65% 85.53% 86.39%
Comparison with Other Techniques
CSU Face Identification Evaluation System PCA, PCA + LDA, Bayesian Intrapersonal/Extrapersonal Difference Classifier Lining up eye coordinates, masking face, histogram equalization, pixel
normalization
Our implemantation of illumination correction + global DCT/global PCA
Our highest accuracies : 97.30% for M2VTS and 89.10% for AR
M2VTS AR
PCA Euclidean 86.48% 22.15%
PCA Mahalinobis
88.17% 42.56%
PCA + LDA 100.0% 21.94%
Bayesian MAP 91.89% 23.95%
Bayesian ML 92.56% 27.84%
M2VTS AR
Global DCT 93.58% 47.54%
Global PCA 89.53% 48.46%
Open Set Identification There is a rejection option
Determines if the unknown face belongs to the database Finds the identity of the subject from the database False Accept Rate (FAR) vs. False Reject Rate + False Classification
Rate (FRR+FCR)
M2VTS database, CSI : 97.30% DCT + ND, EER : 14.89%
Verification Confirming or rejecting an unknown face’s claimed identity FAR vs. FRR
M2VTS database, CSI : 97.30% DCT + ND, EER : 5.74%
Conclusion and Future Work
Different dimensionality reduction and normalization techniques for feature fusion and decision fusion methods Dimensionality reduction methods can be categorized as: DCT,
PCA, NPCA, NNDA (less data dependent transforms) and LDA, APAC, NNDA (data dependent transforms)
Patch-based face recognition is superior to global approaches Decision fusion provides higher recognition results Contributions:
Recently proposed dimensionality reduction techniques are applied to patch-based face recognition
Image level and feature level normalization methods are introduced
Use of decision fusion techniques for patch-based face recognition is introduced and weights in “weighted sum rule” are estimated using a novel method
Conclusion and Future Work
Future Work Moving block centers so that each block corresponds to same
location on the face for all images of all subjects Using color information in additon to gray scale intensity values More accurate distance to posterior probability conversion for
nearest neighbor classification
Thank you ...