feature extraction and fusion techniques for patch-based face recogition berkay topcu sabancı...

FEATURE EXTRACTION AND FUSION TECHNIQUES FOR PATCH-BASED FACE RECOGITION

Berkay TopcuSabancı University, 2009

Outline

Introduction Feature Extraction

Dimensionality Reduction Normalization Methods

Patch-Based Face Recognition Patch-Based Methods Classification: Nearest Neighbor Classification Feature Fusion Decision Fusion

Experiments and Results Databases and Experiment Set-up Closed Set Identification Open Set Identification Verification

Conclusions and Future Work

Face Recognition

Face Image

Feature Extraction

Recognition

Classification

Dimensionality Reduction and Normalization

Feature / Decision Fusion

Closed Set Id. / Open Set Id. / Verification

Dimensionality Reduction Feature selection Dimension reduction

extract relevant structures and relationships Projecting or mapping d-dimensional data into p-dimensions where

p <d Given d-dimensional data , we want to find p-dimensional

data such that :

x f

1

2

1d d

x

x

x

x

1p p

f

f

f

1

2

T

T

Tp p d

w

wW

w

Tf W x

1

ˆp

i ii

f

x w

1 2 6 7

3 5 8 13

4 9 12 14

10 11 15 16

Expresses data as summation of cosine functions Due to its strong energy compaction property, most of the signal

information is concantrated in a few low components

Zig-zag scan First basis : the average intensity Second and third basis : the average horizontal

and vertical intensity change

Discrete Cosine Transform (DCT)

1 1

0 0

1 1( , ) ( , )cos cos

2 2

N M

i j

u v i j i u j vN M

f x

Principal Component Analysis (PCA)

Maps data into a lower dimension by preserving most of its variance

Rows of : eigenvectors that corresponds to the p highest eigenvalues of scatter matrix,

Does not take class information into account, no guarantee for discrimination.

1

( )( )n

Tk k

k

m m

S x x

WS

Principal Component Analysis (PCA)

64 x 64 = 4096 pixels/dimensions 192 dimensions

First 16 principal components (eigenfaces)

Linear Discriminant Analysis (LDA)

Finds the linear combination of features which separate two or more classes

The goal is to maximize between-class scatter while minimizing within-class scatter

Rows of : eigenvectors that corresponds to the p highest eigenvalues of

1

ˆ ˆ( )( )N

TB i i i

i

p

S m m m m1

N

w ii

S Sarg maxT

B

TWW

WS W

WS W

W

1W BS S

Deficiencies of PCA and LDA PCA does not take class information into account

LDA faces computational difficulties with large number of highly correlated features, scatter matrices might become singular

When there is less data for each class, scatter matrices are not estimated reliably and there are also numerical problems related to the singularity of scatter matrices

Outlier classes dominate the eigenvalue decomposition, therefore the influence of already well separated classes are overweighted Distance of already separated classes are preserved, causing overlap

of neighboring classes

Approximate Pairwise Accuracy Criterion (APAC)

-class LDA can be decomposed into a sum of two-class LDA problems

Contribution of each two-class LDA to the overall criterion is weighted

Rows of : eigenvectors that corresponds to the p highest eigenvalues of

erf : Bayes error of two normal distributed classes

1( ) ( )Tij i j w i j

m m S m m

W

1 12 2( )i j ij w ij wp p w

S S S

2

1( ) erf ( )

2 2 2ij

ijij

w

N 12 ( 1)N N

( )( )Tij i j i j S m m m m

PCA maximizes the sum of all squared pairwise distances between projected vectors

The idea is to maximize a weighted (pairwise dissimilarities) sum of pairwise distances

Rows of : generalized eigenvectors that corresponds to the p highest eigenvalues of

where is a Laplacian matrix derived by pairwise dissimilarities and is data matrix (one sample in each row)

Normalized PCA (NPCA)

2(dist )pij iji j

d

T d T(X L X, X X)

W

dL

0 , if and belong to same class

1, if and are from different classes

distij

ij

i j

di j

X

Normalized PCA (NPCA)

Normalized LDA (NLDA) Pairwise simillarities are introduced

Aim is to induce “attraction” between elements of the same class and “repulsion” between elements of different classes, by maximizing

Rows of : generalized eigenvectors that corresponds to the p highest eigenvalues of

2

2

(dist )

(dist )

pij ij

i j

pij ij

i j

d

s

0 , if and belong to same class

1, if and are from different classes

distij

ij

i j

di j

1, if and belong to same class

dist

0 , if and are from different classes

ijij

i js

i j

T d T s(X L X, X L X)

W

Normalized LDA (NLDA)

Nearest Neighbor Discriminant Analysis

(NNDA) Maximizes the distance between classes, while minimizing the

expected distance among the samples of same class.

where is the sample weight definde as:

2 2

1

arg max ( )totalN

E In n n

n

w

W

projected differences :

E En

I In

W

W

: nonparametric extra-class and intra-class differences

E E

I I

x x

x x

In

nI En n

w

nw


(NNDA) Rows of : generalized eigenvectors that corresponds to the

p highest eigenvalues of

Extra-class and intra-class differences are calculated in the original space and then projected into low dimensional space, they do not exactly agree with differences in projection space

Stepwise Dimensionality Reduction : In each step, distances are recalculated in its current dimensionality

' 'B WS S

W

' '

1 1

( )( ) ( )( )total totalN N

E E T I I TB n n n W n n n

n n

w w

S S


(NNDA)

Normalization Methods Image Domain Mean and Variance Normalization: Aims to

exract similar visual feature vectors from each blocks across sessions of same subject.

1ˆ ( )b b b

b

x x

Feature Normalization Aims to reduce inter-session variability and intra-class variance

1. Norm Division (ND):

2. Sample Variance Normalization (SVN):

3. Block Mean and Variance Normalization (BMVN):

4. Feature Vector Mean and Variance Normalization (FMVN):

f f f

( )i i if f f

1( )b b

b fbf

f f

1( )f

f

f f

Patch-Based Face Recognition

In order to eliminate or lower the effects of illumination changes, occlusion and expression changes by analyzing face images locally A detected face is divided into blocks of 16x16 or 8x8 pixels size Dimensionality reduction techniques are applied on each block separately

16x16 blocks 8x8 blocks

Patch-Based Face Recognition

Dimension Reduction 64x64=4096 features 192 features (16x12 or 64x3)

Following feature extraction Feature Fusion: Concatenate features from each block in order to

create visual feature vector of an image Decision Fusion: Classify each block separately and then combine

individual recognition results of each block Originating point of this study: Global PCA vs. Patch-based PCA

Global PCA 83.45%

Block PCA (8x8) 83.78%

Block PCA (16x16)

83.78%

Classification Method: Nearest Neighbor

Classifier Why nearest neighbor classification? Different distance metrics:

Lp-norm between d –dimensional training sample and test sample

Cosine angle between d –dimensional training sample and test sample

In our experiments, we have used L2 –norm but we have also experimented some promising methods with L1 –norm and COS

1

train, test ,1

( ( ) ) pd

pp n n

n

L

f f

train test

train test

COS.

T

T

f f

f f

Classification Method: Nearest Neighbor

Classifier Distance to class posterior probabilities:

Depends on the distance of to the nearest training sample from each class

After calculating posterior probability for each class, they are normalized by dividing to their summation so that they sum up to 1

( | )ip C xx

D = D(1) D(2) D(N)

( )( | ) sigm( log( ))

( )j i

i

jp C

i

Dx

D

1sigm( )

1 xx

e

Feature Fusion Defining an image in a vector from as

where B is the number of blocks and denotes vectorized bth block of the image, we find a linear tranformation matrix, , such that

1T T T

B x x x

bxbW

b b bf W x 1T T T

B f f f

Decision Fusion Combining the decisions of each classifier trained by different

blocks Output of a classifier is class posterior probabilities Fixed combiners

Mean, maximum, minimum, median, sum, product of the set

Majority voting of the individual classifier decisions Trainable combiners

Use the output of the classifier as a feature set From class posterior probabilities of several classifiers, a new

classifier is trained to provide an ultimate decision

( | ) : 1i bp C i Nx

{ ( | ) : 1 }i bp C b Bx

ˆ arg max ( | ) { ( | ) : 1 }i i bi

i p C rule p C b B x x

Trainable Combiners Training data is separated as train data and validation data Stacked generalization

Trainable Combiners Resulting class posterior probabilities are concatenated into a

vector as

The length of this input feature vector of the combiner is In sum rule (fixed combiner)

The posterior probabilities for one class from each classifier are summed.

Weighted summation of posterior probabilities can be performed

Fixed combination method with trainable weights How to assign weights?

1 1 2 1 1[ ( | ), ( | ), , ( | ), ( | )]N B N Bp C p C p C p Cx x x x

N B

1

( | ) ( | )B

i b i bb

p C w p C

x x

Block Weights (Offline Weights)

Learned from training data and independent of test data

1. Equal Weights (EW): Contribution of each block assumed to be same

2. Score Weighting (SW): Depends on the posterior probability distribution of true and wrong labels on validation data

where and

1bw B

1 2PS= ( | ) ( | ) ( | )i i i Bp C p C p Cx x x

1 2NS= ( | ) ( | ) ( | )j j j Bp C p C p C x x x

1:j N j i

Negative examples

Positive examples


2. (SW continued) LDA finds the linear combination of vectors, such that these vectors are most separated in the projected space.

Project 16-dimensional score matrices to 1-dimension and use the coefficients used in this mapping.

1

( | ) ( | )B

i b i bb

p C w p C

x x


3. Validation Accuracy Weighting (VAW): Depends on the individual recognition rates on validation data for each block.

However, the most trusted blocks might not contain that much information in a test image due to partial occlusion

a weighting scheme that depends on the training dataset might not be trustworthy and a more interactive scheme that is related with the test sample is believed to provide more accurate weight assignments

1

acc( )

acc( )b B

k

bw

k

Confidence Weighting(Online Weighting)

Each test sample is treated separately and individual block weights for each test sample is calculated according to its reliability or confidence

Confidence features are extracted from each block for each sample in the validation data and labeled as “correctly classified” or “misclassified”

Similarity, a measure of closeness of a feature to the mean feature

Block selection Aims to discard blocks that are not helpful Blocks are sorted according to block similarity Selected blocks are weighted according to their confidence weights The remaining blocks are discarded (their weights are assigned as zero)

)

.

b T b

b bs

(f

f

Experiments and Results - Databases

M2VTS database 37 subjects – 5 video shots (selected random 8 frames at each

video) 4 tapes for training – 1 tape for testing (includes variations such

as different hairstyles, glasses, hats and scarfs) 32 training images/subject – 8 test images/subject 1184 (32x37) training images – 296 (8x37) testing images

Experiments and Results - Databases

AR database 120 subjects – two sessions (13 images in each session) First 7 images of each session training

Remaining 6 images of each session testing (include sun glasses and scarf – partial occlusion)

14 training images/subject12 test images/subject

1680 training images – 1440 test images

Closed Set Identification Identifying an unknown face if the subject is known to be in

the database Experiments on the M2VTS database

Effect of image domain normalization

w/o IDN with IDN

DCT 85.47% 87.84%

PCA 83.78% 87.50%

LDA 84.80% 84.46%

APAC 86.15% 86.15%

NPCA 83.78% 87.50%

NLDA 87.16% 83.11%

NNDA 85.47% 87.84%

Experiments on the M2VTS database

Feature Fusion LDA, APAC, NLDA provide higher recognition accuracies FMVN increases accuracies, other normalization methods are

inconsistent 16x16 blocks provide higher results than 8x8 blocks The highest accuracy obtained by NLDA + FMVN : 93.45%

Decision Fusion DCT and NNDA provide highest recognition accuracies Image normalization contributes positively (except DCT) All feature normalization methods are helpful Baseline is EW and in most cases SW and VAW perform better The highest accuracies are

DCT + ND (VAW) : 97.30% NNDA + SVN (SW) : 96.96%

Experiments on the AR database

Feature Fusion Less data dependent transforms, DCT, PCA, NPCA and NNDA

perform well LDA, APAC and NLDA face problems when there is not enough

training data Image domain normalization is not helpful as train and test data

have similar illumination conditions ND increases accuracies The highest recognition rate

NNDA + ND : 48.08%

Experiments on the AR database

Decision Fusion DCT,PCA and NNDA provide highest recognition accuracies Image normalization is not helpful All feature normalization methods are helpful Baseline is EW and in most cases SW and VAW perform better The highest accuracies are

NNDA + ND (VAW) : 85.97% DCT+ SVN (VAW) : 84.65% Single training data experiment

To illustrate the effect of normalization methods By using DCT and EW

NN 42.36%

ND 44.03%

BMVN

43.82%

FMVN 45.14%

Confidence Weighting and Block Selection

The weights calculated are close to each other (almost same as EW)

PCA without any normalization methods and EW : 65.49% (AR)

Different Distance Metrics For some of the cases that provide the highest recognition

rates

L2 –norm L1 –norm COS

M2VTS DCT 97.30% 96.62% 93.92%

M2VTS NNDA

96.96% 87.16% 91.55%

AR NNDA 85.90% 88.47% 89.10%

AR DCT 84.65% 85.53% 86.39%

Comparison with Other Techniques

CSU Face Identification Evaluation System PCA, PCA + LDA, Bayesian Intrapersonal/Extrapersonal Difference Classifier Lining up eye coordinates, masking face, histogram equalization, pixel

normalization

Our implemantation of illumination correction + global DCT/global PCA

Our highest accuracies : 97.30% for M2VTS and 89.10% for AR

M2VTS AR

PCA Euclidean 86.48% 22.15%

PCA Mahalinobis

88.17% 42.56%

PCA + LDA 100.0% 21.94%

Bayesian MAP 91.89% 23.95%

Bayesian ML 92.56% 27.84%

M2VTS AR

Global DCT 93.58% 47.54%

Global PCA 89.53% 48.46%

Open Set Identification There is a rejection option

Determines if the unknown face belongs to the database Finds the identity of the subject from the database False Accept Rate (FAR) vs. False Reject Rate + False Classification

Rate (FRR+FCR)

M2VTS database, CSI : 97.30% DCT + ND, EER : 14.89%

Verification Confirming or rejecting an unknown face’s claimed identity FAR vs. FRR

M2VTS database, CSI : 97.30% DCT + ND, EER : 5.74%

Conclusion and Future Work

Different dimensionality reduction and normalization techniques for feature fusion and decision fusion methods Dimensionality reduction methods can be categorized as: DCT,

PCA, NPCA, NNDA (less data dependent transforms) and LDA, APAC, NNDA (data dependent transforms)

Patch-based face recognition is superior to global approaches Decision fusion provides higher recognition results Contributions:

Recently proposed dimensionality reduction techniques are applied to patch-based face recognition

Image level and feature level normalization methods are introduced

Use of decision fusion techniques for patch-based face recognition is introduced and weights in “weighted sum rule” are estimated using a novel method

Conclusion and Future Work

Future Work Moving block centers so that each block corresponds to same

location on the face for all images of all subjects Using color information in additon to gray scale intensity values More accurate distance to posterior probability conversion for

nearest neighbor classification

Thank you ...

feature extraction and fusion techniques for patch-based face recogition berkay topcu sabancı...

Documents

class information

class scatterrows

class lda problemscontribution

p highest eigenvalues

p highest eigenvalues

p expresses data

data matrix

separated classes