learning individual-speciﬁc dictionaries with fused ... · pdf filelearning...

Learning Individual-Specific Dictionaries with Fused Multiple Featuresfor Face Recognition

Shu Kong, Student Member, IEEE, Donghui Wang Member, IEEE,

Abstract— Recent researches emphasize more on exploringmultiple features to improve classification performance. Onepopular scheme is to extend the sparse representation-basedclassification framework with various regularizations. Thesemethods sparsely encode the query image over the trainingset under different constraints, and achieve very encourag-ing performances in various applications, especially in facerecognition (FR). However, they merely make an issue onhow to collaboratively encode the query, but ignore the latentrelationships among the multiple features that can furtherimprove the classification accuracy. It is reasonable to anticipatethat the low-level features of facial images, such as edgesand smoothed/low-frequency image, can be fused into a morecompact and more discriminative representation through somerelationships for better FR performances. Focusing on this, wepropose a unified framework for FR to take advantage of thislatent relationship and to fully make use of the fused features.Our method can realize the following tasks: (1) learning aspecific dictionary for each individual that captures the mostdistinctive features; (2) learning a common pattern pool thatprovides the less-discriminative and shared patterns for allindividuals, such as illuminations and poses; (3) simultaneouslylearning a fusion matrix to merge the features into a morediscriminative and more compact representation. We performa series of experiments on public available databases to evaluateour method, and the experimental results demonstrate theeffectiveness of our proposed approach.

I. INTRODUCTION

With the recent endeavor made by computer vision re-searchers, an increasing number of feature types have beendesigned to describe various aspects of visual characteristics.In practice, multiple visual data for one individual can alsobe derived through heterogenous sensors, e.g. visible lightcameras, inferred cameras and laser range finders. Facingthese multiple information cues, in particular, researchersbegin to jointly study these multiple information sourcesfor object recognition tasks, and very good classificationperformances are reported [5]. Even though it is widelybelieved that classification performance can benefit fromjointly studying multiple features, effective processing andanalysis methods are still badly in need.

On another track, researchers employ the dictionary learn-ing (DL) model for classification tasks under the supervisedfashion [10], [13]. By exploring the label information of theobjects, DL-based methods, which are a particular sparsecoding model, learn the classification-oriented dictionarymainly in two ways [12]: either (1) directly making the

This work is supported by 973 Program (Project No.2010CB327904) andNatural Science Foundations of China (No.61071218).

Shu Kong and Donghui Wang are with the Department of ComputerScience and Technology, Zhejiang University, Hangzhou 310027, China{aimerykong, dhwang}@zju.edu.cn

dictionary discriminative, such as learning class-specificsub-dictionary for each class, or (2) making the sparsecoefficients discriminative to propagate the discriminationpower to the dictionary. Even if these DL-based classifi-cation methods achieve very promising or even state-of-the-art performances on many public datasets for variousapplications, especially for face recognition (FR), they onlywork on a single feature type, e.g. the original gray-levelfacial image, rather than multiple informative features. Thus,they cannot easily exploit multiple features of the samephysical object and their possible semantic correlations forbetter performance.

To address the aforementioned problems, i.e. how to takeadvantage of the multi-feature information and how to em-ploy the powerful sparse coding framework for classification,researchers have proposed several methods [26], [27], [24].Yuan and Yan propose a multi-task joint sparse representationbased classification method (MTJSRC), which, with the helpof multiple features, treats the recognition task as a multi-task problem and each feature type is one task [26]. Theyassume the sparse coefficients of all the features share thesame sparse pattern, and identify the query image accordingto the reconstruction error accumulated over all the featuretypes. However, the assumption is too strict and is not heldin practice. Therefore, Zhang et al. propose a joint dynamicsparse representation classification method (JDSRC)1 [27]to address this problem. They argue that the same sparsitypattern is shared among the coefficients at class-level, butnot necessarily at atom-level. Yang et al. also address thisproblem by proposing a relaxed collaborative representationmethod (RCR), which assumes the sparse codes with respectto all the feature types should be similar in appearance [24].All the three methods elaborately consider the sparsity pat-tern among the coefficients of different feature types, andachieve very encouraging performances on FR. However,these methods merely use the training data as a pre-defineddictionary, which can be very large when the number oftraining data increases and thus makes the recognition com-putationally expensive. Moreover, for classification, directlydealing with all the available features can be too redundantfor classification, thus, the recognition stage will be verytime-consuming. In the classification respect, these multi-feature cues have some semantic connections with each other,and if we estimate their semantic relationship, we can learn a

1Strictly speaking, the joint dynamic sparse coding method is dealing withobject recognition through multiple observations. But their framework canbe extended to recognition on multi-modal features as said by the authorsin [27].

more compact and more discriminative dictionary for betterclassification performance.

Moreover, there are a multitude of low-level features thatcan have the potential to enhance the recognition accuracy,such as edges of the face, smoothed image and the low-frequency of the face image. But no methods have beenproposed to explore these. Concentrating on the above issues,in this paper, we propose a unified framework to explore therelationship among these features for better FR performance.Our method realizes the following goals:

1) automatically fusing the features into a more compactand more discriminative representation of the facialimages;

2) learning a compact and discriminative individual-specific dictionary for each person to capture the mostdistinctive features;

3) learning a common pattern pool which provides theshared and essential patterns for reconstruction of theface images of all persons.

The rest of this paper is organized as follows. In Section II,we briefly review several approaches that motivate ours. Theproposed method is elaborated in Section III, followed byexperimental evaluation in Section IV. Finally, we concludeour paper in Section V with discussions.

II. RELATED WORK

Sparse representation-based classification (SRC) [21] is afar-reaching method under sparse coding theory. It achievesvery encouraging performances on face recognition withrobustness to illumination changes and occlusions. Sup-pose there are C classes of individual faces, let D =[X1, . . . ,Xc, . . . ,XC ] ∈ Rp×N be the set of original train-ing set, where Xc ∈ Rp×Nc consists of all the Nc column-vector-represented training samples from class c and N =∑C

c=1Nc. SRC treats D as a pre-defined dictionary. Giventhe query facial image x ∈ Rp, it identifies x through atwo-stage procedure: (1) sparsely encode x over D via anℓ1-norm minimization a = argmina ∥x − Da∥22 + λ∥a∥1,where λ is a scalar to balance the reconstruction error andthe sparse degree; (2) identifying x to class c such thatc = argmini ∥x−Xiδi(a)∥22, where δi(·) is a vector indicatorfunction that extract the elements corresponding to the ith

training set Xi.However, as SRC directly uses the original training set

as the dictionary, this pre-defined dictionary will incorporatetoo much redundancy as well as noisy and trivial informationthat can degenerate the performance. Additionally, when thetraining data grows in number, the computation of sparsecoding will become a major bottleneck. To circumvent thisproblem, Yang et al. propose a Metaface learning methodto learn a class-specific dictionary Dc = [dc

1, . . . ,dcKc

] ∈Rp×Kc for each individual [23]:

Dc = argminDc,Ac

∥Xc −DcAc∥2F + λ∥Ac∥1,

s.t. ∥dcj∥2 = 1, ∀j = 1, . . . ,Kc,

(1)

where ∥Ac∥1 is defined as the summation of ℓ1-norm ofall the columns of Ac = [ac1, . . . ,a

cNc

] ∈ RKc×Nc , i.e.∥Ac∥1 =

∑Nc

j ∥acj∥1. This method concatenates all the sub-dictionaries as an overall dictionary D = [D1, . . . ,DC ] forclassification, the same as the second stage of SRC.

As pointed out in the literature [13], [17], the learnedclass-specific dictionaries usually share some common pat-terns, which do not help or even degenerate the discrimina-tion of the facial images but are essential for representationof them. This observation also applies to FR well, as thefacial images usually share some common patterns, such asilluminations and poses. Considering this observation, therecent proposed method DL-COPAR by Kong and Wangmakes a difference between the class-specific features andthe common patterns [13]. It learns C class-specific dictio-naries Dc’s for each of the C individuals and a commonpattern pool DC+1 ∈ Rp×KC+1 . The two parts constitutethe overall dictionary D = [D1, . . . ,Dc, . . . ,DC ,DC+1] ∈Rp×K in which K =

∑C+1c=1 Kc.

The existing DL-based methods (including DL-COPAR)can be summarized as a general framework (omitting theconstraint on dictionary atoms) in dictionary learning stage:

minD,A

{f ≡ C(X,D,A) + λϕ(A) + ηQ(D)

}. (2)

In Metaface learning, C(X,D,A) =∑C

c=1 ∥Xc−DcAc∥2F ,ϕ(A) =

∑Cc=1 ∥Ac∥1 and η is set zero. As for DL-COPAR,

the three terms are explained as below:

• C(·) measures the reconstruction error of the trainingdata by the dictionary. It ensures the dictionary to havethe power of representing any face images, and theimages from a specific class (say class c) can be wellapproximated by the collaborative effort of the cth sub-dictionary Dc and the common pattern pool DC+1. De-fine a selection operator Rc = [rc1, . . . , r

cj , . . . , r

cKc

] ∈RK×Kc in which:

rcj = [ 0, . . . , 0,︸︷︷︸∑c−1m=1 Km

j−1︷︸︸︷0, . . . , 0, 1, 0, . . . , 0,︸︷︷︸

Kc

0, . . . , 0︸︷︷︸∑C+1m=c+1 Km

]T .

Therefore, we have [Dc,DC+1] = D[Rc,RC+1], andRT

c Ac selects the coefficient part corresponding to thecth sub-dictionary and the common pattern pool, i.e.[Dc,DC+1]. By denoting Qc = [Rc,RC+1], then C(·)is defined as:

C(X,D,A) ≡C∑

c=1

{∥Xc −DAc∥2F+∥Xc −DQcQ

Tc Ac∥2F

}, (3)

• Denote Q/c = [R1, . . . ,Rc−1,Rc+1, . . . ,RC ], thenthe term ϕ(A) is defined as:

ϕ(A) ≡C∑

c=1

{∥QT

/cAc∥2F + λ∥Ac∥1}.

It can be seen that ϕ(A) pushes coefficients A to besparse with the ℓ1-norm penalty and regularizes A to

guarantee the irrelevant sub-dictionaries Di’s (i ̸= c andi ̸= C + 1) do not contribute to the representation ofthe images from class c.

• Q(D) works on the dictionary to drive different sub-dictionaries as incoherent as possible:

Q(D) ≡C+1∑c=1

C+1∑j=1j ̸=c

∥DTc Dj∥2F .

DL-COPAR optimizes the objective function alternatively,and adopts a SRC-like scheme [21] for classification. In ourpaper, we adopt a similar framework to incorporate and fusemultiple features for FR, as DL-COPAR especially producesvery promising performance in FR.

III. OUR METHODOLOGY

In this section, we first introduce how to incorporate mul-tiple features under the dictionary learning framework. Then,we unify the fusion process into the framework. Thirdly,we provide a simple and effective initialization process forthe proposed method. Finally, we explain our classificationscheme.

A. Exploiting multiple features to learn the dictionary

In recent years, some approaches [24], [26], [27] areproposed to exploit multiple features via some sparse cod-ing techniques to boost classification performance. Beyonddoubt, these coding techniques will improve the FR perfor-mance to some extent by seriously considering the structuralsparse patterns of the coefficients, but they merely use theoriginal training set2 as a pre-defined dictionary. As noticedin Section II, this pre-defined dictionary has some drawbacks:(1) incorporating too much redundancy as well as noisy andtrivial information, and (2) making the sparse coding processmuch more time-consuming when the training set becomeslarge in number. Therefore, learning a compact dictionaryfor each individual is a preferable alternative.

Suppose we have K types of features3, thus each facialimage is represented by a matrix or second-order tensor [11],[20]. Specifically, xk

i means the kth feature of the ith facialimage Xi; if this image is from the cth person, then wedenote i ∈ Ic. Under the framework of Eq. 2, we can easilyextend the first term Eq. 3 of DL-COPAR to exploit multiplefeatures:

C(X,D,A) ≡C∑

c=1

∑i∈Ic

K∑k=1

{∥xk

i −Dkaki ∥2F+

∥xki −DkQcQ

Tc a

ki ∥2F

}, (4)

Similarly, the penalty term over the coefficients and the sub-

2Here the so-called training set does not merely mean the original facialimages any more, but the generated features of the images.

3In this paper, we restrict our research under the assumption that all theK features have the same length p. Even though the multiple-length featurescan be transformed to be the same dimensional by some techniques, e.g.PCA, we do not exploit these as the tensor representation of the same-lengthfeatures can preserve the spatial correspondence information [22].

class 1 class 2

≈class 1 class 2

class 1 class 2

a facial imagewith K features

D∈Rp D K

K dictionariesw.r.t each feature

X R∈p K

x 1

x 2

x K

D1

D2

DK

a1

a2

aK

coefficients witheach feature type

A∈RD K

Fig. 1. The K features of a query datum X are approximated by Kdictionaries with K sparse coefficients.

dictionaries can be easily extended respectively:

ϕ∗(A) ≡C∑

c=1

K∑k=1

{∥QT

/cAkc∥2F + λ∥Ak

c∥1},

Q(D) ≡C+1∑c=1

C+1∑j=1j ̸=c

K∑k=1

∥Dkc

TDk

j ∥2F .(5)

By replacing the above defined terms into Eq. 2, we derivea unified formulation to incorporate multiple features to learnmultiple dictionaries, each one for each feature type. How-ever, these feature types are independent to each other, orthe formulation can be cast as running DL-COPAR on eachof the K features alone. This means the latent recognition-oriented relationships among the K overall-dictionaries arenot explored which can further improve the performance. Inthe next subsection, we will elaborate how to explore therecognition-oriented relationships.

B. Fusing multiple features

Suppose we have already learned the K dictionaries foreach feature type, Dk ∈ Rp×D for k = 1, . . . ,K, andarrange them to a tensorial representation D ∈ Rp×D×K , asillustrated by Fig. 1. It is intuitive to assume each dictionaryw.r.t. the specific feature has its own discrimination powerand all the feature-specific dictionaries can be connectedwith each other through some relationships that can gen-erate better classification accuracy. Therefore, our goal is toexploit this relationship among these dictionaries for betterFR performance and to lower computational burden. Inour work, we assume there is a semantic core dictionary4

B ∈ Rp×D×M (M < K or M ≪ K), which can belinearly extended to all the K dictionaries. In other words,as illustrated by Fig. 2, there is a transformation matrixW ∈ RK×M , such that D = B×3W. Here B×3W meansthat the transformation matrix W is applied to the third modeof tensor B (please refer to [11], [20] for details on tensorcalculation due to limited space). With the core dictionaryB and the transformation W, we rewrite Eq. 4 to derive anew term with the across-feature relationship explained by

4In this paper, we only consider the semantic meaning across the Kdictionaries, and do not over-explore the semantic meaning at atom-level ofeach dictionary.

B and W:

C(X,B,W,A) ≡C∑

c=1

∑i∈Ic

K∑k=1

{∥xk

i −Dkaki ∥2F+

∥xki −DkQcQ

Tc a

ki ∥2F

},

s.t. D = B×3 W,

Dk is the kth slice of D along the third mode.(6)

From Eq. 6, we can see the dictionaries of these featuresare connected by the core dictionary B under relationshipW. But it should be noted that, if we aim to predict a queryimage via the K dictionaries, we still have to deal with all theK feature types. That is to say B ∈ Rp×d×M and W bringno computational benefits and they do not necessarily capturethe discriminative relationship for better classification. Forthis reason, we turn to an alternative.

Given a facial image with K features represented byXi ∈ Rp×K , we let the fusion matrix W act on Xi ratherthan the K dictionaries Dk’s (or B), resulting in a compactrepresentation Yi ∈ Rp×M of this image: Yi = XiW.Put it in another way, the multiple features are fused by thetransformation matrix W. Therefore, we arrive at the finalterm of our objective function:

C∗(X,B,W,A) ≡C∑

c=1

∑i∈Ic

M∑m=1

{∥Xiwm −Bmak

i ∥2F+∥Xiwm −BmQcQ

Tc a

ki ∥2F

}(7)

where wm is the mth column of W ∈ RK×M and Bm ∈Rp×D is the mth slice along the third mode of the coredictionary B ∈ Rp×D×M , corresponding to the mth fusedfeature. As M < K or even M ≪ K, solving sparse codingprocess in the classification is more computational efficientand the semantic dictionary B can be fully exploited. Ac-cordingly, the regularization on the core dictionary B is re-defined as:

Q∗(B) ≡C+1∑c=1

C+1∑j=1j ̸=c

K∑k=1

∥Bkc

TBk

j ∥2F . (8)

Now, the questions are how to initialize and how todrive the fusion matrix W to produce more decent FRperformances. Next, we introduce a simple and effectiveinitialization method with the regularization term on W.

C. Initialization and regularization

As our goal about the fusion matrix W is to improve FRperformance, one intuition is that the fused data should beas separable as possible at class level and as as aggregateas possible within each class/individual. To achieve this,maximizing the Fisher criterion [7] is a good start. Fishercriterion measures the ratio of between-class distances of thedata to the within-class distances. Denote the mean of thecth individual X̄c =

1Nc

∑i∈Ic

Xi, and similarly the globalmean X̄ = 1

N

∑Ni=1 Xi, then we can simply initialize the

fusion matrix W as:

W =argmaxW

C∑c=1

Nc∥(X̄c − X̄)W∥2FC∑

c=1

∑i∈Ic

∥(Xi − X̄c)W∥2F

=argmaxW

tr{WT

{ C∑c=1

Nc(X̄c − X̄)T (X̄c − X̄)}W

}tr{WT

{ C∑c=1

∑i∈Ic

(Xi − X̄c)T (Xi − X̄c)}W

} .

(9)

Let Sb =∑C

c=1Nc(X̄c − X̄)T (X̄c − X̄) and Sw =∑Cc=1

∑i∈Ic

(Xi − X̄c)T (Xi − X̄c). Thus, solving Eq. 9

to derive W is equivalent to calculating the generalizedeigenvectors [7]: Sbw = λSww, for M largest eigenvaluesλ ̸= 0. Note that Eq. 9 is not the same as the classicallinear discriminant analysis (LDA) [7], but is a special caseof two-dimensional LDA [25] or multilinear discriminantanalysis [22] that only deals with the relationship of multi-feature information along the feature-type mode. Therefore,the dimension M of the fused data does not necessarily equalto C−1, and Sb and Sw generally have full rank in practice.

To keep this discrimination power of W through the opti-mization process, we turn to the maximum margin criterion(MMC) [15] to derive a regularization term on W:

ψ∗(W) ≡tr(WT (Sw − Sb)W

). (10)

Intuitively, the above MMC-based term can push the fusedfeatures by W to be as discriminative as possible. Mean-while, it can effectively circumvent the small-sample-sizeproblem. But please note that it is not required for W to becolumn-wise orthogonal. Now we arrive at our final objectivefunction by adding this term:

f ≡ C∗(X,B,A) + λϕ∗(A) + ηQ∗(B) + ωψ∗(W), (11)

where C∗(X,B,A), ϕ∗(A) and Q∗(B) are defined in Eq. 7,Eq. 5 and Eq. 8, respectively. λ, η and ω are constantscalars to balance the contribution of each term. Besides,after the initialization of W, we have the fused representationof all the facial images, then we use the K-SVD [1] toinitialize the M dictionaries of the core dictionary B andthe corresponding coefficients A. This objective function iseasy to optimize alternatively over each parameter, and itwill converge as updating each parameter is convex whenothers are fixed.

D. Classification scheme

After the core dictionary B and the fusion matrix Ware learned, we can use them to recognize a query facialimage. Recently developed methods, e.g. MTJSRC [26], JD-SRC [27] and RCR [24], adopt a sparse representation-basedclassification (SRC) scheme with different regularizations onthe coefficient. In detail, given a query image with K featuresX ∈ Rp×K , we can first fuse the features with the learnedW ∈ RK×M into Y = XW, and then encode Y over the

W∈RK M

K>M or K M>>

class 1 class 2

DD

class 1 class 2class 1 class 2

class 1 class 2class 1 class 2D

D

D

p

class 1 class 2

BB

B

class 1 class 2class 1 class 2

{p

D∈Rp D K

D B∈Rp D M

DK

4

3

2

1

M

2

1

Fig. 2. The multi-modal dictionary D can be constructed from the semantic core dictionary B and a transformation W.

core dictionary B:

A = argminA

M∑m=1

∥ym −Bmam∥2F + λϕ(A),

where ϕ(A) is a regularization (different from the DLprocess) on the coefficients A = [a1, . . . ,aM ]. It is ϕ(A)that makes these methods different from each other. Con-cretely, MTJSRC utilizes an ℓ2,1-norm penalty over thecoefficients [26], JDSRC uses group sparsity [27], whereasRCR adopts an overall-similarity term [24]. Note that theorder of atoms among the learned dictionaries is crucialfor MTJSRC and RCR, as the sparsity pattern relies on theatom order across the dictionaries. Therefore, only the group-sparsity regularization of JDSRC can be directly used in ourframework.

In this paper, we also employ a group-level sparse reg-ularizer under the SRC scheme, but using a restricted oneby allowing only one class-specific sub-dictionary and thecommon pattern pool to represent the query image. Thisconstraint enables us to adopt a local sparse coding methodfor classification [13], i.e. using the individual sub-dictionaryand the common pool to reconstruct the facial image:

1) Calculate the reconstruction error over the core dictio-nary B w.r.t the cth individual for c = 1, . . . C:

ec = minαm

M∑m=1

{∥ym −BmQcQ

Tc α

m∥22 +λα∥αm∥22}

2) identify the query image to the person which pro-duces the smallest reconstruction error: label(X) =argminc ec.

This classification scheme is just a least square problem [28]with a Tikhonov regularization, thus it is much easier andmore efficient to implement than solving the ad hoc sparseproblems [26], [27], [24].

IV. EXPERIMENT

In this section, we evaluate our method through a seriesof experiments on three public availabe datasets5:

• Extended-YaleB [14] contains 2,414 frontal face imagesunder different illuminations of 38 persons, about 68images for each individual. The cropped images areresized to 32× 32-pixel resolution for the experiment.

5In this paper, we only use the 2D face image databases for experimentsrather than 2.5D or 3D images [9], as all the compared approaches are runon the 2D database in the literature.

TABLE ISUMMARY OF DICTIONARY SIZE OF EACH DATABASE.

Dataset Dclass Dcommon DExtended YaleB 10 10 390

CMU-PIE 12 14 830AR 6 8 764

• CMU-PIE [18] is a challenging database due to variousilluminations and expressions. In our experiment, weuse a subset of this database that contains five nearfrontal poses (C05, C07, C09 C27, C29) of all the68 individuals under different illuminations and expres-sions [3]. Therefore, there are 170 images for eachperson in total.

• AR database [16] contains two-session data of 70 maleand 56 female subjects, each one has 26 pictures withthe normalized size as 50×40-pixel resolution. We usethe samples from Session 1 for training and that fromSession 2 for testing.

For the first two datasets, we randomly select half of theimages (about 32 images per person in Extended YaleB and85 in CMU-PIE) for training and the rest for testing. Westart this section with a discussion on the multiple featuresand parameters.

A. Features and parameters

As our method fuses multiple features for FR, in thispaper, we use K = 10 low-level features as illustrated byFig. 3:

1) The original gray-scale facial image.2) The corresponding histogram-equalized image.3) Low-frequency Fourier feature [19] of original image

with cut-off frequency 8.4) Low-frequency Fourier feature of histogram-equalized

image with cut-off frequency 8.5) Sobel edge image [8] of the original image

(threshold=15) with 3× 3 mean filter.6) Sobel edge image of the histogram-equalized image

(threshold=15) with 3× 3 mean filter.7) Canny edge image [4] of the original image (low

threshold= 30 and high threshold= 76) with 3 × 3mean filter.

8) Canny edge image of the Low-frequency Fourier fea-ture (as the same threshold above) with 3 × 3 meanfilter.

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Fig. 3. Ten features of five persons that are randomly selected from Extended YaleB database [14]. The left panel shows the features of four individualson each row, and the right panel displays the features of one person under four different illumination conditions.

TABLE IICOMPARISON OF FACE RECOGNITION ACCURACY (%) ON THE THREE DATASETS. TWO SETTINGS ARE REPORTED THAT RANDOMLY SELECTED K = 5

FEATURES AND ALL THE K = 10 FEATURES ARE USED FOR THE EXPERIMENTS. NOTE THAT WE FIX THE NUMBER OF FUSED FEATURES M = 3, i.e.

OUR METHOD FUSES THE K FEATURES INTO M NEW FEATURES FOR THE REPRESENTATION. OTHER METHODS INVOLVE NO FUSION PROCESS.

MethodExtended YaleB CMU-PIE AR

K = 5 K = 10 K = 5 K = 10 K = 5 K = 10

H-SRC 97.11± 0.32 97.40± 0.27 94.39± 0.40 95.44± 0.34 90.19± 0.30 91.46± 0.44S-SRC 94.20± 0.23 94.85± 0.34 92.04± 0.33 92.61± 0.31 87.54± 0.63 89.11± 0.46MTJSRC 98.33± 0.40 98.56± 0.31 95.61± 0.41 96.92± 0.24 93.91± 0.41 94.61± 0.31JDSRC 98.50± 0.26 99.03± 0.27 96.46± 0.39 97.51± 0.20 94.46± 0.39 94.51± 0.29RCR 98.12± 0.30 98.47± 0.41 95.83± 0.38 97.24± 0.30 94.53± 0.32 95.24± 0.30Ours 98.78 ± 0.28 99.63 ± 0.24 97.65 ± 0.41 98.33 ± 0.23 95.76 ± 0.41 96.92 ± 0.43

9) Canny edge image of the original image (lowthreshold= 20 and high threshold= 51) with 3 × 3mean filter.

10) Sobel edge image of the original image (threshold=20)with 3× 3 mean filter.

In our experiment, we do not make any differences on thefeatures, but setting the equal weight on them6.

In our objective function Eq. 11, we have three parametersto balance the influence of each term in learning the dictio-nary. For λ, we set it to make approximate 10% elementsof the coefficients nonzero. η decides the incoherence of theindividual-specific dictionaries, and we set it to be 1, thesame as that in [13]. Larger ω pushes the fused feature to bemore discriminative with each other, therefore, we can set ita relatively large scalar, say ω = 10 in our experiment.

As for the dictionary B ∈ Rp×D×K , we vary D w.r.tdifferent datasets, as explained by Table I, in which Dclass

and Dcommon denote the number of atoms that constituteeach individual-specific dictionary and the common featurepool (D = #individual ×Dclass +Dcommon).

B. Face recognition

To verify the effectiveness of our method, we compare ourmethod with five closely related and state-of-the-art methods.These methods include holistic SRC (H-SRC) [21] andseparate SRC (S-SRC) [27], MTJSRC [26], JDSRC [27] and

6The weight of each feature can produce different FR performances, asboth JDSRC [27] and RCR [24] have discussed the weight. But in our paper,we merely set the weight to be 1 for all the features.

RCR [24]. H-SRC and S-SRC act as baseline methods thatH-SRC concatenates all the features into a single large vec-tor, while S-SRC, as an intuitive extension of SRC, separatelyencodes each feature and then summarizes the reconstructionerror of each cues for FR with the classification schemeof SRC [21]. Note that, MTJSRC only uses two types offeatures in [26], whereas it performs better on the ten featuresin this paper than reported in [26]; JDSRC manually selectsthe face regions in [27] for multi-region FR, yet we run it onthe global features to report the results; RCR also considerssophisticatedly segmenting the face images for multi-regionFR, but we run RCR for multi-feature FR to report itsperformance. Moreover, in the above three approaches, thereare some parameters (refer to Eq. 2) that control the sparsedegree of the coefficients, and we adjust them to get the bestperformances of each method. All the compared algorithmdo not involve any fusion process, but directly encoding thequery over the training set (the derived features).

Instead of using all the ten features, we also randomlyselect K ≤ 10 features out of the ten for comparison. Indetail, we repeat 10 times to randomly select a subset of thefeatures, and each subset is also repeated 5 times, thus 50times are run for each method on each database. Note that,only our method can fuse the features, and others merelyuse all the features without any fusion process. We reportthe averaged accuracy with the standard derivation.

Direct comparisons of FR is listed in Table II. Two settingsare included in the table: (1) K = 5 features out of theten are randomly selected and (2) all the K = 10 features

are used for experiment. Note that we set the fused featurenumber M = 3, i.e. our method fuses the K featuresinto M = 3 new features for the representation of eachfacial image. From this table, we can easily see that ourmethod consistently outperforms all the others. It is worthnoting that our method fuses the K features into a smallerM = 3 new ones, and others just uses all the K featuresfor recognition. When K = 10, i.e. all the ten featuresare adopted for FR, our approach fuses them into a morecompact representation with only M = 3 new features, andproduces more decent FR performances. It demonstrates thatthe proposed method uses three times smaller feature than theoriginal facial image representation and achieves the highestface recognition accuracy.

To make a more clear comparison, we plot the curve ofFR accuracy vs. the number of adopted features in Fig. 4.From these figures, we can see our proposed method alwaysachieves higher FR rate than the others, especially in ARdatabase, ours outperforms the state-of-the-art approacheswith a large margin. In appearance, to all the methods, itseems that more features means better FR performance. It isan issue worth studying, and we postpone a brief discussionof it in Section V.

In Fig. 5, we display the three fused features of the fiveindividuals that are previously showed in Fig. 3. The firstrow illustrates the original facial images, and the followingthree rows correspond to the three fused features. From thisfigure, especially from the second and third row, we can seethe fused features are robust to illumination changes. Eventhough there are shadows caused by various illuminationangles, these shadows are alleviated in the fused features thatprovide the resilience to illumination changes. Additionally,from the second row, i.e. the first fused feature, we cansee the trivial pixels that bring no discrimination powerin the face are smoothed, and the important patterns, e.g.the eyes and the nose, are highlighted. We think it is thepixel-level spatial correspondence of the features representedby tensor attributes to this phenomenon, thus intuitivelydemonstrating why the fused features can produce moredecent FR performances.

C. Face recognition under different fused feature number

In this subsection, we further examine the effects of thefused feature number M on the recognition rate using thethree public available datasets. Different from the subsectionthat the fused feature number M is fixed to be 3, now wevary the number M in the range of M ∈ {1, 2, 3, 4, 5, 6, 7}with all the K = 10 features, and the plots in Fig. 6 give theperformance of our method on the three datasets. For eachof the dataset, we randomly select about 50% facial imagesper person for training and use the rest for testing. This isdifferent on AR database as used in the previous subsection.All the experiments are repeated 10 times, and the averagedaccuracy with standard deviation are plotted as an error barin Fig. 6.

From this figure, we can see that, when M increasesin a range, the FR rates are improved as well in general.

K

(a) Extended YaleB database.

K

(b) CMU-PIE.

K

(c) AR database.Fig. 4. Recognition accuracy vs. number of features for fusion on threedatasets.

But when M reaches some thresholds, the FR performancewill saturate. It is easy to explain that larger M meansmore discriminative fused features are generated. As theregularization term and initialization of the fusion matrixW is motivated by Fisher criterion [7] and MMC [15], thisphenomenon resembles Fisher LDA that when the reduceddimension increases in a reasonable range, the classificationbecomes more accurate.

V. CONCLUSION

In this paper, we briefly review several recent developedmethods that either improve the dictionary learning frame-

Fig. 5. The three fused features of different persons. The top row showsthe original facial images of five people that are previously illustrated inFig. 3. The left part shows the fused features of four individuals on eachcolumn, and the right one displays that of the same individual under fourdifferent illumination conditions.

M

Fig. 6. Recognition accuracy vs. fused feature number M . The M newfeatures are fused from all the ten features.

work or jointly consider multiple features to achieve betterFR performance, and analyze some drawbacks of them. Tocircumvent these drawbacks, we propose a unified methodto fuse various features into a more compact and morediscriminative one for better FR. Our approach also learns anindividual-specific dictionary for each person and a commonfeature pool for all the subjects. The common feature poolconsists of patterns that are shared by all the individuals,thus making the learned overall dictionary more compactand more discriminative. As for recognition, the proposedmethod encodes the query (its multiple features) over thedictionary sparsely at group level. But we restrict the codingprocess to only allow one individual-specific dictionary andthe common feature pool to contribute to the approximation.Therefore, we merely use a lease-square fitting criterion forthe coding. This coding scheme is effective owing to the helpof the common feature pool. Through experimental valida-tion, we demonstrate the proposed method can achieve verydecent FR performance with comparison of other closelyrelated and state-of-the-art approaches.

However, compared with MTJSRC, JDSRC and RCR,one limitation of our method is that it only uses globalfeatures that are with the same length for fusion. There isno doubt that, despite the global features, the local ones andstatistical ones [6], [2] can further improve the performance.But these features can vary in dimension such that the fusionprocess will be nontrivial and complicated. Therefore, asa future work, we will consider how to extend our work

to fuse heterogeneous features with various dimensionality.Moreover, in this paper, the proposed method only blindlyfuses all the available features for better FR. Undoubtedly,some of them may have little help or even can compromisethe performance in the FR aspect. As a result, how toautomatically select the most informative features to fuseis another research work.

REFERENCES

[1] M. Aharon, M. Elad, and A. M. Bruckstein. K-svd: An algorithm fordesigning overcomplete dictionaries for sparse representation. IEEETrans. Signal Processing, 54(11):4311–4322, 2006.

[2] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with localbinary patterns: Application to face recognition. IEEE Trans. PatternAnal. Mach. Intelligence, 28(12):2037–2041, 2006.

[3] D. Cai, X. He, Y. Hu, J. Han, and T. Huang. Learning a spatiallysmooth subspace for face recognition. In CVPR, 2007.

[4] J. Canny. a computational approach to edge detection. IEEE Trans.Pattern Anal. Mach. Intelligence, 8(6):679–698, 1986.

[5] L. Cao, J. Luo, F. Liang, and T. S. Huang. Heterogeneous featuremachines for visual recognition. In ICCV, 2009.

[6] Z. Cui, S. Shan, X. Chen, and L. Zhang. sparsely encoded localdescriptor for face recognition. In FG, 2011.

[7] K. Fukunaga. Introduction to Statistical Pattern Classification. Aca-demic Press, San Diego, California, USA, 1990.

[8] R. C. Gonzalez and R. E. Woods. Digital Image Processing. PrenticeHall, Upper Saddle River, New Jersey, USA, third edition, 2008.

[9] S. Gupta, M. K. Markey, and A. C. Bovik. Anthropometric 3d facerecognition. International Journal of Computer Vision, 90(3):331–349,2010.

[10] Z. Jiang, Z. Lin, and L. S. Davis. Learning a discriminative dictionaryfor sparse coding via label consistent k-svd. In CVPR, 2011.

[11] T. G. Kolda and B. W. Bader. Tensor decompositions and applications.SIAM Review, 51(3):455–500, September 2009.

[12] S. Kong and D. Wang. A brief summary of dictionary learning basedapproach for classification. CoRR abs/1205.6544, 2012.

[13] S. Kong and D. Wang. A dictionary learning approach for classifi-cation: separating the particularity and the commonality. In ECCV,2012.

[14] K. Lee, J. Ho, and D. Kriegman. Acquiring linear subspaces for facerecognition under variable lighting. IEEE Trans. Pattern Anal. Mach.Intelligence, 27(5):684–698, 2005.

[15] H. Li, T. Jiang, and K. Zhang. Efficient and robust feature extractionby maximum margin criterion. In NIPS, 2003.

[16] A. Martinez and R. Benavente. The ar face database. CVC TechnicalReport #24, 1988.

[17] I. Ramirez, P. Sprechmann, and G. Sapiro. Classification and clus-tering via dictionary learning with structured incoherence and sharedfeatures. In CVPR, 2010.

[18] T. Sim, S. Baker, , and M. Bsat. The cmu pose, illumination, andexpression database. IEEE Trans. Pattern Anal. Mach. Intelligence,25(12):1615–1618, 2003.

[19] Y. Su, S. Shan, X. Chen, and W. Gao. hierarchical ensemble of globaland local classifiers for face recognition. In ICCV, 2007.

[20] D. Wang and S. Kong. Feature selection from high-order tensorial datavia sparse decomposition. Pattern Recognition Letters, 33(13):1695–1702, 2012.

[21] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust facerecognition via sparse representation. IEEE Trans. Pattern Anal. Mach.Intelligence, 2009.

[22] S. Yan, D. Xu, Q. Yang, L. Zhang, and X. Tang. Multilinear discrim-inant analysis for face recognition. IEEE Trans. Image Processing,2007.

[23] M. Yang, L. Zhang, J. Yang, and D. Zhang. metaface learning forsparse representation based face recognition. In ICIP, 2010.

[24] M. Yang, L. Zhang, D. Zhang, and S. Wang. relaxed collaborativerepresentation for pattern classification. In CVPR, 2012.

[25] J. Ye, R. Janardan, and Q. Li. Two-dimensional linear discriminantanalysis. In NIPS, 2004.

[26] X.-T. Yuan and S. Yan. visual classification with multi-task joint sparserepresentation. In CVPR, 2010.

[27] H. Zhang, N. M. Nasrabadi, Y. Zhang, and T. S. Huang. Multi-observation visual recognition via joint dynamic sparse representation.In ICCV, 2011.

[28] L. Zhang, M. Yang, and X. Feng. Sparse representation or collabora-tive representation: Which helps face recognition? In ICCV, 2011.

learning individual-speciﬁc dictionaries with fused ... · pdf filelearning...

Documents