fast low-rank shared dictionary learning for image ... · fast low-rank shared dictionary learning...

13
SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING, OCT 2016 Fast Low-rank Shared Dictionary Learning for Image Classification Tiep Huu Vu, Student Member, IEEE, Vishal Monga, Senior Member, IEEE Abstract— Despite the fact that different objects possess distinct class-specific features, they also usually share common patterns. This observation has been exploited partially in a recently proposed dictionary learning framework by separating the particularity and the commonality (COPAR). Inspired by this, we propose a novel method to explicitly and simultaneously learn a set of common patterns as well as class-specific features for classification with more intuitive constraints. Our dictionary learning framework is hence characterized by both a shared dictionary and particular (class-specific) dictionaries. For the shared dictionary, we enforce a low-rank constraint, i.e. claim that its spanning subspace should have low dimension and the coefficients corresponding to this dictionary should be similar. For the particular dictionaries, we impose on them the well-known constraints stated in the Fisher discrimination dictionary learning (FDDL). Further, we develop new fast and accurate algorithms to solve the subproblems in the learning step, accelerating its convergence. The said algorithms could also be applied to FDDL and its extensions. The efficiencies of these algorithms are theoretically and experimentally verified by comparing their complexities and running time with those of other well-known dictionary learning methods. Experimental results on widely used image datasets establish the advantages of our method over state-of-the-art dictionary learning methods. Index terms—sparse coding, dictionary learning, low-rank models, shared features, object classification. I. I NTRODUCTION Sparse representations have emerged as a powerful tool for a range of signal processing applications. Applications include compressed sensing [1], signal denoising, sparse signal recovery [2], image inpainting [3], image segmentation [4], and more recently, signal classification. In such representations, most of signals can be expressed by a linear combination of few bases taken from a “dictionary”. Based on this theory, a sparse representation-based classifier (SRC) [5] was initially developed for robust face recognition. Thereafter, SRC was adapted to numerous signal/image classification problems, ranging from medical image classification [6]–[8], hyperspectral image classification [9]–[11], synthetic aperture radar (SAR) image classification [12], recaptured image recognition [13], video anomaly detection [14], and several others [15]–[20]. The authors are with the School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA 16802, USA (e-mail: [email protected]). This work has been supported partially by the Office of Naval Research (ONR) under Grant 0401531 UP719Z0 and NSF CAREER award to (V.M.). Ideally, different classes lie in non-overlappling subspaces. In this case, X is block diagonal. [Y 1 ... Y c ... Y C ] Y [ D 1 ... D c ... D C ] D × X Figure 1: Ideal structure of the coefficient matrix in SRC. The central idea in SRC is to represent a test sample (e.g. a face) as a linear combination of samples from the available training set. Sparsity manifests because most of non-zeros correspond to bases whose memberships are the same as the test sample. Therefore, in the ideal case, each object is expected to lie in its own class subspace and all class subspaces are non-overlapping. Concretely, given C classes and a dictionary D =[D 1 ,..., D C ] with D c comprising training samples from class c, c =1,...,C , a new sample y from class c can be represented as y D c x c . Therefore, if we express y using the dictionary D : y Dx = D 1 x 1 + ··· + D c x c + ··· + D C x C , then most of active elements of x should be located in x c and hence, the coefficient vector x is expected to be sparse. In matrix form, let Y =[Y 1 ,..., Y c ,..., Y C ] be the set of all samples where Y c comprises those in class c, the coefficient matrix X would be sparse. In the ideal case, X is block diagonal (see Figure 1). It has been shown that learning a dictionary from the training samples instead of using all of them as a dictionary can further enhance the performance of SRC. Most existing classification-oriented dictionary learning methods try to learn discriminative class-specific dictionaries by either imposing block-diagonal constraints on X or encouraging the incoherence between class-specific dictionaries. Based on the K-SVD [3] model for general sparse representations, Discriminative K-SVD (D-KSVD) [21] and Label-Consistent K-SVD (LC-KSVD) [22], [23] learn the discriminative dictionaries by encouraging a projection of sparse codes X to be close to a sparse matrix with all non-zeros being one while satisfying a block diagonal structure as in Figure 1. Vu et al. [6], [7] with DFDL and Yang et al. [24], [25] with FDDL apply Fisher-based ideas on dictionaries and sparse coefficients, respectively. Recently, Li et al. [26] with D 2 L 2 R 2 combined the Fisher-based idea and introduced a low-rank constraint on each sub-dictionary. They claim that such a model would reduce the negative effect of noise contained in training samples. 1 arXiv:1610.08606v1 [cs.CV] 27 Oct 2016

Upload: voliem

Post on 31-Mar-2018

232 views

Category:

Documents


2 download

TRANSCRIPT

SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING, OCT 2016

Fast Low-rank Shared Dictionary Learningfor Image Classification

Tiep Huu Vu, Student Member, IEEE, Vishal Monga, Senior Member, IEEE

Abstract— Despite the fact that different objects possess distinctclass-specific features, they also usually share common patterns.This observation has been exploited partially in a recentlyproposed dictionary learning framework by separating theparticularity and the commonality (COPAR). Inspired by this,we propose a novel method to explicitly and simultaneouslylearn a set of common patterns as well as class-specific featuresfor classification with more intuitive constraints. Our dictionarylearning framework is hence characterized by both a shareddictionary and particular (class-specific) dictionaries. For theshared dictionary, we enforce a low-rank constraint, i.e. claimthat its spanning subspace should have low dimension andthe coefficients corresponding to this dictionary should besimilar. For the particular dictionaries, we impose on themthe well-known constraints stated in the Fisher discriminationdictionary learning (FDDL). Further, we develop new fast andaccurate algorithms to solve the subproblems in the learningstep, accelerating its convergence. The said algorithms couldalso be applied to FDDL and its extensions. The efficienciesof these algorithms are theoretically and experimentally verifiedby comparing their complexities and running time with thoseof other well-known dictionary learning methods. Experimentalresults on widely used image datasets establish the advantagesof our method over state-of-the-art dictionary learning methods.

Index terms—sparse coding, dictionary learning, low-rankmodels, shared features, object classification.

I. INTRODUCTION

Sparse representations have emerged as a powerful toolfor a range of signal processing applications. Applicationsinclude compressed sensing [1], signal denoising, sparsesignal recovery [2], image inpainting [3], image segmentation[4], and more recently, signal classification. In suchrepresentations, most of signals can be expressed by a linearcombination of few bases taken from a “dictionary”. Based onthis theory, a sparse representation-based classifier (SRC) [5]was initially developed for robust face recognition. Thereafter,SRC was adapted to numerous signal/image classificationproblems, ranging from medical image classification [6]–[8],hyperspectral image classification [9]–[11], synthetic apertureradar (SAR) image classification [12], recaptured imagerecognition [13], video anomaly detection [14], and severalothers [15]–[20].

The authors are with the School of Electrical Engineering and ComputerScience, The Pennsylvania State University, University Park, PA 16802, USA(e-mail: [email protected]).

This work has been supported partially by the Office of Naval Research(ONR) under Grant 0401531 UP719Z0 and NSF CAREER award to (V.M.).

Ideally, different classes lie in non-overlapplingsubspaces. In this case, X is block diagonal.

[Y1 . . .Yc . . .YC ]︸ ︷︷ ︸Y

[ D1 . . . Dc . . . DC ]︸ ︷︷ ︸D

× X

Figure 1: Ideal structure of the coefficient matrix in SRC.

The central idea in SRC is to represent a test sample (e.g. aface) as a linear combination of samples from the availabletraining set. Sparsity manifests because most of non-zeroscorrespond to bases whose memberships are the same asthe test sample. Therefore, in the ideal case, each objectis expected to lie in its own class subspace and all classsubspaces are non-overlapping. Concretely, given C classesand a dictionary D = [D1, . . . ,DC ] with Dc comprisingtraining samples from class c, c = 1, . . . , C, a new sample yfrom class c can be represented as y ≈ Dcx

c. Therefore, if weexpress y using the dictionary D : y ≈ Dx = D1x

1 + · · ·+Dcx

c+ · · ·+DCxC , then most of active elements of x should

be located in xc and hence, the coefficient vector x is expectedto be sparse. In matrix form, let Y = [Y1, . . . ,Yc, . . . ,YC ]be the set of all samples where Yc comprises those in classc, the coefficient matrix X would be sparse. In the ideal case,X is block diagonal (see Figure 1).

It has been shown that learning a dictionary from the trainingsamples instead of using all of them as a dictionary canfurther enhance the performance of SRC. Most existingclassification-oriented dictionary learning methods try tolearn discriminative class-specific dictionaries by eitherimposing block-diagonal constraints on X or encouragingthe incoherence between class-specific dictionaries. Based onthe K-SVD [3] model for general sparse representations,Discriminative K-SVD (D-KSVD) [21] and Label-ConsistentK-SVD (LC-KSVD) [22], [23] learn the discriminativedictionaries by encouraging a projection of sparse codes Xto be close to a sparse matrix with all non-zeros being onewhile satisfying a block diagonal structure as in Figure 1. Vuet al. [6], [7] with DFDL and Yang et al. [24], [25] withFDDL apply Fisher-based ideas on dictionaries and sparsecoefficients, respectively. Recently, Li et al. [26] with D2L2R2

combined the Fisher-based idea and introduced a low-rankconstraint on each sub-dictionary. They claim that such amodel would reduce the negative effect of noise containedin training samples.

1

arX

iv:1

610.

0860

6v1

[cs

.CV

] 2

7 O

ct 2

016

A. Closely Related work and Motivation

The assumption made by most discriminative dictionarylearning methods, i.e. non-overlapping subspaces, is unrealisticin practice. Often objects from different classes share somecommon features, e.g. background in scene classification. Thisproblem has been partially addressed by recent efforts, namelyDLSI [27], COPAR [28] and CSDL [29]. However, DLSIdoes not explicitly learn shared features since they are stillhidden in the sub-dictionaries. COPAR and CSDL explicitlylearn a shared dictionary D0 but suffer from the followingdrawbacks. First, we contend that the subspace spannedby columns of the shared dictionary must have low rank.Otherwise, class-specific features may also get representedby the shared dictionary. In the worst case, the shareddictionary span may include all classes, greatly diminishing theclassification ability. Second, the coefficients (in each columnof the sparse coefficient matrix) corresponding to the shareddictionary should be similar. This implies that features areshared between training samples from different classes via the“shared dictionary”. In this paper, we develop a new low-rankshared dictionary learning framework (LRSDL) which satisfiesthe aforementioned properties. Our framework is basicallya generalized version of the well-known FDDL [24], [25]with the additional capability of capturing shared features,resulting in better performance. We also show practical meritsof enforcing these constraints are significant.

The typical strategy in optimizing general dictionary learningproblems is to alternatively solve their subproblems wheresparse coefficients X are found while fixing dictionary D orvice versa. In discriminative dictionary learning models, bothX and D matrices furthermore comprise of several small class-specific blocks constrained by complicated structures, usuallyresulting in high computational complexity. Traditionally,X, and D are solved block-by-block until convergence.Particularly, each block Xc (or Dc in dictionary update ) issolved by again fixing all other blocks Xi, i 6= c (or Di, i 6= c).Although this greedy process leads to a simple algorithm, itnot only produces inaccurate solutions but also requires hugecomputation. In this paper, we aim to mitigate these drawbacksby proposing efficient and accurate algorithms which allowsto directly solve X and D in two fundamental discriminativedictionary learning methods: FDDL [25] and DLSI [27]. Thesealgorithms can also be applied to speed-up our proposedLRSDL, COPAR [28], D2L2R2 [26] and other related works.

B. ContributionsThe main contributions of this paper are as follows:

1) A new low-rank shared dictionary learning frame-work1 (LRSDL) for automatically extracting bothdiscriminative and shared bases in several widely usedimage datasets is presented to enhance the classificationperformance of dictionary learning methods. Ourframework simultaneously learns each class-dictionary

1The preliminary version of this work was presented in IEEE InternationalConference on Image Processing, 2016 [30].

per class to extract discriminative features and the sharedfeatures that all classes contain. For the shared part,we impose two intuitive constraints. First, the shareddictionary must have a low-rank structure. Otherwise,the shared dictionary may also expand to containdiscriminative features. Second, we contend that thesparse coefficients corresponding to the shared dictionaryshould be almost similar. In other words, the contributionof the shared dictionary to reconstruct every signal shouldbe close together. We will experimentally show that bothof these constraints are crucial for the shared dictionary.

2) New accurate and efficient algorithms for selectedexisting and proposed dictionary learning methods.We present three effective algorithms for dictionarylearning: i) sparse coefficient update in FDDL [25] byusing FISTA [31]. We address the main challenge inthis algorithm – how to calculate the gradient of acomplicated function effectively – by introducing a newsimple function M(•) on block matrices and a lemmato support the result. ii) Dictionary update in FDDL[25] by a simple ODL [32] procedure using M(•) andanother lemma. Because it is an extension of FDDL, theproposed LRSDL also benefits from the aforementionedefficient procedures. iii) Dictionary update in DLSI [27]by a simple ADMM [33] procedure which requires onlyone matrix inversion instead of several matrix inversionsas originally proposed in [27]. We subsequently showthe proposed algorithms have both performance andcomputational benefits.

3) Complexity analysis. We derive the computationalcomplexity of numerous dictionary learning methodsin terms of approximate number of operations (multi-plications) needed. We also report complexities andexperimental running time of aforementioned efficientalgorithms and their original counterparts.

4) Reproducibility. Numerous sparse coding and dictionarylearning algorithms in the manuscript are reproduciblevia a user-friendly toolbox. The toolbox includesimplementations of SRC [5], ODL [32], LC-KSVD[23]2, efficient DLSI [27], efficient COPAR [28], efficientFDDL [25], D2L2R2 [26] and the proposed LRSDL. Thetoolbox (a MATLAB version and a Python version) isprovided3 with the hope of usage in future research andcomparisons via peer researchers.

The remainder of this paper is organized as follows. SectionII presents our proposed dictionary learning framework, theefficient algorithms for its subproblems and one efficientprocedure for updating dictionaries in DLSI and COPAR. Thecomplexity analysis of several well-known dictionary learningmethods are included in Section III. In Section IV, we showclassification accuracies of LRSDL on widely used datasets incomparisons with existing methods in the literature to revealmerits of the proposed LRSDL. Section V concludes the paper.

2Source code for LC-KSVD is directly taken from the paper at:http://www.umiacs.umd.edu/ zhuolin/projectlcksvd.html.

3The toolbox can be downloaded at:http://signal.ee.psu.edu/lrsdl.html.

II. DISCRIMINATIVE DICTIONARY LEARNING FRAMEWORK

A. Notation

In addition to notation stated in the Introduction, let D0 bethe shared dictionary, I be the identity matrix with dimensioninferred from context. For c = 1, . . . , C; i = 0, 1, . . . , C,suppose that Yc ∈ Rd×nc and Y ∈ Rd×N with N =∑Cc=1 nc; Di ∈ Rd×ki , D ∈ Rd×K with K =

∑Cc=1 kc; and

X ∈ RK×N . Denote by Xi the sparse coefficient of Y on Di,by Xc ∈ RK×Nc the sparse coefficient of Yc on D, by Xi

c thesparse coefficient of Yc on Di. Let D =

[D D0

]be the total

dictionary, X = [XT , (X0)T ]T and Xc = [(Xc)T , (X0

c)T ]T .

For every dictionary learning problem, we implicitly constraineach basis to have its Euclidean norm no greater than 1. Thesevariables are visualized in Figure 2a).

Let m,m0, and mc be the mean of X,X0, and Xc columns,respectively. Let Mc = [mc, . . . ,mc] ∈ RK×nc ,M0 =[m0, . . . ,m0] ∈ Rk0×N , and M = [m, . . . ,m] be the meanmatrices. The number of columns of M depends on context,e.g. by writing Mc−M, we mean that M and Mc have samenumber of columns nc .The ‘mean vectors’ are illustrated inFigure 2c).

Given a function f(A,B) with A and B being two sets ofvariables, define fA(B) = f(A,B) as a function of B whenthe set of variables A is fixed. Greek letters (λ, λ1, λ2, η)represent positive regularization parameters. Given a blockmatrix A, define a function M(A) as follows:

A11 . . . A1C

A21 . . . A2C

. . . . . . . . .AC1 . . . ACC

︸ ︷︷ ︸A

7→ A +

A11 . . . 00 . . . 0. . . . . . . . .0 . . . ACC

︸ ︷︷ ︸M(A)

. (1)

That is, M(A) doubles diagonal blocks of A. The row andcolumn partitions of A are inferred from context. M(A) is acomputationally inexpensive function of A and will be widelyused in our LRSDL algorithm and the toolbox.

We also recall here the FISTA algorithm [31] for solving thefamily of problems:

X = arg minX

h(X) + λ‖X‖1, (2)

where h(X) is convex, continuously differentiable withLipschitz continuous gradient. FISTA is an iterative methodwhich requires to calculate gradient of h(X) at each iteration.In this paper, we will focus on calculating the gradient of h.

B. Closely related work: Fisher discrimination dictionarylearning (FDDL)

FDDL [24] has been used broadly as a technique for exploitingboth structured dictionary and learning discriminativecoefficient. Specifically, the discriminative dictionary D

In real problems, different classes sharesome common features (represented by D0).

We model the shared dictionary D0 as alow-rank matrix.

[Y1 . . .Yc . . .YC ]

Y

[D1 . . .Dc . . .DC D0]

×

D

X

XX

X1 Xc

Xc

XC

X0

X1c

Xcc

XCc

X0cX0c

YcD0 X

0c

Yc

DcXcc

D1X

1c

DCXCc

No constraint

YcD0 X

0c

Yc

DcXcc

D1X

1c DCXCc

LRSDL

Yc = Yc −D0X0c

Goal:‖Yc −DcX

cc −D0X

0c‖2F small.

‖D jXjc‖2F small ( j 6= c, j 6= 0).

m1

‖X1 −M1‖2F

mc

‖Xc −Mc‖2F

mC

‖XC −MC‖2F

m0

‖X0 −M0‖2F

m

Goal:

‖Xc−Mc‖2F (intra class) small.

‖Mc−M‖2F (inter class) lagre.

‖X0 −M0‖2F small.

a)

b) c)

Figure 2: LRSDL idea with: brown items – shared; red, green,blue items – class-specific. a) Notation. b) The discriminative fidelityconstraint: class-c sample is mostly represented by D0 and Dc. c)The Fisher-based discriminative coefficient constraint.

and the sparse coefficient matrix X are learned based onminimizing the following cost function:

JY(D,X) =1

2fY(D,X) + λ1‖X‖1 +

λ22g(X), (3)

where fY(D,X) =

C∑

c=1

rYc(D,Xc) is the discriminative

fidelity with:

rYc(D,Xc) = ‖Yc−DXc‖2F+‖Yc−DcX

cc‖2F+

j 6=c

‖DjXjc‖2F ,

g(X) =∑Cc=1(‖Xc −Mc‖2F − ‖Mc −M‖2F ) + ‖X‖2F is the

Fisher-based discriminative coefficient term, and the l1-normencouraging the sparsity of coefficients.

C. Proposed Low-rank shared dictionary learning (LRSDL)

In the presence of the shared dictionary, it is expected that Yc

can be well represented by the collaboration of the particulardictionary Dc and the shared dictionary D0. Concretely, thediscriminative fidelity term fY(D,X) in (3) can be extendedto fY(D,X) =

∑Cc=1 rYc

(D,Xc) with rYc(D,Xc) being

defined as:

‖Yc−DXc‖2F+‖Yc−DcXcc−D0X

0c‖2F+

C∑

j=1,j 6=c

‖DjXjc‖2F .

Note that since rYc(D,Xc) = rYc(D,Xc) with Yc =

Yc −D0X0c (see Figure 2b)), we have:

fY(D,X) = fY(D,X), (4)

with Y = Y −D0X0.

The Fisher-based discriminative coefficient term g(X) isextended to g(X) defined as:

g(X) = g(X) + ‖X0 −M0‖2F , (5)

where the term ‖X0 −M0‖2F forces the coefficients of alltraining samples represented via the shared dictionary to besimilar (see Figure 2c)).

For the shared dictionary, as stated in the Introduction, weconstrain rank(D0) to be small by using the nuclear norm‖D0‖∗ which is its convex relaxation [34]. Finally, the costfunction JY(D,X) of our proposed LRSDL is:

JY(D,X) =1

2fY(D,X) + λ1‖X‖1 +

λ22g(X) + η‖D0‖∗.

(6)By minimizing this objective function, we can jointly find theappropriate dictionaries as we desire. Notice that if there is noshared dictionary D0 (by setting k0 = 0), then D,X becomeD,X, respectively, JY(D,X) becomes JY(D,X) and ourLRSDL reduces to FDDL.

Classification scheme:

After the learning process, we obtain the total dictionary Dand mean vectors mc,m

0. For a new test sample y, first wefind its coefficient vector x = [xT , (x0)T ]T with the sparsityconstraint on x and further encourage x0 to be close to m0:

x = arg minx

1

2‖y−Dx‖22 +

λ22‖x0 −m0‖22 + λ1‖x‖1. (7)

Using x as calculated above, we extract the contribution ofthe shared dictionary to obtain y = y −D0x

0. The identityof y is determined by:

arg min1≤c≤C

(w‖y −Dcxc‖22 + (1− w)‖x−mc‖22), (8)

where w ∈ [0, 1] is a preset weight for balancing thecontribution of the two terms.

D. Efficient solutions for optimization problems

Before diving into minimizing the LRSDL objective functionin (6), we first present efficient algorithms for minimizing theFDDL objective function in (3).

1) Efficient FDDL dictionary update: Recall that in [24], thedictionary update step is divided into C subproblems, eachupdates one class-specific dictionary Dc while others fixed.This process is repeated until convergence. This approach isnot only highly time consuming but also inaccurate. We willsee this in a small example presented in Section IV-B. Werefer this original FDDL dictionary update as O-FDDL-D.

We propose here an efficient algorithm for updating dictionarycalled E-FDDL-D where the total dictionary D will beoptimized when X is fixed, significantly reducing thecomputational cost.

Concretely, when we fix X in equation (3), the problem ofsolving D becomes:

D = arg minD

fY,X(D) (9)

Therefore, D can be solved by using the following lemma.Lemma 1: The optimization problem (9) is equivalent to:

D = arg minD{−2trace(EDT ) + trace(FDTD)}, (10)

where E = YM(XT ) and F =M(XXT ).

Proof: See Appendix A.

The problem (10) can be solved effectively by OnlineDictionary Learning (ODL) method [32].

2) Efficient FDDL sparse coefficient update (E-FDDL-X):When D is fixed, X will be found by solving:

X = arg minX

h(X) + λ1‖X‖1, (11)

where h(X) = 12fY,D(X) + λ2

2 g(X). The problem (11) hasthe form of equation (2), and can hence be solved by FISTA[31]. We need to calculate gradient of f(•) and g(•) withrespect to X.

Lemma 2: Calculating gradient of h(X) in equation (2)

∂ 12fY,D(X)

∂X= M(DTD)X−M(DTY), (12)

∂ 12g(X)

∂(X)= 2X + M− 2

[M1 M2 . . .Mc

]︸ ︷︷ ︸

M

. (13)

Then we obtain:∂h(X)

∂X= (M(DTD) + 2λ2I)X−M(DTY) + λ2(M− 2M).

(14)

Proof: See Appendix B.

Since the proposed LRSDL is an extension of FDDL, we canalso extend these two above algorithms to optimize LRSDLcost function as follows.

3) LRSDL dictionary update (LRSDL-D): Returning to ourproposed LRSDL problem, we need to find D = [D,D0]when X is fixed. We propose a method to solve D and D0

separately.

For updating D, recall the observation that fY(D,X) =fY(D,X), with Y , Y −D0X

0 (see equation (4)), and theE-FDDL-D presented in section II-D1, we have:

D = arg minD{−2trace(EDT ) + trace(FDTD)}, (15)

with E = YM(XT ) and F =M(XXT ).

For updating D0, we use the following lemma:

Lemma 3: When D,X in (6) are fixed,

JY,D,X(D0,X0) = ‖V −D0X0‖2F +λ22‖X0 −M0‖2F +

+η‖D0‖∗ + λ1‖X0‖1 + constant, (16)

where V = Y − 12DM(X).

Proof: See Appendix C.

Based on the Lemma 3, D0 can be updated by solving:

D0 = arg minD0

trace(FDT0 D0)− 2trace(EDT

0 ) + η‖D0‖∗where: E = V(X0)T ; F = X0(X0)T (17)

using the ADMM [33] method and the singular valuethresholding algorithm [35]. The ADMM procedure is asfollows. First, we choose a positive ρ, initialize Z = U = D0,then alternatively solve each of the following subproblemsuntil convergence:

D0 = arg minD0

−2trace(EDT0 ) + trace

(FDT

0 D0

), (18)

with E = E +ρ

2(Z−U); F = F +

ρ

2I, (19)

Z =Dη/ρ(Dc + U), (20)U =U + D0 − Z, (21)

where D is the shrinkage thresholding operator [35]. Theoptimization problem (18) can be solved by ODL [32]. Notethat (19) and (21) are computationally inexpensive.

4) LRSDL sparse coefficients update (LRSDL-X): In ourpreliminary work [30], we proposed a method for effectivelysolving X and X0 alternatively, now we combine bothproblems into one and find X by solving the followingoptimization problem:

X = arg minX

h(X) + λ1‖X‖1. (22)

where h(X) =1

2fY,D(X) +

λ22g(X). We again solve this

problem using FISTA [31] with the gradient of h(X):

∂h(X)

∂X=

∂hX0(X)

∂X∂hX(X0)

∂X0

. (23)

For the upper term, by combining the observation

hX0(X) =1

2fY,D,X0(X) +

λ22gX0(X),

=1

2fY,D(X) +

λ22g(X) + constant, (24)

and using equation (14), we obtain:

∂hX0(X)

∂X= (M(DTD)+2λ2I)X−M(DTY)+λ2(M−2M).

(25)

For the lower term, by using Lemma 3, we have:

hX(X0) = ‖V−D0X0‖2F+

λ22‖X0−M0‖2F+constant. (26)

⇒ ∂hX(X0)

∂X0= 2DT

0 D0X0 − 2DT

0 V + λ2(X0 −M0),

= (2DT0 D0 + λ2I)X

0 − 2DT0 V − λ2M0. (27)

By combining these two terms, we can calculate (23).

E. Efficient solutions for other dictionary learning methods

We also propose here another efficient algorithm for updatingdictionary in two other well-known dictionary learningmethods: DLSI [27] and COPAR [28]. The cost functionJ1(D,X) in DLSI is defined as:

C∑

c=1

(||Yc−DcX

c‖2F+λ‖Xc‖1+η

2

C∑

j=1,j 6=c

‖DTj Dc‖2F

)(28)

Each class-specific dictionary Dc is updated by fixing othersand solve:

Dc = arg minDc

‖Yc −DcXc‖2F + η‖ADc‖2F , (29)

with A =[D1, . . . ,Dc−1,Dc+1, . . . ,DC

]T.

The original solution for this problem, which will be referredas O-FDDL-D, updates each column dc,j of Dc one by onebased on the procedure:

u = (‖xjc‖22I + ηATA)−1(Yc −∑

i6=j

dc,ixic)x

jc, (30)

dc,j = u/‖u‖22, (31)

where dc,i is the i-th column of Dc and xjc is the j-throw of Xc. This algorithm is highly computational since itrequires one matrix inversion for each of kc columns of Dc.We propose one ADMM [33] procedure to update Dc whichrequires only one matrix inversion, which will be referred as E-DLSI-D. First, by letting E = Yc(X

c)T and F = Xc(Xc)T ,we rewrite (29) in a more general form:

Dc = arg minDc

trace(FDTc Dc)− 2trace(EDT

c ) + η‖ADc‖2F .(32)

In order to solve this problem, first, we choose a ρ, letZ = U = Dc, then alternatively solve each of the followingsub problems until convergence:

Dc = arg minDc

−2trace(EDTc ) + trace

(FDT

c Dc

), (33)

with E = E +ρ

2(Z−U); F = F +

ρ

2I. (34)

Z =(2ηATA + ρI)−1(Dc + U). (35)U =U + Dc − Z. (36)

This efficient algorithm requires only one matrix inversion.Later in this paper, we will both theoretically andexperimentally show that E-DLSI-D is much more efficientthan O-DLSI-D [27]. Note that this algorithm can be beneficialfor two subproblems of updating the common dictionary andthe particular dictionary in COPAR [28] as well.

III. COMPLEXITY ANALYSIS

We compare the computational complexity for the efficientalgorithms and their corresponding original algorithms. Wealso evaluate the total complexity of the proposed LRSDL andcompeting dictionary learning methods: DLSI [27], COPAR[28] and FDDL [24]. The complexity for each algorithmis estimated as the (approximate) number of multiplicationsrequired for one iteration (sparse code update and dictionaryupdate). For simplicity, we assume: i) number of trainingsamples, number of dictionary bases in each class (and theshared class) are the same, which means: nc = n, ki = k.ii) The number of bases in each dictionary is comparable tonumber of training samples per class and much less than thesignal dimension, i.e. k ≈ n� d. iii) Each iterative algorithmrequires q iterations to convergence. For consistency, we havechanged notations in those methods by denoting Y as trainingsample and X as the sparse code.

In the following analysis, we use the fact that: i) if A ∈Rm×n,B ∈ Rn×p, then the matrix multiplication AB hascomplexity mnp. ii) If A ∈ Rn×n is nonsingular, then thematrix inversion A−1 has complexity n3. iii) The singularvalue decomposition of a matrix A ∈ Rp×q , p > q, is assumedto have complexity O(pq2).

A. Online Dictionary Learning (ODL)

We start with the well-known Online Dictionary Learning [32]whose cost function is:

J(D,X) =1

2‖Y −DX‖2F + λ‖X‖1. (37)

where Y ∈ Rd×n,D ∈ Rd×k,X ∈ Rk×n. Most of dictionarylearning methods find their solutions by alternatively solvingone variable while fixing others. There are two subproblems:

1) Update X (ODL-X): When the dictionary D is fixed, thesparse coefficient X is updated by solving the problem:

X = arg minX

1

2‖Y −DX‖2F + λ‖X‖1 (38)

using FISTA [31]. In each of q iterations, the mostcomputational task is to compute DTDX − DTY whereDTD and DTY are precomputed with complexities k2d andkdn, respectively. The matrix multiplication (DTD)X hascomplexity k2n. Then, the total complexity of ODL-X is:

k2d+ kdn+ qk2n = k(kd+ dn+ qkn). (39)

2) Update D (ODL-D): After finding X, the dictionary Dwill be updated by:

D = arg minD−2trace(EDT ) + trace(FDTD), (40)

subject to: ‖di‖2 ≤ 1, with E = YXT , and F = XXT .Each column of D will be updated by fixing all others:

u← 1

Fii(ei −Dfi)− di; di ←

u

max(1, ‖u‖2),

where di, ei, fi are the i−th columns of D,E,F and Fiiis the i−th element in the diagonal of F. The dominantcomputational task is to compute Dfi which requires dkoperators. Since D has k columns and the algorithm requiresq iterations, the complexity of ODL-D is qdk2.

B. Dictionary learning with structured incoherence (DLSI)

DLSI [27] proposed a method to encourage the independencebetween bases of different classes by minimizing coherencebetween cross-class bases. The cost function J1(D,X) ofDLSI is defined as (28).

1) Update X (DLSI-X): In each iteration, the algorithm solvesC subproblems:

Xc = arg minXc‖Yc −DcX

c‖2F + λ‖Xc‖1. (41)

with Yc ∈ Rd×n,Dc ∈ Rd×k, and Xc ∈ Rk×n. Basedon (39), the complexity of updating X (C subproblems) is:

Ck(kd+ dn+ qkn). (42)

2) Original update D (O-DLSI-D): For updating D,each sub-dictionary Dc is solved via (29). The mainstep in the algorithm is stated in (30) and (31). Thedominant computational part is the matrix inversion whichhas complexity d3. Matrix-vector multiplication and vectornormalization can be ignored here. Since Dc has k columns,and the algorithm requires q iterations, the complexity of theO-DLSI-D algorithm is Cqkd3.

3) Efficient update D (E-DLSI-D): Main steps of the proposedalgorithm are presented in equations (33)–(36) where (34) and(36) require much less computation compared to (33) and (35).The total (estimated) complexity of efficient Dc update is asummation of two terms: i) q times (q iterations) of ODL-D in(33). ii) One matrix inversion (d3) and q matrix multiplicationsin (35). Finally, the complexity of E-DLSI-D is:

C(q2dk2 + d3 + qd2k) = Cd3 + Cqdk(qk + d). (43)

Total complexities of O-DLSI (the combination of DLSI-Xand O-DLSI-D) and E-DLSI (the combination of DLSI-X andE-DLSI-D) are summarized in Table II.

C. Separating the particularity and the commonalitydictionary learning (COPAR)

1) Cost function: COPAR [28] is another dictionary learningmethod which also considers the shared dictionary (butwithout the low-rank constraint). By using the same notationas in LRSDL, we can rewrite the cost function of COPAR inthe following form:

1

2f1(Y,D,X) + λ

∥∥X∥∥1

+ η

C∑

c=0

C∑

i=0,i6=c

‖DTi Dc‖2F ,

where f1(Y,D,X) =

C∑

c=1

r1(Yc,D,Xc) and r1(Yc,D,Xc)

is defined as:

‖Yc −DXc‖2F + ‖Yc −D0X0c −DcX

cc‖+

C∑

j=1,j 6=c

‖Xjc‖2F .

2) Update X (COPAR-X): In sparse coefficient updatestep, COPAR [28] solve Xc one by one via one l1-normregularization problem:

X = arg minX‖Y − DX‖2F + λ‖X‖1,

where Y ∈ Rd×n, D ∈ Rd×k, X ∈ R(k×n, d = 2d+(C−1)kand k = (C + 1)k (details can be found in Section 3.1 of[28]). Following results in Section III-A1 and supposing thatC � 1, q � 1, n ≈ k � d, the complexity of COPAR-X is:

Ck(kd+ dn+ qkn) ≈ C3k2(2d+ Ck + qn).

3) Update D (COPAR-D): The COPAR dictionary updatealgorithm requires to solve (C + 1) problems of form (32).While O-COPAR-D uses the same method as O-DLSI-D(see equations (30-31)), the proposed E-COPAR-D takesadvantages of E-DLSI-D presented in Section II-E. Therefore,the total complexity of O-COPAR-D is roughly Cqkd3, whilethe total complexity of E-COPAR-D is roughly C(q2dk2 +d3 + qd2k). Here we have supposed C + 1 ≈ C for large C.

D. Fisher discrimination dictionary learning (FDDL)

1) Original update X (O-FDDL-X): Based on results reportedin DFDL [7], the complexity of O-FDDL-X is roughlyC2kn(d+ qCk) + C3dk2 = C2k(dn+ qCkn+ Cdk).

2) Efficient update X (E-FDDL-X): Based on section II-D2,the complexity of E-FDDL-X mainly comes from equation(14). Recall that function M(•) does not require muchcomputation. The computation of M and Mc can also beneglected since each required calculation of one column, allother columns are the same. Then the total complexity of thealgorithm E-FDDL-X is roughly:

(Ck)d(Ck)︸ ︷︷ ︸M(DTD+λ2I)

+ (Ck)d(Cn)︸ ︷︷ ︸M(DTY)

+q (Ck)(Ck)(Cn)︸ ︷︷ ︸M(DTD+λ2I)X

,

= C2k(dk + dn+ qCnk). (44)

3) Original update D (O-FDDL-D): The original dictionaryupdate in FDDL is divided in to C subproblems. In eachsubproblem, one dictionary Dc will be solved while all othersare fixed via:

Dc = argminDc

‖Y −DcXc‖2F + ‖Yc −DcX

cc‖2F +

∑i 6=c

‖DcXci‖2F ,

= argminDc

−2trace(EDTc ) + trace(FDT

c Dc)︸ ︷︷ ︸complexity: qdk2

, (45)

where:

Y = Y −∑

i 6=c

DiXi complexity: (C − 1)dkCn,

E = Y(Xc)T + Yc(Xcc)T complexity: d(Cn)k + dnk,

F = 2(Xc)(Xc)T complexity k(Cn)k.

When d� k,C � 1, complexity of updating Dc is:

qdk2 + (C2 + 1)dkn+ Ck2n ≈ qdk2 + C2dkn (46)

Then, complexity of O-FDDL-D is Cdk(qk + C2n).

4) Efficient update D (E-FDDL-D): Based on Lemma 1, thecomplexity of E-FDDL-D is:

d(Cn)(Ck)︸ ︷︷ ︸YM(X)T

+ (Ck)(Cn)(Ck)︸ ︷︷ ︸M(XXT )

+ qd(Ck)2︸ ︷︷ ︸ODL in (10)

,

= Cdk(Cn+ Cqk) + C3k2n. (47)

Total complexities of O-FDDL and E-FDDL are summarizedin Table II.

E. LRSDL

1) Update X,X0: From (23), (25) and (27), in each iterationof updating X, we need to compute:

(M(DTD) + 2λ2I)X−M(DTY) +

+λ2(M− 2M)−M(DTD0X0), and

(2DT0 D0 + λ2I)X

0 − 2DT0 Y + DT

0 DM(X)− λ2M0.

Therefore, the complexity of LRSDL-X is:

(Ck)d(Ck)︸ ︷︷ ︸DTD

+ (Ck)d(Cn)︸ ︷︷ ︸DTY

+ (Ck)dk︸ ︷︷ ︸DTD0

+ kdk︸︷︷︸DT

0 D0

+ kd(Cn)︸ ︷︷ ︸DT

0 Y

+

+q

(Ck)2(Cn)︸ ︷︷ ︸(M(DTD)+2λ2I)X

+ (Ck)k(Cn)︸ ︷︷ ︸M(DTD0X0)

+

+ k2Cn︸ ︷︷ ︸(2DT

0 D0+λ2I)X0

+ k(Ck)(Cn)︸ ︷︷ ︸DT

0 DM(X)

,

≈ C2k(dk + dn) + Cdk2 + qCk2n(C2 + 2C + 1),

≈ C2k(dk + dn+ qCkn). (48)

which is similar to the complexity of E-FDDL-X. Recall thatwe have supposed number of classes C � 1.

2) Update D: Compare to E-FDDL-D, LRSDL-D requiresone more computation of Y = Y−D0X

0 (see section II-D3).Then, the complexity of LRSDL-D is:

Cdk(Cn+ Cqk) + C3k2n︸ ︷︷ ︸E-FDDL-D

+ dk(Cn)︸ ︷︷ ︸D0X0

,

≈ Cdk(Cn+ Cqk) + C3k2n, (49)

which is similar to the complexity of E-FDDL-D.

3) Update D0: The algorithm of LRSDL-D0 is presented insection II-D3 with the main computation comes from (17),

Table I: Complexity analysis for proposed efficient algorithmsand their original versions

Method Complexity Pluggingnumbers

O-DLSI-D Cqkd3 6.25× 1012

E-DLSI-D Cd3 + Cqdk(qk + k) 2.52× 1010

O-FDDL-X C2k(dn+ qCkn+ Cdk) 1.51× 1011

E-FDDL-X C2k(dn+ qCnk + dk) 1.01× 1011

O-FDDL-D Cdk(qk + C2n) 1011

E-FDDL-D Cdk(Cn+ Cqk) + C3k2n 2.8× 1010

Table II: Complexity analysis for different dictionary learningmethods

Method Complexity Pluggingnumbers

O-DLSI Ck(kd+ dn+ qkn) + Cqkd3 6.25× 1012

E-DLSICk(kd+ dn+ qkn)+Cd3 + Cqdk(qk + d)

3.75× 1010

O-FDDL C2dk(n+ Ck + Cn)++Ck2q(d+ C2n)

2.51× 1011

E-FDDL C2k((q + 1)k(d+ Cn) + 2dn) 1.29× 1011

O-COPAR C3k2(2d+ Ck + qn) + Cqkd3 6.55× 1012

E-COPAR C3k2(2d+ Ck + qn)++Cd3 + Cqdk(qk + d)

3.38× 1011

LRSDL C2k((q + 1)k(d+ Cn) + 2dn)C2dkn+ (q + q2)dk

2 1.3× 1011

(18) and (20). The shrinkage thresholding operator in (20)requires one SVD and two matrix multiplications. The totalcomplexity of LRSDL-D0 is:

d(Ck)(Cn)︸ ︷︷ ︸V=Y− 1

2DM(X)

+ d(Cn)k︸ ︷︷ ︸E in (17)

+ k(Cn)k︸ ︷︷ ︸F in (17)

+ qdk2︸︷︷︸(18)

+O(dk2) + 2dk2︸ ︷︷ ︸(20)

,

≈ C2dkn+ qdk2 +O(dk2),

= C2dkn+ (q + q2)dk2, for some q2. (50)

By combing (48), (49) and (50), we obtain the total complexityof LRSDL, which is specified in the last row of Table II.

F. Summary

Table I and Table II show final complexity analysis of eachproposed efficient algorithm and their original counterparts.Table II compares LRSDL to other state-of-the-art methods.We pick a typical set of parameters with 100 classes, 20training samples per class, 10 bases per sub-dictionary andshared dictionary, data dimension 500 and 50 iterations foreach iterative method. Concretely, C = 100, n = 20, k =10, q = 50, d = 500. We also assume that in (50), q2 = 50.Table I shows that all three proposed efficient algorithmsrequire less computation than original versions with mostsignificant improvements for speeding up DLSI-D. Table IIdemonstrates an interesting fact. LRSDL is the least expensivecomputationally when compared with other original dictionarylearning algorithms, and only E-FDDL has lower complexity,which is to be expected since the FDDL cost function is aspecial case of the LRSDL cost function. COPAR is found tobe the most expensive computationally.

a) Extended YaleB b) AR face

c) AR gender

males females

bluebell fritillary sunflower daisy dandelion

d) Oxford Flower

laptop chair motorbike

e) Caltech 101

dragonfly air plane

Figure 3: Examples from five datasets.

IV. EXPERIMENTAL RESULTS

A. Comparing methods and datasetsWe present the experimental results of applying these methodsto five diverse datasets: the Extended YaleB face dataset [36],the AR face dataset [37], the AR gender dataset, the OxfordFlower dataset [38], and one multi-class object categorydataset – the Caltech 101 [39]. Example images from thesedatasets are shown in Figure 3. We compare our resultswith those using SRC [5] and other state-of-the-art dictionarylearning methods: LC-KSVD [23], DLSI [27], FDDL [25],COPAR [28], and D2L2R2 [26]. Regularization parametersin all methods are chosen using cross-validation [40].

For two face datasets, feature descriptors are random faces,which are made by projecting a face image onto a randomvector using a random projection matrix. As in [21], thedimension of a random-face feature in the Extended YaleBis d = 504, while the dimension in AR face is d = 540.Samples of these two datasets are shown in Figure 3a) and b).

For the AR gender dataset, we first choose a non-occludedsubset (14 images per person) from the AR face dataset, whichconsists of 50 males and 50 females, to conduct experimentof gender classification. Training images are taken from thefirst 25 males and 25 females, while test images comprisesall samples from the remaining 25 males and 25 females.PCA was used to reduce the dimension of each image to 300.Samples of this dataset are shown in Figure 3c).

The Oxford Flower dataset is a collection of images offlowers drawn from 17 species with 80 images per class,totaling 1360 images. For feature extraction, based on theimpressive results presented in [25], we choose the FrequentLocal Histogram feature extractor [41] to obtain featurevectors of dimension 10,000. The test set consists of 20 imagesper class, the remaining 60 images per class are used fortraining. Samples of this dataset are shown in Figure 3d).

For the Caltech 101 dataset, we use a dense SIFT (DSIFT)

20 40 60 80 100

21

22

23

24

21

22

23

24

iteration

cost function

20 40 60 80 100

0

2,000

7,000

12,000

0

2,000

7,000

12,000

iteration

running time (s)

O-FDDL O-FDDL-X + E-FDDL-DE-FDDL E-FDDL-X + O-FDDL-D

Figure 4: Original and efficient FDDL convergence rate comparison.

10 20 30 40 50

50

150

250

50

150

250

iteration

cost fuction

10 20 30 40 50

0500

2,000

4,000

0500

2,000

4,000

iteration

running time (s)

O-DLSI E-DLSI

Figure 5: DLSI convergence rate comparison.

descriptor. The DSIFT descriptor is extracted from 25 × 25patch which is densely sampled on a dense grid with 8pixels. We then extract the sparse coding spatial pyramidmatching (ScSPM) feature [42], which is the concatenation ofvectors pooled from words of the extracted DSIFT descriptor.Dimension of words is 1024 and max pooling technique isused with pooling grid of 1×1, 2×2, and 4×4. With this setup,the dimension of ScSPM feature is 21,504; this is followed bydimension reduction to d = 3000 using PCA. Samples of thisdataset are shown in Figure 3e).

B. Validation of efficient algorithms

To evaluate the improvement of three efficient algorithmsproposed in section II, we apply these efficient algorithms andtheir original versions on training samples from the AR facedataset to verify the convergence speed of those algorithms.In this example, number of classes C = 100, the random-face feature dimension d = 300, number of training samplesper class nc = n = 7, number of atoms in each particulardictionary kc = 7.

1) E-FDDL-D and E-FDDL-X: Figure 4 shows the costfunctions and running time after each of 100 iterations of 4different versions of FDDL: the original FDDL (O-FDDL),combination of O-FDDL-X and E-FDDL-D, combination ofE-FDDL-X and O-FDDL-D, and the efficient FDDL (E-FDDL). The first observation is that O-FDDL convergesquickly to a suboptimal solution, which is far from the bestcost obtained by E-FDDL. In addition, while O-FDDL requiresmore than 12,000 seconds (around 3 hours and 20 minutes) torun 100 iterations, it takes E-FDDL only half an hour to do

10 20 30 40 50

0

100

200

300

400

iteration

cost function

10 20 30 40 50

0

2,000

4,000

5,000

0

2,000

4,000

iteration

running time (s)

O-COPAR E-COPAR

Figure 6: COPAR convergence rate comparison.

the same task.

2) E-DLSI-D and E-COPAR-D: Figure 5 and 6 compareconvergence rates of DLSI and COPAR algorithms. As wecan see, while the cost function value improves slightly, therun time of efficient algorithms reduces significantly. Basedon benefits in both cost function value and computation, inthe rest of this paper, we use efficient optimization algorithmsinstead of original versions for obtaining classification results.

C. Visualization of learned shared bases

To demonstrate the behavior of dictionary learning methodson a dataset in the presence of shared features, we createa toy example in Figure 7. This is a classification problemwith 4 classes whose basic class-specific elements and sharedelements are visualized in Figure 7a). Each basis elementhas dimension 20 pixel ×20 pixel. From these elements,we generate 1000 samples per class by linearly combiningclass-specific elements and shared elements followed by noiseadded; 200 samples per class are used for training, 800remaining images are used for testing. Samples of each classare shown in Figure 7b).

Figure 7c) show sample learned bases using DLSI [27] whereshared features are still hidden in class-specific bases. In LC-KSVD bases (Figure 7e) and f)), shared features (the squaredin the middle of a patch) are found but they are classified asbases of class 1 or class 2, diminishing classification accuracysince most of test samples are classified as class 1 or 2. Thesame phenomenon happens in FDDL bases (Figure 7g)).

The best classification results happen in two shared dictionarylearnings (COPAR in Figure 7d) and the proposed LRSDLin Figure 7h)) where the shared bases are extracted andgathered in the shared dictionary. However, in COPAR, sharedfeatures still appear in class-specific dictionaries and theshared dictionary also includes features of class 1 and 2.In LRSDL, class-specific elements and shared elements arenearly perfectly decomposed into appropriate sub dictionaries.The reason behind this phenomenon is the low-rank constrainton the shared dictionary of LRSDL. Thanks to this constraint,LRSDL produces perfect results on this simulated data.

1

2

3

4 Shared

Basic elementsa) Samples

1

2

3

4

b)DLSI bases

1

2

3

4

Accuracy: 95.15%.

c)

LCKSVD1 bases

1

2

3

4

Accuracy: 45.15%.

e)LCKSVD2 bases

1

2

3

4

Accuracy: 48.15%.

f)FDDL bases

1

2

3

4

Accuracy: 97.25%.

g)

DLCOPAR bases

1

2

3

4

SharedAccuracy: 99.25%.

d)

LRSDL bases

1

2

3

4

SharedAccuracy: 100%.

h)

Figure 7: Visualization of learned bases of different dictionary learning methods on the simulated data.

10 20 30 40 50 60 70 80

75

80

85

90

size of the shared dictionary (k0)

Overall accuray (%)

COPAR

LRSDL η = 0

LRSDL η = 0.01

LRSDL η = 0.1

Figure 8: Dependence of overall accuracy on the shared dictionary.

D. Effect of the shared dictionary sizes on overall accuracy

We perform an experiment to study the effect of the shareddictionary size on the overall classification results of twodictionary methods: COPAR [28] and LRSDL in the ARgender dataset. In this experiment, 40 images of each class areused for training. The number of shared dictionary bases variesfrom 10 to 80. In LRSDL, because there is a regularizationparameter η which is attached to the low-rank term (seeequation (6)), we further consider three values of η: η = 0,i.e. no low-rank constraint, η = 0.01 and η = 0.1 for twodifferent degrees of emphasis. Results are shown in Figure 8.

We observe that the performance of COPAR heavily dependson the choice of k0 and its results worsen as the size of theshared dictionary increases. The reason is that when k0 islarge, COPAR tends to absorb class-specific features into theshared dictionary. This trend is not associated with LRSDL

even when the low-rank constraint is ignored (η = 0), becauseLRSDL has another constraint (‖X0 −M0‖2F small) whichforces the coefficients corresponding to the shared dictionaryto be similar. Additionally, when we increase η, the overallclassification of LRSDL also gets better. These observationsconfirm that our two proposed constraints on the shareddictionary are important, and the LRSDL exhibits robustnessto parameter choices.

E. Overall Classification Accuracy

Table III shows overall classification results of variousmethods on all presented datasets. It is evident that in mostcases, two dictionary learning methods with shared features(COPAR [28] and our proposed LRSDL) outperform otherswith all five highest values presenting in our proposed LRSDL.

F. Performance vs. size of training set

Real-world classification tasks often have to contend withlack of availability of large training sets. To understandtraining dependence of the various techniques, we present acomparison of overall classification accuracy as a functionof the training set size of the different methods. In Figure9, overall classification accuracies are reported for allaforementioned datasets corresponding to various scenarios.It is readily apparent that LRSDL exhibits the most gracefuldecline as training is reduced. In addition, LRSDL also showshigh performance even with low training on AR face and ARgender datasets.

5 10 15 20 25 30

70

80

90

100YaleB

5 10 15 2040

50

60

70

80

90

100

AR face

50 150 250 35086

88

90

92

94

96AR gender

20 30 40 50 6060

70

80

90

Oxford Flower

5 10 15 20 25 3045

50

55

60

65

70

75

Caltech 101

SRC LCKSVD1 LCKSVD2 DLSI FDDL D2L2R2 DLCOPAR LRSDL

Figure 9: Overall classification accuracy (%) as a function of training set size per class.

Table III: Results (%) on different datasets.

Ext.YaleB AR AR

genderOxfordFlower

Caltech101

SRC [5] 97.96 97.33 92.57 75.79 72.15LC-KSVD1 [23] 97.09 97.78 88.42 91.47 73.40LC-KSVD2 [23] 97.80 97.70 90.14 92.00 73.60DLSI [27] 96.50 96.67 94.57 85.29 70.67FDDL [25] 97.52 97.00 93.70 91.17 73.83D2L2R2 [26] 96.70 95.33 93.71 83.23 75.26COPAR [28] 98.19 98.50 95.14 85.29 76.05LRSDL 98.76 98.83 95.42 92.58 76.70

V. DISCUSSION AND CONCLUSION

In this paper, our primary contribution is the developmentof a discriminative dictionary learning framework viathe introduction of a shared dictionary with two crucialconstraints. First, the shared dictionary is constrained to below-rank. Second, the sparse coefficients corresponding to theshared dictionary obey a similarity constraint. In conjunctionwith discriminative model as proposed in [24], [25], thisleads to a more flexible model where shared features areexcluded before doing classification. An important benefit ofthis model is the robustness of the framework to size (k0)and the regularization parameter (η) of the shared dictionary.In comparison with state-of-the-art algorithms developedspecifically for these tasks, our LRSDL approach offers betterclassification performance on average.

In Section II-D and II-E, we discuss the efficient algorithmsfor FDDL [25], DLSI [27], then flexibly apply them into moresophisticated models. Thereafter in Section III and IV-B,we both theoretically and practically show that the proposedalgorithms indeed significantly improve cost functions andrun time speeds of different dictionary learning algorithms.The complexity analysis also shows that the proposed LRSDLrequires less computation than competing models.

As proposed, the LRSDL model learns a dictionary shared byevery class. In some practical problems, a feature may belongto more than one but not all classes. Very recently, researchershave begun to address this issue [43], [44]. In future work, wewill investigate the design of hierarchical models for extractingcommon features among classes.

APPENDIX

A. Proof of Lemma 1

Let wc ∈ {0, 1}K is a binary vector whose j-th element is one if andonly if the j-th columns of D belong to Dc, and Wc = diag(wc).We observe that DcX

ci = DWcXi. We can rewrite fY,X(D) as:

‖Y −DX‖2F +

C∑c=1

(‖Yc −DcX

cc‖2F +

∑j 6=c

‖DjXjc‖2F

)= ‖Y −DX‖2F +

C∑c=1

(‖Yc −DWcXc‖2F +

∑j 6=c

‖DWjXc‖2F),

= trace

((XXT +

C∑c=1

C∑j=1

WjXcXTc W

Tj

)DTD

),

−2trace

((YXT +

C∑c=1

YcXTc Wc

)DT

)+ constant,

= −2trace(EDT ) + trace(FDTD) + constant.

where we have defined:

E = YXT +

C∑c=1

YcXTc Wc,

= YXT +[Y1(X

11)

T . . . YC(XCC)

T],

= Y

XT +

(X11)

T . . . 00 . . . 0. . . . . . . . .0 . . . (XC

C)T

= YM(X)T ,

F = XXT +

C∑c=1

C∑j=1

WjXcXTc W

Tj ,

= XXT +

C∑j=1

Wj

(C∑

c=1

XcXTc

)WT

j ,

= XXT +C∑

j=1

WjXXTWTj .

Let:

XXT = A =

A11 . . . A1j . . . A1C

. . . . . . . . . . . . . . .A21 . . . Ajj . . . A2C

. . . . . . . . . . . . . . .AC1 . . . ACj . . . ACC

.From definition of Wj , we observe that ‘left-multiplying’ a matrixby Wj forces that matrix to be zero everywhere except the j-th block

row. Similarly, ‘right-multiplying’ a matrix by WTj = Wj will keep

its j-th block column only. Combining these two observations, wecan obtain the result:

WjAWTj =

0 . . . 0 . . . 0. . . . . . . . . . . . . . .0 . . . Ajj . . . 0. . . . . . . . . . . . . . .0 . . . 0 . . . 0

.Then:

F = XXT +

C∑j=1

WjXXTWTj ,

= A+

A11 . . . 00 . . . 0. . . . . . . . .0 . . . ACC

=M(A) =M(XXT ).

Lemma 1 has been proved. �

B. Proof of Lemma 2

We need to prove two parts:For the gradient of f , first we rewrite:

f(Y,D,X) =

C∑c=1

r(Yc,D,Xc) =∥∥∥∥∥∥∥∥∥∥∥∥∥∥

Y1 Y2 . . . YC

Y1 0 . . . 00 Y2 . . . 0. . . . . . . . . . . .0 0 . . . YC

︸ ︷︷ ︸

Y

D1 D2 . . . DC

D1 0 . . . 00 D2 . . . 0. . . . . . . . . . . .0 0 . . . DC

︸ ︷︷ ︸

D

X

∥∥∥∥∥∥∥∥∥∥∥∥∥∥

2

F

= ‖Y − DX‖2F .

Then we obtain:

∂ 12fY,D(X)

∂X= DT D− DT Y =M(DTD)X−M(DTY).

For the gradient of g, let Eqp be the all-one matrix in Rp×q . It is

easy to verify that:

(Eqp)

T = Epq , Mc = mcE

nc1 =

1

ncXcE

ncnc,

EqpE

rq = qEr

p, (I− 1

pEp

p)(I−1

pEp

p)T = (I− 1

pEp

p).

We have:

Xc −Mc = Xc −1

ncXcE

ncnc

= Xc(I−1

ncEnc

nc),

⇒ ∂

∂Xc

1

2‖Xc −Mc‖2F = Xc(I−

1

ncEnc

nc)(I− 1

ncEnc

nc)T ,

= Xc(I−1

ncEnc

nc) = Xc −Mc.

Therefore we obtain:

∂ 12

∑Cc=1 ‖Xc −Mc‖2F

∂X= [X1, . . . ,XC ]− [M1, . . . ,MC ]︸ ︷︷ ︸

M

= X− M. (51)

For Mc −M, first we write it in two ways:

Mc −M =1

ncXcE

ncnc− 1

NXEnc

N =1

ncXcE

ncnc− 1

N

C∑j=1

XjEncnj,

=N − nc

NncXcE

ncnc− 1

N

∑j 6=c

XjEncnj, (52)

=1

ncXcE

ncnc− 1

NXlE

ncnl− 1

N

∑j 6=l

XjEncnj

(l 6= c). (53)

Then we infer:

(52)⇒ ∂

∂Xc

1

2‖Mc −M‖2F =

(1

nc− 1

N

)(Mc −M)Enc

nc,

= (Mc −M) +1

N(M−Mc)E

ncnc.

(53)⇒ ∂

∂Xl

1

2‖Mc −M‖2F =

1

N(M−Mc)E

nlnc(l 6= c).

⇒ ∂

∂Xl

1

2

C∑c=1

‖Mc −M‖2F = Ml −M+1

N

C∑c=1

(M−Mc)Enlnc.

Now we prove thatC∑

c=1

(M−Mc)Enlnc

= 0. Indeed,

C∑c=1

(M−Mc)Enlnc

=

C∑c=1

(mEnc1 −mcE

nc1 )Enl

nc,

=

C∑c=1

(m−mc)Enc1 Enl

nc=

C∑c=1

nc(m−mc)Enl,1

=( C∑

c=1

ncm−C∑

c=1

ncmc

)E

nl1 = 0

(since m =

∑Cc=1 ncmc∑C

c=1 nc

).

Then we have:

∂X

1

2

C∑c=1

‖Mc −M‖2F = [M1, . . . ,MC ]−M = M−M. (54)

Combining (51), (54) and ∂ 12‖X‖2F∂X

= X , we have:

∂ 12g(X)

∂X= 2X+M− 2M.

Lemma 2 has been proved. �

C. Proof of Lemma 3

When Y,D,X are fixed, we have:

JY,D,X(D0,X0) =

1

2‖Y −D0X

0 −DX‖2F + η‖D0‖∗+C∑

c=1

1

2‖Yc −D0X

0c −DcX

cc‖2F + λ1‖X0‖1 + constant. (55)

Let Y = Y − DX, Yc = Yc − DcXcc and Y =[

Y1 Y2 . . . YC

], we can rewrite (55) as:

JY,D,X(D0,X0) =

1

2‖Y −D0X

0‖2F +1

2‖Y −D0X

0‖2F +

+λ1‖X0‖1 + η‖D0‖∗ + constant1,

=

∥∥∥∥∥Y + Y

2−D0X

0

∥∥∥∥∥2

F

+ λ1‖X0‖1 + η‖D0‖∗ + constant2.

We observe that:

Y + Y = 2Y −DX−[D1X

11 . . . DCX

CC

]= 2Y −DM(X).

Now, by letting V =Y + Y

2, Lemma 3 has been proved. �

REFERENCES

[1] D. L. Donoho, “Compressed sensing,” IEEE Transactions on informationtheory, vol. 52, no. 4, pp. 1289–1306, 2006.

[2] H. S. Mousavi, V. Monga, and T. D. Tran, “Iterative convex refinementfor sparse recovery,” IEEE Signal Processing Letters, vol. 22, no. 11,pp. 1903–1907, 2015.

[3] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm fordesigning overcomplete dictionaries for sparse representation,” IEEETrans. on Signal Processing, vol. 54, no. 11, pp. 4311–4322, 2006.

[4] M. W. Spratling, “Image segmentation using a sparse coding model ofcortical area v1,” IEEE Trans. on Image Processing, vol. 22, no. 4, pp.1631–1643, 2013.

[5] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust facerecognition via sparse representation,” IEEE Trans. on Pattern Analysisand Machine Int., vol. 31, no. 2, pp. 210–227, Feb. 2009.

[6] T. H. Vu, H. S. Mousavi, V. Monga, U. Rao, and G. Rao, “DFDL:Discriminative feature-oriented dictionary learning for histopathologicalimage classification,” Proc. IEEE International Symposium on Biomed-ical Imaging, pp. 990–994, 2015.

[7] T. H. Vu, H. S. Mousavi, V. Monga, U. Rao, and G. Rao,“Histopathological image classification using discriminative feature-oriented dictionary learning,” IEEE Transactions on Medical Imaging,vol. 35, no. 3, pp. 738–751, March, 2016.

[8] U. Srinivas, H. S. Mousavi, V. Monga, A. Hattel, and B. Jayarao,“Simultaneous sparsity model for histopathological image representationand classification,” IEEE Transactions on Medical Imaging, vol. 33,no. 5, pp. 1163–1179, May 2014.

[9] X. Sun, N. M. Nasrabadi, and T. D. Tran, “Task-driven dictionarylearning for hyperspectral image classification with structured sparsityconstraints,” IEEE Transactions on Geoscience and Remote Sensing,vol. 53, no. 8, pp. 4457–4471, 2015.

[10] X. Sun, Q. Qu, N. M. Nasrabadi, and T. D. Tran, “Structured priorsfor sparse-representation-based hyperspectral image classification,”Geoscience and Remote Sensing Letters, IEEE, vol. 11, no. 7, pp. 1235–1239, 2014.

[11] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral imageclassification via kernel sparse representation,” IEEE Transactions onGeoscience and Remote Sensing, vol. 51, no. 1, pp. 217–231, 2013.

[12] H. Zhang, N. M. Nasrabadi, Y. Zhang, and T. S. Huang, “Multi-viewautomatic target recognition using joint sparse representation,” IEEETrans. on Aerospace and Electronic Systems, vol. 48, no. 3, pp. 2481–2497, 2012.

[13] T. Thongkamwitoon, H. Muammar, and P.-L. Dragotti, “An imagerecapture detection algorithm based on learning dictionaries of edgeprofiles,” IEEE Transactions on Information Forensics and Security,vol. 10, no. 5, pp. 953–968, 2015.

[14] X. Mo, V. Monga, R. Bala, and Z. Fan, “Adaptive sparse representationsfor video anomaly detection,” Circuits and Systems for VideoTechnology, IEEE Transactions on, vol. 24, no. 4, pp. 631–645, 2014.

[15] H. S. Mousavi, U. Srinivas, V. Monga, Y. Suo, M. Dao, and T. Tran,“Multi-task image classification via collaborative, hierarchical spike-and-slab priors,” in Proc. IEEE Conf. on Image Processing, 2014, pp.4236–4240.

[16] U. Srinivas, Y. Suo, M. Dao, V. Monga, and T. D. Tran, “Structuredsparse priors for image classification,” Image Processing, IEEETransactions on, vol. 24, no. 6, pp. 1763–1776, 2015.

[17] H. Zhang, Y. Zhang, N. M. Nasrabadi, and T. S. Huang, “Joint-structured-sparsity-based classification for multiple-measurement tran-sient acoustic signals,” Systems, Man, and Cybernetics, Part B: Cyber-netics, IEEE Transactions on, vol. 42, no. 6, pp. 1586–1598, 2012.

[18] M. Dao, Y. Suo, S. P. Chin, and T. D. Tran, “Structured sparserepresentation with low-rank interference,” in 2014 48th Asilomar Conf.on Signals, Systems and Computers. IEEE, 2014, pp. 106–110.

[19] M. Dao, N. H. Nguyen, N. M. Nasrabadi, and T. D. Tran, “Collaborativemulti-sensor classification via sparsity-based representation,” IEEETransactions on Signal Processing, vol. 64, no. 9, pp. 2400–2415, 2016.

[20] H. Van Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa,“Design of non-linear kernel dictionaries for object recognition,” IEEETransactions on Image Processing, vol. 22, no. 12, pp. 5123–5135, 2013.

[21] Q. Zhang and B. Li, “Discriminative K-SVD for dictionary learningin face recognition,” in Proc. IEEE Conf. Computer Vision PatternRecognition. IEEE, 2010, pp. 2691–2698.

[22] Z. Jiang, Z. Lin, and L. S. Davis, “Learning a discriminative dictionaryfor sparse coding via label consistent K-SVD,” in Proc. IEEE Conf.Computer Vision Pattern Recognition, 2011, pp. 1697–1704.

[23] Z. Jiang, Z. Lin, and L. Davis, “Label consistent K-SVD: Learninga discriminative dictionary for recognition,” IEEE Trans. on PatternAnalysis and Machine Int., vol. 35, no. 11, pp. 2651–2664, 2013.

[24] M. Yang, L. Zhang, X. Feng, and D. Zhang, “Fisher discriminationdictionary learning for sparse representation,” in Proc. IEEE Interna-tional Conference on Computer Vision, Nov. 2011, pp. 543–550.

[25] M. Yang, L. Zhang, X. Feng, and D. Zhang, “Sparse representationbased fisher discrimination dictionary learning for image classification,”Int. Journal of Computer Vision, vol. 109, no. 3, pp. 209–232, 2014.

[26] L. Li, S. Li, and Y. Fu, “Learning low-rank and discriminative dictionaryfor image classification,” Image and Vision Computing, vol. 32, no. 10,pp. 814–823, 2014.

[27] I. Ramirez, P. Sprechmann, and G. Sapiro, “Classification and clusteringvia dictionary learning with structured incoherence and shared features,”in IEEE Conf. on Comp. Vision and Pattern Recog. IEEE, 2010, pp.3501–3508.

[28] S. Kong and D. Wang, “A dictionary learning approach for classification:separating the particularity and the commonality,” in Computer Vision–ECCV 2012. Springer, 2012, pp. 186–199.

[29] S. Gao, I. W.-H. Tsang, and Y. Ma, “Learning category-specificdictionary and shared dictionary for fine-grained image categorization,”IEEE Trans. on Image Processing, vol. 23, no. 2, pp. 623–634, 2014.

[30] T. H. Vu and V. Monga, “Learning a low-rank shared dictionaryfor object classification,” IEEE International Conference on ImageProcessing, pp. 4428–4432, 2016.

[31] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems,” SIAM Journal on ImagingSciences, vol. 2, no. 1, pp. 183–202, 2009.

[32] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrixfactorization and sparse coding,” The Journal of Machine LearningResearch, vol. 11, pp. 19–60, 2010.

[33] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,” Foundations and Trends R© in Machine Learning, vol. 3,no. 1, pp. 1–122, 2011.

[34] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-ranksolutions of linear matrix equations via nuclear norm minimization,”SIAM review, vol. 52, no. 3, pp. 471–501, 2010.

[35] J.-F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholdingalgorithm for matrix completion,” SIAM Journal on Optimization,vol. 20, no. 4, pp. 1956–1982, 2010.

[36] A. Georghiades, P. Belhumeur, and D. Kriegman, “From few to many:Illumination cone models for face recognition under variable lightingand pose,” IEEE Trans. on Pattern Analysis and Machine Int., vol. 23,no. 6, pp. 643–660, 2001.

[37] A. Martinez and R. Benavente, “The AR face database,” CVC TechnicalReport, vol. 24, 1998.

[38] M.-E. Nilsback and A. Zisserman, “A visual vocabulary for flowerclassification,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, vol. 2, 2006, pp. 1447–1454.

[39] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual modelsfrom few training examples: An incremental bayesian approach testedon 101 object categories,” Computer Vision and Image Understanding,vol. 106, no. 1, 2007.

[40] R. Kohavi, “A study of cross-validation and bootstrap for accuracyestimation and model selection.” Morgan Kaufmann, 1995.

[41] B. Fernando, E. Fromont, and T. Tuytelaars, “Effective use of frequentitemset mining for image classification,” in Computer Vision–ECCV2012. Springer, 2012, pp. 214–227.

[42] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramidmatching using sparse coding for image classification,” in IEEE Conf.on Comp. Vision and Pattern Recog. IEEE, 2009, pp. 1794–1801.

[43] M. Yang, D. Dai, L. Shen, and L. Gool, “Latent dictionary learningfor sparse representation based classification,” in Proc. IEEE Conf.Computer Vision Pattern Recognition, 2014, pp. 4138–4145.

[44] J. Yoon, J. Choi, and C. D. Yoo, “A hierarchical-structured dictionarylearning for image classification,” in Proc. IEEE Conf. on ImageProcessing, 2014, pp. 155–159.