image set classiﬁcation for low resolution surveillance · 2019-06-18 · 1 image set...

1

Image Set Classificationfor Low Resolution Surveillance

Uzair Nadeem, Syed Afaq Ali Shah, Mohammed Bennamoun, Roberto Togneriand Ferdous Sohel

Abstract—This paper proposes a novel image set classification technique based on the concept of linear regression.Unlike most other approaches, the proposed technique does not involve any training or feature extraction. The galleryimage sets are represented as subspaces in a high dimensional space. Class specific gallery subspaces are used toestimate regression models for each image of the test image set. Images of the test set are then projected on the gallerysubspaces. Residuals, calculated using the Euclidean distance between the original and the projected test images, areused as the distance metric. Three different strategies are devised to decide on the final class of the test image set.We performed extensive evaluations of the proposed technique under the challenges of low resolution, noise and lessgallery data for the tasks of surveillance, video based face recognition and object recognition. Experiments show that theproposed technique achieves a better classification accuracy and a faster execution time under the challenging testingconditions.

Index Terms—Image Set Classification, Linear Regression Classification, Surveillance, Video based Face Recognition,Object Recognition.

F

1 INTRODUCTION

IN the recent years, due to the wide availabilityof mobile and handy cameras, image set classifi-

cation has emerged as a promising research area incomputer vision. The problem of image set classi-fication is defined as the task of recognition frommultiple images [20]. Unlike single image basedrecognition, the test set in image set classificationconsists of a number of images belonging to thesame class. The task is to assign a collective label toall the images in the test set. Similarly, the trainingor gallery data is also composed of one or moreimage sets for each class and each image set consistsof different images of the same class.

Image set classification can provide solution tomany of the shortcomings of single image basedrecognition methods. The extra information (posesand views) contained in an image set can be usedto overcome a wide range of appearance variations

• Uzair Nadeem, Syed Afaq Ali Shah and Mohammed Bennamounare with the School of Computer Science and Software Engineer-ing, and Roberto Togneri is with the School of Electrical, Elec-tronics and Computer Engineering, The University of WesternAustralia, 35 Stirling Highway, 6009, Crawley, WA, Australia.E-mail: [email protected]

• Ferdous Sohel is with the School of Engineering and InformationTechnology, Murdoch University, 90 South St, 6150, MurdochWA, Australia.

such as non-rigid deformations, viewpoint changes,occlusions, different background and illuminationvariations. It can also help to take better decisionsin the presence of noise, distortions and blur in im-ages. These attributes render image set classificationas a strong candidate for many real life applica-tions e.g., surveillance, video based face recognition,object recognition and person re-identification in anetwork of security cameras [17].

The image set classification techniques proposedin the literature can broadly be classified into threecategories. (i) The first category, known as paramet-ric methods, uses statistical distributions to modelan image set [1]. These techniques calculate thecloseness between the distributions of training andtest image sets. These techniques however, fail toproduce good results in the absence of a strongstatistical relationship between the training and thetest data. (ii) Contrary to parametric methods, non-parametric techniques do not model image setswith statistical distributions. Rather, they representimage sets as linear or nonlinear subspaces [26],[37], [40], [41]. (iii) The third type of techniques usedeep learning to classify image sets [16], [17], [24].

Most current image set classification techniqueshave been evaluated on video based face recogni-tion. However, the results cannot be generalized to

arX

iv:1

803.

0947

0v1

[cs

.CV

] 2

6 M

ar 2

018

2

real life surveillance scenarios, because the imagesobtained by surveillance cameras are usually oflow resolution and contrast, and contain blur andnoise. Moreover, in surveillance cases, backgroundsin training and testing are usually different. Thetraining set usually consists of high resolution mugshots in indoor studio environments, while the testset contains noisy images in highly uncontrolledenvironments. Another challenge is that not enoughdata is normally available for testing and training,e.g., even if there is a video of a person to beidentified, captured by a surveillance camera, theremay be very few frames in which the face of thatperson is recognizable. In addition to humans, itmay also be required to recognize various objectsin a surveillance scenario.

The current state-of-the-art techniques for im-age set classification are learning-based and theyrequire a lot of training data to achieve good results.However, in practical scenarios (e.g., to identify aperson captured in a surveillance video from a fewmugshots) very little training data is usually avail-able per class. Moreover, the current techniques arenot able to cope effectively with the challenges, e.g.,of low resolution images. Image set classificationtechniques usually require hand-crafted featuresand parameter tuning to achieve good results. Com-putational time is another constraint as decisionsfor surveillance and security are required in real-time. Moreover, it is difficult to add new classes inmethods that require training, as this will requirethe re-training on all of the data. An ideal imageset classification technique should be able to distin-guish image sets with less gallery data and shouldbe capable of coping with the challenges of lowresolution, noise and blur. The technique shouldalso work in real time with minimal or no training(or parameter tuning) with the capability to easilyadd new classes.

This paper proposes a novel non-parametric ap-proach for image set classification. The proposedtechnique is based on the concept of image projec-tions on subspaces using Linear Regression Clas-sification (LRC) [25]. LRC uses the principle thatimages of the same class lie on a linear subspace[2], [3]. In our proposed technique, we used down-sampled, raw grayscale images of gallery set of eachcategory to form a subspace in a high dimensionalspace. Our technique does not involve any training.At test time, each test image is represented as alinear combination of images in each gallery im-age set. Regression model parameters are estimatedfor each image of the test image set using a least

squares based solution. Due to the least squaressolution, the proposed technique can be susceptibleto the problem of singular matrix or singularity.This occurs when the matrix is rank deficient i.e.,the rank of the matrix is less than the numberof rows. We used perturbation to overcome thisproblem. The estimated regression model is usedto calculate the projection of the test image on thegallery subspace. The residual obtained for the testimage projection is used as the distance metric.Three different strategies: majority voting, nearestneighbour classification and weighted voting, areevaluated to decide for the class of the test imageset. The block diagram of the proposed techniqueis shown in Figure 1. We performed a rigorousevaluation for the performance of the proposedtechnique for the tasks of real life surveillance sce-narios on the SCFace Dataset [10], FR SURV dataset[29] and Choke Point Dataset [39]. To the best of ourknowledge, this is the first implementation of imageset classification on real life surveillance datasets.We also evaluated our algorithm for the task ofvideo based faced recognition on three popularimage set classification datasets: CMU Motion ofBody (CMU MoBo) Dataset [11], YoutubeCelebrity(YTC) Dataset [19] and UCSD/Honda Dataset [22].We used the Washington RGBD Dataset [21] andETH-80 dataset [23] for experiments on object recog-nition. We also provide comparison with similarimage set classification algorithms as well as otherpopular image set methods.

The main contribution of this paper is a novelimage set classification technique based on the prin-ciple of LRC, which is able to produce a supe-rior classification performance, especially under theconstraints of low resolution and limited gallerydata. Our technique does not require any train-ing and can easily be generalized across differentdatasets. A preliminary version of this work willappear in [32]. This paper extends [32] by providingextensive experiments to demonstrate the superiorcapability of the proposed technique to performeffectively using a limited number of low resolutionimages. Three different strategies are also devisedfor the estimation of the class of the test imageset. In addition to video based face recognition,we also evaluated the proposed technique for thetasks of object classification and real life surveillancescenarios.

The rest of this paper is organized as follows.Section 2 presents an overview of the related work.The proposed technique is discussed in Section 3.Section 4 reports the experimental conditions and

3

Fig. 1. A block diagram of the proposed technique. Each galleryimage set is converted to a matrix and is treated as a subspace.Each test image is projected on the gallery subspaces. Theresiduals, obtained by the difference of the original and theprojected images, are used to decide on the class of the testimage set.

a detailed evaluation of our results, compared tothe state-of-the-art approaches. The comparison ofthe computational time of the proposed techniquewith other methods is presented in the Section 5.The paper is concluded in Section 6.

2 RELATED WORK

Image set classification methods can be classi-fied into three broad categories: parametric, non-parametric and deep learning based methods, de-pending on how they represent image sets. Theparametric methods represent image sets with astatistical distribution model [1] and then use differ-ent approaches, such as KL-divergence, to calculatethe similarity and differences between the estimateddistribution models of gallery and test image sets.Such methods have the drawback that they as-sume a strong statistical relationship between theimage sets of the same class and a weak statisticalconnection between image sets of different classes.However, this assumption does not hold many atimes.

On the other hand, non-parametric methodsmodel image sets by exemplar images or as geomet-ric surfaces. Different works have used a wide va-riety of metrics to calculate the set to set similarity.Wang et al. [38] use the Euclidean distance betweenthe mean images of sets as the similarity metric. Ce-vikalp and Triggs [4] adaptively learn samples fromthe image sets and use affine hull or convex hull

to model them. The distance between image sets iscalled Convex Hull Image Set Distance (CHISD) inthe case of convex hull model, while in the case ofaffine hull model, it is termed as the Affine HullImage Set Distance (AHISD). Hu et al. [18] calculatethe mean image of the image set and affine hullmodel to estimate the Sparse Approximated NearestPoints (SANP). The calculated Nearest Points de-termine the distance between the training and thetest image sets. Chen et al. [6] estimate differentposes in the training set and then iteratively trainseparate sparse models for them. Chen et al. [7] usea non-linear kernel to improve the performance of[6]. The non-availability of all the poses in all videoshampers the final classification accuracy. Some non-parametric methods e.g., [13], [36], [37], [38], rep-resent the whole image set with only one pointon a geometric surface. The image set can also berepresented either by a combination of subspacesor on a complex non-linear manifold. In the caseof subspace representation, the similarity metricusually depends on the smallest angles between anyvector in one subspace and any other vector in theother subspace. Then the sum of the cosines of thesmallest angles is used as the similarity measurebetween the two image sets. In the case of manifoldrepresentation of image sets, different metrics areused for distance, e.g., Geodesic distance [15], [34],the projection kernel metric on the Grassman man-ifold [12], and the log-map distance metric on theLie group of Reimannian manifold [14]. Differentlearning techniqes based on discriminant analysisare commonly used to compare image sets on themanifold surface such as Discriminative Canoni-cal Correlations (DCC) [20], Manifold DiscriminantAnalysis (MDA) [36], Graph Embedding Discrim-inant Analysis (GEDA) [13] and Covariance Dis-criminative Learning (CDL) [37]. Chen [5] assumesa virtual image in a high dimensional space andused the distance of the training and test image setsfrom the virtual image as a metric for classification.Feng et al. [8] extends the work of [5] by usingthe the distance between the test set and unrelatedtraining sets, in addition to the distance between thetest image set and the related training set. However,these methods can only work for very small imagesets due to the limitation that the dimension ofthe feature vectors should be much larger than thesum of the number of images in the gallery andthe test sets. Lu et al. [24] use deep learning tolearn non-linear models and then apply discrim-inative and class specific information for classifi-cation. Hayat et al. [16], [17] propose Adaptive

4

Deep Network Template (ADNT) which is a deeplearning based method. They use autoencoders tolearn class specific non-linear models for the train-ing sets. Instead of random initialization, GaussianRestricted Boltzmann Machine (GRBM) is used toinitialize the weights of the autoencoders. At testtime, the class specific autoencoder models are usedto reconstruct each image of the test set and thenthe reconstruction error is used as a classificationmetric. ADNT has been demonstrated to achievegood classification accuracies, but it requires finetuning of several parameters and hand crafted LBPfeatures to achieve good performance. Moreover,deep learning based techniques are computationallyexpensive and require a large amount of trainingdata.

Our technique estimates the linear regressionmodel parameters for each image in the test setand then reconstructs test images from gallery sub-spaces. The proposed technique can classify imagesets in real time and is much faster than the learningbased techniques both at training and test times(Section 5). Our technique can cope well with thechallenges of low resolution, noise and less gallerydata and does not have any constraints on thenumber of images in the test set. Moreover, it doesnot involve any feature extraction or training andcan produce state of the art results using only theraw images.

3 PROPOSED TECHNIQUE

In this paper, we adopt the following notation:bold capital letters denote MATRICES, bold lowerletters show vectors, while normal letters are forscalar numbers. We use [i] to denote the set ofinteger numbers from 1 to i, for a given integer i.

Consider the problem to classify the unknownclass of a test image set as one of the [Y ] classes.Let the gallery image set of each unique class [Y ]contain Nζ images. Each image in the gallery imagesets is downsampled from the initial resolution ofa× b to the resolution of c× d and are converted tograyscale to be represented as Gi

y ∈ Rc×d, ∀ y ∈ [Y ]and ∀ i ∈ [Nζ ]. We concatenate all the columnsof each image to obtain a feature vector such thatGi

y ∈ Rc×d → ςiy ∈ Rτ×1, where τ = cd. Wedo not use any feature extraction technique anduse raw grayscale pixels as the features. Next, animage set matrix ζy is formed for each class [Y ]by horizontally concatenating the image vectors ofclass y.

ζy = [ς1yς2yς

3y ...ς

Nζy ] ∈ Rτ×Nζ , ∀ y ∈ [Y ] (1)

This follows the principle that patterns from thesame class form a linear subspace [2]. Each vectorςiy, ∀ i ∈ [Nζ ], of the matrix ζy, spans a subspace ofRτ×1. Thus, each class [Y ] is represented by a vectorsubspace ζy called the regressor for class y.

Let ϕ be the unknown class of the test imageset with NP number of images. In the same wayas the gallery images, images of the test imageset are downsampled to the resolution of c × dand converted to grayscale to be represented asT jϕ ∈ Rc×d where ϕ is the unknown class of the

test image set and j ∈ [NP ]. Each downsampledimage is converted to a vector through columnconcatenation such that T j

ϕ ∈ Rc×d → ρjϕ ∈ Rτ×1,where τ = cd. A matrix Pϕ is formed for the testimage set by the horizontal concatenation of imagevectors ρjϕ,∀ j ∈ [NP ].

Pϕ = [ρ1ϕρ2ϕρ

3ϕ...ρ

NPϕ ] ∈ Rτ×NP (2)

If Pϕ belongs to the yth class, then its vectors shouldspan the same subspace as the image vectors of thegallery set ζy. In other words, image vectors of testset Pϕ should be represented as a linear combi-nation of the gallery image vectors from the sameclass. Let θjy ∈ RNζ×1 be a vector of parameters.Then the image vectors of test set can be representedas:

ρjϕ = ζyθjy, ∀ j ∈ [NP ], ∀ y ∈ [Y ] (3)

The number of features must be greater than orequal to the number of images in each gallery seti.e., τ ≥ Nζ in order to calculate a unique solutionfor Equation (3) by the Least Squares method [9],[30], [31]. However, even if the condition of τ ≥ Nζ

holds, it is possible for one or more rows of theregressor ζy to be linearly dependent on one an-other, which renders the regressor matrix ζy to besingular. In this case, the regressor ζy is called arank deficient matrix since the rank of ζy is lessthan the number of rows. In this scenario, it is notpossible to use the least squares solution to estimatea unique parameter vector θjy. The singularity of theregressor ζy can be removed by perturbation [37]. Inthe case of a singular matrix, a small perturbationmatrix ε with uniform random values in the range−0.5 ≤ ε ≤ +0.5 was added to the regressor ζy:

ζ̃y = ζy + ε, ε ∈ Rτ×Nζ , ∀ε ∈ ε : |ε| ≤ 0.5 (4)

The regressor ζy is perturbed while its valuesare still in the range of 0 to 255. Therefore, themaximum possible change in the value of any pixelis 0.5. Now, the vector of parameters θjy can be

5

Datasets→ SC Face FR SURV Choke PointMethods↓ \ Resolutions→ 10×10 15×15 20×20 10×10 15×15 20×20 10×10 15×15 20×20

AHISD [4] 30.76 40.76 40 31.37 33.33 35.29 52.27 54.54 55.08RNP [40] 29.23 36.92 36.15 27.45 33.33 35.29 46.84 48.82 49.27

SSDML [41] 9.23 15.38 13.85 11.76 7.84 7.84 53.82 47.73 48.34DLRC [5] 36.92 39.23 38.46 5.88 9.8 3.92 41.79 45.35 45.40DRM [17] 2.31 6.15 4.62 5.88 9.8 9.8 32.54 50.00 55.53PLRC [8] 43.84 46.15 45.38 37.25 35.29 35.29 53.85 56.19 56.83

Ours (MV) 50.77 49.23 49.23 45.10 49.02 45.10 55.11 54.95 56.34Ours (NN) 49.23 48.46 48.46 39.22 49.02 47.06 55.99 56.68 56.98Ours (WV) 51.54 50.77 50.77 41.18 51 47.06 56.42 56.95 57.22

TABLE 1Average percentage classification accuracies for the task of surveillance on SCface [10], FR SURV [29] and Choke Point [39]

datasets.

estimated for the test image vector ρjϕ and regressorζ̃y by the following equation:

θjy = (ζ̃y′ζ̃y)−1ζ̃y

′ρjϕ, ∀ j ∈ [NP ], ∀ y ∈ [Y ] (5)

where ζ̃y′

is the transpose of ζ̃y. Next, the estimatedvector of parameters θjy and the regressor ζ̃y areused to calculate the projection of the image vectorρjϕ on the yth subspace. The projection ρ̂jy of theimage vector ρjϕ on the yth subspace is calculatedby the following equations:

ρ̂jy = ζ̃yθjy, ∀ j ∈ [NP ], ∀ y ∈ [Y ] (6)

ρ̂jy = ζ̃y(ζ̃y′ζ̃y)−1ζ̃y

′ρjϕ (7)

The projection ρ̂jy can also be termed as the re-constructed image vector for ρjϕ. Equations (6) and(7) are useful to estimate the image projectionsin online mode i.e. when all the test data is notavailable simultaneously. However, for an offlineclassification scenario, the matrix implementationof the above problem can be efficiently used todecrease the processing time. Let Θy ∈ RNζ×[NP ]be a matrix of parameters. Then the test image setmatrix Pϕ can be represented as the combination ofΘy and the gallery set matrix ζ̃y:

Pϕ = ζ̃yΘy, ∀ y ∈ [Y ] (8)

Θy can be calculated by using the least squareestimation using the following equations:

Θy = (ζ̃y′ζ̃y)−1ζ̃y

′Pϕ, ∀ y ∈ [Y ] (9)

The projections of the test image vectors on theyth subspace can be calculated as follows:

P̂y = ζ̃yΘy, ∀ y ∈ [Y ] (10)

P̂y = ζ̃y(ζ̃y′ζ̃y)−1ζ̃y

′Pϕ (11)

where P̂y ∈ RT×NP is the matrix of projectionsof the image vectors for the test image set Pϕ onthe yth subspace. The difference between each testimage ρjϕ and its projection ρ̂jy, called residual, iscalculated using the Euclidean distance:

rjy =∥∥∥ρjϕ − ρ̂jy∥∥∥

2, ∀ y ∈ [Y ], ∀ j ∈ [NP ] (12)

Once the residuals are calculated we propose threestrategies to reach the final decision for the un-known class ϕ of the test set:

3.1 Majority Voting (MV)

In majority voting, each image j of the test imageset casts an equal weighted vote ϑj for the class y,whose regressor ζy results in the minimum value ofresidual for that test image vector:

ϑj = arg miny

(rjy) ∀ y ∈ [Y ], ∀ y ∈ [Y ] (13)

Then the votes are counted for each class and theclass with the maximum number of votes is de-clared as the unknown class of the test image set.

ϕ = mode([ϑ1ϑ2ϑ3...ϑNP ]) (14)

In the case of a tie between two or more classes, wecalculate the mean of residuals for the competingclasses, and the class with the minimum mean ofresiduals is declared as the output class of the testimage set. Suppose classes 1, 2 and 3 have the samenumber of votes then:

ry(mean) = meanj

(rjy) ∀ y ∈ [1, 2, 3] (15)

ϕ = arg miny

(ry(mean)) ∀ y ∈ [1, 2, 3] (16)

6

No. of Images→ 20-20 40-40 All-AllMethods↓ \ Resolutions→ 10×10 15×15 20×20 10×10 15×15 20×20 10×10 15×15 20×20

AHISD [4] 91.02 92.82 92.82 92.82 95.12 94.61 93.84 98.71 99.23RNP [40] 94.35 95.12 95.12 95.64 97.43 96.41 95.64 99.48 99.48

SSDML [41] 80.51 83.08 83.85 84.10 85.64 85.64 89.74 90.77 91.54DLRC [5] 92.30 92.82 92.82 90.51 93.07 92.82 94.3 96.9 97.1DRM [17] 31.28 26.92 28.46 81.79 83.33 79.49 100 99.74 100PLRC [8] 92.82 93.07 93.84 94.35 95.12 95.38 99.48 100 100

Ours (MV) 90.51 91.80 92.05 94.36 95.13 96.15 100 100 100Ours (NN) 95.64 96.15 95.90 96.41 97.95 97.69 100 100 100Ours (WV) 95.90 96.15 95.90 97.18 97.95 98.72 100 100 100

TABLE 2Average percentage classification accuracies for the task of video based face recognition on UCSD/Honda (Honda) [22] dataset

for different number of gallery and test set images.

3.2 Nearest Neighbour Classification (NN)In Nearest Neighbour Classification, we first findthe minimum residual ry(min) across all classes in[Y ] for each of the test image vector.

ry(min) = minj

(rjy) ∀ y ∈ [Y ] (17)

Then, the class with the minimum value of ry(min)is declared as the output class of the test image set.

ϕ = arg miny

(ry(min)) ∀ y ∈ [Y ] (18)

3.3 Weighted Voting (WV)In weighted voting, each image [NP ] in the testimage set casts a vote ωjy to each class y in thegallery. We tested with different types of weightsincluding the Euclidean distance, exponential of theEuclidean distance, the inverse of the Euclideandistance and the square of the inverse Euclideandistance. Empirically, it was found that the bestperformance was achieved when we used the ex-ponential of the Euclidean distance as weights forthe votes. Hence, the following equation defines theweight of the vote ωjy of each image j :

ωjy = e−βrjy , y ∈ [Y ], ∀ j ∈ [NP ] (19)

where β is a constant. The weights are accumulatedfor each gallery class [Y ] from each image of the testset:

$y =

NP∑j=1

ωjy, ∀ y ∈ [Y ] (20)

The final decision is ruled in the favour of the classy with the maximum accumulated weight from allthe images ρjϕ of the test image set Pϕ:

ϕ = arg maxy

($y) ∀ y ∈ [Y ] (21)

3.4 Fast Linear Image Reconstruction

If the testing is done offline (i.e., all test data isavailable), then the processing time can substan-tially be reduced by using Equations (8), (9), (10)and (11) instead of Equations (3), (5), (6) and (7). Thecomputational efficiency of the proposed techniquecan further be improved by using Moore-Penrosepseudoinverse [27], [33] to estimate the inverse ofthe regressor matrix ζ̃y at the time of the galleryformation. This step significantly reduces the com-putations required at test time. Table 7 shows thatmany fold gain in computational efficiency wasachieved by the fast linear image reconstructiontechnique for the YouTube Celebrity dataset [19].Let ζ̃inv

y be the pseudoinverse of the regressor ζ̃ycalculated at the time of gallery formation. Equation(8) can then be solved at the test time as:

Θy = ζ̃invy Pϕ (22)

P̂y = ζ̃y(ζ̃invy Pϕ) (23)

4 EXPERIMENTS AND ANALYSIS

We rigorously evaluated the proposed technique forthe tasks of surveillance, video based face recog-nition and object recognition. We evaluated ourtechnique on three surveillance databases, namelySCFace [10], FR SURV [29] and Choke Point [39]datasets. For video based face recognition, we usedthe publicly available and challenging databases,namely Honda/UCSD Dataset [22], CMU Motionof Body Dataset (CMU MoBo) [11] and YoutubeCelebrity Dataset (YTC) [19]. Washington RGBDObject [21] and ETH-80 [23] datasets were used forthe task of object recognition.

7

We compared our technique with several promi-nent image set classification methods. These tech-niques include the Affine Hull based Image SetDistance (AHISD) [4], Regularized Nearest Points(RNP) [40], Set to Set Distance Metric Learning(SSDML) [41], Dual Linear Regression Classifier(DLRC) [5], Deep Reconstruction Models (DRM)[17] and Pairwise Linear Regression Classifier(PLRC) [8]. We followed the standard experimentalprotocols wherever applicable. For the comparedmethods, we used the implementations providedby the respective authors. For all of the comparedmethods, the parameter settings recommended inthe respective papers were used. Since [5] and [8]can work with only small datasets, we selected thenumber of training and testing images for thesetechniques depending on the best accuracy in eachof the datasets.

4.1 Surveillance

4.1.1 SCFace DatasetThe SCface dataset [10] contains static images ofhuman faces. The main purpose of SCface dataset isto test face recognition algorithms in real world sce-narios. This dataset presents the challenges of differ-ent training and testing environments, uncontrolledillumination, low resolution, less gallery data, headpose orientation and a large number of classes. Fivecommercially available video surveillance camerasof different qualities and various resolutions wereused to capture the images of 130 individuals in un-controlled indoor environments. Different qualitycameras present a real world surveillance scenariofor face recognition. Due to the large number ofindividuals, there is a very low probability of resultsobtained by pure chance. Images were taken underuncontrolled illumination conditions with outdoorlight coming from a window as the only sourceof illumination. Two of the surveillance camerasrecord both visible spectrum and IR night visionimages, which are also included in the dataset.Images were captured from various distances foreach individual. Another challenge in the dataset isthe head pose i.e., the camera is above the subject’shead to mimic a commercial surveillance systemswhile for the gallery images the camera is in frontof the individual. For the gallery set, individu-als were photographed with a high quality digitalphotographer’s camera at close range in controlledconditions (standard indoor lighting, adequate useof flash to avoid shades, high resolution of images)to mimic the conditions used by law enforcement

Fig. 2. Faces detected by the Viola and Jones face detection al-gorithm of four individuals from the challenging SCFace Dataset[10]. The first three columns show the Gallery images while theremaining columns show the test images. The large differencebetween the quality of the gallery and test images is evident.

agencies. Nine different views were captured foreach individual from -90 degrees to +90 degrees atintervals of 22.5 degrees. Figure 2 clearly presentsthe difference in the quality of images in the galleryimages and test images of four individuals in theSCFace dataset.

We used the images taken in controlled con-ditions for the gallery sets while all other imageswere used to form the test image sets. The faces inthe images were detected using the Viola and Jonesface detection algorithm [35]. The images were con-verted to grayscale and resized to 10×10, 15×15 and20 × 20. Table 1 shows the classification accuraciesof our technique compared to the other state ofthe art techniques. Our technique achieved the bestclassification accuracy on all resolutions. This canbe attributed to the fact that there is not sufficientdata available for training, so the techniques whichdepend heavily on data availability e.g., [17] fail toproduce good results.

4.1.2 FR SURV DatasetThe FR SURV dataset [29] contains images of 51 in-dividuals. The main purpose of this dataset is to testface recognition algorithms in long distance surveil-lance scenarios. This dataset presents the challengesof uncontrolled illumination, low resolution, blur,noise, less training data, head pose orientation and alarge number of classes. Unlike most other datasets,the gallery and test images were captured in differ-ent conditions. The gallery set contains twenty highresolution images of each individual captured froma close distance in controlled indoor conditions.The images contain slight head movements and thecamera is placed parallel to the head position. Thereis a significant difference between the resolutions ofthe gallery and the test images. The test images werecaptured in outdoor conditions from a distance of

8

No. of Images→ 20-20 All-AllMethods↓ \ Resolutions→ 10×10 15×15 20×20 10×10 15×15 20×20

AHISD [4] 62.77 71.66 76.80 93.89 96.25 96.81RNP [40] 66.25 72.5 77.63 92.36 95.69 97.36

SSDML [41] 73.47 73.06 73.47 96.39 96.67 96.25DLRC [5] 80.69 82.08 80.00 96.2 97.2 97.7DRM [17] 26.94 40.14 38.06 90 90.56 91.81PLRC [8] 79.72 80.41 79.16 96.38 96.38 96.94

Ours (MV) 73.06 74.31 74.17 90.97 90.70 91.53Ours (NN) 80.97 83.06 82.92 98.61 99.31 99.03Ours (WV) 80.56 83.06 82.92 98.61 99.31 99.03

TABLE 3Average percentage classification accuracies for the task of video based face recognition on CMU MoBo (MoBo) [11] dataset for

different sizes of gallery and test images.

Fig. 3. Random images of four individuals from the challengingFR Surv Dataset [29]. The first five columns show the galleryimages while the remaining columns show the test images. Thedifference in the quality of the gallery and the test images canclearly be observed.

50m - 100m. The camera was placed at the heightof around 20m - 25m. Due to the use of commercialsurveillance camera and the large distance betweenthe camera and the individual, the test images are ofvery low quality which makes it difficult to identifythe individual. The difference in the gallery and testimages of the FR SURV dataset in Figure 3 showsthe challenges of the dataset.

We used the indoor images for the gallery setswhile the outdoor images were used to form the testimage sets. Viola and Jones face detection algorithm[35] was used to detect faces in the images. Theimages were converted to grayscale and resizedto 10 × 10, 15 × 15 and 20 × 20. The classificationaccuracies of our technique compared to the otherimage set classification techniques are shown inTable 1. Although the results of all techniques arelow on this dataset, our technique achieved the bestperformance on all resolutions. The percentage gapin the performance of our technique and the secondbest technique is more than 15%. The techniqueswhich require a large training data fail to produceany considerable results on this dataset.

4.1.3 Choke Point Dataset

The choke Point dataset [39] consists of videos of29 individuals captured using surveillance cam-eras. Three cameras at different position simulta-neously record the entry or exit of a person from3 viewpoints. The main purpose of this datasetwas to identify or verify persons under real-worldsurveillance conditions. The cameras were placedabove the doors to capture individuals walkingthrough the door in a natural manner. This datasetpresents the challenges of variations in illumination,background, pose and sharpness. The 48 video se-quences of the dataset contain 64,204 face imagesat a frame rate of 30 fps with an image resolutionof 800 × 600 pixels. In all sequences, only one in-dividual is present in the image at a time. The se-quences contain different backgrounds dependingon the door and whether the individual is enteringor leaving the room. Overall, 25 individuals appearin 16 video sequences while 4 appear in 8 videosequences.

We used the Viola and Jones face detection algo-rithm [35] to detect faces in the images. The imageswere converted to grayscale and resized to 10 × 10,15×15 and 20×20. As recommended by the authorsof the dataset for the evaluation protocol [39], weused only images from the camera which capturedthe most frontal faces for each video sequence. Thisresults in 16 image sets for 25 individuals and 8image sets for 4 individuals. We randomly selectedtwo image sets of each individual for the gallery setwhile the rest of the image sets were used as testimage sets. The experiment was repeated ten timeswith a random selection of gallery and test imagesets to improve the generalization of results. Weachieved average classification accuracies of 71.23%,73.85% and 72.75% for 10 × 10, 15 × 15 and 20 × 20resolutions, respectively. In order to test the per-

9

formance of image set classification methods underthe challenges of limited gallery and test data, wereduced the size of image sets to 20 by retaining thefirst 20 images of each image set. Table 1 shows theclassification accuracies on the Choke Point Dataset.Our technique achieved the best classification accu-racies compared to the other image set techniques.

4.2 Video based Face RecognitionFor Video based face recognition, we used the mostpopular image set classification datasets. For thissection we also carried out extensive experimentsusing the full length image sets, in addition tothe experiments which use limited gallery and testdata, in order to demonstrate the suitability of ourtechnique for both scenarios. In experiments usingfull length image sets, we reduced the gallery data,for our technique, whenever the number of imagesin the gallery set was larger than the number ofpixels in the downsampled images. This ensuresthat the number of features is always larger thanthe number of images (see Section 3 for details) forour technique.

4.2.1 UCSD/Honda DatasetThe UCSD/Honda Dataset [22] was developed toevaluate the face tracking and recognition algo-rithms. The dataset consists of 59 videos of 20individuals. The resolution of each video sequenceis 640× 480. Of the 20 individuals, all but one haveat least two videos in the dataset. All videos con-tain significant head rotations and pose variations.Moreover, some of the video sequences also containpartial occlusions in some frames. We followedthe standard experimental protocol used in [17],[18], [22], [36], [38]. The face from each frame ofvideos was automatically detected using the Violaand Jones face detection algorithm [35]. The de-tected faces were down sampled to the resolutionsof 10 × 10, 15 × 15 and 20 × 20 and convertedto grayscale. Histogram equalization was appliedto increase the contrast of images. For the galleryimage sets, we randomly selected one video of eachindividual. The remaining 39 videos were used asthe test image sets. For our technique, we randomlyselected a small number of frames to keep the num-ber of gallery images considerably smaller than thenumber of pixels (refer to Section 3). To improve theconsistency in scores we repeated the experiment 10times with a different random selection of galleryimages, gallery image sets and test image sets. Ourtechnique achieved 100% classification accuracy on

all the resolutions while using a significantly lessnumber of gallery images. We also carried out ex-periments for less gallery and test data by retain-ing the first 20 and first 40 images of each imageset. Our technique achieved the best classificationperformance in all of the testing protocols. Table 2summarizes the average identification rates of ourtechnique compared to other image set classificationtechniques on the Honda dataset.

4.2.2 CMU MoBo Dataset

The original purpose of the CMU Motion of BodyDataset (CMU MoBo) was to advance biometricresearch on human gait analysis [11]. It containsvideos of 25 individuals walking on a treadmill,captured from six different viewpoints. Except thelast one, all subjects in the dataset have four videosfollowing different walking patterns. The four walk-ing patterns are slow walk, fast walk, inclined walkand holding a ball while walking. Figure 4 showsrandom images of four individuals in the CMUMoBo dataset detected by the Viola and Jones facedetection algorithm [35]

Following the same procedure as previousworks [5], [8], [17], we used the videos of the24 individuals which contain all the four walkingpatterns. Only the videos from the camera in frontof the individual were used for image set classifi-cation. The frames of each video were consideredas an image set. We followed the standard protocol[4], [17], [18] and [38], and randomly selected thevideo of one walking pattern of each individualfor the gallery image set. The remaining videoswere considered as test image sets. As mentionedin Section 3, the number of images should be lessthan or equal to the number of pixels in the down-sampled images. In practice, the number of imagesshould be considerably less than the number ofpixels. For our technique, we randomly selected asmall number of frames from each gallery videoto form the gallery image sets. The face from eachframe was automatically detected using the Violaand Jones face detection algorithm [35]. The imageswere resampled to the resolutions of 10×10, 15×15and 20 × 20 and converted to grayscale. Histogramequalization was applied to increase the contrastof images. Unlike most other techniques, we usedraw images for our technique and did not extractany features, such as LBP in [4], [5], [8], [17]. Theexperiments were repeated 10 times with differentrandom selections of the gallery and the test sets.We also used different random selections of the

10

No. of images→ 60-20 All-AllMethods↓ \ Resolutions→ 10×10 15×15 20×20 10×10 15×15 20×20

AHISD [4] 59.71 59.36 58.58 43.9 55.88 54.18RNP [40] 58.22 57.65 56.59 65.46 65.74 65.46

SSDML [41] 60.5 61.42 60.14 63.26 65.25 64.4DLRC [5] 57.02 61.13 60.42 60.14 64.4 63.9DRM [17] 39.93 46.52 51.91 49.36 57.66 61.06PLRC [8] 60.19 59.27 59.48 62.08 62.51 62.59

Ours (MV) 57.38 58.23 56.95 60.71 60.64 59.29Ours (NN) 60.78 61.49 61.35 66.45 66.81 66.46Ours (WV) 59.65 60.71 60.78 65.25 66.24 65.89

TABLE 4Average percentage classification accuracies for the task of video based face recognition on YouTube Celebrity (YTC) [19] dataset

for different sizes of gallery and test image sets.

Fig. 4. Face images detected by the Viola and Jones facedetection Algorithm of four individuals from CMU MoBo dataset[11].

gallery images in each round to make our test-ing environment more challenging. Our techniqueachieved the best performance on this dataset, withgreater than 99% average classification accuracy.Then we reduced the number of images in eachimage set to the first 20. Our technique achievedthe best classification accuracy in both experimentalprotocols. Table 3 provides the average accuracy ofour technique along with a comparison with othermethods.

4.2.3 Youtube Celebrity Dataset

The YoutubeCelebrity (YTC) Dataset [19] containscropped clips from videos of 47 celebrities andpoliticians. In total, the dataset contains 1910 videoclips. The videos are downloaded from the YouTubeand are of very low quality. Due to their highcompression rates, the video clips are very noisyand have a low resolution. This makes it a verychallenging dataset. The Viola and Jones face de-tection algorithm [35] failed to detect any face in alarge number of frames. Therefore, the IncrementalLearning Tracker [28] was used to track the faces inthe video clips. To initialize the tracker, we used the

initialization parameters1 provided by the authorsof [19]. This approach, however, produced manytracking errors due to the low image resolution andnoise. Although the cropped face region was notuniform across frames, we decided to use the au-tomatically tracked faces without any refinement tomake our experimental settings more challenging.

We followed the standard protocol which wasalso followed in [17], [18], [36], [37], [38], for ourexperiments. The dataset was divided into five foldswith a minimal overlap between the various folds.Each fold consists of 9 video clips per individual,to make a total of 423 video clips. Three videosof each individual were randomly selected to formthe gallery set while the remaining six were usedas six separate test image sets. All the tracked faceimages were resampled to the resolutions of 10×10,15 × 15 and 20 × 20 and converted to grayscale.We applied histogram equalization to enhance theimage contrast. Figure 5 shows histogram equalizedimages of four celebrities in the YouTube Celebritydataset. For the gallery formation in our technique,we randomly selected a small number of imagesfrom each of the three gallery videos per individualper fold in order to keep the number of galleryimages less than the number of pixels (features).Our technique achieved the highest classificationaccuracy among all the methods, while using sig-nificantly less gallery data compared to the othermethods. We also carried out experiments for lessgallery and test data by selecting the first 20 imagesfrom each image set. If any video clip had less than20 frames, all the frames of that video clip wereused for gallery formation. In this way each galleryset had a maximum of 60 images while each testset had a maximum of 20 images. Our technique

1. http://seqam.rutgers.edu/site/index.php?option=comcontent&view=article&id=64&Itemid=80

http://seqam.rutgers.edu/site/index.php?option=com_content&view=article&id=64&Itemid=80

http://seqam.rutgers.edu/site/index.php?option=com_content&view=article&id=64&Itemid=80

11

Datasets→ Washington RGBD Object ETH-80Methods↓ \ Resolutions→ 10×10 15×15 20×20 10×10 15×15 20×20

AHISD [4] 53.78 57.32 57.37 73.25 85.25 87.5RNP [40] 52.72 55.60 57.17 74 74.25 79

SSDML [41] 44.39 46.31 43.33 73.25 69 67.75DLRC [5] 50.91 56.76 57.77 76.2 85.2 87.2DRM [17] 34.85 51.92 61.72 88.75 90.25 88.75PLRC [8] 55.61 58.13 59.17 80.93 86.47 87.44

Ours (MV) 61.06 63.74 62.58 89.75 94.75 96Ours (NN) 53.13 56.31 53.94 74.25 81 81Ours (WV) 61.52 64.80 63.33 90.5 95.25 95.75

TABLE 5Average percentage classification accuracies for the task of object recognition on Washington RGB-D Object [21] and ETH-80 [23]

datasets.

Fig. 5. Random histogram equalized, grayscale images of fourcelebrities from YouTube Celebrity dataset [19].

achieved the best classification accuracies for all res-olutions. Table 4 summarizes the average accuraciesof the different techniques on the YouTube Celebritydataset.

4.3 Object Recognition4.3.1 Washington RGBD Object DatasetThe Washington RGB-D Object Dataset [21] is alarge dataset, which consists of common householdobjects. The dataset contains 300 objects in 51 cat-egories. The dataset was recorded at a resolutionof 640 × 480 pixels. The objects were placed on aturntable and video sequences were captured at 30fps for a complete rotation. A camera mounted at3 different heights was used to capture 3 videosequences for each object so that the object is viewedfrom different angles.

We used the RGB images of the dataset in ourexperiments. Using the leave one out protocol de-fined by the authors of the dataset for object classrecognition [21], we achieved average classificationaccuracies of 83.33%, 83.73% and 84.51% for 10×10,15 × 15 and 20 × 20 resolutions respectively. Toevaluate the proposed technique for less galleryand test data, 20 images were randomly selected

from the three video sequences of each instance toform a single image set. Two image sets of eachclass were randomly selected to form the galleryset, while the remaining sets were used as test sets.The images were converted to grayscale and resizedto 10 × 10, 15 × 15 and 20 × 20 resolutions. Table5 shows the classification accuracies of the imageset classification techniques on this dataset. Ourtechnique achieved the best classification accuraciesat all the resolutions.

4.3.2 ETH-80 DatasetThe ETH-80 dataset [23] contains eight object cate-gories: apples, pears, tomatoes, cups, cars, horses,cows and dogs. There are ten different image sets ineach object category. Each image set contains imagesof an object taken from 41 different view angles in3D. We used the cropped images, which contain theobject without any border area, provided by thedataset. Figure 6 shows few images of four of theclasses from ETH-80 dataset.

The images were resized to the resolutions of10 × 10, 15 × 15 and 20 × 20. We used grayscaleraw images for our technique. Similar to [17], [20],[36] and [37], five image sets of each object cate-gory were randomly selected to form the galleryset while the other five are considered to be inde-pendent test image sets. In order to improve thegeneralization of the results, the experiments wererepeated 10 times for different random selections ofgallery and test sets. Table 5 summarizes the resultsof our technique compared to other methods. Weachieved superior classification accuracy comparedto all the other techniques at all resolutions.

5 COMPUTATIONAL TIME ANALYSIS

The fast approach of the proposed technique re-quires the least computational time, compared to

12


AHISD [4] NR NR NR NR NR NRRNP [40] NR NR NR NR NR NR

SSDML [41] 9.74 18.89 21.61 101.69 193.37 278.63DLRC [5] NR NR NR NR NR NR

DRM* [17] 2194.28 2194.36 2194.40 4149.87 4157.23 4164.63PLRC [8] NR NR NR NR NR NR

Ours NR NR NR NR NR NROurs (Fast) NR NR NR NR NR NR

TABLE 6Training time in seconds per fold required by different methods for the task of video based face recognition on YouTube Celebrity

(YTC) [19] dataset for different sizes of training and test image sets. NR shows that the method does not require training. *Includes time required to calculate LBP features.

Fig. 6. Random images of four of the classes from ETH-80dataset. The resemblance between the images of apple (firstrow) and tomatoe (scond row), and between the images of horse(third row) and cow (fourth row) shows the challenge of inter-class similarity [23].

other image set techniques. In contrast to the signif-icant training times of [17] and [41], the proposedtechnique does not require any training (Table 6).The performance of our technique is also the fastestat test time, especially for less gallery and test data.Our approach is more than 1.5 times faster thanRNP [40], which is the second fastest method in thetable, for the experiments which use the full sizedimage sets of the YouTube Celebrity dataset [19].Table 7 shows the testing times that are required perfold for the YouTube Celebrity dataset for variousmethods using a Core i7 3.40GHz CPU with 24GB RAM. Although our technique reconstructs eachimage of the test image set from all the galleryimage sets, due to the efficient matrix representation(refer to Section 3 and Section 3.4), we achieveda timing efficiency that is superior to the othermethods.

6 CONCLUSION AND FUTURE DIRECTION

In this paper, we have proposed a novel imageset classification technique, which exploits linearregression for efficient classification. The images

from the test image set are projected on linearsubspaces defined by down sampled gallery imagesets. The class of the test image set is decidedusing the residuals from the projected test images.Extensive experimental evaluations for the tasks ofsurveillance, video based face and object recogni-tion, demonstrate the superior performance of ourtechnique under the challenges of low resolution,noise and less training data commonly encounteredin surveillance settings. Unlike other techniques,the proposed technique has a consistent perfor-mance for the different tasks of surveillance, videobased face recognition and object recognition. Thetechnique requires less training data compared toother image set classification methods and workseffectively at very low resolution.

Our proposed technique can easily be scaledfrom processing one frame at a time (for live videoacquisition) to processing all of the test data at once(for faster performance). It can also be quickly andeasily deployed as it is very easy to add new classesin our technique (by just arranging images in a ma-trix) and it does not require any training or featureextraction. These advantages make our techniqueideal for practical applications such as surveillance.For future research, we plan to incorporate non-linear regression in our technique to make it robustto outliers and to further improve the classificationaccuracy.

ACKNOWLEDGMENTS

This work is partially funded by SIRF Scholar-ship from the University of Western Australia(UWA) and Australian Research Council (ARC)grant DP150100294.

Portions of the research in this paper use the SC-face database [10] of facial images. Credit is hereby

13


AHISD [4] 22.78 30.09 33.01 63.39 233.37 568.34RNP [40] 8.16 14.20 16.07 21.35 31.83 43.65

SSDML [41] 17.26 31.75 32.06 183.36 371.79 498.54DLRC [5] 6.6262 12.40 23.57 303.55 347.55 431.23

DRM* [17] 89.15 89.46 89.69 381.82 386.65 391.97PLRC [8] 27.91 61.92 104.62 203.44 248.00 299.58

Ours 4.75 9.71 14.27 42.33 99.62 193.62Ours (Fast) 0.52 1.28 1.89 10.27 19.30 28.00

TABLE 7Testing time in seconds per fold required by different methods for the task of video based face recognition on YouTube Celebrity

(YTC) [19] dataset for different sizes of training and test image sets. Top-2 testing time for each setting are shown in bold. *Includes time required to calculate LBP features.

given to the University of Zagreb, Faculty of Elec-trical Engineering and Computing for providing thedatabase of facial images.

Portions of the research in this paper use theChoke Point dataset [39] which is sponsored byNICTA. NICTA is funded by the Australian Govern-ment as represented by the Department of Broad-band, Communications and the Digital Economy, aswell as the Australian Research Council through theICT Centre of Excellence program.

REFERENCES

[1] O. Arandjelovic, G. Shakhnarovich, J. Fisher, R. Cipolla,and T. Darrell. Face recognition with image sets usingmanifold density divergence. In IEEE Conference on Com-puter Vision and Pattern Recognition, volume 1, pages 581–588. IEEE, 2005.

[2] R. Basri and D. W. Jacobs. Lambertian reflectance andlinear subspaces. IEEE transactions on Pattern Analysis andMachine Intelligence, 25(2):218–233, 2003.

[3] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman.Eigenfaces vs. fisherfaces: Recognition using class specificlinear projection. IEEE Transactions on Pattern Analysis andMachine Intelligence, 19(7):711–720, 1997.

[4] H. Cevikalp and B. Triggs. Face recognition based onimage sets. In IEEE Conference on Computer Vision andPattern Recognition, pages 2567–2573. IEEE, 2010.

[5] L. Chen. Dual linear regression based classification for facecluster recognition. In IEEE Conference on Computer Visionand Pattern Recognition, pages 2673–2680. IEEE, 2014.

[6] Y.-C. Chen, V. M. Patel, P. J. Phillips, and R. Chellappa.Dictionary-based face recognition from video. In EuropeanConference on Computer Vision, pages 766–779. Springer,2012.

[7] Y.-C. Chen, V. M. Patel, S. Shekhar, R. Chellappa, andP. J. Phillips. Video-based face recognition via joint sparserepresentation. In 10th IEEE International Conference andWorkshops on Automatic Face and Gesture Recognition, pages1–8. IEEE, 2013.

[8] Q. Feng, Y. Zhou, and R. Lan. Pairwise linear regressionclassification for image set retrieval. In IEEE Conference onComputer Vision and Pattern Recognition, pages 4865–4872,2016.

[9] J. Friedman, T. Hastie, and R. Tibshirani. The elements ofstatistical learning, volume 1. Springer series in statisticsSpringer, Berlin, 2001.

[10] M. Grgic, K. Delac, and S. Grgic. Scface–surveillancecameras face database. Multimedia tools and applications,51(3):863–879, 2011.

[11] R. Gross and J. Shi. The cmu motion of body (mobo)database. 2001.

[12] J. Hamm and D. D. Lee. Grassmann discriminant analysis:a unifying view on subspace-based learning. In Proceedingsof the 25th international conference on Machine learning, pages376–383. ACM, 2008.

[13] M. T. Harandi, C. Sanderson, S. Shirazi, and B. C. Lovell.Graph embedding discriminant analysis on grassmannianmanifolds for improved image set matching. In IEEE Con-ference on Computer Vision and Pattern Recognition, pages2705–2712. IEEE, 2011.

[14] M. T. Harandi, C. Sanderson, A. Wiliem, and B. C. Lovell.Kernel analysis over riemannian manifolds for visualrecognition of actions, pedestrians and textures. In Ap-plications of Computer Vision (WACV), 2012 IEEE Workshopon, pages 433–439. IEEE, 2012.

[15] M. Hayat and M. Bennamoun. An automatic frameworkfor textured 3d video-based facial expression recogni-tion. IEEE Transactions on Affective Computing, 5(3):301–313,2014.

[16] M. Hayat, M. Bennamoun, and S. An. Learning non-linear reconstruction models for image set classification. InIEEE Conference on Computer Vision and Pattern Recognition,pages 1907–1914, 2014.

[17] M. Hayat, M. Bennamoun, and S. An. Deep reconstructionmodels for image set classification. IEEE transactionson Pattern Analysis and Machine Intelligence, 37(4):713–727,2015.

[18] Y. Hu, A. S. Mian, and R. Owens. Face recognitionusing sparse approximated nearest points between imagesets. IEEE transactions on Pattern Analysis and MachineIntelligence, 34(10):1992–2004, 2012.

[19] M. Kim, S. Kumar, V. Pavlovic, and H. Rowley. Facetracking and recognition with visual constraints in real-world videos. In IEEE Conference on Computer Vision andPattern Recognition, pages 1–8. IEEE, 2008.

[20] T.-K. Kim, J. Kittler, and R. Cipolla. Discriminative learn-ing and recognition of image set classes using canonicalcorrelations. IEEE Transactions on Pattern Analysis andMachine Intelligence, 29(6):1005–1018, 2007.

[21] K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchicalmulti-view rgb-d object dataset. In Robotics and Automation(ICRA), 2011 IEEE International Conference on, pages 1817–1824. IEEE, 2011.

14

[22] K.-C. Lee, J. Ho, M.-H. Yang, and D. Kriegman. Video-based face recognition using probabilistic appearancemanifolds. In IEEE Conference on Computer Vision andPattern Recognition, volume 1, pages I–313. IEEE, 2003.

[23] B. Leibe and B. Schiele. Analyzing appearance and con-tour based methods for object categorization. In IEEEConference on Computer Vision and Pattern Recognition, vol-ume 2, pages II–409. IEEE, 2003.

[24] J. Lu, G. Wang, W. Deng, P. Moulin, and J. Zhou. Multi-manifold deep metric learning for image set classification.In IEEE Conference on Computer Vision and Pattern Recogni-tion, pages 1137–1145, 2015.

[25] I. Naseem, R. Togneri, and M. Bennamoun. Linear re-gression for face recognition. IEEE Transactions on PatternAnalysis and Machine Intelligence, 32(11):2106–2112, 2010.

[26] E. G. Ortiz, A. Wright, and M. Shah. Face recognition inmovie trailers via mean sequence sparse representation-based classification. In IEEE Conference on Computer Visionand Pattern Recognition, pages 3531–3538, 2013.

[27] R. Penrose. On best approximate solutions of linear matrixequations. In Mathematical Proceedings of the CambridgePhilosophical Society, volume 52, pages 17–19. CambridgeUniv Press, 1956.

[28] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang. Incrementallearning for robust visual tracking. International Journal ofComputer Vision, 77(1-3):125–141, 2008.

[29] S. Rudrani and S. Das. Face recognition on low qual-ity surveillance images, by compensating degradation.International Conference on Image Analysis and Recognition(ICIAR 2011, Part II); Burnaby, BC, Canada, Lecture Notes inComputer Science (LNCS), 6754/2011:212–221, June 2011.

[30] T. P. Ryan. Modern regression methods, volume 655. JohnWiley & Sons, 2008.

[31] G. A. Seber and A. J. Lee. Linear regression analysis, volume936. John Wiley & Sons, 2012.

[32] S. A. A. Shah, U. Nadeem, M. Bennamoun, F. Sohel, andR. Togneri. Efficient image set classification using linearregression based image classification. In IEEE Conferenceon Computer Vision and Pattern Recognition Workshops. IEEE,2017.

[33] J. Stoer and R. Bulirsch. Introduction to numerical analysis,volume 12. Springer Science & Business Media, 2013.

[34] P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chel-lappa. Statistical computations on grassmann and stiefelmanifolds for image and video-based recognition. IEEETransactions on Pattern Analysis and Machine Intelligence,33(11):2273–2286, 2011.

[35] P. Viola and M. J. Jones. Robust real-time face detection.International Journal of Computer Vision, 57(2):137–154, 2004.

[36] R. Wang and X. Chen. Manifold discriminant analysis. InIEEE Conference on Computer Vision and Pattern Recognition,pages 429–436. IEEE, 2009.

[37] R. Wang, H. Guo, L. S. Davis, and Q. Dai. Covariancediscriminative learning: A natural and efficient approachto image set classification. In IEEE Conference on ComputerVision and Pattern Recognition, pages 2496–2503. IEEE, 2012.

[38] R. Wang, S. Shan, X. Chen, and W. Gao. Manifold-manifold distance with application to face recognitionbased on image set. In IEEE Conference on Computer Visionand Pattern Recognition, pages 1–8. IEEE, 2008.

[39] Y. Wong, S. Chen, S. Mau, C. Sanderson, and B. C. Lovell.Patch-based probabilistic image quality assessment forface selection and improved video-based face recognition.In IEEE Conference on Computer Vision and Pattern Recogni-tion Workshops, pages 74–81. IEEE, 2011.

[40] M. Yang, P. Zhu, L. Van Gool, and L. Zhang. Face recog-nition based on regularized nearest points between imagesets. In 10th IEEE International Conference and Workshopson Automatic Face and Gesture Recognition, pages 1–7. IEEE,2013.

[41] P. Zhu, L. Zhang, W. Zuo, and D. Zhang. From point to set:Extend the learning of distance metrics. In IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2664–2671, 2013.

Uzair Nadeem received his Bachelor ofEngineering degree from National Univer-sity of Science and Technology (NUST) in2014. He is currently a PhD candidate atThe University of Western Australia (UWA)sponsored by the Scholarship for Interna-tional Research Fees (SIRF). His researchinterests include deep learning, computervision, machine learning, image processing

and pattern recognition.

Syed Afaq Ali Shah obtained his PhD fromthe University of Western Australia in thearea of computer vision and machine learn-ing. He is currently working as a researchassociate in school of computer scienceand software engineering, the University ofWestern Australia, Crawley, Australia. Hehas been awarded Start Something Prizefor Research Impact through Enterprise for

3D facial analysis project funded by the Australian ResearchCouncil. His research interests include deep learning, 3D ob-ject/face recognition, 3D modelling, and image processing.

Mohammed Bennamoun received theMSc degree in control theory from QueensUniversity, Canada, and the PhD degreein computer vision from QUT in Brisbane,Australia. He is currently a Winthrop Pro-fessor at the University of Western Aus-tralia, Australia. He is the co-author of thebook Object Recognition: Fundamentalsand Case Studies (Springer-Verlag, 2001).

He has published more than 300 journal and conference pub-lications. His areas of interest include control theory, robotics,object recognition, artificial neural networks, and computer vi-sion. He served as a guest editor for a couple of special issuesin International journals such as the International Journal ofPattern Recognition and Artificial Intelligence. He was selectedto give tutorials at the European Conference on Computer Vision(ECCV 02), the International Conference on Acoustics Speechand Signal Processing (2003), Interspeech (2014), and CVPR(2016).

15

Roberto Togneri Roberto Togneri (M89-SM04) received the Ph.D degree in 1989from the University of Western Australia.He joined the School of Electrical, Elec-tronic and Computer Engineering at TheUniversity of Western Australia in 1988,where he is now currently an AssociateProfessor. He leads the Signal Process-ing and Recognition Lab and his research

activities in signal processing and pattern recognition include:feature extraction and enhancement of audio signals, statisticaland neural network models for speech and speaker recognition,audio-visual recognition and biometrics, and related aspectsof language modelling and understanding. He has publishedover 150 refereed journal and conference papers in the areasof signal processing and recognition, the chief investigator onthree Australian Research Council Discovery Project researchgrants since 2009, and was an Associate Editor for IEEE SignalProcessing Magazine Lecture Notes and IEEE Transactions onSpeech, Audio and Language Processing from 2012 to 2016.

Ferdous Sohel received PhD degree fromMonash University, Australia in 2009. Heis currently a Senior Lecturer at MurdochUniversity, Australia. Prior to joining Mur-doch University, he was a Research As-sistant Professor at the School of Com-puter Science and Software Engineering,The University of Western Australia fromJanuary 2008 to mid-2015. His research

interests include computer vision, multimodal biometrics, sceneunderstanding, and robotics. He is a recipient of prestigiousDiscovery Early Career Research Award (DECRA) funded bythe Australian Research Council. He is also a recipient of theEarly Career Investigators award (UWA) and the best PhD the-sis medal form Monash University.

image set classiﬁcation for low resolution surveillance · 2019-06-18 · 1 image set...

Documents