PERSON IDENTIFICATION USING FACE ANDSPEECH BIOMETRICS
by
Imran Naseem
A Thesis Presented to theGraduate Research School
In Partial Fulfillment of the Requirementsfor the Degree
DOCTOR OF PHILOSOPHY
IN
Electrical, Electronic and Computer Engineering
UNIVERSITY OF WESTERN AUSTRALIA
Crawley, WA 6009, Australia
June 2010
ii
Acknowledgements
In the name of Allah, the Most Gracious and the Most Merciful
All praise and glory goes to Almighty Allah (Subhanahu Wa Ta’ala) who gave me the
courage and patience to carry out this work. Peace and blessings of Allah be upon His
last Prophet Muhammad (peace be upon him). First and foremost gratitude is due to
the esteemed university, the University of Western Australia, and to its learned faculty
members for imparting quality knowledge. My deep appreciation and heartfelt gratitude
goes to my thesis supervisors Dr. Roberto Togneri and Prof. Mohammed Bennamoun for
their constant support and the numerous moments of attention they devoted throughout
the course of this research work. Working with them in a friendly and motivating envi-
ronment was really a joyful experience of my life. I am thankful to Roberto Togneri, in
particular, for having faith in my efforts and ideas, and for giving me liberty in choosing
my research topics. He also efficiently managed and upgraded the Signal and Information
Processing (SIP) lab which made my experiment work a lot easier. Thanks are also due
to Prof. Mohammed Bennamoun, for allowing me to use the lab facility in his school.
I would like to acknowledge my mentor, Dr. Muhammad Hafiz Afzal, for providing
unconditional support, love and guidance from far away. I wish I could be in his company
again. Acknowledgement is due to my friends Dr. Nazim Khan, Ghazi Abu Rumman,
Salim, Bandar, Abdul Rahman, Shafiq, Hisham, Adnan Azam and many others all of
whom I would not be able to itemize. I owe thanks to my house mates and friends Khalid,
Abdul Rahman, Tahir, Umair, Shahnawaz and Asim Aqeel for their help, motivation
and pivotal support. They made my work and stay at UWA very pleasant and joyful.
My heartfelt thanks to my days old friends Hashim Raza Khan, Khawar Saeed, Mudassir
iii
Masood, Imran Azam, Faisal Zaheer, Mazhar Azim, Sajid Anwar, Moinuddin, Saad Azhar,
Arshad Raza and Aiman Rashid. I wish we could get together some time.
Last, but not the least, I thank my family: my respected father, Muhammad Naseem
Siddiqui, and my loving mother, Rashida Gulnaz, for educating me, for unconditional
love, support and encouragement to pursue my interests, even when the interests went
beyond boundaries of language, field and geography. My wife Javeria, with whom I have
recently started a new era of my life full of love, devotion, care and understanding. My
dearest sisters Sadia, Zuvia and Moniza have always been of great moral support for me,
I love all of them and wish them a prosperous future . My brother Arsalan Naseem, for
taking care of family while I am overseas. My grandfather, Muhammad Jameel Siddiqui
(late), was a light for me in dark times and gloomy circumstances. I enjoyed each and
every moment spent in his company, his memories are still a source of joy for me. His
demise was a great setback for the whole family, but then every one has to go, nobody
stays in this mortal world forever, may Allah grant him paradise and forgiveness. With all
humbleness, I pray to Allah Subhanahu Ta’la for the prosperity and well-being of whole
human race, irrespective of religion, cast, creed and ethnicity. May Allah show us all the
right path which will lead us to success of this world and hereafter, Ameen.
iv
Contents
Acknowledgements iii
Abstract xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Face as a Biometric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Voice as a Biometric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Aims and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Sparse Representation for Visual Biometric Recognition 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Compressive Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Sparse Representation Classification . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Sparse Representation Classification for Recognition from Still Face Images 16
2.4.1 Yale Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 AT&T Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.3 AR Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Sparse Representation Classification for Video-based Face Recognition . . . 21
2.5.1 Scale Invariant Feature Transform (SIFT) for Face Recognition . . . 21
2.5.2 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . 23
v
2.6 Sparse Representation Classification for Ear Biometric . . . . . . . . . . . . 27
2.6.1 Experiments and Discussion . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Linear Regression for Face Identification 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Linear Regression for Face Recogniton . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Linear Regression Classification (LRC) Algorithm . . . . . . . . . . 37
3.2.2 Modular Approach for the LRC Algorithm . . . . . . . . . . . . . . 39
3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 AT&T Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 Yale Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.3 Georgia Tech (GT) Database . . . . . . . . . . . . . . . . . . . . . . 46
3.3.4 FERET Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.5 Extended Yale B Database . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.6 AR Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Robust Regression for Face Recognition 67
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 The Problem of Robust Estimation . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Robust Linear Regression Classification (RLRC) for Robust Face Recognition 73
4.4 Case Study: Face recognition in Presence of Severe Illumination Variations 74
4.4.1 Yale Face Database B . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.2 CMU-PIE Face Database . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.3 AR Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.4 FERET Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.5 Case Study: Face Recognition in Presence of Random Pixel Noise . . . . . 84
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
vi
4.7 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5 Speaker Identification using Sparse Representation 103
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Sparse Representation for Speaker Identification . . . . . . . . . . . . . . . 106
5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6 Speaker Identification using Linear Regression 111
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2 Linear Regression Classification (LRC) Algorithm . . . . . . . . . . . . . . . 114
6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.4 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . 119
6.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7 Conclusions and Future Directions 123
7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Bibliography 127
vii
viii
List of Figures
2.1 A typical subject from the Yale database with various poses and variations. 16
2.2 (a) Recognition accuracy for the (a) Yale and (b) AT&T database with
respect to feature dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 A typical subject from the AT&T database . . . . . . . . . . . . . . . . . . 18
2.4 Gesture variations in the AR database, note the changing position of head
with different poses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 A typical localized face from the VidTIMIT database with extracted SIFTs. 22
2.6 A sample video sequence from the VidTIMIT database. . . . . . . . . . . . 24
2.7 (a)Rank profiles and (b) ROC curves for the SIFT, SRC and the combina-
tion of the two classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.8 Variation in performance with respect to bias in fusion. . . . . . . . . . . . 27
2.9 A typical subject from the UND database illustrating different pose and
illumination variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.10 A typical cropped ear (a) and its compressed form in the feature space (b). 28
2.11 (a)Rank Profile for the UND database. (b) ROC curves for the UND
Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.12 (a)Rank Profile for the FEUD database. (b) ROC curves for the FEUD
Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.13 A typical subject from the FEUD database . . . . . . . . . . . . . . . . . . 31
3.1 A typical subject from the AT&T database . . . . . . . . . . . . . . . . . . 42
3.2 (a) Test image from subject 1. (b) Residuals using a randomly selected
false subspace. (c) Residuals using subspace 1 . . . . . . . . . . . . . . . . 43
ix
3.3 (a) Recognition accuracy for the AT&T database with respect to feature
dimension using the LRC algorithm. (b) Cross-validation with 20 random
selections of gallery and probe images. . . . . . . . . . . . . . . . . . . . . . 44
3.4 A typical subject from the Yale database with various poses and variations. 46
3.5 Yale database: Recognition accuracy with respect to feature dimension for
a randomly selected experiment. . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Samples of a typical subject from the GT database. . . . . . . . . . . . . . 48
3.7 A typical subject from the FERET database, fa and fb representing frontal
shots with gesture variations while ql and qr correspond to pose variations. 50
3.8 Starting from top, each row illustrates samples from subsets 1, 2, 3, 4 and
5 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.9 Recognition accuracy with varying feature dimension for EP1, EP2, EP3
and EP4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.10 Gesture variations in the AR database, note the changing position of head
with different poses. First and second rows correspond to 2 different sessions
incorporating neutral, happy, angry and screaming expressions respectively. 55
3.11 Examples of contiguous occlusion in the AR database. . . . . . . . . . . . . 59
3.12 The recognition accuracy versus feature dimension for scarf occlusion using
the LRC approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.13 A sample image indicating eyes and mouth locations for the purpose of
manual alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.14 Samples of cropped and aligned faces from the AR database. . . . . . . . . 61
3.15 Case studies for Modular LRC approach for the problem of scarf occlusion. 62
3.16 (a) Distance measures dj(n) for the four partitions, note that non-face com-
ponents make decisions with low evidences. (b) Recognition accuracies for
all blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1 Yale Face Database B: Starting from top, each row represents typical im-
ages from subsets 3, 4 and 5 respectively. Note that subset 5 (third row)
characterizes the worst illumination variations. . . . . . . . . . . . . . . . . 75
x
4.2 The 21 different illumination variations for a typical subject from the CMU
PIE database. These images were captured without any ambient lighting
thereby demonstrating more severe luminance alterations . . . . . . . . . . 77
4.3 Performance curves for the CMU-PIE database under EP 2. . . . . . . . . . 80
4.4 Various luminance variations for a typical subject of the AR database, the
two rows represent two different sessions. . . . . . . . . . . . . . . . . . . . 81
4.5 ROC curves for the FERET database. . . . . . . . . . . . . . . . . . . . . . 84
4.6 First row illustrates some gallery images from subset 1 and 2 while second
row shows some probes from subset 3. . . . . . . . . . . . . . . . . . . . . . 85
4.7 Probe images corrupted with (a) 20% (b) 40% (c) 60% and (d) 80% dead
pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.8 Recognition accuracy of various approaches for a range of dead pixel noise
density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.9 Probes with (a) 20% (b) 40% (c) 70% and (d) 90% salt and pepper noise
density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.10 Recognition accuracy curves in the presence of varying density of salt and
pepper noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.11 Probe images corrupted with (a) 4 (b) 6 (c) 8 and (d) 10 variance speckle
noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.12 Dead-pixel noise: First row elaborates rank-recognition profiles while the
second row shows the Receiver Operating Characteristics (ROC). From left
to right columns indicate 20%, 40%, 60% and 80% noise density respectively. 91
4.13 Salt and pepper noise: First row represents rank recognition curves while
the second row shows Receiver Operating Characteristics (ROC). From left
to right columns indicate 50%, 60% 70% and 80% noise densities respectively. 92
4.14 Speckle noise: First row represents rank recognition curves while second
row shows receiver operating characteristics. From left to right columns
indicate noise densities with variances 2, 4, 6, and 8 respectively. . . . . . . 94
xi
4.15 Gaussian noise: First row represents rank recognition curves while second
row shows Receiver Operating Characteristics (ROC). From left to right
columns indicate noise densities with variances 0.5, 0.7, 0.8, and 0.9 respec-
tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.16 Probe images corrupted with (a) 0.2 (b) 0.4 (c) 0.6 and (d) 0.8 variance
zero-mean Gaussian noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.17 Recognition accuracy of various approaches in the presence of speckle noise
for different variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.18 Recognition accuracy of various approaches in the presence of Gaussian
noise for different variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1 Experiment Set 1: Recognition accuracy of various approaches with respect
to number of mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2 Experiment Set 2: Recognition accuracy of various approaches with respect
to number of mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
xii
List of Tables
2.1 Results for Yale database using the leave-one-out method. . . . . . . . . . . 17
2.2 Results for two experiment sets using the AT&T database. . . . . . . . . . 18
2.3 Recognition Results for Gesture Variations under Experiment Set 1 . . . . . 20
2.4 Recognition Results for Gesture Variations under Experiment Set 2 . . . . . 21
2.5 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Results for EP1 and EP2 using the AT&T database. . . . . . . . . . . . . . 45
3.2 Results for Yale database using the leave-one-out method. . . . . . . . . . . 46
3.3 Results for the Georgia Tech. database. . . . . . . . . . . . . . . . . . . . . 49
3.4 Results for the FERET database. . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Results for the Extended Yale B database. . . . . . . . . . . . . . . . . . . . 52
3.6 Recognition Results for Gesture Variations Using the LRC Approach . . . . 55
3.7 Recognition Results for Gesture Variations under EP3 . . . . . . . . . . . . 57
3.8 Recognition Results for Gesture Variations under EP4 . . . . . . . . . . . . 58
3.9 Recognition Results for Occlusion . . . . . . . . . . . . . . . . . . . . . . . . 59
3.10 Comparison of the DEF with the Sum Rule for Three Case Studies . . . . . 64
4.1 Outline of Robust Linear Regression Classification (RLRC) Algorithm . . . 72
4.2 Details of the subsets for Yale Face Database B with respect to light source
directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Recognition Results for Yale Face Database B . . . . . . . . . . . . . . . . . 76
4.4 Performance comparison with state-of-the-art algorithms characterizing train-
ing images captured from near frontal lighting. All results are as reported
in [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
xiii
4.5 Performance comparison with state-of-the-art algorithms characterizing train-
ing images with severe lighting conditions. All results are as reported in
[1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.6 Results for the AR database under EP 1. . . . . . . . . . . . . . . . . . . . 81
4.7 Results for the AR database under EP 2. . . . . . . . . . . . . . . . . . . . 82
4.8 Results for the AR database under EP 3. . . . . . . . . . . . . . . . . . . . 82
4.9 Verification Results for dead-pixel noise . . . . . . . . . . . . . . . . . . . . 89
4.10 Verification Results for salt and pepper noise . . . . . . . . . . . . . . . . . 89
4.11 Verification Results for Speckle Noise . . . . . . . . . . . . . . . . . . . . . . 96
4.12 Verification Results for Gaussian Noise . . . . . . . . . . . . . . . . . . . . . 96
5.1 Experimental Results for the TIMIT database . . . . . . . . . . . . . . . . . 108
6.1 Experiment Set 1: Recognition accuracy for various approaches with respect
to different number of mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 Experiment Set 2: Recognition accuracy for various approaches with respect
to different number of mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . 116
xiv
Abstract
Increasing security threats have recently highlighted the importance of efficient authenti-
cation systems. Although face and speech biometrics have shown good performance, there
are key robustness issues which challenge the reliability of these systems. For instance
illumination, expression, pose and occlusion remain open challenges in the paradigm of
face recognition. To address these robustness issues, there is a dire need of novel algo-
rithms in these areas. This dissertation investigates the recently proposed face recognition
algorithm called Sparse Representation Classification (SRC) for key robustness issues.
Since local features such as eyes, ears and lips have shown better performance compared
to their global counterparts, the thesis successfully extends the SRC approach to the
problem of ear recognition. In the paradigm of face recognition, three novel algorithms
named Linear Regression Classification (LRC), Modular LRC and Robust Linear Regres-
sion Classification (RLRC) are proposed to address various issues of severe expression
variations, adverse luminance variations, contiguous occlusion and random pixel corrup-
tion. Extensive experiments have been conducted on standard databases and excellent
results have been reported. In particular, using the Modular LRC approach we achieve
the best result ever reported for the challenging scarf occlusion problem. Addressing the
problem of luminance, the proposed RLRC algorithm is able to achieve 100% recognition
accuracy on the most adversely distorted Subset 5 of the Yale Database B outperforming
the contemporary illumination invariant face recognition algorithms. In the paradigm of
speaker recognition, we propose two novel algorithms based on sparse representation and
linear regression. These algorithms are tested on subsets of TIMIT database achieving
competitive performance index compared to the state-of-art approaches. The dissertation
is presented as a compilation of publications.
xv
xvi
Chapter 1
Introduction
1.1 Motivation
With the advent of technology, electronic devices have become an integral part of our day
to day life. These devices are used in several applications. The applications range from
daily use of home appliances to some high performance scientific devices in modern labs.
In many cases, the use of these facilities must be allowed to a limited group of people.
For example, you will not allow a stranger to access your office computer. Similarly, it
will be a serious problem if your bank account information is hacked and can be accessed
through an ATM. Some kind of “key” must be there so that only the concerned person(s),
and not any intruder, can access these facilities. The most commonly used, and the
earliest, technique is that of passwords. The user is required to type in his password.
If the password is correct the user is allowed access to the facility, otherwise the system
rejects the user. There are several problems associated with the use of passwords [2]. The
passwords can be forgotten and it is difficult to memorize different passwords for different
applications. Users therefore tend to have one password for various venues which results
in a high increase of risk-level. With the recent technological development, passwords can
also be “stolen” or hacked. To counter these difficulties, smart-cards were introduced in
conjunction with passwords. The system can only be used for access by the person in
possession of a card. Again, this poses several problems as the card can be lost, stolen or
damaged or even duplicated. To avoid these troubles, systems based on biometrics started
to gain more popularity and acceptance [2].
1
Biometrics is the way of capturing a person’s physical or behavioral characteristics or
traits which can be used for authentication or identification. Biometrics help us avoid
the problems discussed above. In fact, in a biometrics-based environment, subject is an
identity by himself. It is however important to point out that with the recent emergence of
multimedia technology, traditional biometric systems have also become vulnerable. There
is a tendency of fooling the biometric system by presenting very fine quality recorded
biometric samples when the concerned person is actually not present [3]. The problem
referred to as “liveliness detection” is in fact an emerging research area in the field of bio-
metrics. However, even with the issue of liveliness detection, biometrics arguably remain
the best choice for secure authentication. Several biometric systems exist today:
• Fingerprints
• Voice
• Face
• Hand Geometry
• Eyes Geometry
• Iris
• Ear Geometry
The systems based on the above mentioned features have been shown to be very
efficient. One must be mindful that each biometric system comes with some intrinsic
weaknesses and one system suitable for a particular application may not fit well for the
other. For example, fingerprints are arguably the most developed of all the biometrics
and achieve very good recognition accuracy but require a high level of cooperation from
the user. This makes fingerprints non-user-friendly and unsuitable for many important
applications, such as surveillance. Similarly, iris patterns have proven to be excellent
features for human recognition, but the process of getting a good iris image is complex,
expensive and intrusive. Of the above mentioned biometrics, face and speech are the two
most natural choices as they are user-friendly, do not require physical contact and are less
intrusive.
2
1.2 Face as a Biometric
The importance of face recognition is highlighted with the widely deployed video surveil-
lance systems. Surveillance cameras capturing images can be used to monitor abnormal
activities in sensitive areas. The face recognition problem can be defined as the process
of identifying an individual from his/her face image. This face image can be captured
by a camera or can be extracted from a video. Face recognition is a challenging task in
pattern recognition. This is mainly due to the fact that the image of a face is prone to
change due to a number of factors like noise, illumination, viewpoint, age, facial expres-
sions, occlusion etc. In the past 40 years, we have witnessed a major development in the
field of face recognition. The main reason for such expansion is the need of such systems
for various commercial security applications. In spite of this advancement, recognition
systems have faced certain limitations due to the above mentioned robustness issues. In
particular, the issues of contiguous occlusion and illumination variation are considered as
the most challenging problems in the paradigm of face recognition [3].
1.3 Voice as a Biometric
Automatic speaker recognition systems identify people utilizing utterances. Depending
on the nature of the application, speaker identification or speaker verification systems,
could be modeled to operate either in text-dependent or text-independent modes. For
text-dependent mode, the user is required to utter a specific password, while for text-
independent ASR (Automatic Speaker Recognition), there is no need for such a constraint.
Success in both cases depends on the modeling of speech characteristics which distinguishes
one user from the other. Text-dependent mode is used for applications where the user is
willing to cooperate by memorizing a phrase or password to be spoken. Research in the field
of speaker recognition traces back to the early 1960s when Lawrence Kersta at Bell Labs
[4] made the first major step in speaker verification by computers, where he introduced
the term “voiceprint” for a spectrogram. Since then, there has been a tremendous amount
of research in the area and speech has emerged as a mature biometric trait.
3
1.4 Aims and Scope
The research was focused to develop novel and robust recognition algorithms for face and
speech biometrics. Recent developments in the theory of compressive sensing [5] has found
numerous applications in various fields of signal processing. Recently this concept of sparse
representation has been used for the problem of face recognition [6]. Sparse Representa-
tion Classification (SRC) presented in [6] has shown some interesting results. The feature
extraction stage has been a debatable topic within the face recognition community. A
number of approaches have been presented incorporating complicated computations. In
[6] it has been shown that with the choice of an appropriate classifier, merely downsam-
pled images are sufficient at the feature extraction stage to yield good results. Essentially
SRC was the starting point for this research, we successfully extended the intriguing ap-
proach of SRC for other challenging problems of view-based biometric recognition systems.
These experiments were quite encouraging and lead to the development of a novel face
recognition algorithm, called Linear Regression Classification (LRC). The proposed LRC
algorithm showed excellent results on standard face databases. We further extended the
approach to a patch-based technique called Modular LRC to achieve excellent results for
the challenging problem of contiguous occlusion. Noting that LRC formulates the problem
of face recognition as a task of linear regression, we further proposed to use robust statis-
tics to develop a Robust Linear Regression Classification (RLRC) algorithm to tackle the
challenging issues of illumination variations and random pixel corruption.
Traditionally, the problem of speaker identification is tackled using the Gaussian Mix-
ture Model (GMM) - based probabilistic approaches. Recently, however an interesting
concept of GMM mean supervector has enabled the representation of a speaker as a point
in a high dimensional space i.e. the speaker space [7]. Essentially with this approach, a
variable-length utterance can be represented as a fixed-length feature vector in the feature
space which was not possible before. Consequently the problem of speaker recognition
can be tackled as a general problem of pattern recognition. In [7] the concept of Support
Vector Machine (SVM) has been successfully used to yield competitive results compared
to the state-of-art probabilistic approaches. The SVM calculations however tend to be
computationally expensive, in particular for one-against-all SVM architecture, with large
4
number of Gaussian mixtures, the simulations are not possible on a standard machine. It
is therefore imperative to further explore current pattern classification algorithms for the
problem of speaker identification. With this understanding we extended the SRC and LRC
classification algorithms for the problem of speaker identification, achieving competitive
performance index compared to the state-of-art approaches including SVM approach. The
proposed algorithm remains one of the most simple of the current state-of-art classification
approaches.
1.5 Thesis Structure
The research presented in the thesis has either been published/accepted for publication or
under review in prestigious journals and conferences, the thesis is therefore presented as
compilation of these publications. It is worthy to point out that for the sake of consistency
and flow, some chapters consist of more than one publication. Nevertheless each chapter
is self-contained and does not require linkage with any other chapter. Since the thesis
is presented as a combination of publications, there is an inevitable overlap between the
chapters when describing the general problem statements. The organization of the thesis
is as follows:
Chapter 2 is an extensive evaluation of the recently introduced SRC algorithm for view-
based biometrics. It begins with the evaluation of robustness of the SRC algorithm for
two major issues. In particular we address the issues of severe expression variations and
moderate illumination variations. We further investigated the performance of the SRC
algorithm on the video-based face recognition problem. Considering that in the paradigm
of face recognition, local features (such as eyes, ears, lips etc.) have shown good results
compared to the global counterparts, we extended the SRC approach to the problem of
ear recognition. Publications incorporated in the chapter are also reported at the end of
the chapter.
Chapter 3 presents the Linear Regression Classification (LRC) algorithm which is a
novel approach for the problem of face recognition. The difficult problem of occluded faces
5
is also successfully addressed using the modular LRC approach. Publications constituting
the chapter are also reported at the end.
Chapter 4 presents the Robust Linear Regression Classification (RLRC) algorithm
which is a novel approach to address two major robustness issues namely (1) severe il-
lumination variations and (2) random pixel noise. The proposed algorithm has shown
superior performance index compared to the state of art robust approaches. Publications
constituting the chapter are also reported at the end.
Chapter 5 presents the novel extension of the SRC algorithm for the problem of speaker
identification. Experiments have been conducted and results are compared to the state-
of-art approaches including the rcently proposed SVM classification. Publication arising
from the chapter is indicated at the end.
Chapter 6 presents the novel LRC algorithm for the problem of speaker identification.
Extensive experiments have shown good performance index for the proposed approach.
Publication arising from the chapter is indicated at the end.
Chapter 7 concludes the dissertation with a summary of the contributions and sug-
gested future directions of the research.
1.6 Contributions
1. Evaluation of the robustness of Sparse Representation Classification (SRC) algorithm
for (1) slight-to-moderate light variations and (2) severe expression variations.
2. Extension of the SRC algorithm for the problem of ear recognition.
3. Extension of the SRC algorithm for video-based face recognition.
4. Development of a novel face recognition algorithm called Linear Regression Clas-
sification (LRC) demonstrating excellent results compared to the benchmark ap-
proaches.
6
5. Development of Modular LRC approach to tackle the difficult problem of contiguous
occlusion using the novel Distance based Evidence Fusion (DEF) algorithm. The
proposed algorithm achieved the best results ever reported for the scarf occlusion
problem.
6. Development of the Robust Linear Regression Classification (RLRC) algorithm for
the challenging issues of (1) illumination variations and (2) random pixel corruption.
The proposed algorithm achieved excellent results compared to the state-of-art ro-
bust approaches.
7. Development of a novel algorithm in the paradigm of speaker identification using
the concept of sparse representation.
8. Development of a novel linear regression based speaker recognition algorithm achiev-
ing comparable results to the state-of-art approaches.
1.7 Publications
Key publications arising from the thesis are the following:
1. Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Linear Regression
for Face Recognition”, In Print IEEE Transactions on Pattern Analysis and Machine
Intelligence (IEEE TPAMI)
2. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, “Robust Regres-
sion for Face Recognition”, First revision submitted IEEE Transactions on Pattern
Analysis and Machine Intelligence (IEEE TPAMI)
3. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, “Face Identification
using Linear Regression”, International Conference on Image Processing ICIP09,
Cairo, Egypt.
4. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, Sparse Representa-
tion for View-Based Face Recognition, Accepted as a book chapter in book Advances
in Face Image Analysis: Techniques and Technologies ED. Y.-J. Zhang, IGI Global
Publishing.
7
5. Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Sparse Represen-
tation for Speaker Identification”, Accepted in IAPR International Conference on
Pattern Recognition, ICPR’10
6. Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Robust Regres-
sion for Face Recognition”, Accepted in IAPR International Conference on Pattern
Recognition, ICPR’10
7. Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Sparse Represen-
tation for Video-Based Face Recognition”, book chapter in Advances in Biometrics
(Lecture Notes in Computer Science, LNCS series), Springer Berlin / Heidelberg.
Volume 5558/2009, pages 219-228. 0302-9743 (Print) 1611-3349 (Online), ISBN
978-3-642-01792-6.
8. Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Sparse Represen-
tation for Video-Based Face Recognition”, International Conference on Biometrics,
ICB09, Alghero, Italy.
9. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, “Sparse Represen-
tation for Ear Biometrics”, book chapter in Advances in Visual Computing (Lecture
Notes in Computer Science, LNCS series), Springer Berlin / Heidelberg. Volume
5359/2008, pages 336-345. ISSN 0302-9743 (Print) 1611-3349 (Online), ISBN 978-
3-540-89645-6.
10. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, “Sparse Represen-
tation for Ear Biometrics”, International Symposium on Visual Computing (ISVC),
December 1-3, 2008, Las Vegas, Nevada, USA.
11. Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Linear Regression
for Speaker Identification”, Submitted to InterSpeech 2010
8
Chapter 2
Sparse Representation for Visual
Biometric Recognition1
2.1 Introduction
With the ever increasing security threats, the problem of invulnerable authentication sys-
tems is becoming more acute. Traditional means of securing a facility essentially depend
on strategies corresponding to “what you have” or “what you know”, for example smart
cards, keys and passwords. These systems however can easily be fooled. Passwords for
example, are difficult to remember and therefore people tend to use the same password
for multiple facilities making it more susceptible to hacking. Similarly cards and keys can
easily be stolen or forged. A more inalienable approach is therefore to go for strategies
corresponding to “what you are” or “what you exhibit” i.e. biometrics. Although the
issue of “liveliness” has recently been highlighted due to the advancement in digital media
technology, biometrics arguably remain the best choice.
Among the other available biometrics, such as speech, iris, fingerprints, hand geometry
and gait, face seems to be the most natural choice [2]. It is nonintrusive, requires a mini-
mum of user cooperation and is cheap to implement. The importance of face recognition
is highlighted for widely used video surveillance systems where we typically have facial
1Parts from the chapter have been published in International Symposium on Visual Computing(ISVC’08) and International Conference on Biometrics (ICB’09). The research has also been acceptedto be published as a book-chapter in an upcoming book “Advances in Face Image Analysis: Techniquesand Technologies”
9
images of suspects. With the additional temporal dimension, video sequences are much
more informative than still images. As a result the person identification task is facili-
tated due to specific attributes of each subject such as head rotation and pose variation
along the temporal dimension. Additionally more efficient face representations such as
super resolution images can be derived from video sequences for further enhancement of
the overall system. These motivations have urged researchers to look into the develop-
ment of face recognition systems that can utilize the spatiotemporal information in video
sequences. It is therefore becoming imperative to evaluate present state-of-the-art face
recognition algorithms for video-based applications. A face recognition system works in
three modes: 1) Face Identification/Recognition 2) Face Verification and 3) The Watch
List approach. Face identification/recognition is a 1:N matching problem where a Closed
Universe model is used. Therefore each probe image is implicitly assumed to be from one
of the registered users. Face verification on the other hand, is defined as a 1:1 matching
problem and requires the confirmation of the identity claimed by a user. The watch list
approach as proposed in Face Recognition Vendor Test (FRVT 2002) [8], assumes an Open
Universe model where a probe face image may or may not correspond to the registered
users. Similarity scores are computed against each subject in the gallery and an alarm is
raised if the score exceeds a given threshold [3].
Ear is also an important biometric, gaining popularity primarily due to the immunity
against the aging factor [9]. With the increasing age faces sag, speech becomes heavier,
fingerprints wear out and the style of walking declines [9]. Ears however tend to maintain
their shapes and seem to be the most time invariant biometric. Furthermore they are also
pose invariant as they do not change shapes with the change in gestures. These specialties
have urged researchers to consider ears for the purpose of person identification. Historically
speaking, Iannarelli [10] first provided enough experimental evidence in 1989 to draw the
attention of the researchers to the problem of ear recognition. However, it was not before
the last decade that the computer vision community had started evaluating ear recognition
systems with some appreciable results [11]. Various studies in the area include neural
network approaches [12] and Principal Component Analysis (PCA) variations [13, 14].
The PCA approaches however rely heavily on the normalization processes for any reliable
10
results [15].
Appearance-based face recognition systems, either employing the whole face or local
landmark features such as eyes, lips, ears etc., critically depend on manifold learning meth-
ods. A gray-scale face image of order a× b can be represented as an ab-dimensional vector
in the original image space. However any attempt of recognition in such a high dimensional
space is vulnerable to a variety of issues often referred to as the curse of dimensionality.
Typically in pattern recognition problems it is believed that high-dimensional data vectors
are redundant measurements of an underlying source. The objective of manifold learning
is therefore to uncover this “underlying source” by a suitable transformation of high-
dimensional measurements to low-dimensional data vectors. View-based face recognition
methods are no exception to this rule. Therefore, at the feature extraction stage, images
are transformed to low dimensional vectors in a face space. The main objective is to find
a basis function for this transformation, which could distinguishably represent faces in the
face space. Linear transformation from the image space to the feature space is perhaps the
most traditional way of dimensionality-reduction, also called “Linear Subspace Analysis”.
A number of approaches have been reported in the literature including Principal Com-
ponent Analysis (PCA) [16], [13], Linear Discriminant Analysis (LDA) [17] and Indepen-
dent Component Analysis (ICA) [18], [19]. These approaches have been classified in two
categories namely reconstructive and discriminative methods. Reconstructive approaches
(such as PCA and ICA) are reported to be robust for the problem related to contaminated
pixels, whereas discriminative approaches (such as LDA) are known to yield better results
in clean conditions [20]. Nevertheless, the choice of the manifold learning method for a
given problem of face recognition has been a hot topic of research in the face recognition
literature. These debates have recently been challenged by a new concept of “Sparse Rep-
resentation Classification (SRC)” [6]. It has been shown that unorthodox features such
as downsampled images and random projections can serve equally well. As a result the
choice of the feature space may no longer be so critical [6]. What really matters is the
dimensionality of the feature space and the design of the classifier. The key factor to the
success of sparse representation classification is the recent development of “Compressive
Sensing” theory [5].
11
Recent developments in the theory of compressive sensing [5] has found numerous appli-
cations in various fields of signal processing. Recently this concept of sparse representation
has been used for the problem of face recognition [6]. The reported results are encouraging
enough to extend the algorithm to other biometrics and further evaluate their performance
for harder face recognition problems under the most challenging practical constraints such
as video-based face recognition, occlusion and varying ambient illumination. The main
objective of this chapter is therefore two fold: (1) To extend the Sparse Representation
Classification (SRC) method for the problem of the emerging ear biometric. (2) To eval-
uate the SRC algorithm for more challenging and realistic face recognition issues such as
gesture variations, illumination variations and video-based applications.
The rest of the chapter is organized as follows: Section 2.2 briefly covers the problem of
compressive sensing followed by description of Sparse Representation Classification (SRC)
in Section 2.3. Section 2.4 consists of extensive experiments on still face images followed
by evaluations on video-based face recognition problem in Section 2.5. Evaluations for ear
biometric recognition are presented in Section 2.6. The chapter is concluded in Section
2.7 followed by a list of publications arising from this research in Section 2.8.
2.2 Compressive Sensing
Most of the signals of practical interest are compressible in nature. For example, audio
signals are compressible in localized Fourier domain and digital images are compressible in
Discrete Cosine Transform (DCT) and wavelet domains. This concept of compressibility
gives rise to the notion of transform coding so that subsequent processing of information is
computationally efficient. It simply means that a signal when transformed into a specific
domain becomes sparse in nature and could be approximated efficiently by say a K number
of large coefficients, ignoring all the small values. However, the initial data acquisition is
typically performed in accordance with the Nyquist sampling theorem, which states that
a signal could only be safely recovered from the samples if and only if the samples are
drawn at a sampling rate which is at least twice the maximum frequency of the signal.
Consequently, the data acquisition part can be an overhead since a huge number of acquired
samples will have to be further compressed for any subsequent realistic processing.
12
As a result, a legitimate question is whether there is any efficient way of data acquisition
so as to remove the Nyquist overhead, yet safely recovering the signal. The new area of
compressive sensing answers this question. Let us formulate the problem in the following
manner [21].
Let g be a signal vector of order N × 1. Any signal in RN can be represented in terms
of an N ×N orthonormal basis matrix Ψ and N ×1 vector of weighting coefficients s such
that:
g = Ψs (2.1)
It has to be noted here that g and s are essentially two different representations of the
same signal. g is the signal expressed in the time domain while s is the signal represented
in Ψ domain. Note that the transformation of the signal g in Ψ basis makes it K-sparse.
This means that ideally s has only K non-zero entries.
Now the aim of compressive sensing is to measure a low dimensional vector y of order
M × 1 (M < N) such that the original information g can be safely retrieved from y. It
means that we are looking for a transformation Φ such that
yM×1 = ΦM×NgN×1 (2.2)
From equation 2.1
yM×1 = ΦM×NΨN×NsN×1 (2.3)
yM×1 = ΘM×NsN×1
In equation 2.3 the main aim is to design a stable measurement matrix Φ which would
ensure that there is no information loss in compressible signal due to the dimensionality
reduction from RN to R
M [21]. Leaving the issue of measurement matrix for a moment, we
would like to emphasize that given the measurement vector y in equation 2.3 the problem
is still ill-posed as we are looking for N unknowns with a system of M equations. However
the issue is easily resolved due to the K-sparse nature of s which means that essentially
13
there will only be K non-zero entries in s and hence we will be looking to find K unknowns
from a system of M equations where K ≤ M .
It has been shown that for equation 2.3 to be true Θ must satisfy the Restricted Isom-
etry Property (RIP) [22]. Alternatively it has been discussed in [5, 22] that the stability of
the measurement matrix Φ could be ensured if it is incoherent with the sparsifying basis
Ψ. In the framework of compressive sensing this discussion boils down to the selection of
Φ as a random matrix. This means that if, for instance, we select Φ as a Gaussian random
matrix, such that the entries of the matrix Φ are independent and identically distributed
(iid) then Θ will satisfy the RIP with a high probability [5, 22, 23].
Once the RIP property of Θ is satisfied in equation 2.3, the recovery of vector s is
merely a problem of using a suitable reconstruction algorithm. In the compressive sensing
literature [5, 22] it has been shown that s can be recovered with a high probability using
the l1 optimization given that
M ≥ cK log
(N
K
)
(2.4)
c being a small constant.
2.3 Sparse Representation Classification
We now discuss the basic framework of the face recognition system in the context of sparse
representation [6]. Let us assume that we have k distinct classes and ni images available
for training from the ith class. Each training sample is a gray scale image of order a × b.
The image is downsampled to an order w × h and is converted into a 1-D vector vi,j by
concatenating the columns of the downsampled image such that vi,j ∈ Rm (m = wh).
Here i is the index of the class, i = 1, 2, . . . , k and j is the index of the training sample,
j = 1, 2, . . . , ni. All this training data from the ith class is placed in a matrix Ai such that
Ai = [vi,1,vi,2, . . . . . . ,vi,ni] ∈ R
m×ni . As stated in [6], when the training samples from
the ith class are sufficient, the test sample y from the same class will approximately lie in
the linear span of the columns of Ai:
14
y = αi,1vi,1 + αi,2vi,2 + · · · + αi,nivi,ni
(2.5)
where αi,j are real scalar quantities. Now we develop a dictionary matrix A for all k
classes by concatenating Ai, i = 1, 2, . . . , k as follows:
A = [A1, A2, . . . , Ak] ∈ Rm×nik (2.6)
Now a test pattern y can be represented as a linear combination of all n training
samples (n = ni × k):
y = Ax (2.7)
Where x is an unknown vector of coefficients. Now from equation 2.7 it is relatively
straight forward to note that only those entries of x that are non-zero correspond to the
class of y [6]. This means that if we are able to solve equation 2.7 for x we can actually
find the class of the test pattern y. Recent research in compressive sensing and sparse
representation [22, 5, 24, 25, 26] have shown that using the sparsity of the solution of
equation 2.7, enables us to solve the problem using l1-norm minimization:
(l1) : x1 = argmin ‖x‖1 ;Ax = y (2.8)
Once we have estimated x1, ideally it should have nonzero entries corresponding to
the class of y and now deciding the class of y is a simple matter of locating indices of the
non-zero entries in x1. However due to noise and modeling limitations x1 is commonly
corrupted by some small nonzero entries belonging to different classes. To resolve this
problem we define an operator δi for each class i so that δi(x1) gives us a vector ∈ Rn
where the only nonzero entries are from the ith class. This process is repeated k times for
each class. Now for a given class i we can approximate yi = Aδi(x1) and assign the test
pattern to the class with the minimum residual between y and yi.
min︸︷︷︸
i
ri(y) = ‖y − Aδi(x1)‖2 (2.9)
15
2.4 Sparse Representation Classification for Recognition from
Still Face Images
2.4.1 Yale Database
The Yale database, maintained at Yale university, consists of 165 grayscale images from
15 individuals [27]. Images from each subject reflects gesture variations incorporating
normal, happy, sad, sleepy, surprised, and wink expressions. Luminance variation is also
addressed by including images with lighting source from central, right and left directions.
A couple of images with and without spectacles are also included. Figure 2.1 represents 11
different images from a single subject. Experiments are conducted on the original database
without any preprocessing stages of face cropping and/or normalization. Each 320 × 243
grayscale image is downsampled to an order of 25× 25 to get a 625-d feature vector. The
experiments are conducted using the leave-one-out approach as reported quite regularly
in the literature [28], [29], [30]. A comprehensive comparison of various approaches is
provided in Table 2.1. Note that the error rates have been transformed to recognition
rates for [30]. The SRC approach substantially outperformed all reported techniques
showing an improvement of 5.48% over the best contestant i.e. the Fisherfaces approach.
Note that the SRC approach leads the traditional PCA and ICA approaches by a margin
of 22.58% and 26.66% respectively. The choice of feature space dimension is elaborated
in Figure 2.2 (a), the dimensionality curve is shown for a randomly selected leave-one-out
experiment. Classification accuracy becomes fairly constant in an approximately 600-D
feature space.
Figure 2.1: A typical subject from the Yale database with various poses and variations.
16
100 200 300 400 500 600 700 800 900 1000 110050
55
60
65
70
75
80
85
90
95
100
Feature Dimension
Rec
ogni
tion
Acc
urac
y
(a)
15 20 25 30 35 40 45 5050
55
60
65
70
75
80
85
90
95
100
Feature Dimension
Rec
ogni
tion
Acc
urac
y
(b)
Figure 2.2: (a) Recognition accuracy for the (a) Yale and (b) AT&T database withrespect to feature dimension.
Table 2.1: Results for Yale database using the leave-one-out method.
Evaluation Method Approach Recognition Rate
ICA [28] 71.52%Kernel Eigenfaces [28] 72.73%
Edge map [30] 73.94%Eigenfaces [30] 75.60%
Leave-one-out Correlation [30] 76.10%Linear subspace [30] 78.40%
2DPCA [28] 84.24%Eigenface w/o 1st 3 [30] 84.70%
LEM [30] 85.45%Fisherfaces [30] 92.70%
SRC 98.18%
2.4.2 AT&T Database
The AT&T database is maintained at the AT&T Laboratories, Cambridge University. Ten
different images from one of the 40 subjects from the database are shown in Figure 2.3.
The database incorporates facial gestures such as smiling or non-smiling, open or closed
eyes and alterations like glasses or without glasses. It also characterizes a maximum of
20◦ rotation of the face with some scale variations of about 10%.
The choice of dimensionality for the AT&T database is dilated in Figure 2.2(b) which
reflects that the recognition rate becomes fairly constant above a 40-dimensional feature
space. Therefore each 112 × 92 grayscale image is downsampled to an order 7 × 6 and is
transformed to a 42-dimensional feature vector by column concatenation.
17
Figure 2.3: A typical subject from the AT&T database
To provide a comparative value for the SRC approach we follow two evaluation strate-
gies as proposed in the literature [28], [29], [31]. First evaluation strategy takes the first
five images of each individual as a training set, while the last five are designated as probes.
Another set of experiments were conducted using the “leave-one-out” approach. A detailed
comparison of the results for the two experimental setups is summarized in Table 2.2, all
results are as reported in [28]. For first set of experiments the SRC algorithm achieves
a comparable recognition accuracy of 93% in a 42-D feature space, the best results are
reported for the 2DPCA approach which are 3% better than the SRC method. Also for the
second set of experiments the SRC approach attains a high recognition success of 97.5%
in a 42-D feature space, it outperforms the ICA approach by 3.7% (approximately) and is
fairly comparable to Fisherfaces, Eigenfaces, Kernel Eigenfaces and 2DPCA approaches.
Table 2.2: Results for two experiment sets using the AT&T database.
Evaluation Method Approach Recognition Rate
Fisherfaces 94.50%ICA 85.00%
Experiment Set 1 Kernel Eigenfaces 94.00%2DPCA 96.00%SRC 93.00 %
Fisherfaces 98.50%ICA 93.80%
Experiment Set 2 Eigenfaces 97.50%Kernel Eigenfaces 98.00%
2DPCA 98.30%SRC 97.50%
18
2.4.3 AR Database
The AR database consists of more than 4,000 color images of 126 subjects (70 men and 56
women) [32]. The database characterizes divergence from ideal conditions by incorporating
various facial expressions (neutral, smile, anger and scream), luminance alterations (left
light on, right light on and all side lights on) and occlusion modes (sunglass and scarf). Due
to the large number of subjects and the substantial amount of variations, the AR database
is much more challenging compared to the AT&T and Yale databases. It has been used
by researchers as a test-bed to evaluate and benchmark face recognition algorithms. In
this research we address the problem of varying facial expressions, see Figure 2.4. We
evaluate the AR database under two experimental setups as proposed in literature, for all
experiments the 576× 768 image frames are downsampled to an order 8× 10 constituting
an 80-D feature space.
(a) (b) (c) (d)
Figure 2.4: Gesture variations in the AR database, note the changing position of headwith different poses.
For the first set of experiments we follow the setup as designed in [30]. A subset of AR
database consisting of 112 individuals is randomly selected. The system is trained using
only one image per subject which characterizes neutral expression (Figure 2.4(a)), therefore
we have 112 gallery images. The system is tested on the remaining three expressions
shown in Figure 2.4 (b), (c) and (d) altogether making 336 probe images. Table 2.3 shows
a thorough comparison of the SRC approach and the results reported in [30]. EM and
LEM stands for Edge Map and Line Edge Map respectively while all other approaches
being variants of Principle Component Analysis (PCA) [30].
SRC approach achieves a good recognition accuracy of 89.58% in the overall sense
which outperforms the best reported result of 75.67% (112-eigenvectors approach) by a
margin of 13.91%. For the cases of smile and anger expressions we obtained 93.75%
19
Table 2.3: Recognition Results for Gesture Variations under Experiment Set 1
Approach Recognition AccuracySmile Anger Scream Overall
20-eigenvectors 87.85% 78.57% 34.82% 67.08%60-eigenvectors 94.64% 84.82% 41.96% 73.80%112-eigenvectors 93.97% 87.50% 45.54% 75.67%
112-eigenvectors w/o 1st 3 82.04% 73.21% 32.14% 62.46%EM 52.68% 81.25% 20.54% 51.49%LEM 78.57% 92.86% 31.25% 67.56%SRC 93.75% 91.07% 83.93% 89.58%
and 91.07% respectively which are quite comparable to the best contestants i.e. 94.64%
(60-eigenvectors) and 92.86%(LEM). For the screaming expression, the SRC approach
outstandingly beats all the reported approaches attaining a decent recognition accuracy
of 83.93%.
In the second set of experiments we compare the proposed approach with two state of
the art algorithms: Bayesian Eigenfaces (MIT) [33] and FaceIt (Visionics). The Bayesian
Eigenfaces approach was reported to be one of the best in the 1996 FERET test [34],
whereas the FaceIt algorithm (based on Local Feature Analysis [35]) is claimed to be one
of the most successful commercial face recognition system [36]. A new subset of the AR
database is generated by randomly selecting 116 individuals. The system is trained using
the neutral expression of the first session (Figure 2.4 (a)) and therefore we have 116 gallery
images. The system is validated for all other expressions of the same session (Figures 2.4
(b), (c) and (d)) making altogether 348 probe images. A comprehensive comparison of
the SRC approach with these two state of the art algorithms is presented in Table 2.4, all
the results are as reported in [36]. For mild variations due to smile and anger expressions
the SRC approach yields quite competent recognition accuracies of 92.24% and 91.38% in
comparison to FaceIt and MIT approaches. For the severe case of screaming expression the
SRC leads the FaceIt and MIT approaches by a margin of 5.62% and 42.62% respectively.
20
Table 2.4: Recognition Results for Gesture Variations under Experiment Set 2
Approach Recognition AccuracySmile Anger Scream Overall
FaceIt 96.00% 93.00% 78.00% 89.00%MIT 94.00% 72.00% 41.00% 60.00%SRC 92.24% 91.38% 83.62% 89.08%
2.5 Sparse Representation Classification for Video-based Face
Recognition
In this section we evaluate the SRC algorithm for the problem of video-based faced recog-
nition. For the purpose of comparative analysis, experiments were also conducted using
the latest Scale Invariant Feature Transform (SIFT) [37]. We now describe the basic archi-
tecture of SIFT-based classification followed by extensive experiments on the VidTIMIT
database [38], [39].
2.5.1 Scale Invariant Feature Transform (SIFT) for Face Recognition
The Scale Invariant Feature Transform (SIFT) was proposed in 1999 for the extraction of
unique features from images [40]. The idea, initially proposed for a more generic object
recognition task, was later successfully applied for the problem of face recognition [37].
Interesting characteristics of scale/rotation invariance and locality in both spatial and fre-
quency domains have made the SIFT-based approach a pretty much standard technique in
the paradigm of view-based face recognition. The first step in the derivation of the SIFT
features is the identification of potential pixels of interest called “keypoints”, in the face
image. An efficient away of achieving this is to make use of the scale-space extrema of the
Difference-of-Gaussian (DoG) function convolved with the face image [40]. These potential
keypoints are further refined based on the high contrasts, good localization along edges
and the ratio of principal curvatures criterion. Orientation(s) are then assigned to each
keypoint based on local image gradient direction(s). A gradient orientation histogram is
formed using the neighboring pixels of each keypoint. Contribution from neighbors are
weighted by their magnitudes and by a circular Gaussian window. Peaks in the histogram
21
represent the dominant directions and are used to align the histogram for rotation in-
variance. 4 × 4 pixel neighborhoods are used to extract eight bin histograms resulting in
128-dimensional SIFT features. For illumination robustness, the vectors are normalized
to unity, thresholded to a ceiling of 0.2 and finally renormalized to unit length. Figure 2.5
shows a typical face from the VidTIMIT database [38, 39] with extracted SIFT features.
Figure 2.5: A typical localized face from the VidTIMIT database with extracted SIFTs.
During validation a SIFT feature vector from the query video fq is matched with the
feature vector from the gallery:
e = arccos [fq(fg)T ] (2.10)
where fg corresponds to a SIFT vector from a training video sequence. All SIFT
vectors from the query frame are matched with all SIFT features from a training frame
using Equation 2.10. Pairs of features with the minimum error e are considered as matches.
Note that if more than one SIFT vector from a given query frame happens to be the best
match with the same SIFT vector from gallery (i.e. many-to-one match scenario), the one
with the minimum error e is chosen. Other false matches were reduced by matching the
SIFT vectors from only nearby regions of the two images.
In principle, for different image pairs we have different number of matches. This
information is further harnessed to be used as an additional similarity measure between
the two faces. The final similarity score between two frames is computed by normalizing
the average error e between their matching pairs of SIFT features and the total number
22
of matches z on a scale [0,1] and then using a weighted sum rule.
e′ =e − min (e)
max (e − min (e))(2.11)
z′ =z − min (z)
max (z − min (z))(2.12)
s =1
2(βee
′ + βz(1 − z′)) (2.13)
where βe and βz are the weights of normalized average error e′ and normalized number
of matches z′ respectively. It has to be noted that e′ is a distance (dis-similarity) measure
while z′ is a similarity score, therefore in Equation 2.13 z′ is subtracted from 1 for a
homogeneous fusion. Consequently s becomes a distance measure.
2.5.2 Experimental Results and Discussion
The problem of temporal face recognition using the SRC and SIFT feature based face
recognition algorithms was evaluated on the VidTIMIT database [38], [39]. VidTIMIT is
a multimodal database consisting of video sequences and corresponding audio files from
43 distinct subjects. The video section of the database characterizes 10 different video
files from each subject. Each video file is a sequence of 512 × 384 JPEG images. Two
video sequences were used for training while the remaining eight were used for validation.
Due to the high correlation between consecutive frames, training and testing were carried
out on alternate frames. Off-line batch learning mode [41] was used for these experiments
and therefore probe frames did not add any information to the system.
Face localization is the first step in any face recognition system. Fully automatic face
localization was carried out using a Harr-like feature based face detection algorithm [42]
during off-line training and on-line recognition sessions. For the SIFT based face recogni-
tion, each detected face in a video frame was scale-normalized to 150× 150 and histogram
23
Figure 2.6: A sample video sequence from the VidTIMIT database.
equalized before the extraction of the SIFT features. We achieved an identification rate of
93.83%. Verification experiments were also conducted for a more comprehensive compar-
ison between the two approaches. An Equal Error Rate (EER) of 1.8% was achieved for
the SIFT based verification. Verification rate at 0.01 False Accept Rate (FAR) was found
to be 97.32%.
For the SRC classifier, each detected face in a frame is downsampled to order 10 ×
10. Column concatenation is carried out to generate a 100-dimensional feature vector as
discussed in Section 2.3. Off-line batch learning is carried out on alternate frames using
two video sequences as discussed above. Unorthodox downsampled images in combination
with the SRC classifier yielded quite comparable recognition accuracy of 94.45%. EER
dropped to 1.3% with a verification accuracy of 98.23% at 0.01 FAR. The rank profile and
ROC (Receiver Operating Characteristics) curves are shown in Figure 2.7 (a) and 2.7 (b)
respectively.
We further investigated the complementary nature of the two classifiers by fusing them
at the score level. The weighted sum rule is used which is perhaps the major work-horse
in the field of combining classifiers [43]. Both classifiers were equally weighted and a high
recognition accuracy of 97.73% was achieved which outperforms the SIFT based classi-
fier and the SRC classifier by a margin of 3.90% and 3.28% respectively. Verification
experiments also produced superior results with an EER of 0.3% which is better than
the SIFT and the SRC based classification by 1.5% and 1.0% respectively. An excellent
24
1 2 3 4 5 6 7 8 9 100.85
0.9
0.95
1
Rank
Rec
ogni
tion
Rat
e
SIFTSRCFUSION
(a)
10−2
10−1
100
0.95
0.955
0.96
0.965
0.97
0.975
0.98
0.985
0.99
0.995
1
False Accept Rate (Log Scale)
Ver
ifica
tion
Rat
e
SIFTSRCFUSION
(b)
Figure 2.7: (a)Rank profiles and (b) ROC curves for the SIFT, SRC and the combinationof the two classifiers.
25
Table 2.5: Summary of results
Evaluation Attributes SIFT SRC Fusion
Recognition Accuracy 93.83% 94.45% 97.73%
Equal Error Rate 1.80% 1.30% 0.30%
Verification rate at 0.01 FAR 97.32% 98.23% 99.90%
verification of 99.90% at an FAR of 0.01 is reported. Fusion of the two classifiers substan-
tially improved the rank profile as well achieving 100% results at rank-5 only. A detailed
comparison of the results is provided in Table 2.5.
Presented results certainly reflect a comparable performance index for the SRC classi-
fier as compared to state-of-the-art SIFT based recognition. Extensive experiments based
on identification, verification and rank-recognition evaluations consistently reflect better
results for the SRC approach. Moreover the complementary information exhibited by the
SRC method increased the verification success of the combined system to 99.9% for the
standard 0.01 FAR criterion. Figure 2.8 shows variation in the recognition accuracy with
the change in the normalized weight of the SRC classifier at the fusion stage. Approxi-
mately the highest recognition is achieved when both classifiers were equally weighted i.e.
no prior information of the participating experts was incorporated in fusion.
Apart from these appreciable results it was found that the l1-norm minimization using
a large dictionary matrix made the iterative convergence lengthy and slow. To provide a
comparative value we performed computational analysis for a randomly selected identifi-
cation trial. The time required by the SRC algorithm for classifying a single frame on a
typical 2.66 GHz machine with 2 GB memory was found to be 297.46 seconds (approx-
imately 5 minutes). This duration is approximately 5 times greater than the processing
time of the SIFT algorithm for the same frame which was found to be 58.18 seconds (ap-
proximately 1 minute). Typically a video sequence consists of hundreds of frames which
would suggest a rather prolonged span for the evaluation of the whole video sequence.
Noteworthy is the fact that experiments were conducted using an offline learning mode
[41]. The probe frames did not contribute to the dictionary information. Critically speak-
ing, the spatiotemporal information in video sequences is best harnessed using smart online
26
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.990
91
92
93
94
95
96
97
98
99
100
Weight of the SRC Classifier
Rec
ogni
tion
Acc
urac
y
Figure 2.8: Variation in performance with respect to bias in fusion.
[44] and hybrid [45] learning modes. These interactive learning algorithms add useful in-
formation along the temporal dimension and therefore enhance the overall performance.
However, in the context of SRC classification, this would suggest an even larger dictionary
matrix and consequently a lengthier evaluation.
2.6 Sparse Representation Classification for Ear Biometric
2.6.1 Experiments and Discussion
We did extensive experiments to validate the SRC algorithm using subsets of the UND
database [46, 13] and the FEUD [47] database. The subset of the UND database consists
of 32 subjects with six profile images each. Other subjects of the UND database have
fewer images and are therefore inadequate for our ear recognition system. Six different
images of a typical subject from the UND database are shown in Figure 2.9. The subjects
were photographed under varying lighting conditions and with the head rotations of -90
and -75 degrees, observed from the top in a clockwise direction. From each image the ear
portion is manually cropped and is shown in Figure 2.10. At the feature extraction stage,
the ear intensity image is downsampled to an order of 30 × 30 as shown in Figure 2.10.
27
Figure 2.9: A typical subject from the UND database illustrating different pose andillumination variations
The coulumns of the downsampled ear image are concatenated to form a 900-D feature
vector. The features extracted from all the training images are used to develop the gallery
i.e the dictionary matrix A.
(a) (b)
Figure 2.10: A typical cropped ear (a) and its compressed form in the feature space (b).
We evaluated the proposed system under two evaluation protocols, Tr3V3 and Tr4V2.
Tr3V3 corresponds to three training and three testing images per subject while Tr4V2
corresponds to four training and two testing images per person. The proposed algorithm
gave a recognition rate of 91.67% and 96.88% for Tr3V3 and Tr4V2 respectively. The rank
profile of the system is shown in Figure 2.11(a). The ROC curves are shown in Figure
2.11(b) with an Equal Error Rate (EER) of approximately 0.05 and 0.03 for Tr3V3 and
Tr4V2 respectively.
The FEUD ear database consists of 56 subjects with five images each, Figure 2.13
depicts a typical subject of the FEUD database. The ear portions are manually cropped
and a 625-D feature space is formed by downsampling the images to an order 25× 25. We
define two evaluation protocols as Tr3V2 and Tr4V1. Tr3V2 corresponds to three training
and two testing images while Tr4V1 corresponds to four training and one testing image
28
1 2 3 4 5 6 7 8 9 100.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Rank
Rec
ognitio
nR
ate
Tr3V3
Tr4V2
(a)
10−3
10−2
10−1
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False accept rate (log scale)
Ver
ifica
tion
rate
Tr3V3
Tr4V2
(b)
Figure 2.11: (a)Rank Profile for the UND database. (b) ROC curves for the UNDDatabase.
29
1 2 3 4 5 6 7 8 9 100.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Rank
Rec
ognitio
nR
ate
Tr3V2
Tr4V1
(a)
10−3
10−2
10−1
0.7
0.75
0.8
0.85
0.9
0.95
1
False accept rate (log scale)
Ver
ifica
tion
rate
Tr3V2
Tr4V1
(b)
Figure 2.12: (a)Rank Profile for the FEUD database. (b) ROC curves for the FEUDDatabase.
30
per person. We obtained high recognition rates of 95.54% and 98.21% for Tr3V2 and
Tr4V1 respectively with the rank profile of the validation shown in Figure 2.12(a). The
ROC evaluation of the system gives an EER of 0.02 and 0.01 (approximately) for Tr3V2
and Tr4V1 respectively as shown in Figure 2.12(b).
Figure 2.13: A typical subject from the FEUD database
2.7 Conclusion
Sparse representation classification has recently emerged as the latest paradigm in the
research of appearance-based face recognition. A comprehensive evaluation of the SRC
algorithm provides a comparable index with traditional, state of the art approaches. It has
also been found robust for the problem of varying facial expressions. For the video-based
face recognition, an identification rate of 94.45% is achieved on the VidTIMIT database
which is quite comparable to 93.83% accuracy using state-of-the-art SIFT features based
algorithm. Verification experiments were also conducted and the SRC approach exhibited
an EER of 1.30% which is 0.5% better than the SIFT method. The SRC classifier was
found to nicely complement the SIFT based method, the fusion of the two methods using
the weighted sum rule consistently produced superior results for identification, verification
and rank-recognition experiments. However since SRC requires an iterative convergence
using an l1-norm minimization, the approach was found computationally expensive as com-
pared to the SIFT based recognition. Typically SRC required 5 minutes (approximately)
for processing a single recognition trial which is 5 times greater than the time required
by the SIFT based approach. To the best of our knowledge, this is the first evaluation of
the SRC algorithm on a video database. From the experiments presented in the chapter,
it is quite safe to maintain that additional work is required before the SRC approach is
declared as a standard approach for video-based applications. Computational expense
is arguably an inherent issue with video processing giving rise to the emerging area of
31
“Video Abstraction”. Efficient algorithms have been proposed to cluster video sequences
along the temporal dimension (for example [48] including others). These clusters are then
portrayed by cluster-representative frame(s)/features resulting in a substantial decrease of
complexity. Given the good performance of the SRC algorithm presented in this research,
the evaluation of the method using state-of-the-art video abstraction methods will be the
subject of our future research. For the problem of user authentication using ear biometric,
the proposed system is evaluated using standard ear databases yielding high recognition
accuracies for various evaluation protocols. In particular experiments were conducted on
the UND [46, 13] and the FEUD [47] databases with session variability and incorporating
different head rotations and lighting conditions. The proposed system is found to be ro-
bust under varying light and head rotations yielding a high recognition rate of the order
of 98%. The proposed system does not assume any prior normalization of the ear region
and is found to be robust to varying light conditions and different head rotations. The
interesting outcomes of the research such as a high recognition accuracy with a low train-
ing overhead, robustness to practical constraints and independence from normalization
overheads are certainly encouraging enough to extend the compressive sensing approach
to other biometrics.
32
2.8 Publications
1. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, Sparse Representa-
tion for View-Based Face Recognition, Accepted as a book chapter in book Advances
in Face Image Analysis: Techniques and Technologies ED. Y.-J. Zhang, IGI Global
Publishing.
2. Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Sparse Represen-
tation for Video-Based Face Recognition”, book chapter in Advances in Biometrics
(Lecture Notes in Computer Science, LNCS series), Springer Berlin / Heidelberg.
Volume 5558/2009, pages 219-228. 0302-9743 (Print) 1611-3349 (Online), ISBN
978-3-642-01792-6.
3. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, “Sparse Represen-
tation for Ear Biometrics”, book chapter in Advances in Visual Computing (Lecture
Notes in Computer Science, LNCS series), Springer Berlin / Heidelberg. Volume
5359/2008, pages 336-345. ISSN 0302-9743 (Print) 1611-3349 (Online), ISBN 978-
3-540-89645-6.
33
34
Chapter 3
Linear Regression for Face
Identification 1
3.1 Introduction
Face recognition systems are known to be critically dependent on manifold learning meth-
ods. A gray-scale face image of an order a × b can be represented as an ab-dimensional
vector in the original image space. However any attempt of recognition in such a high
dimensional space is vulnerable to a variety of issues often referred to as curse of di-
mensionality. Therefore, at the feature extraction stage, images are transformed to low
dimensional vectors in the face space. The main objective is to find such a basis func-
tion for this transformation, which could distinguishably represent faces in the face space.
A number of approaches have been reported in the literature such as Principle Compo-
nent Analysis (PCA) [16], [13], Linear Discriminant Analysis (LDA) [17] and Independent
Component Analysis (ICA) [18], [19]. Primarily these approaches are classified in two cat-
egories i.e. reconstructive and discriminative methods. Reconstructive approaches (such
as PCA and ICA) are reported to be robust for the problem of contaminated pixels [49],
whereas discriminative approaches (such as LDA) are known to yield better results in
1Parts from the chapter have been accepted/published in IEEE Transactions on Pattern Analysis andMachine Intelligence (TPAMI) and IEEE International Conference in Image Processing (ICIP’09). Generalproblem statements and literature review in the Introduction section of the chapter are included for thesake of completeness and to make the chapter self-contained. Since the thesis is presented as a compilationof independent publications, the repetition of the general statements between the chapters is thereforeinevitable.
35
clean conditions [20]. Apart from these traditional approaches, it has been shown recently
that unorthodox features such as downsampled images and random projections can serve
equally well. In fact the choice of the feature space may no longer be so critical [6]. What
really matters is the dimensionality of feature space and the design of the classifier.
In this chapter we propose a fairly simple but efficient linear regression based classifi-
cation (LRC) for the problem of face identification. Samples from a specific object class
are known to lie on a linear subspace [17], [50]. We use this concept to develop class spe-
cific models of the registered users simply using the downsampled gallery images, thereby
defining the task of face recognition as a problem of linear regression. Least squares es-
timation is used to estimate the vectors of parameters for a given probe against all class
models. Finally the decision is ruled in favor of the class with the most precise estimation.
The proposed classifier can be categorized as a Nearest Subspace (NS) approach.
An important relevant work is presented in [6] where downsampled images from all
classes are used to develop a dictionary matrix during the training session. Each probe
image is represented as a linear combination of all gallery images thereby resulting in an
ill-conditioned inverse problem. With the latest research in compressive sensing and sparse
representation, sparsity of the vector of coefficients is harnessed to solve the ill-conditioned
problem using the l1-norm minimization. In [51] where the concept of Locally Linear Re-
gression (LLR) is introduced specifically to tackle the problem of pose. Main thrust of the
research is to indicate an approximate linear mapping between a nonfrontal face image and
its frontal counterpart, the estimation of linear mapping is further formulated as a predic-
tion problem with a regression-based solution. For the case of severe pose variations, the
nonfrontal image is sampled to obtain many overlapped local segments. Linear regression
is applied to each small patch to predict the corresponding virtual frontal patch, the LLR
approach has shown some good results in presence of coarse alignment. In [52] a two-step
approach has been adopted fusing the concept of wavelet decomposition and discriminant
analysis to design a sophisticated feature extraction stage. These discriminant features
are used to develop feature planes (for Nearest Feature Plane - NFP classifier) and feature
spaces (for Nearest Feature Space - NFS classifier). The query image is projected onto
the subspaces and decision is ruled in favor of the subspace with the minimum distance.
36
However, the proposed LRC approach, for the first time, uses simply the downsampled
images in combination with the linear regression classification to achieve superior results
compared to the benchmark techniques.
Further for the problem of severe contiguous occlusion, a modular representation of
images is expected to solve the problem [53]. Based on this concept we propose an efficient
Modular LRC Approach. The proposed approach segments a given occluded image and
reaches individual decisions for each block. These intermediate decisions are combined
using a novel Distance based Evidence Fusion (DEF) algorithm to reach the final decision.
The proposed DEF algorithm uses the distance metrics of the intermediate decisions to
decide about the “goodness” of a partition. There are two major advantages of using the
DEF approach. Firstly, the non-face partitions are rejected dynamically, therefore they do
not take part in the final decision making. Secondly the overall recognition performance is
better than the best individual result of the combining partitions due to efficient decision
fusion of the face segments.
The rest of the chapter is organized as follows: In Section 3.2 the proposed LRC and
Modular LRC algorithms are described. This is followed by extensive experiments using
standard databases under a variety of evaluation protocols in Section 3.3. The chapter
concludes in Section 3.4 followed by a list of publications arising from this chapter in
Section 3.5.
3.2 Linear Regression for Face Recogniton
3.2.1 Linear Regression Classification (LRC) Algorithm
Let there be N number of distinguished classes with pi number of training images from
the ith class, i = 1, 2, . . . , N . Each grayscale training image is of an order a × b and is
represented as(m)ui ∈ R
a×b, i = 1, 2, . . . , N and m = 1, 2, . . . , pi. Each gallery image is
downsampled to an order c × d and transformed to vector through column concatenation
such that(m)ui ∈ R
a×b →(m)wi∈ R
q×1, where q = cd, cd << ab. Each image vector is normal-
ized so that maximum pixel value is 1. Using the concept that patterns from the same
class lie on a linear subspace [50], we develop a class specific model Xi by stacking the
37
q-dimensional image vectors,
Xi = [(1)wi
(2)wi . . . . . .
(pi)wi ] ∈ R
q×pi , i = 1, 2, . . . , N (3.1)
Each vector(m)wi , m = 1, 2, . . . , pi, spans a subspace of R
q also called the column space
of Xi. Therefore at the training level each class i is represented by a vector subspace, Xi,
which is also called the regressor or predictor for class i. Let z be an unlabeled test image
and our problem is to classify z as one of the classes i = 1, 2, . . . , N . We transform and
normalize the grayscale image z to an image vector y ∈ Rq×1 as discussed for the gallery.
If y belongs to the ith class it should be represented as a linear combination of the training
images from the same class (lying in the same subspace) i.e.
y = Xiβi , i = 1, 2, . . . , N (3.2)
where βi ∈ Rpi×1 is the vector of parameters. Given that q ≥ pi the system of equations
in equation 3.2 is well-conditioned and βi can be estimated using least squares estimation
[54], [55], [56].
βi =(XT
i Xi
)−1XT
i y (3.3)
The estimated vector of parameters, βi, along with the predictors Xi are used to predict
the response vector for each class i:
yi = Xiβi, i = 1, 2, . . . , N (3.4)
yi = Xi
(XT
i Xi
)−1XT
i y
yi = Hy
Where the predicted vector yi ∈ Rq×1 is the projection of y onto the ith subspaces. In
other words yi is the closest vector, in the ith subspace, to the observation vector y in the
Euclidean sense [57]. H is called a hat matrix since it maps y into yi. We now calculate
the distance measure between the predicted response vector yi, i = 1, 2, . . . , N and the
38
Algorithm: Linear Regression Classification (LRC)
Inputs: Class models Xi ∈ Rq×pi , i = 1, 2, . . . , N and a test image vector y ∈ R
q×1.Output: Class of y
1. βi ∈ Rpi×1 is evaluated against each class model, βi =
(XT
i Xi
)−1
XTi y, i = 1, 2, . . . , N
2. yi is computed for each βi, yi = Xiβi, ı = 1, 2, . . . , N
3. Distance calculation between original and predicted response variables di(y) = ‖y − yi‖2
4. Decision is made in favor of the class with the minimum distance di(y)
original response vector y,
di(y) = ‖y − yi‖2 , i = 1, 2, . . . , N (3.5)
and rule in favor of the class with minimum distance i.e.
min︸︷︷︸
i
di(y), i = 1, 2, . . . , N (3.6)
3.2.2 Modular Approach for the LRC Algorithm
The problem of identifying partially occluded faces could be efficiently dealt with using
the modular representation approach [53]. Contiguous occlusion can safely be assumed
local in nature in a sense that it corrupts only a portion of conterminous pixels of the
image. The amount of contamination being unknown. In the modular approach we utilize
the neighborhood property of the contaminated pixels by dividing the face image into a
number of sub-images. Each sub-image is now processed individually and a final decision
is made by fusing information from all the sub-images. A commonly reported technique
for decision fusion is majority voting [53]. However a major pitfall with majority voting
is that it equally treats noisy and clean partitions. For instance if three out of four
partitions of an image are corrupted, majority voting is likely to be erroneous no matter
how significant the clean partition may be in the context of facial features. The task
becomes even more complicated by the fact that the distribution of occlusion over a face
image is never known a priori and therefore, along with face and non-face sub-images, we
are likely to have face portions corrupted with occlusion. Some sophisticated approaches
have been developed to filter out the potentially contaminated image pixels (for example
[58]). In this section we make use of the specific nature of distance classification to develop
39
a fairly simple but efficient fusion strategy which implicitly deemphasizes corrupted sub-
images improving significantly the overall classification accuracy. We propose to use the
distance metric as an evidence of our belief in the “goodness” of intermediate decisions
taken on the sub-images, the approach is called “Distance based Evidence Fusion” (DEF).
To formulate the concept let us suppose that each training image is segmented in
M partitions and each partitioned image is designated as vn; n = 1, 2, . . . , M . The nth
partition of all pi training images from the ith class are subsampled and transformed
to vectors as discussed in Section 3.2.1 to develop a class specific and partition-specific
subspace U(n)i :
U(n)i =
[
(1)wi
(n) (2)wi
(n)
. . . . . .(pi)wi
(n)]
, i = 1, 2, . . . , N (3.7)
Each class is now represented by M subspaces and altogether we have M ×N subspace
models. Now a given probe image is partitioned into M segments accordingly. Each
partition is transformed to an image vector y(n); n = 1, 2, . . . , M . Given that i is the true
class for the given probe image, y(n) is expected to lie on the nth subspace of the ith class
U(n)i and should satisfy:
y(n) = U(n)i β
(n)i (3.8)
The vector of parameters and the response vectors are estimated as discussed in Section
3.2.1
β(n)i =
[(
U(n)i
)T
U(n)i
]−1 (
U(n)i
)T
y(n) (3.9)
y(n)i = U
(n)i β
(n)i ; i = 1, 2, . . . , N (3.10)
The distance measure between the estimated and the original response vector is com-
puted:
di
(
y(n))
=∥∥∥y(n) − y
(n)i
∥∥∥
2; i = 1, 2, . . . , N (3.11)
40
Now for the nth partition an intermediate decision called j(n) is reached with a corre-
sponding minimum distance calculated as:
dj(n)= min
︸︷︷︸
i
di(y(n)) i = 1, 2, . . . , N (3.12)
Therefore, we now have M decisions j(n) with M corresponding distances dj(n)and we
decide in favor of the class with minimum distance.
Decision = arg min︸︷︷︸
j
dj(n)n = 1, 2, . . . , M (3.13)
3.3 Experimental Results
Extensive experiments were carried out to illustrate the efficacy of the proposed ap-
proach. Essentially three standard databases i.e. the AT&T [59], Yale [27] and the AR
[32] databases have been addressed. These databases incorporate several deviations from
the ideal conditions including pose, illumination, occlusion and gesture alterations. Sev-
eral standard evaluation protocols, reported in the face recognition literature, have been
adopted and a comprehensive comparison of the proposed approach with the state of art
techniques has been presented.
3.3.1 AT&T Database
We attack the problem of face recognition by first addressing the AT&T database [59].
The AT&T database is maintained at the AT&T Laboratories, Cambridge University. Ten
different images from one of the 40 subjects from the database are shown in Figure 3.1.
The database incorporates facial gestures such as smiling or non-smiling, open or closed
eyes and alterations like glasses or without glasses. It also characterizes a maximum of 20◦
rotation of the face with some scale variations of about 10%. Half of the database is used
as gallery while the other half is used for validation. For the purpose of elaboration we
compared the residuals (y− y) generated using the true subspace model with those using
a false subspace model under the framework of the proposed LRC approach. Figures 3.2
41
(a) shows a test image of subject 1. The residuals using a false subspace model (X10),
shown in Figure 3.2 (b) are substantially greater than the residuals using the true subspace
model (X1) as shown in Figure 3.2 (c), note that residuals are shown for a 25-dimensional
feature space. Large residuals using (X10) reflect imprecise prediction of the response
vector whereas small residuals using X1 testify a precise estimation and lead to a correct
classification.
Figure 3.1: A typical subject from the AT&T database
The choice of dimensionality for the AT&T database is dilated in Figure 3.3(a) which
reflects that the recognition rate becomes fairly constant above a 40-dimensional feature
space. Therefore each 112 × 92 grayscale image is downsampled to an order 7 × 6 and is
transformed to a 42-dimensional feature vector by column concatenation. To make the re-
sults independent of a particular choice of a training data set we made 20 different random
selections of the gallery and probe images and found an average recognition accuracy of
96.8%, the worst and the best results for these random selections being 93.5% and 99.5%
respectively. The results for these random experiments are summarized in Figure 3.3(b).
To provide a comparative value for our approach we follow two evaluation protocols as
proposed in the literature [28], [29], [31]. Evaluation Protocol 1 (EP1) takes the first five
images of each individual as a training set, while the last five are designated as probes. For
Evaluation Protocol 2 (EP2) the “leave-one-out” strategy is adopted. A detailed compar-
ison of the results for the two evaluation protocols is summarized in Table 3.1, all results
are as reported in [28]. For EP1 the LRC algorithm achieves a comparable recognition
accuracy of 93.5% in a 50-D feature space, the best results are reported for the 2DPCA
approach which are 2.5% better than the LRC method. Also for EP2 the LRC approach
attains a high recognition success of 98.75% in a 50-D feature space, it outperforms the
42
(a)
0 5 10 15 20 25−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Coefficients
Res
idua
l for
Fal
se C
lass
(b)
0 5 10 15 20 25−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Coefficients
Res
idua
t for
Tru
e C
lass
(c)
Figure 3.2: (a) Test image from subject 1. (b) Residuals using a randomly selected falsesubspace. (c) Residuals using subspace 1
43
0 10 20 30 40 50 60 70 80 90 10010
20
30
40
50
60
70
80
90
100
Feature Dimension
Rec
ogni
tion
Acc
urac
y
(a)
0 2 4 6 8 10 12 14 16 18 2050
55
60
65
70
75
80
85
90
95
100
Different Random Selections of Gallery and Probe
Rec
ogni
tion
Acc
urac
y
(b)
Figure 3.3: (a) Recognition accuracy for the AT&T database with respect to featuredimension using the LRC algorithm. (b) Cross-validation with 20 random selections ofgallery and probe images.
44
ICA approach by 5% (approximately) and is fairly comparable to Fisherfaces, Eigenfaces,
Kernel Eigenfaces and 2DPCA approaches.
Table 3.1: Results for EP1 and EP2 using the AT&T database.
Evaluation Protocol Approach Recognition Rate
Fisherfaces 94.50%ICA 85.00%
EP1 Kernel Eigenfaces 94.00%2DPCA 96.00%LRC 93.50%
Fisherfaces 98.50%ICA 93.80%
EP2 Eigenfaces 97.50%Kernel Eigenfaces 98.00%
2DPCA 98.30%LRC 98.75%
3.3.2 Yale Database
The Yale database, maintained at Yale university, consists of 165 grayscale images from
15 individuals [27]. Images from each subject reflects gesture variations incorporating
normal, happy, sad, sleepy, surprised, and wink expressions. Luminance variation is also
addressed by including images with lighting source from central, right and left directions.
A couple of images with and without spectacles are also included. Figure 3.4 represents 11
different images from a single subject. Experiments are conducted on the original database
without any preprocessing stages of face cropping and/or normalization. Each 320 × 243
grayscale image is downsampled to an order of 25× 25 to get a 625-d feature vector. The
experiments are conducted using the leave-one-out approach as reported quite regularly
in the literature [28], [29], [30]. A comprehensive comparison of various approaches is
provided in Table 3.2. Note that the error rates have been transformed to recognition rates
for [30]. Apart from the Fisherfaces method the LRC approach substantially outperforms
all reported techniques showing an improvement of 7.31% over the best contestant i.e. the
LEM (Line Edge Map) approach. Note that the proposed approach leads the traditional
PCA and ICA approaches by a margin of 17.16% and 21.24% respectively. The choice
of feature space dimension is elaborated in Figure 3.5, the dimensionality curve is shown
45
for a randomly selected leave-one-out experiment. Classification accuracy becomes fairly
constant in an approximately 600-D feature space.
Figure 3.4: A typical subject from the Yale database with various poses and variations.
Table 3.2: Results for Yale database using the leave-one-out method.
Evaluation Protocol Approach Recognition Rate
ICA [28] 71.52%Kernel Eigenfaces [28] 72.73%
Edge map [30] 73.94%Eigenfaces [30] 75.60%
Leave-one-out Correlation [30] 76.10%Linear subspace [30] 78.40%
2DPCA [28] 84.24%Eigenface w/o 1st 3 [30] 84.70%
LEM [30] 85.45%Fisherfaces [30] 92.70%
LRC 92.76%
3.3.3 Georgia Tech (GT) Database
The Georgia Tech (GT) database consists of 50 subjects with 15 images per subject [60].
It characterizes several variations such as pose, expression, cluttered background and
illumination (see Figure 3.6).
Images were downsampled to an order of 15 × 15 to constitute a 225-D feature space.
First 8 images of each subject were used for training while the remaining 7 served as probes
46
0 100 200 300 400 500 600 7000
10
20
30
40
50
60
70
80
90
Feature Dimension
Rec
ogni
tion
Acc
urac
y
Figure 3.5: Yale database: Recognition accuracy with respect to feature dimension for arandomly selected experiment.
47
Figure 3.6: Samples of a typical subject from the GT database.
48
[61], all experiments were conducted on original database without any cropping/normalization.
Table 3.3 shows a detailed comparison of the LRC with a variety of approaches, all results
are as reported in [61] with recognition error rates converted to recognition success rates.
Also results in [61] are shown for a large range of feature dimensions, for the sake of fair
comparison we have picked the best reported results. The proposed LRC algorithm out-
performs the traditional PCAM and PCAE approaches by a margin of 12% and 18.57%
respectively, achieving a high recognition accuracy of 92.57%. It is also shown to be fairly
comparable to all other methods including the latest ERE approaches.
Table 3.3: Results for the Georgia Tech. database.
Method PCAM PCAE BML DSL NLDA
Recognition Rate 80.57% 74.00% 87.43% 90.57% 88.86%
Method FLDA UFS ERE Sb ERE St LRC
Recognition Rate 90.71% 90.86 92.86% 93.14% 92.57%
3.3.4 FERET Database
Evaluation Protocol 1 (EP1)
The FERET database is arguably one of the largest publicly available database [62].
Following [61], [63] we construct a subset of database consisting of 128 subjects with at
least 4 images per subject. We however used 4 images per subject [61], Figure 3.7 shows
images of a typical subject from the FERET database. It has to be noted that in [61] the
database consists of 256 subjects, 128 subjects (i.e. 512 images) are used to develop the
face space while the remaining 128 subjects are used for the face recognition trials. The
proposed LRC approach uses the gallery images of each person to form a linear subspace,
therefore it does not require any additional development of the face space. However it
requires multiple gallery images for a reliable construction of linear subspaces. Using a
single gallery image for each person is not substantial in the context of linear regression, as
this corresponds to only a single regressor (or predictor) observation, leading to erroneous
least-squares calculations.
Cross validation experiments for LRC were conducted in a 42-D feature space, for each
recognition trial 3 images per person were used for training while the system was tested
49
(fa) (fb) (ql) (qr)
Figure 3.7: A typical subject from the FERET database, fa and fb representing frontalshots with gesture variations while ql and qr correspond to pose variations.
for the fourth one. The results are shown in Table 3.4. The frontal images fa and fb
incorporate gesture variations with small pose, scale and rotation changes, whereas ql and
qr correspond to major pose variations (see [62] for details). The proposed LRC approach
copes well with the problem of facial expressions in presence of small pose variations
achieving high recognition rates of 91.41% and 94.53% for fa and fb respectively. It
outperforms the benchmark PCA and ICA I algorithms by margins of 17.19% and 17.97%
for fa and 21.09% and 23.44% for fb respectively. The LRC approach however shows
degraded recognition rates of 78.13% and 84.38% for the severe pose variations of ql and
qr respectively, however even with such major posture changes it is substantially superior
to the PCA and ICA I approaches. In an overall sense we achieve a recognition accuracy of
87.11% which is favorably comparable to 83.00% recognition achieved by ERE [61] using
single gallery images.
Table 3.4: Results for the FERET database.
Experiment Method fa fb ql qr Overall
PCA 74.22% 73.44% 65.63% 72.66% 71.48%EP1 ICA I 73.44% 71.09% 65.63% 68.15% 69.57%
LRC 91.41% 94.53% 78.13% 84.38% 87.11%
PCA 80.00% 78.75% 67.50% 71.75% 74.50%EP2 ICA I 77.50% 77.25% 68.50% 70.25% 73.37%
LRC 93.25% 93.50% 75.25% 76.00% 84.50%
Evaluation Protocol 2 (EP2)
50
In this experimental setup we validated the consistency of the proposed approach with
a large number of subjects. We now have a subset of FERET database consisting of
400 randomly selected persons. Cross-validation experiments were conducted as discussed
above, results are reported in Table 3.4. The proposed LRC approach showed quite agree-
able results with the large database as well. It persistently achieved high recognition rates
of 93.25% and 93.50% for fa and fb respectively. For the case of severe pose variations
of ql and qr, we note a slight degradation in the performance as expected. The overall
performance is however pretty much comparable with an average recognition success of
84.50%. For all case-studies the proposed LRC approach is found to be superior to the
benchmark PCA and ICA I approaches.
3.3.5 Extended Yale B Database
Extensive experiments were carried out using the Extended Yale B database [64], [65].
The database consists of 2,414 frontal-face images of 38 subjects under various lighting
conditions. The database was divided in 5 subsets, subset 1 consisting of 266 images (7
images per subject) under nominal lighting conditions was used as the gallery while all
others were used for validation (see Figure 4.6). Subsets 2 and 3, each consisting of 12
images per subject, characterize slight-to-moderate luminance variations, while subset 4
(14 images per person) and subset 5 (19 images per person) depict severe light variations.
All experiments for the LRC approach were conducted with images downsampled to an
order 20×20, results are shown in Table 3.5. The proposed LRC approach showed excellent
performance for moderate light variations yielding 100% recognition accuracy for subsets
2 and 3. The recognition success however falls to 83.27% and 33.61% for subsets 4 and 5
respectively. The proposed LRC approach has shown better tolerance for considerable il-
lumination variations compared to benchmark reconstructive approaches comprehensively
outperforming PCA and ICA I for all case studies. The proposed algorithm however, could
not withstand severe luminance alterations.
51
Figure 3.8: Starting from top, each row illustrates samples from subsets 1, 2, 3, 4 and 5respectively.
Table 3.5: Results for the Extended Yale B database.
Approach Subset 2 Subset 3 Subset 4 Subset 5
PCA 98.46% 80.04% 15.79% 24.38%
ICA I 98.03% 80.70% 15.98% 22.02%
LRC 100% 100% 83.27% 33.61%
3.3.6 AR Database
The AR database consists of more than 4,000 color images of 126 subjects (70 men and 56
women) [32]. The database characterizes divergence from ideal conditions by incorporating
various facial expressions (neutral, smile, anger and scream), luminance alterations (left
light on, right light on and all side lights on) and occlusion modes (sunglass and scarf).
It also contains adverse scenarios of occlusion with luminance (sunglass with left light on,
52
sunglass with right light on, scarf with left light on and scarf with right light on). To
take care of session variability the pictures were taken in two sessions separated by two
weeks and no restrictions regarding wear, make-up, hair style etc. were imposed on the
participants. Due to the large number of subjects and the substantial amount of variations,
the AR database is much more challenging compared to the AT&T and Yale databases.
It has been used by researchers as a test-bed to evaluate and benchmark face recognition
algorithms. In this research we address two fundamental challenges of face recognition i.e.
facial expression variations and contiguous occlusion.
Gesture Variations
Facial expressions are defined as the variations in appearance of the face induced by inter-
nal emotions or social communications [66]. Analysis of these expressions is an emerging
research area in the field of behavioral sciences [67], [68]. In context of face identification
the problem of varying facial expressions refers to the development of the face recognition
systems which are robust to these changes. The task becomes more challenging due to the
natural variations in the head orientation with the changes in facial expressions as depicted
in Figure 3.10. Most of the face detection and orientation normalization algorithms make
use of the the facial features such as eyes, nose and mouth. It has to be noted that for the
case of adverse gesture variations such as “scream” the eyes of the subject naturally get
closed (see Figure 3.10(d) and (h)). Consequently under such severe conditions the eyes
cannot be automatically detected and therefore face normalization is likely to be erro-
neous. Hence there are two possible configurations for a realistic evaluation of robustness
for a given face recognition algorithm: 1) By implementing an automatic face localization
and normalization module before the actual face recognition module. 2) By evaluating
the algorithm using the original frame of face image rather than a manually localized and
aligned face. With this understanding we validate the proposed LRC algorithm for the
problem of gesture variations on the original, uncropped and unnormalized AR database.
We design four evaluation strategies i.e Evaluation Protocol 1 (EP1), Evaluation Protocol
2 (EP2), Evaluation Protocol 3 (EP3) and Evaluation Protocol 4 (EP4). For all of these
evaluation protocols the 576 × 768 image frames are downsampled to an order 10 × 10
53
constituting a 100-D feature space. The choice of dimensionality is elaborated in Figure
3.9. The recognition accuracy of the proposed approach becomes fairly constant after a
dimensionality index of 50.
0 50 100 150 200 2500
10
20
30
40
50
60
70
80
90
100
Feature Dimension
Rec
ogni
tion
Acc
racy
EP1EP2EP3EP4
Figure 3.9: Recognition accuracy with varying feature dimension for EP1, EP2, EP3 andEP4.
Evaluation Protocol 1 Out of 125 subjects of the AR database, a subset is gen-
erated by randomly selecting 100 individuals (50 males and 50 females). The database
characterizes four facial expressions: neutral, smile, anger and scream.
EP1 is based on leave-one-out strategy i.e. each time the system is trained using images
of 3 different expressions (600 images in all) while the testing session is conducted using
the left out expression (200 images) [58]. The LRC algorithm achieves a high recognition
accuracy for all facial expressions, the results for a 100-D feature space are reported
in Table 3.6 with an overall average recognition of 98.88%. For the case of screaming
the proposed approach achieves 99.5% which outperforms the results in [58] by 12.5%,
noteworthy is the fact that results in [58] are shown on a subset consisting of only 50
individuals.
54
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 3.10: Gesture variations in the AR database, note the changing position of headwith different poses. First and second rows correspond to 2 different sessions incorporatingneutral, happy, angry and screaming expressions respectively.
Table 3.6: Recognition Results for Gesture Variations Using the LRC Approach
Evaluation Protocol Gestures Recognition Accuracy
Neutral 99.00%Smile 98.50%
EP1 Anger 98.50%Scream 99.50%Overall 98.88%
Smile 98.00%EP2 Anger 95.00%
Scream 95.00%Overall 96.00%
Evaluation Protocol 2
Under EP2 we design a typical experimental setup by training the system only on
neutral expression (figures 3.10(a) and (e)) while testing on smile, anger and scream ex-
pressions (figures 3.10 (b), (c), (d), (f), (g) and (h)). Therefore the gallery consists of
200 images and we have 600 probe images. The results for a 100-D feature space are
reported in Table 3.6. The LRC algorithm achieves an overall accuracy of 96% with 98%
55
recognition for smile and 95% for scream and anger expression.
Evaluation Protocol 3
Under EP3 we follow the experimental setup as designed in [30]. We now have a subset
of AR database consisting of 112 individuals. The system is trained using only one image
per subject which characterizes neutral expression (Figure 3.10(a)), therefore we have 112
gallery images. The system is tested on the remaining three expressions shown in Figure
3.10 (b), (c) and (d) altogether making 336 probe images. Table 3.7 shows a thorough
comparison of the LRC approach and the results reported in [30]. EM and LEM stands
for Edge Map and Line Edge Map respectively while all other approaches being variants of
Principle Component Analysis (PCA) [30]. Note that in [30] the choice of session for the
AR database is not explicitly mentioned, therefore to remove the ambiguity we evaluated
the LRC algorithm for both sessions.
For the first session, the proposed LRC approach achieves a high recognition accu-
racy of 92.86% in the overall sense which outperforms the best reported result of 75.67%
(112-eigenvectors approach) by a margin of 17.19%. For the cases of smile and anger ex-
pressions we obtained 93.75% and 95.54% respectively which are quite comparable to the
best contestants i.e. 94.64% (60-eigenvectors) and 92.86%(LEM). However, for the most
adverse case of screaming expression, the proposed LRC approach outstandingly beats all
the reported approaches attaining a decent recognition accuracy of 89.29% maintaining a
difference of 43.75% with the best competitor (112-eigenvectors). The main achievement
of the proposed method is the consistent excellent performance for gentle (smile and anger)
as well as severe (scream) facial expressions.
For the second session, Figure 3.10(e) is used for training while validation is conducted
on Figures 3.10 (f), (g) and (h). The consistency of the proposed approach is more
pronounced and highlighted for this experimental setup as the LRC now achieves a high
recognition rate of 97.32% for all three expressions, outperforming the best reported results
for smile, anger and scream by 2.68%, 4.46% and 51.78% respectively. In the overall sense
the LRC approach is now better than the best competitor by a margin of 21.65%.
Evaluation Protocol 4
Under EP4 we compare the proposed approach with two state of the art algorithms:
56
Table 3.7: Recognition Results for Gesture Variations under EP3
Approach Recognition AccuracySmile Anger Scream Overall
20-eigenvectors 87.85% 78.57% 34.82% 67.08%
60-eigenvectors 94.64% 84.82% 41.96% 73.80%
112-eigenvectors 93.97% 87.50% 45.54% 75.67%
112-eigenvectors w/o 1st 3 82.04% 73.21% 32.14% 62.46%
EM 52.68% 81.25% 20.54% 51.49%
LEM 78.57% 92.86% 31.25% 67.56%
LRC (Session 1) 93.75% 95.54% 89.29% 92.86%
LRC (Session 2) 97.32% 97.32% 97.32% 97.32%
Bayesian Eigenfaces (MIT) [33] and FaceIt (Visionics). The Bayesian Eigenfaces approach
was reported to be one of the best in the 1996 FERET test [34], whereas the FaceIt
algorithm (based on Local Feature Analysis [35]) is claimed to be one of the most successful
commercial face recognition system [36]. A new subset of the AR database is generated by
randomly selecting 116 individuals. The system is trained using the neutral expression of
the first session (Figure 3.10 (a)) and therefore we have 116 gallery images. The system is
validated for all other expressions of the same session (Figures 3.10 (b), (c) and (d)) making
altogether 348 probe images. A comprehensive comparison of the LRC approach with these
two state of the art algorithms is presented in Table 3.8, all the results are as reported
in [36]. For mild variations due to smile and anger expressions the LRC approach yields
quite competent recognition accuracies of 93.97% and 95.69% in comparison to FaceIt and
MIT approaches. The efficacy of the proposed approach is highlighted for the severe case
of screaming expression where the LRC comprehensively outperforms the FaceIt and MIT
approaches by a margin of 11.66% and 48.66% respectively. The consistent performance
of the LRC approach yields an excellent identification rate of 93.10% in the overall sense,
which is better than either of the FaceIt or the MIT approach.
Contiguous Occlusion
The problem of face identification in presence of contiguous occlusion is arguably one of the
most challenging paradigm in context of robust face recognition. Commonly used objects
such as caps, sunglasses and scarves tend to obstruct facial features causing recognition
57
Table 3.8: Recognition Results for Gesture Variations under EP4
Approach Recognition AccuracySmile Anger Scream Overall
FaceIt 96.00% 93.00% 78.00% 89.00%
MIT 94.00% 72.00% 41.00% 60.00%
LRC 93.97% 95.69% 89.66% 93.10%
errors. Moreover in the presence of occlusion the problems of automatic face localization
and normalization as discussed in the previous section are even more magnified. Therefore
experiments on manually cropped and aligned databases make an implicit assumption of
an evenly cropped and nicely aligned face which is not available in practice.
The AR database consists of two modes of contiguous occlusion i.e. images with a pair
of sunglasses and scarf. Figure 3.11 reflects these two scenarios for two different sessions.
A subset of AR database consisting of 100 randomly selected individuals (50 men and 50
women) is used for empirical evaluation. The system is trained using Figures 3.10 (a)-(h)
for each subject thereby generating a gallery of 800 images. Probes consist of Figures
3.11 (a) and (b) for sunglass occlusion and Figures 3.11 (c) and (d) for scarf occlusion.
The proposed approach is evaluated on the original database without any manual cropping
and/or normalization.
For the case of sunglass occlusion the proposed LRC approach achieves a high recog-
nition accuracy of 96% in a 100-D feature space. Table 3.9 depicts a detailed comparison
of the LRC approach with a variety of approaches reported in [6] consisting of Principal
Component Analysis (PCA), Independent Component Analysis - architecture I (ICA I),
Local Nonnegative Matrix Factorization (LNMF), least-squares projection onto the sub-
space spanned by all face images and Sparse Representation based Classification (see [6]
for details). NN and NS corresponds to Nearest Neighbors and Nearest Subspace based
classification respectively. The LRC algorithm comprehensively outperforms the best com-
petitor (SRC) by a margin of 9%. To the best of our knowledge the LRC approach achieves
the best results for the case of sunglass occlusion, note that in [6] a comparable recognition
rate of 97.5% has been achieved by a subsequent image partitioning approach.
For the case of severe scarf occlusion the proposed approach gives a recognition accu-
58
(a) (b) (c) (d)
Figure 3.11: Examples of contiguous occlusion in the AR database.
Table 3.9: Recognition Results for Occlusion
Approach Recognition AccuracySunglass Scarf
PCA+NN 70.00% 12.00%
ICA I+NN 53.50% 15.00%
LNMF+NN 33.50% 24.00%
l2+NS 64.50% 12.50%
SRC 87.00% 59.50%
LRC 96.00% 26%
racy of 26% in a 3600-D feature space. Figure 3.12 shows the performance of the system
with respect to an increasing dimensionality of the feature space. Although The LRC
algorithm outperforms the classical PCA and ICA I approaches by a margin of 14% and
11% respectively, it lags the SRC approach by a margin of 33.5%.
We now demonstrate the efficacy of the proposed Modular LRC approach under severe
occlusion conditions. As a preprocessing step the AR database is normalized, both in
scale and orientation, generating a cropped and aligned subset of images consisting of 100
subjects . Images are manually aligned using eyes and mouth locations as shown in Figure
3.13 [69], each image is cropped to an order 292 × 240, some images from the normalized
database are shown in Figure 3.14.
We follow the evaluation protocol as discussed in Section 3.3.6, all images are parti-
tioned into 4 blocks as shown in Figure 3.15 (a). The blocks are numbered in an ascending
order from left to right, starting from the top, the LRC algorithm for each sub-image
uses a 100-D feature space as discussed in previous section. Figure 3.16 (a) elaborates the
59
0 1000 2000 3000 4000 5000 60005
10
15
20
25
30
Feature Dimension
Rec
ogni
tion
Acc
urac
y
Figure 3.12: The recognition accuracy versus feature dimension for scarf occlusion usingthe LRC approach.
efficacy of the proposed approach for a random probe image. In our proposed approach we
have used the distance measures dj(n) as an evidence of our belief in a sub-image. The key
factor to note in Figure 3.16 (a) is that corrupted sub-images (i.e. blocks 3 and 4 in Figure
3.15 (a)) reach a decision with a low belief i.e. high distance measures dj(n). Therefore in
the final decision making these corrupted blocks are rejected thereby giving a high recog-
nition accuracy of 95%. The superiority of the proposed approach is more pronounced by
considering the individual recognition rates of the sub-images in Figure 3.16 (b). Block
1 and 2 yield a high classification accuracy of 94% and 90% respectively where as block
3 and 4 give 1% output each. Note that the effect of the proposed approach is two fold,
firstly it automatically deemphasizes the non-face partitions. Secondly the efficient and
dynamic fusion harnesses the complementary information of the face sub-images to yield
an overall recognition accuracy of 95% which is better than the best of the participating
face partition.
Note that in Figure 3.15 (a) the partitioning is such that the uncorrupted sub-images
60
Figure 3.13: A sample image indicating eyes and mouth locations for the purpose ofmanual alignment.
(a) (b) (c) (d)
Figure 3.14: Samples of cropped and aligned faces from the AR database.
(block 1 and 2) correspond to undistorted and complete eyes which are arguably one of
the most discriminant facial features. Therefore one can argue that high classification
accuracy is due to this coincidence and the approach might not work well otherwise. To
remove this ambiguity we partitioned the images into six and eight blocks as shown in
Figure 3.15 (b) and (c) respectively. Blocks are numbered left to right, starting from the
top. For Figure 3.15 (b) partitions 1 and 2 give high recognition accuracies of 92.5% and
61
(a) (b) (c)
Figure 3.15: Case studies for Modular LRC approach for the problem of scarf occlusion.
90% respectively, while the remaining blocks i.e. 3, 4, 5 and 6 yield 8%, 7%, 0% and 1%
recognition respectively. Interestingly, although the best block gives 92.5% which is 1.5%
less than the best block of Figure 3.15 (a), the overall classification accuracy comes out to
be 95.5%.
Similarly in Figure 3.15 (c), blocks 1, 2, 3 and 4 give classification accuracies of 88.5%,
84.5%, 80% and 77.5% respectively, while the corrupted blocks 5, 6, 7 and 8 produce 3%,
0.5%, 1% and 1% of classification. The proposed evidence based algorithm yields a high
classification accuracy of 95%, a key factor to note is the best individual result is 88.5%
which lags the best individual result of Figure 3.15 (a) by 5.5%, however the proposed
integration of combining blocks yields a comparable overall recognition. Interestingly the
eye-brows regions (blocks 1 and 2) in Figure 3.15 (c) have been found most useful.
To the best of our knowledge the recognition accuracy of 95.5% achieved by the pre-
sented approach is the best result ever reported for the case of scarf occlusion. The
previous best being 93.5% achieved by partitioned SRC approach in [6], also 93% clas-
sification accuracy is reported for 50 subjects in [58]. Finally we compare the proposed
DEF approach with the weighted sum rule which is perhaps the major work-horse in the
field of combining classifiers [43]. The comparison for three case studies in Figure 3.15
is presented in Table 3.10. Note that without any prior knowledge of the goodness of a
specific partition we used equal weights for all sub-images of a given partitioned image.
The DEF approach comprehensively outperforms the sum rule for the three case studies
62
1 2 3 40
0.2
0.4
0.6
0.8
1
1.2
1.4
Index of Sub−images
Dis
tanc
e M
etric
s
(a)
1 2 3 40
10
20
30
40
50
60
70
80
90
100
Index of Sub−images
Cla
ssifi
catio
n A
ccur
acy
(b)
Figure 3.16: (a) Distance measures dj(n) for the four partitions, note that non-facecomponents make decisions with low evidences. (b) Recognition accuracies for all blocks.
63
showing an improvement of 38.5%, 33.5% and 20% respectively. The performance of the
sum rule improves with the increase in the pure face regions thereby demonstrating the
strong dependency on the way partitioning is performed. On the other hand, the proposed
DEF approach shows a quite consistent performance for even the worst case partitioning
i.e. Figure 3.15 (a) consisting of an equal number of face and non-face partitions.
Table 3.10: Comparison of the DEF with the Sum Rule for Three Case Studies
Case Studies Sum Rule DEF
4-partitions 56.50% 95.00%
6-partitions 62.00% 95.50%
8-partitions 75.00% 95.00%
3.4 Conclusion
In this chpater a novel nearest subspace classification algorithm is proposed which for-
mulates the face identification task as a problem of linear regression. The proposed LRC
algorithm is extensively evaluated using the most standard databases with a variety of
evaluation methods reported in the face recognition literature. Specifically the challenges
of varying facial expressions and contiguous occlusion are addressed. Considerable compar-
ative analysis with the state-of-art algorithms clearly reflects the potency of the proposed
approach. The proposed LRC approach reveals a number of interesting outcomes. Apart
from the Modular LRC approach for face identification in presence of disguise, the LRC
approach yields high recognition accuracies without requiring any preprocessing steps of
face localization and/or normalization. We argue that in presence of non-ideal conditions
such as occlusion, illumination and sever gestures, a cropped and aligned face is gener-
ally not available. Therefore, a consistent reliable performance with unprocessed standard
databases makes the LRC algorithm appropriate for real scenarios. For the case of varying
gestures the LRC approach has shown to cope well with the most severe screaming expres-
sion where the state-of-art techniques lag behind, indicating the consistency for mild and
severe changes. For the problem of face recognition in presence of disguise, the Modular
LRC algorithm using an efficient evidential fusion strategy yields the best reported results
64
in the literature. The simple architecture of the proposed approach makes it computa-
tionally efficient making it therefore a suitable candidate for video-based face recognition
applications. Other future directions include the robustness issues related to illumination,
random pixel corruption and pose variations.
65
3.5 Publications
1. Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Linear Regression
for Face Recognition”, In Print IEEE Transactions on Pattern Analysis and Machine
Intelligence (IEEE TPAMI)
2. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, “Face Identification
using Linear Regression”, in International Conference on Image Processing ICIP09,
Cairo, Egypt.
66
Chapter 4
Robust Regression for Face
Recognition1
4.1 Introduction
In general, face recognition systems critically depend on manifold learning methods. A
gray-scale face image of order a × b can be represented as an ab dimensional vector in
the original image space. Typically, in pattern recognition problems, it is believed that
high-dimensional data vectors are redundant measurements of an underlying source. The
objective of manifold learning is therefore to uncover this “underlying source” by a suit-
able transformation of high-dimensional measurements to low-dimensional data vectors.
Therefore, at the feature extraction stage, images are transformed to low dimensional
vectors in a face space. The main objective is to find a basis function for this transfor-
mation, which could distinguishably represent faces in the face space. In the presence of
noise, however it is supposed to be an extremely challenging task [3], [70]. It follows from
coding theory that iterative measurements are more likely to safely recover information
in the presence of noise [71], therefore working in a low dimensional feature space main-
taining the aspect of robustness is in fact an ardent problem in object recognition. A
1A part of the chapter has been accepted for publication in International Conference on PatternRecognition (ICPR’10). Initial submission to the IEEE TPAMI received revisions in December 2009, thechapter has been duly revised and resubmitted in April 2010. General problem statements and literaturereview in the Introduction section of the chapter are included for the sake of completeness and to makethe chapter self-contained. Since the thesis is presented as a compilation of publications, the repetition ofthe general statements between the chapters is therefore inevitable.
67
number of approaches have been reported in the literature for dimensionality reduction.
In the context of robustness, these approaches have been broadly classified in two cate-
gories namely generative/reconstructive and discriminative methods [72]. Reconstructive
approaches (such as PCA [13], ICA [31] and NMF [73], [74]) are reported to be robust for
the problem related to missing and contaminated pixels, these methods essentially exploit
the redundancy in the visual data to produce representations with sufficient reconstruc-
tion property. Formally, given an input x and label y, the generative classifiers learn a
model of the joint probability p(x, y) and classify using p(y|x), which is determined using
the Bayes’ rule. The discriminative approaches (such as LDA [17]), on the other hand,
are known to yield better results in “clean” conditions [20] owing to the flexible decision
boundaries. The optimal decision boundaries are determined using the posterior p(y|x)
directly from the data [72] and are consequently more sensitive to outliers. Apart from
these traditional approaches, it has been shown recently that unorthodox features such as
downsampled images and random projections can serve equally well. In fact the choice of
the feature space may no longer be so critical [75], [76], [6]. What really matters is the
dimensionality of the feature space and the design of the classifier.
In the paradigm of face recognition, illumination variation is supposed to be a major
robustness issue [3]. Several approaches have been proposed in the literature to tackle
the problem. A sophisticated approach in [64] models images of a subject with a fixed
pose but different illumination conditions as a convex cone in the space of images. Con-
sequently a small number of training images of each face taken under various lighting
conditions are used for the reconstruction of shape and albedo of the face. Although the
approach has demonstrated some good results, in practice the computation of an exact
illumination cone for a given subject is quite expensive and tedious due to a large num-
ber of extreme rays. Studies have shown that the facial images under varying luminance
conditions can be modeled as low-dimensional linear subspaces [65]. Basis images for this
purpose may be obtained using a 3D model under diffuse lighting based on spherical har-
monics. Arrangement of physical lighting can be harnessed so as to obtain images which
can directly be used as basis vectors for low-dimensional linear space. Another line of ac-
tion is to normalize/compensate the illumination effect by some kind of preprocessing such
68
as histogram equalization, gamma correction and logarithm transform. However these el-
ementary global processing techniques are not of much help in the presence of nonuniform
illumination variations [77]. Moreover some latest approaches such as the Line Edge Map
(LEM) [30] and Face-ARG matching [78] have shown good tolerance under adverse lumi-
nance alterations. The use of geometrical/structural information of the face region justifies
the implicit robustness of these approaches.
Apart from the illumination problem, it has also been shown in the literature that
traditional face recognition approaches do not cope well in the presence of severe random
noise [6], [79], [80], [81],[82]. Most of the approaches in the literature, robust to random
pixel noise, are variants of neural network classification. An important work is presented
in [80] incorporating a robust kernel approach in the presence of severe noise. Encouraging
results have been shown for two important problems of the additive noise (salt and pep-
per) and the multiplicative noise (speckle) compared to the traditional SVM approaches.
Similarly in [82] it has been shown that neural network classifier outperforms traditional
PCA [13], 2DPCA [28], LDA [17] and Laplacianfaces [83] approaches for the case of severe
additive Gaussian noise. Apart from these neural network approaches, recently sparse
representation classification (SRC) has been presented [6], [81]. In the presence of noise
(modeled as uniform random variable) the proposed approach has shown to outperform
the traditional approaches of PCA, ICA I, LNMF and L2+NS, however other important
noise models such as speckle and salt and pepper noise are not addressed.
In this chapter we propose a robust classification algorithm for the problem of face
recognition in the presence of random pixel distortion. Samples from a specific object
class are known to lie on a linear subspace [50], [17]. In our previous work [75], [76]
we proposed to develop class specific models of the registered users thereby defining the
task of face recognition as a problem of linear regression. In the work presented here, we
extend our investigations to the problem of noise contaminated probes, where the inverse
problem is solved using a novel application of the robust linear Huber estimation [84], [85]
and the class label is decided based on the subspace with the most precise estimation.
The proposed approach, although being simple in architecture, has shown demonstrating
results for two critical robustness issues of severe illumination variations and random pixel
69
noise.
The rest of the chapter is organized as follows: The fundamental problem of robust
estimation is discussed in Section 4.2 followed by the face recognition problem formulation
in Section 4.3. Section 4.4 demonstrates the efficacy of the proposed approach for the
problem of severely varying illumination followed by the experiments for random pixel
corruption in Section 4.5. The paper finally concludes in Section 4.6 followed by a list of
publications in Section 4.7
4.2 The Problem of Robust Estimation
Consider a linear model
y = Xβ + e (4.1)
where the dependent or response variable y ∈ Rq×1, the regressor or predictor variable
X ∈ Rq×p, the vector of parameters β ∈ R
p×1 and error term e ∈ Rq×1 . The problem of
robust estimation is to estimate the vector of parameters β so as to minimize the residual
r = y − y; y = Xβ (4.2)
y being the predicted response variable. In classical statistics the error term e is
conventionally taken as a zero mean Gaussian noise [86]. A traditional method to optimize
the regression is to minimize the least squares (LS) problem
arg min︸ ︷︷ ︸
β
q∑
j=1
r2j (β) (4.3)
where rj(β) is the jth component of the residual vector r. However in the presence of
outliers, least squares estimation is inefficient and can be biased. Although it has been
claimed that classical statistical methods are robust, they are only robust in the sense of
type I error. Type I error corresponds to the rejection of null hypothesis when it is in
fact true. It is straightforward to note that type I error rate for classical approaches in
the presence of outliers tend to be lower than the nominal value. This is often referred to
70
as conservatism of classical statistics. However due to contaminated data, type II error
increases drastically. Type II error is the error when the null hypothesis is not rejected
when it is in fact false. This drawback is often referred to as inadmissibility of the classical
approaches. Additionally, classical statistical methods are known to perform well with the
homoskedastic data model. In many real scenarios however, this assumption is not true
and heteroskedasticity is indispensable, thereby emphasizing the need of robust estimation.
Several approaches to robust estimation have been proposed such as R-estimators
and L-estimators. However M -estimators have shown superiority due to their generality
and high breakdown point [84], [86]. Primarily M -estimators are based on minimizing a
function of residuals
β = arg min︸ ︷︷ ︸
β∈Rp
F (β) ≡
q∑
j=1
ρ(
rj(β))
(4.4)
where ρ(r) is a symmetric function with a unique minimum at zero [84], [85]
ρ(r) =
12γ
r2 for |r| ≤ γ
|r| − 12γ for |r| > γ
(4.5)
γ being a tuning constant called the Huber threshold. Many algorithms have been de-
veloped for calculating the Huber M -estimate in Equation 4.4, some of the most efficient
are based on Newton’s method [87]. M -estimators have been found to be robust and sta-
tistically efficient compared to classical methods [88], [89], [57]. Although robust methods,
in general, are superior to their classical counterparts, they have rarely been addressed
in applied fields [58], [86]. Several reasons have been discussed in [86] for this paradox,
computational expense related to the robust methods has been a major hindrance [88].
However, with recent developments in computational power, this reason has become in-
significant. The reluctance in the use of robust regression methods may also be credited
to the belief of many statisticians that classical methods are robust.
71
Table 4.1: Outline of Robust Linear Regression Classification (RLRC) AlgorithmAlgorithm: Robust Linear Regression Classification (RLRC)
Inputs: Class models Xi ∈ Rq×pi , i = 1, 2, . . . , N and a test image vector y ∈ R
q×1.Output: Class of y
1. βi ∈ Rpi×1 is evaluated against each class model, βi = arg min
︸ ︷︷ ︸
βi∈Rpi
{
F (βi) ≡∑q
j=1 ρ(
rj(βi))}
, i = 1, 2, . . . , N
2. yi is computed for each βi, yi = Xiβi, i = 1, 2, . . . , N
3. Distance calculation between original and predicted response variables di(y) = ‖y − yi‖2 , i = 1, 2, . . . , N
4. Decision is made in favor of the class with the minimum distance di(y)
72
4.3 Robust Linear Regression Classification (RLRC) for Ro-
bust Face Recognition
Consider N number of distinguished classes with pi number of training images from the
ith class such that i = 1, 2, . . . , N . Each grayscale training image is of an order a × b
and is represented as(m)ui ∈ R
a×b, i = 1, 2, . . . , N and m = 1, 2, . . . , pi. Each gallery
image is downsampled to an order c × d and transformed to a vector through column
concatenation such that(m)ui ∈ R
a×b →(m)wi∈ R
q×1, where q = cd, cd << ab. Each image
vector is normalized so that the maximum pixel value is 1. Using the concept that patterns
from the same class lie on a linear subspace [50], we develop a class specific model Xi by
stacking the q-dimensional image vectors,
Xi = [(1)wi
(2)wi . . . . . .
(pi)wi ] ∈ R
q×pi , i = 1, 2, . . . , N (4.6)
Each vector(m)wi , m = 1, 2, . . . , pi, spans a subspace of R
q also called the column space
of Xi. Therefore at the training level each class i is represented by a vector subspace, Xi,
which is also called the regressor or predictor for class i. Let z be an unlabeled test image
and our problem is to classify z as one of the classes i = 1, 2, . . . , N . We transform and
normalize the grayscale image z to an image vector y ∈ Rq×1 as discussed for the gallery.
If y belongs to the ith class it should be represented as a linear combination of the training
images from the same class (lying in the same subspace) i.e.
y = Xiβi + e , i = 1, 2, . . . , N (4.7)
where βi ∈ Rpi×1. From the perspective of face recognition the training of the system
corresponds to the development of the explanatory variable (Xi) which is normally done in
a controlled environment, therefore the explanatory variable can safely be regarded as noise
free. The issue of robustness comes into play when a given test pattern is contaminated
with noise which may arise due to luminance, malfunctioning of the sensor, channel noise
etc. Given that q ≥ pi, the system of equations in equation 4.7 is well-conditioned and βi
is estimated using robust Huber estimation as discussed in Section 4.2 [85]
73
βi = arg min︸ ︷︷ ︸
βi∈Rpi
F (βi) ≡
q∑
j=1
ρ(
rj(βi))
, i = 1, 2, . . . , N (4.8)
where rj(βi) is the jth component of the residual
r(βi) = y − Xiβi , i = 1, 2, . . . , N (4.9)
The estimated vector of parameters, βi, along with the predictors Xi are used to predict
the response vector for each class i:
yi = Xiβi, i = 1, 2, . . . , N (4.10)
We now calculate the distance measure between the predicted response vector yi, i =
1, 2, . . . , N and the original response vector y,
di(y) = ‖y − yi‖2 , i = 1, 2, . . . , N (4.11)
and rule in favor of the class with minimum distance i.e.
min︸︷︷︸
i
di(y), i = 1, 2, . . . , N (4.12)
The proposed RLRC algorithm is outlined in Table 4.1.
4.4 Case Study: Face recognition in Presence of Severe Il-
lumination Variations
The proposed RLRC algorithm is extensively evaluated on various databases incorporating
several modes of luminance variations. In particular we address three standard databases
namely Yale Face Database B [64], CMU-PIE database [90] and AR database [32]. For all
experiments images are histogram equalized and transformed to logarithm domain.
74
Figure 4.1: Yale Face Database B: Starting from top, each row represents typical imagesfrom subsets 3, 4 and 5 respectively. Note that subset 5 (third row) characterizes theworst illumination variations.
Table 4.2: Details of the subsets for Yale Face Database B with respect to light sourcedirections.
Subsets 1 2 3 4 5
Lighting angle (degrees) 0−12 13−25 26−50 51−77 >77
Number of images 70 120 120 140 190
4.4.1 Yale Face Database B
Yale face database B consists of 10 individuals with 9 poses incorporating 64 different
illumination alterations for each pose [64]. The database has been used by researchers as
a test-bed for the evaluation of robust face recognition algorithms. Since we are concerned
with the illumination tolerant face recognition problem, only the frontal images of the
subjects are considered. The images are divided into 5 subsets with respect to the angle
between the light source direction and the camera axis (see Figure 4.1), refer to Table 4.2.
Interested readers may also refer to [64] for further details of the database, all images are
downsampled to an order of 50 × 50.
We follow the evaluation protocol as reported in [64], [77], [91], [92], [93], [94], [95].
Training is conducted using subset 1 and the system is validated on the remaining sub-
sets. A detail comparison of the results with some latest approaches is shown in Table
75
Table 4.3: Recognition Results for Yale Face Database BMethods Subset 3 Subset 4 Subset 5
No Normalization [77] 89.20% 48.60% 22.60%
Histogram Equalization [77] 90.80% 45.80% 58.90%
Linear Subspace [64] 100.00% 85.00% N/A
Cones-attached [64] 100.00% 91.40% N/A
Cones-cast [64] 100.00% 100.00% N/A
Gradient Angle [91] 100.00% 98.60% N/A
Harmonic Images [92] 99.70% 96.90% N/A
Illumination Ratio Images [93] 96.70% 81.40% N/A
Quotient Illumination Relighting [94] 100.00% 90.60% 82.50%
9PL [95] 100.00% 97.20% N/A
Method in [77] 100.00% 99.82% 98.29%
RLRC 100.00% 100.00% 100.00%
4.3, all results are as reported in [77]. Note that the error rates have been converted to
the recognition success rates. Since subset 3 incorporates moderate luminance variations,
most of the state-of-art algorithms report error-free recognition as shown in Table 4.3.
For subset 4 with more adverse illumination variations, the proposed algorithm achieves
100% recognition which is either better than or comparable to all the results reported in
the literature. In particular the proposed approach outperforms the Cones-attached, Illu-
mination Ratio Images and Quotient Illumination Relighting methods by 8.60%, 18.60%
and 9.40% respectively. It is also found to be fairly comparable to the latest Cone-cast
and Gradient Angle approaches. Subset 5 represents the worst case scenario with angle
between the light source direction and camera axis being greater than 77◦. The pro-
posed RLRC algorithm consistently achieves 100% recognition for the severe alterations
comparing favorably with all the reported results in the literature beating the Quotient
Illumination Relighting method by more than 17%. Noteworthy is the fact that results for
this subset are not available in the literature for most of the contemporary approaches.
4.4.2 CMU-PIE Face Database
Evaluation Protocol 1 (EP 1)
Extensive experiments were conducted on CMU-PIE database [90]. We follow the eval-
uation protocol as proposed in [1] randomly selecting a subset of database consisting of
65 subjects with 21 illumination variations per subject, all images are resized to an order
76
(1) (2) (3) (4) (5) (6) (7)
(8) (9) (10) (11) (12) (13) (14)
(15) (16) (17) (18) (19) (20) (21)
Figure 4.2: The 21 different illumination variations for a typical subject from the CMUPIE database. These images were captured without any ambient lighting thereby demon-strating more severe luminance alterations
77
Table 4.4: Performance comparison with state-of-the-art algorithms characterizing training images captured from near frontal lighting. Allresults are as reported in [1]
Training Images IPCA 3D Linear Subspace Fisherfaces MACE Filters Corefaces RLRC
5,6,7,8,9,10,11,18,19,20 97.60% 97.30% 97.30% 100.00% 100.00% 100.00%
5,6,7,8,9,10 91.40% 97.10% 89.30% 99.90% 100.00% 99.41%
5,7,9,10 72.40% 93.20% 71.40% 99.90% 99.90% 99.85%
7,10,19 36.10% 50.90% 73.30% 99.10% 99.90% 99.93%
8,9,10 78.00% 97.80% 82.10% 99.90% 99.90% 99.41%
18,19,20 91.00% 98.40% 94.20% 99.90% 100.00% 100.00%
Table 4.5: Performance comparison with state-of-the-art algorithms characterizing training images with severe lighting conditions. Allresults are as reported in [1]
Training Images IPCA 3D Linear Subspace Fisherfaces MACE Filters Corefaces RLRC
3,7,16 95.90% 99.90% 99.90% 100.00% 100.00% 100.00%
1,10,16 90.70% 99.90% 99.90% 100.00% 100.00% 100.00%
2,7,16 88.57% 99.85% 100.00% 100.00% 100.00% 100.00%
4,7,13 91.40% 98.90% 99.10% 100.00% 100.00% 100.00%
3,10,16 91.70% 100.00% 99.90% 100.00% 100.00% 100.00%
3,16 44.30% N/A 49.90% 99.90% 99.90% 99.93%
78
50 × 50. Figure 4.2 represents 21 different alterations for a typical subject each image
being accordingly labeled. We follow two experimental setups as proposed in [1], in the
first set of experiments the system is trained using images with near frontal lighting and
validation is conducted across the whole database. A detailed comparison of the perfor-
mance with the state-of-art approaches is depicted in Table 4.4. The proposed RLRC
algorithm is found to be pretty much comparable with the latest approaches of MACE
filters and Corefaces, it also comprehensively outperforms the IPCA, 3D linear subspace
and Fisherfaces methods for various case-studies of training sessions. For instance with
training images labeled 7, 10 and 19, the proposed RLRC algorithm achieves 99.93% recog-
nition which is 63.83%, 49.03% and 26.63% better than IPCA, 3D linear subspace and
Fisherfaces methods respectively.
For the second set of experiments training is conducted on images captured under
extreme lighting conditions, the system is again validated across the whole database. The
proposed RLRC algorithm is found to be comparable with the latest approaches as shown
in Table 4.5. The only erroneous recognition trial was for the case with the training images
labeled 3 and 16. The error may be attributed to the fact that the system was trained
using only 2 images which accounts for only a couple of regressor or predictor observations
for each class in the context of the RLRC algorithm. Apart from insufficient information,
it has to be noted that images 3 and 16 (Figure 4.2) have adverse luminance conditions.
Evaluation Protocol 2 (EP 2)
Under Evaluation Protocol 2 (EP 2) we follow the leave-one-out strategy on the 68 subjects
of the CMU-PIE database as proposed in a recent work of generalized quotient image
[96]. A detailed comparison with the best results in [96] is shown in Figure 4.3. The
proposed RLRC approach consistently attained high recognition accuracy for all leave-
one-out experiments. In particular, apart from one recognition trail we attained an error-
free performance index with 100% recognition accuracy. Only one error was reported for
the seventh leave-one-out experiment where we achieved a recognition rate of 98.53%. It
is appropriate to point out that performance curve for S-QI method in Figure 4.3 is an
approximation to the curve shown in [96].
79
0 2 4 6 8 10 12 14 16 18 2050
55
60
65
70
75
80
85
90
95
100
Leave−one−out Experiments
Rec
ogni
tion
Acc
urac
y
S−QI
RLRC
Figure 4.3: Performance curves for the CMU-PIE database under EP 2.
4.4.3 AR Database
The AR face database contains over 4000 color images taken in two sessions separated by
two weeks [32]. The database characterizes various deviations from the ideal conditions
including facial expressions, luminance conditions and occlusion modes. In particular,
there are three lighting modes with left light on, right light on and both lights on. Figure
4.4 represents these variations for the two sessions.
Evaluation Protocol 1 (EP 1)
We follow the evaluation protocol as proposed in [97], a subset of the database consisting
of 118 randomly selected individuals is selected. Training is performed on images with
nominal lighting conditions (Figure 4.4 (a) and (e)) while validation is conducted on
images with adverse ambient lighting (Figure 4.4 (b), (c), (d), (f), (g) and (h)). Therefore
altogether we have 236 (118 × 2) gallery images and 708 (118 × 6) probes.
All images are downsampled to an order of 180 × 180. The results are dilated in
80
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 4.4: Various luminance variations for a typical subject of the AR database, thetwo rows represent two different sessions.
Table 4.6: Results for the AR database under EP 1.Method Recognition Accuracy
LPP 65.25%
DLPP 96.89%
RLRC 95.76%
Table 4.6, the proposed RLRC algorithm outperforms the Locality Preserving Projections
(LPP) method by a margin of 30.51% and is quite comparable to the Discriminant Locality
Preserving Projections (DLPP) method. All results are as reported in [97].
Evaluation Protocol 2 (EP 2)
Under EP 2 we follow the experimental setup as proposed in [98]. We now have a subset
of 121 subjects, training is done on Figure 4.4 (a) and the system is validated for adverse
luminance variations of the same session i.e Figures 4.4 (b), (c) and (d). Therefore we
have 121 gallery images and 363 (121 × 3) probes. The results are tabulated in Table
4.7, all the results are as reported in [98]. The proposed RLRC algorithm achieves a high
81
Table 4.7: Results for the AR database under EP 2.Method Recognition Accuracy
PCA 25.90%
PCA+HE 37.70%
PCA+BHE 71.30%
PCA+2D Face Model 81.80%
RLRC 94.49%
Table 4.8: Results for the AR database under EP 3.Method Left-Light Right-Light Both-Lights
1-NN [78] 22.20% 17.80% 3.70%
PCA [78] 7.40% 7.40% 2.20%
LEM [30] 92.90% 91.10% 74.10%
Face-ARG [78] 98.50% 96.30% 91.10%
RLRC 96.30% 94.07% 94.07%
recognition accuracy of 94.49% outperforming the latest 2D face model approach by a
margin of 12.69%.
Evaluation Protocol 3 (EP 3)
Under Evauluation Protocol 3 (EP 3) we follow the experimental setup as proposed in
recent works of Face-ARG matching [78] and Line Edge Map (LEM) [30]. These recent
approaches use the geometric quantities and structural information of a human face and
have therefore shown to be robust to severe illumination variations. We select a subset of
AR database consisting of 135 subjects. The system is trained using Figure 4.4 (a) while
Figures 4.4 (b), (c) and (d) serve as probes, altogether we have 135 gallery images and
405 (135 × 3) probes. The results are tabulated in Table 4.8, noteworthy is the fact that
results in [30] are shown for 112 subjects.
The proposed RLRC approach shows a consistent performance across all illumination
modes of the AR database. For the cases of “left light on” and “right light on”, recognition
accuracies of 96.30% and 94.07% are achieved which are fairly comparable to the latest
LEM and Face-ARG approaches as shown in Table 4.8. For the most challenging problem
of illumination with “both lights on” the proposed RLRC approach attains 94.07% recog-
nition which is favorably comparable with the Face-ARG approach and outperforms the
LEM approach by a margin of approximately 20%. The conventional methods of PCA and
82
1-NN reported in [78] are out of discussion as they lag far behind these latest approaches.
4.4.4 FERET Database
The FERET database is arguably one of the largest publicly available database with two
versions [62], gray FERET database and color FERET database. The database addresses
several challenging issues such as expression variations, pose alterations and aging factor
etc. For the case of varying illumination there is only one evaluation protocol recognized
as “fafc” within the framework of gray FERET database. The methodology utilizes only
one gallery image for each of the 1196 subjects, the gallery size is therefore 1196. The
probe set consists of 194 images, refer to [62] for further details on the FERET evaluation
methodology. It is worthy to note that recognizing a person from a single gallery image is
itself an independent, ardent issue within the paradigm of face recognition [99] and as such
not the focus of the presented research. However to evaluate the efficacy of the proposed
algorithm with a single gallery image per subject, we conducted extensive experiments
as shown in Figure 4.5. The proposed RLRC algorithm outperformed 13 of the reported
14 algorithms and lagged only one algorithm tagged as USC MAR 97 in [62]. Figure
4.5 illustrates receiver operating characteristics for three best reported algorithms (in
the sense of recognition accuracy). The proposed RLRC algorithm achieves a verification
accuracy of 70.10% at 0.001 FAR which lags 9.28% as compared to the best result reported
for USC MAR 97, the proposed RLRC algorithm however comprehensively outperforms
the other 13 algorithms beating UMD MAR 97 and EF HIST DEV M12 by a margin of
approximately 30% and 50% (at 0.001 FAR) respectively. It has to be noted that for higher
values of FAR the proposed RLRC algorithm is better than the best reported method.
In particular the RLRC algorithm achieves better performance from 0.017 FAR onwards
with a good equal error rate of 0.03 (approximately) which is better than 0.05 as reported
for USC MAR 97 method [62].
83
10−3
10−2
10−1
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Accept Rate
Ver
ifica
tion
Rat
e
RLRCusc mar 97umd mar 97ef hist dev m12
Figure 4.5: ROC curves for the FERET database.
4.5 Case Study: Face Recognition in Presence of Random
Pixel Noise
Extensive experiments were carried out using the Extended Yale B database [64], [65].
The database consists of 2,414 frontal-face images of 38 subjects under various lighting
conditions. Subset 1 and 2 consisting of 719 images under normal-to-moderate lighting
conditions, were used as gallery. Subset 3 consisting of 456 images under severe luminance
alterations were designated as probes. Sample gallery and probe images are shown in
Figure 4.6. The choice of training and testing images is specifically to isolate the effect of
noise.
The proposed approach was validated for a range of exemplary noise models specific to
84
Figure 4.6: First row illustrates some gallery images from subset 1 and 2 while secondrow shows some probes from subset 3.
image data. For all experiments the location of noisy pixels is unknown to the algorithm.
Figure 4.7 reflects the probe images corrupted with various degrees of dead pixel noise.
(a) (b) (c) (d)
Figure 4.7: Probe images corrupted with (a) 20% (b) 40% (c) 60% and (d) 80% deadpixels.
The proposed robust linear regression classification approach comprehensively outper-
forms the benchmark reconstructive algorithms of PCA [13] and ICA I [31] showing a high
breakdown point as shown in Figure 4.8. Even for the worst case scenario of 90% corrupted
pixels, the proposed approach achieves 94.52% recognition accuracy outperforming PCA
and ICA I by 49.13% and 75.44% respectively. In the presence of outliers the L1-norm
computation is reported to be more efficient compared to the usual euclidean distance
measure [88], [89]. Also in the literature median filtering has shown to improve image
understanding in the presence of noise [100]. Therefore for an appropriate indication of
the performance index for the proposed RLRC algorithm we also conducted experiments
85
0 10 20 30 40 50 60 70 80 900
10
20
30
40
50
60
70
80
90
100
Noise Percentage
Rec
ogni
tion
Acc
urac
y
PCA + NNICA I + NNRLRCM−PCAPCA+L1M−RLRC
Figure 4.8: Recognition accuracy of various approaches for a range of dead pixel noisedensity.
using median filtering and L1-norm calculations for the PCA. Note that the median pre-
processing is indicated by letter “M” before the corresponding approach. The performance
curves are shown in Figure 4.8, the proposed RLRC algorithm consistently outperforms
these robust variants of the benchmark approaches.
In particular, for the case of 90% corruption we note a major performance difference
with the RLRC algorithm outperforming the L1-norm and M-PCA methods by 89.04%
and 91.67% respectively. Noteworthy is the fact that standard preprocessing and robust
calculations are of no use in such severe noise conditions.
For a comprehensive comparison with the benchmark generative approaches, extensive
verification experiments were conducted, the similarity scores were normalized on the scale
[0 1]. Results are dilated in Figure 4.12 and Table 4.9. Rank recognition profiles for various
degree of dead-pixel contamination show an excellent performance index for the proposed
RLRC approach. With the increasing noise intensity the proposed approach shows good
86
tolerance as compared to PCA and ICA I.
Specifically with 80% noise density, the proposed RLRC approach achieves an excellent
rank-1 recogniiton accuracy of 99.78% comprehensively beating PCA and ICA I rank-1
recognition results by 28.29% and 48.68% respectively. The RLRC approach achieves an
excellent equal error rate (EER) of 0.40% as compared to 5.80% and 11.02% for PCA and
ICA I respectively. Verification rate of 99.78% at a typical 0.01 FAR, as indicated in Table
4.9, also comprehensively outperforms the benchmark approaches.
In the next set of experiments we contaminate the probe images with data drop-out
and snow in the image noise simultaneously, commonly referred to as salt and pepper
noise [101]. This fat-tail distributed noise, also called impulsive noise or spike noise [101],
can be caused by analog-to-digital converter errors and bit errors in transmission [102],
[103]. Figure 4.9 reflects probes distorted with various degrees of salt and pepper noise. In
the overall sense, the proposed RLRC approach is favorably comparable with the bench-
mark reconstructive approaches as depicted in Figure 4.10. At a noise density of 70%
for instance, the RLRC algorithm gives 9.21% and 18.64% better recognition accuracy
compared to PCA and ICA I respectively. However under high noise densities of 80%
and 90%, PCA seems to be better than either ICA I or RLRC. It should be noted that
under such severe conditions of salt and pepper noise, although PCA gives a better com-
parative performance index, the recognition accuracy achieved (for e.g. 48.25% at 80%
noise density) is itself far from satisfactory. For salt and pepper noise we note that the
median preprocessing results in a significant improvement both for PCA and RLRC ap-
proaches. For instance at 70% noise density M-PCA achieves 93.86% recognition which is
22.15% better than simple PCA. Similarly the M-RLRC shows an improvement of almost
20% compared to simple RLRC achieving a maximum recognition of 100%. The L1-norm
calculation is of little benefit as M-RLRC consitently performs better than all competing
approaches.
Verification experiments were also conducted for salt and pepper noise, results are
shown in Figure 4.13 and Table 4.10. For noise density up to 70%, the proposed RLRC
algorithm shows better performance index, both in recognition and verification, compared
to PCA and ICA I. For instance at a 60% contamination level RLRC achieves a high
87
(a) (b) (c) (d)
Figure 4.9: Probes with (a) 20% (b) 40% (c) 70% and (d) 90% salt and pepper noisedensity.
0 10 20 30 40 50 60 70 80 900
10
20
30
40
50
60
70
80
90
100
Pecentage Noise Density
Rec
ogni
tion
Acc
urac
y
PCA + NNICA I + NNRLRCM−PCAPCA+L1M−RLRC
Figure 4.10: Recognition accuracy curves in the presence of varying density of salt andpepper noise.
88
Table 4.9: Verification Results for dead-pixel noise
20% Noise Density 40% Noise Density 60% Noise Density 80% Noise Density
Approach EER Verif. Approach EER Verif. Approach EER Verif. Approach EER Verif.
PCA 0.70% 99.78% PCA 1.00% 97.81% PCA 1.80% 94.30% PCA 5.80% 75.44%
ICA I 0.60% 100% ICA I 1.50% 96.93% ICA I 3.70% 92.11% ICA I 11.02% 51.54%
RLRC 0.20% 100% RLRC 0.20% 100% RLRC 0.20% 100% RLRC 0.40% 99.78%
Table 4.10: Verification Results for salt and pepper noise
50% Noise Density 60% Noise Density 70% Noise Density 80% Noise Density
Approach EER Verif. Approach EER Verif. Approach EER Verif. Approach EER Verif.
PCA 2.12% 95.61% PCA 4.10% 87.28% PCA 6.36% 75.22% PCA 15.13% 53.07%
ICA I 2.16% 93.86% ICA I 4.25% 83.99% ICA I 8.83% 62.28% ICA I 17.24% 42.54%
RLRC 0.21% 99.78% RLRC 2.18% 97.37% RLRC 5.48% 85.75% RLRC 19.30% 39.91%
89
(a) (b) (c) (d)
Figure 4.11: Probe images corrupted with (a) 4 (b) 6 (c) 8 and (d) 10 variance specklenoise.
rank-1 recognition accuracy of 94.52%, outperforming PCA and ICA I by a difference
of 12.50% and 16.45% respectively (refer to Figure 4.13 (b)). A low EER of 2.18% for
the proposed RLRC approach is also favorable comparable to 4.10% and 4.25% of PCA
and ICA I respectively. Note the major performance difference of the receiver operating
characteristics in Figure 4.13 (f). An excellent verification rate of 97.37% at standard
0.01 FAR outstandingly outperforms the benchmark approaches (refer to Table 4.10).
However, for a noise density greater than 70% the verification results for PCA are better
than either RLRC or ICA I approaches. For insstance at 80% noise contamination, PCA
achieves a verification rate of 53.07% at a typical 0.01 FAR which is better than ICA
I and RLRC by 12.53% and 13.16% respectively, also the EER performance for PCA is
superior compared to both approaches as shown in Table 4.10. The superior performance
of PCA for severe noise density, is however undone by the fact that it is unable to reach
satisfactory performance in the absolute sense as 53.07% success rate is not reliable by
any standard. For low to moderate salt and pepper noise, the proposed RLRC remains
the best choice.
Speckle noise, is regarded as a major interference in digital imaging and therefore forms
another important robustness issue. The proposed approach was extensively evaluated by
adding varying multiplicative speckle noise to probes as shown in Figure 4.11. Speckle
noise is efficiently modeled as a zero mean uniform random variable.
The proposed RLRC approach showed a good performance index as shown in Figure
4.17, consistently achieving a high recognition accuracy for a wide range of error variance.
The effect of speckle noise with a variance of 6 is shown in Figure 4.11 (b), the image
90
1 2 3 4 5 6 7 8 9 100.94
0.95
0.96
0.97
0.98
0.99
1
Rank
Rec
ogni
tion
Rat
e
PCA+NNICA I + NNRLRC
(a)
1 2 3 4 5 6 7 8 9 100.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
RankR
ecog
nitio
n R
ate
PCA+NNICA 1+NNRLRC
(b)
1 2 3 4 5 6 7 8 9 100.85
0.9
0.95
1
Rank
Rec
ogni
tion
Rat
e
PCA+NNICA I+NNRLRC
(c)
1 2 3 4 5 6 7 8 9 100.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Rec
ogni
tion
Rat
e
PCA+NNICA I+NNRLRC
(d)
0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.020.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
False Accept Rate
Ver
ifica
tion
Rat
e
PCA+NNICA I+NNRLRC
(e)
0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.020.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
False Accept Rate
Ver
ifica
tion
Rat
e
PCA+NNICA I+NNRLRC
(f)
0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.020.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
False Accept Rate
Ver
ifica
tion
Rat
e
PCA+NNICA I+NNRLRC
(g)
0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.120.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
False AcceptRate
Ver
ifica
tion
Rat
e
PCA+NN+ICA I+NNRLRC
(h)
Figure 4.12: Dead-pixel noise: First row elaborates rank-recognition profiles while the second row shows the Receiver Operating Charac-teristics (ROC). From left to right columns indicate 20%, 40%, 60% and 80% noise density respectively.
91
1 2 3 4 5 6 7 8 9 100.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Rank
Rec
ogni
tion
Rat
e
PCA+NNICA I+NNRLRC
(a)
1 2 3 4 5 6 7 8 9 100.75
0.8
0.85
0.9
0.95
1
Rank
Rec
ogni
tion
Rat
e
PCA+NNICA I+NNRLRC
(b)
1 2 3 4 5 6 7 8 9 10
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Rec
ogni
tion
Rat
e
PCA+NNICA I+NNRLRC
(c)
1 2 3 4 5 6 7 8 9 10
0.4
0.5
0.6
0.7
0.8
0.9
1
Rank
Rec
ogni
tion
Rat
e
PCA+NNICA I+NNRLRC
(d)
0.01 0.011 0.012 0.013 0.014 0.015 0.016 0.017 0.0180.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
False Accept Rate
Ver
ifica
tion
Rat
e
ROC Curves
PCA+NNICA I +NNRLRC
(e)
0.01 0.011 0.012 0.013 0.014 0.015 0.016 0.017 0.0180.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
False Accept Rate
Ver
ifica
tion
Rat
e
ROC Curves
PCA+NNICA 1+NNRLRC
(f)
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
False Accept Rate
Ver
ifica
tion
Rat
e
ROC Curves
PCA+NNICA I+NNRLRC
(g)
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Accept Rate
Ver
ifica
tion
Rat
e
ROC Curves
PCA+NNICA I +NNRLRC
(h)
Figure 4.13: Salt and pepper noise: First row represents rank recognition curves while the second row shows Receiver Operating Charac-teristics (ROC). From left to right columns indicate 50%, 60% 70% and 80% noise densities respectively.
92
is badly distorted and traditional reconstructive approaches fail to produce competitive
results. The proposed RLRC approach attains a high recognition accuracy of 91.67%
outperforming PCA and ICA I approaches by 16.67% and 23.69% respectively. Noteworthy
is the tolerance and consistency of the proposed approach for highly corrupted data. The
best recognition results are reported for the RLRC approach with median filtering (M-
RLRC indicated by red dashed line in Figure 4.17). For instance for the worst case of
speckle noise with variance 10 the M-RLRC approach achieves a high recognition success
of 97.37% comprehensively outperforming all competing approaches, the best competitor
being the RLRC without any preprocessing achieving 86.40% (solid blue line in Figure
4.17). Median filtering variants again show significant improvement of approximately 12%
compared to the unprocessed computations. Note that the L1-norm robust calculations
are of not much help in such adverse conditions.
Results for verification experiments are shown in Figure 4.14 and Table 4.11. In par-
ticular, in the presence of noise density with variance 8, the RLRC approach achieves a
high rank-1 recognition accuracy of 89.04% outperforming PCA and ICA I by margins
of 19.52% and 22.15% respectively. At a typical 0.01 FAR the proposed RLRC reaches a
verification rate of 94.74% with an EER of only 3.07% substantially outperforming both
contesting approaches (refer to Figures 4.14 (d) and (h)).
Due to the fact that all imaging systems acquire images by counting photons, the
detector noise (modeled as additive white Gaussian noise) is always an important case-
study in the context of robustness [104]. The probes were distorted by adding zero-mean
Gaussian noise with a wide range of error variance as shown in Figure 4.16. Since classical
statistical methods are known to be efficient in the presence of Gaussian noise, we also
conducted experiments by solving Equation 4.7 using the least squares (LS) approach.
To harness redundant measurements in the presence of noise, all experiments for LS and
RLRC were conducted in the original image space. Results shown in Figure 4.18 reflect the
superiority of the proposed method. The RLRC approach consistently outperformed all
other approaches for a wide range of error variance. In particular, with an error variance of
0.8 the RLRC approach beats PCA, ICA I and LS methods by margins of 7.89%, 16.88%
and 48.48% respectively. Even with a severe additive noise of 0.9 variance a reasonable
93
1 2 3 4 5 6 7 8 9 100.75
0.8
0.85
0.9
0.95
1
Rank
Rec
ogni
tion
Rat
e
PCA+NNICA I+NNRLRC
(a)
1 2 3 4 5 6 7 8 9 100.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Rec
ogni
tion
Rat
e
PCA+NNICA I+NNRLRC
(b)
1 2 3 4 5 6 7 8 9 100.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Rec
ogni
tion
Rat
e
PCA+NNICA I+NNRLRC
(c)
1 2 3 4 5 6 7 8 9 100.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Rec
ogni
tion
Rat
e
PCA+NNICA I+NNRLRC
(d)
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
False Accept Rate
Ver
ifica
tion
Rat
e
PCA+NNICA I+NNRLRC
(e)
0.02 0.04 0.06 0.08 0.1 0.120.7
0.75
0.8
0.85
0.9
0.95
1
False Accept Rate
Ver
ifica
tion
Rat
e
PCA+NNICA I+NNRLRC
(f)
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
False Accept Rate
Ver
ifica
tion
Rat
e
PCA+NNICA I+NNRLRC
(g)
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
False Accept Rate
Ver
ifica
tion
Rat
e
PCA+NNICA I+NNRLRC
(h)
Figure 4.14: Speckle noise: First row represents rank recognition curves while second row shows receiver operating characteristics. Fromleft to right columns indicate noise densities with variances 2, 4, 6, and 8 respectively.
94
1 2 3 4 5 6 7 8 9 100.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Rec
ogni
tion
Rat
e
PCA+NNICA I+NNLSRLRC
(a)
1 2 3 4 5 6 7 8 9 100.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
RankR
ecog
nitio
n R
ate
PCA+NNICA+NNLSRLRC
(b)
1 2 3 4 5 6 7 8 9 100.4
0.5
0.6
0.7
0.8
0.9
1
Rec
ogni
tion
Rat
e
PCA+NNICA I+NNLSRLRC
(c)
1 2 3 4 5 6 7 8 9 100.4
0.5
0.6
0.7
0.8
0.9
1
Rank
Rec
ogni
tion
Rat
e
PCA+NNICA I+NNLSRLRC
(d)
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.7
0.75
0.8
0.85
0.9
0.95
1
False Accept Rate
Ver
ifica
tion
Rat
e
PCA+NNICA I+NNLSRLRC
(e)
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
False Accept Rate
Ver
ifica
tion
Rat
e
PCA+NNICA I+NNLSRLRC
(f)
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
False Accept Rate
Ver
ifica
tion
Rat
e
PCA+NNICA I+NNLSRLRC
(g)
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.4
0.5
0.6
0.7
0.8
0.9
1
False Accept Rate
Ver
ifica
tion
Rat
e
PCA+NNICA I+NNLSRLRC
(h)
Figure 4.15: Gaussian noise: First row represents rank recognition curves while second row shows Receiver Operating Characteristics(ROC). From left to right columns indicate noise densities with variances 0.5, 0.7, 0.8, and 0.9 respectively.
95
Table 4.11: Verification Results for Speckle Noise
Noise Variance=2 Noise Variance=4 Noise Variance=6 Noise Variance=8
Approach EER Verif. Approach EER Verif. Approach EER Verif. Approach EER Verif.
PCA 3.20% 87.94% PCA 4.09% 81.80% PCA 5.71% 79.82% PCA 6.62% 73.90%
ICA I 4.60% 86.84% ICA I 5.57% 77.63% ICA I 6.36% 70.61% ICA I 8.27% 67.76%
RLRC 0.52% 99.34% RLRC 1.00% 98.90% RLRC 2.61% 96.27% RLRC 3.07% 94.74%
Table 4.12: Verification Results for Gaussian Noise
Noise Variance=0.5 Noise Variance=0.7 Noise Variance=0.8 Noise Variance=0.9
Approach EER Verif. Approach EER Verif. Approach EER Verif. Approach EER Verif.
PCA 2.19% 95.61% PCA 3.28% 92.11% PCA 2.23% 93.20% PCA 4.05% 86.18%
ICA I 2.06% 94.30% ICA I 3.50% 89.47% ICA I 4.16% 86.62% ICA I 4.17% 86.40%
LS 7.80% 78.73% ICA I 13.49% 56.36% ICA I 15.88% 50.00% ICA I 18.20% 46.93%
RLRC 0.21% 99.78% RLRC 0.43% 99.78% RLRC 1.30% 98.68% RLRC 1.63% 98.68%
96
(a) (b) (c) (d)
Figure 4.16: Probe images corrupted with (a) 0.2 (b) 0.4 (c) 0.6 and (d) 0.8 variancezero-mean Gaussian noise.
93.64% recognition accuracy was achieved. The LS approach showed an interesting behav-
ior, for low variance noise the performance of LS is pretty much comparable to the RLRC
approach. However, with low SNR the LS method substantially lags the robust linear
regression classification. The best recognition results are obtained for the RLRC with
median filtering (M-RLRC), for instance with 0.8 error variance a recognition accuracy
of 99.78% is reported which is 3.73% better than plain RLRC approach. Median filtering
also substantially improved the performance of PCA, however the two top performance
curves are obtained for the M-RLRC and RLRC methods.
Verification results for various SNR case-studies of AWGN are shown in Figure 4.15
and Table 4.12. In particular, for the worst case scenario of 0.9 variance Gaussian noise,
the proposed RLRC approach achieves high verification rate of 98.68% at 0.01 FAR com-
prehensively outperforming PCA, ICA I and the LS methods by 12.50%, 12.28% and
51.75% respectively (see Figure 4.15 (h)). The huge performance difference of more than
50% compared to the LS approach signifies the importance of robust regression for the
particular case-study of face recognition. Also in terms of EER the proposed approach at-
tains excellent performance index of 1.63% while other approaches substantially lag behind
(refer to Table 4.12).
4.6 Conclusion
In this chapter we present a novel robust face recognition algorithm based on the ro-
bust Huber estimation approach. It is for the first time that the problem of robust face
recognition has been formulated as a robust Huber estimation task. The proposed Ro-
97
1 2 3 4 5 6 7 8 9 1020
30
40
50
60
70
80
90
100
Variance
Rec
ogni
tion
Acc
urac
y
PCA + NNICA I + NNRLRCM−PCAPCA+L1M−RLRC
Figure 4.17: Recognition accuracy of various approaches in the presence of speckle noisefor different variances.
98
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.920
30
40
50
60
70
80
90
100
Variance
Rec
ogni
tion
Acc
urac
y
PCA + NNICA I + NNLS RLRCM−PCAPCA+L1M−RLRC
Figure 4.18: Recognition accuracy of various approaches in the presence of Gaussiannoise for different variances.
99
bust Linear Regression Classification (RLRC) algorithm has been evaluated for two case
studies i.e severe illumination variation and random pixel corruption. For the case of
illumination invariant face recognition, we have demonstrated results on three standard
databases incorporating adverse luminance alterations. A comprehensive comparison with
the state-of-art robust approaches indicates a comparable performance index for the pro-
posed RLRC approach. We demonstrate, for the first time, an error-free recognition for
the most challenging Subset 5 of the Yale face database B. In addition we report a com-
parable evaluation for the RLRC algorithm on the CMU-PIE and AR databases under
standard evaluation protocols as reported in the literature.
In addition, the problem of random pixel corruption is also addressed. The proposed
RLRC approach has shown good results for various noise models comprehensively outper-
forming the reconstructive benchmark approaches. In particular the proposed approach
attains a high verification rate of 99.78% at 0.01 FAR for the important case-study of
probes contaminated with 80% dead pixel noise. This performance is appreciable consid-
ering that the benchmark approaches are unable to provide satisfactory results under such
severe noisy conditions. Similarly in the presence of severe AWGN the proposed RLRC
approach beats the traditional generative approaches by a margin of an order of 12%. It
has also been experimentally shown that the classical LS approach is extremely inefficient
in the presence of severe AWGN and lags the proposed RLRC algorithm by more than
50%. For a fair comparison the robust variants of the base systems are also evaluated and
a comprehensive comparison demonstrates the efficacy of the proposed approach.
Apart from the good performance index of the proposed approach, there are several
interesting outcomes of the presented research. In the paradigm of view-based face recog-
nition, the choice of features for a given case-study, has been a debatable topic. Recent
research has however shown competency of the unorthodox features such as downsampled
images and random projections, indicating a divergence from the conventional ideology
[75], [76], [6]. The proposed RLRC approach in fact conforms to this emerging belief. It
has been shown that with an appropriate choice of classifier the original image space can
produce good results compared to the traditional subspace approaches. Good results for
randomly distributed noisy pixels are encouraging enough to extend the proposed algo-
100
rithm for the problem of contiguous occlusion where contaminated pixels are known to
have a connected neighborhood.
101
4.7 Publications
1. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, “Robust Regres-
sion for Face Recognition”, First revision submitted IEEE Transactions on Pattern
Analysis and Machine Intelligence (IEEE TPAMI)
2. Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Robust Regres-
sion for Face Recognition”, Accepted in IAPR International Conference on Pattern
Recognition, ICPR’10
102
Chapter 5
Speaker Identification using
Sparse Representation1
5.1 Introduction
Human speech is a natural way of recognizing a person and therefore Automatic Speaker
Recognition (ASR) systems have been widely deployed for secured authentication. Con-
ventional speaker recognition algorithms make use of acoustic features to develop proba-
bilistic speaker models and utilize an adequate statistical distance metric for the classi-
fication purpose. Gaussian Mixture Models (GMM) have been typically used to develop
probabilistic model for each speaker in a given database [105]. The large scale acceptance
of the GMMs as the standard in the ASR can be credited to a number of factors such
as the high accuracy, the ability to scale training algorithms for large data sets and the
probabilistic framework. A speech signal is naturally characterized by continuous changes
in the spectral domain, consequently a number of Gaussian components (typically of the
order of 64) are necessary to model the speaker-dependent features over the length of an
utterance. The collection of these Gaussian components results in the complete Gaussian
mixture model.
A more efficient approach is to develop a Universal Background Model (UBM) using
utterances from a set of speakers and adapt this universal model with respect to a par-
1The chapter has been accepted for publication in International Conference on Pattern Recognition(ICPR’10).
103
ticular speaker using Maximum-A-Posteriori (MAP) adaptation [106]. This state-of-art
approach is commonly referred to as the GMM-UBM. There are several benefits in using
this approach that have accounted for significant performance improvements in the GMM-
based classification. For instance, when training data is not available for the adaptation
of components in the UBM, the speaker values revert to those in the UBM to provide a
more robust speaker model. In contrast, when ample training data is available for a given
GMM component, the values approach those of the ML estimate.
Recently an intriguing variation in the GMM-UBM approach has enabled represen-
tation of a speaker as a point in a high dimensional space i.e. the speaker space [7].
The main idea is to concatenate the means of the GMM components to form a so-called
GMM mean supervector [7]. In this way, a variable-length utterance can be represented
as a fixed-length feature vector in the feature space and therefore the problem of speaker
recognition can be tackled as a general problem of pattern recognition. One technique
that has received significant focus in pattern recognition literature is the Support Vector
Machine (SVM). The discriminative nature of the SVM has been successfully applied to
a variety of pattern recognition tasks. An SVM is basically a two-class classifier that fits
a separating hyperplane between the two classes (assuming linear separability). In re-
cent years, the SVM-based classification has become a major focus in the task of speaker
identification and verification [7].
Typically, in pattern recognition problems, it is believed that high-dimensional data
vectors are redundant measurements of an underlying source. The objective of manifold
learning is therefore to uncover this “underlying source” by a suitable transformation of
high-dimensional measurements to low-dimensional data vectors. The main objective is
to find a basis function for this transformation, which could distinguishably represent
patterns in the feature space. A number of approaches have been reported in the litera-
ture for dimensionality reduction. These approaches have been broadly classified in two
categories namely generative/reconstructive and discriminative methods. Reconstructive
approaches (such as Principal Component Analysis or the PCA [13]), are reported to be
robust for noisy data, these methods essentially exploit the redundancy in the original data
to produce representations with sufficient reconstruction property. Formally, given an in-
104
put x and label y, the generative classifiers learn a model of the joint probability p(x, y)
and classify using p(y|x), which is determined using the Bayes’ rule. The discriminative
approaches (such as Linear Discriminant Analysis or the LDA [17]), on the other hand,
are known to yield better results in “clean” conditions [20] owing to the flexible decision
boundaries. The optimal decision boundaries are determined using the posterior p(y|x)
directly from the data and are consequently more sensitive to outliers. In the speaker
recognition community there is a growing interest for the exploration of these manifold
learning methods. The PCA for instance, has shown some good results in this regard and
is usually referred to as Eignevoice approach [107], [108]. Working on the same lines, the
concept of Fishervoice, based on the LDA approach, has recently been proposed to address
the problem of semi-supervised speaker clustering [109].
In this chapter we present a novel speaker identification algorithm in the context of
sparse representation [21]. We propose to utilize the concept of the GMM mean supervec-
tor to develop an overcomplete dictionary using training utterances from all the speakers.
The fixed-length GMM mean supervector of a given test utterance from an unknown
speaker is represented as a linear combination of this overcomplete dictionary. This repre-
sentation is naturally sparse since the test utterance corresponds to only a small fraction of
the whole training database. Using this sparsity, we propose to solve the inverse problem
using the l1-norm minimization as it is shown to be the sparsest solution [24]. The vector
of coefficients thus obtained will have non-zero entries corresponding to the class of the
test utterance. The proposed algorithm is evaluated on a subset of the widely available
TIMIT speech corpus [110]. Comparative analysis with the state-of-art speaker recogni-
tion algorithms yields a fairly comparable performance index for the proposed algorithm.
To the best of our knowledge, it is for the first time that sparse representation classification
is used for the problem of speaker identification.
The rest of the chapter is organized as follows: Basic framework of the proposed
algorithm is presented in Section 5.2 followed by experimental evaluation in Section 5.3.
The chapter is concluded in Section 5.4.
105
5.2 Sparse Representation for Speaker Identification
Sparse or parsimonious representation of signals is regarded as a major research area in
the paradigm of statistical signal processing. Most of the signals of practical interest are
compressible in nature. For example, audio signals are compressible in localized Fourier
domain and digital images are compressible in Discrete Cosine Transform (DCT) and
wavelet domains. Recent research in the area of compressive sampling has shown that if
the optimal representation of a signal is sufficiently sparse when linearly represented with
respect to an overcomplete dictionary (also referred to as measurement matrix ), it can be
efficiently computed using convex optimization [111, 22, 25, 26, 5, 24].
The main objective of the compressive sensing theory is to achieve computational effi-
ciency for information processing using the parsimonious representation of signals. From
this perspective, the compressive sensing theory basically tries to avoid the Shannon-
Nyquist bound by sampling at a much lower rate and still safely recovering the original
information [111]. Although the compressive sensing paradigm is not intended for clas-
sification purpose, the sparse representation of a signal with respect to a basis remains
implicitly discriminative in nature. It selects only those basis vector which most compactly
represents the signal and reject the others [6].
We exploit this discriminative nature of sparse representation to propose a novel
speaker identification algorithm. The proposed algorithm incorporates the GMM mean
supervector kernel approach [7] to represent the utterances as feature vectors of a fixed
dimension.
We now present the basic framework of the proposed speaker identification algorithm.
Let us assume that we have k distinct classes and ni utterances are available for training
from the ith class. Each variable-length training utterance is mapped to a fixed-dimension
feature vector using the GMM mean supervector kernel [7]. Let the resultant feature vector
be designated as vi,j such that vi,j ∈ Rm. Here i is the index of the class, i = 1, 2, . . . , k
and j is the index of the training utterance, j = 1, 2, . . . , ni. All this training data from
the ith class is placed in a matrix Ai such that Ai = [vi,1,vi,2, . . . . . . ,vi,ni] ∈ R
m×ni .
Let y ∈ Rm be the GMM mean supervector for a test utterance from the ith speaker. A
fundamental concept in pattern recognition indicates that patterns from the same class lie
106
on a linear subspace [50], therefore if y belongs to the ith class and the training samples
from the ith class are sufficient, y will approximately lie in the linear span of the columns
of Ai:
y = αi,1vi,1 + αi,2vi,2 + · · · + αi,nivi,ni
(5.1)
where αi,j are real scalar quantities. Since identity i of the test sample y is unknown we
develop a global dictionary matrix A for all k classes by concatenating Ai, i = 1, 2, . . . , k
as follows:
A = [A1, A2, . . . , Ak] ∈ Rm×nik (5.2)
The test pattern y can now be represented as a linear combination of all n training
samples (n = ni × k):
y = Ax (5.3)
where
x = [0, · · · , 0, αi,1, αi,2, · · · , αi,ni, 0, · · · , 0]T ∈ R
n (5.4)
is an unknown vector of coefficients. Note that from equation 5.3 and our earlier discussion
it is straight forward to note that only those entries of x that are non-zero, correspond to
the class of y [6]. This means that if we are able to solve equation 5.3 for x we can actually
find the class of the test pattern y. Recent research in compressive sensing and sparse
representation [22, 25, 26, 5, 24] has shown that the sparsity of the solution of equation
5.3, enables us to solve the problem using the l1-norm minimization:
(l1) : x1 = argmin ‖x‖1 ;Ax = y (5.5)
Once we have estimated x1, ideally it should have nonzero entries corresponding to
the class of y and now deciding the class of y is a simple matter of locating indices of the
non-zero entries in x1. However due to noise and modeling limitations x1 is commonly
corrupted by some small nonzero entries belonging to different classes. To resolve this
107
problem we define an operator δi for each class i so that δi(x1) gives us a vector ∈ Rn
where the only nonzero entries are from the ith class. This process is repeated k times for
each class. Now for a given class i we can approximate yi = Aδi(x1) and assign the test
pattern to the class with the minimum residual between y and yi.
min︸︷︷︸
i
ri(y) = ‖y − Aδi(x1)‖2 (5.6)
5.3 Experimental Evaluation
The TIMIT corpus is a collection of phonetically balanced sentences sampled at 16 kHz (8
kHz bandwidth), consisting of 10 utterances from 630 speakers across 8 dialect regions in
the USA [110]. Extensive experiments were conducted on a randomly selected subset of
the TIMIT database consisting of 114 speakers. For our experiments we used 8 utterances
per speaker for training (5 SX and 3 SI sentences) while 2 utterances (2 SA sentences)
constituted the testing set. Refer to [110], [105] for further details.
Table 5.1: Experimental Results for the TIMIT databaseApproach Recognition Accuracy
GMM 92.98%
GMM-UBM 96.93%
GMM-SVM 97.80%
Sparse Representation 98.24%
At the feature extraction stage, GMM mean supervector approach [7] (consisting of 64
mixtures) is used to generate fixed-length feature vectors from variable length utterances.
In all experiments a pre-emphasis filter with coefficient 0.97 was applied to the sampled
waveform and features were extracted from each 25ms frame and generated every 10ms,
all frames were windowed using the Hamming window function. Comparative analysis is
performed using three state-of-the-art approaches i.e. the GMM [105], the GMM-UBM
[106] and the GMM-SVM [7] speaker identification algorithms. For the implementation
of the GMM and GMM-UBM systems the Hidden Markov Model ToolKit (HTK version
3.4.1) [112], was configured to model a single-state HMM with the standard MLLR (Maxi-
mum Likelihood Linear Regression) and MAP (Maximum-A-Posteriori) adaptation scripts
to adapt the UBM accordingly for the GMM-UBM models and GMM-SVM supervectors.
108
For the GMM-SVM the SVM-KM toolbox [113] was used to implement the one-against-all
SVM classifier.
Results are shown in Table 1. The proposed sparse representation identification al-
gorithm achieves 98.24% recognition accuracy which is better than all the contesting ap-
proaches. The conventional GMM [105] approach for instance, attains 92.98% recognition
which lags 5.26% as compared to the proposed approach. The state-of-art GMM UBM
[106] approach yields a comparable identification success of 96.93%. Recently proposed
GMM-SVM system [7] also attains a good performance with 97.80% recognition.
5.4 Conclusion
With the recent development in the paradigm of speaker recognition, variable-length ut-
terances can be represented as fixed length features in a high dimensional feature space.
The task of speaker identification can now therefore be viewed as a traditional pattern
classification problem. Motivated with these studies, we propose a novel speaker iden-
tification algorithm based on sparse representation. Noting that a given test utterance
from a particular speaker corresponds to only a fraction of the whole training database,
we proposed to develop an overcomplete dictionary of all training utterances. A given
test utterance is thus represented as a linear combination of all training utterances giv-
ing rise to a naturally sparse representation. The inverse problem is solved using the
l1-minimization (as it is the sparsest solution). Consequently the vector of coefficients is
also sparse with non-zero entries corresponding to the class of the unknown speaker. The
proposed algorithm is evaluated on the standard TIMIT database and comparative anal-
ysis is performed with the state-of-art speaker identification approaches. The proposed
sparse representation classification algorithm has shown good performance index and is
favorably comparable with all approaches.
Although the initial investigations for the proposed algorithm are quite good, the
TIMIT database however characterizes ideal acquisition environment and does not depict
key robustness issues (e.g. reverberant noise and session variability). Good performance
index under clean conditions is encouraging enough to extend the proposed approach for
robust speaker recognition addressing more challenging databases.
109
5.5 Publications
Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Sparse Representation
for Speaker Identification”, Accepted in IAPR International Conference on Pattern Recog-
nition, ICPR’10
110
Chapter 6
Speaker Identification using Linear
Regression1
6.1 Introduction
With increasing security concerns, automatic person identification has emerged as an ac-
tive research area over last two decades. Human speech is a natural way of recognizing
a person and therefore Automatic Speaker Recognition systems have been widely de-
ployed for secured authentication. Conventional speaker recognition algorithms make use
of acoustic features to develop probabilistic speaker models and utilize an adequate statis-
tical distance metric for the classification purpose. Gaussian Mixture Models (GMM) have
been typically used to develop probabilistic model for each speaker in a given database
[105]. The large scale acceptance of the GMMs as the standard in the paradigm of speaker
identification can be credited to a number of factors such as the high accuracy, the ability
to scale training algorithms for large data sets and the probabilistic framework. A speech
signal is naturally characterized by continuous changes in the spectral domain, conse-
quently a number of Gaussian components are necessary to model the speaker-dependent
features over the length of an utterance. The collection of these Gaussian components
results in the complete Gaussian mixture model.
1The chapter is under review for prospective publication in InterSpeech 2010. General problem state-ments and literature review in the Introduction section of the chapter are included for the sake of complete-ness and to make the chapter self-contained. Since the thesis is presented as a compilation of independentpublications, the repetition of the general statements between the chapters is therefore inevitable.
111
A more efficient approach is to develop a Universal Background Model (UBM) using
utterances from a set of speakers and adapt this universal model with respect to a par-
ticular speaker using Maximum-A-Posteriori (MAP) adaptation [106]. This state-of-art
approach is commonly referred to as the GMM-UBM. There are several benefits in using
this approach that have accounted for significant performance improvements in the GMM-
based classification. For instance, when training data is not available for the adaptation
of components in the UBM, the speaker values revert to those in the UBM to provide a
more robust speaker model. In contrast, when ample training data is available for a given
GMM component, the values approach those of the ML estimate.
Recently an intriguing variation in the GMM-UBM approach has enabled represen-
tation of a speaker as a point in a high dimensional space i.e. the speaker space [7].
The main idea is to concatenate the means of the GMM components to form a so-called
GMM mean supervector [7]. In this way, a variable-length utterance can be represented
as a fixed-length feature vector in the feature space and therefore the problem of speaker
recognition can be tackled as a general problem of pattern recognition. One technique
that has received significant focus in pattern recognition literature is the Support Vector
Machine (SVM). The discriminative nature of the SVM has been successfully applied to
a variety of pattern recognition tasks. An SVM is basically a two-class classifier that fits
a separating hyperplane between the two classes (assuming linear separability). In re-
cent years, the SVM-based classification has become a major focus in the task of speaker
identification and verification [7].
Typically, in pattern recognition problems, it is believed that high-dimensional data
vectors are redundant measurements of an underlying source. The objective of manifold
learning is therefore to uncover this “underlying source” by a suitable transformation of
high-dimensional measurements to low-dimensional data vectors. The main objective is
to find a basis function for this transformation, which could distinguishably represent
patterns in the feature space. A number of approaches have been reported in the litera-
ture for dimensionality reduction. These approaches have been broadly classified in two
categories namely generative/reconstructive and discriminative methods. Reconstructive
approaches (such as Principal Component Analysis or the PCA [13]), are reported to be
112
robust for noisy data, these methods essentially exploit the redundancy in the original data
to produce representations with sufficient reconstruction property. Formally, given an in-
put x and label y, the generative classifiers learn a model of the joint probability p(x, y)
and classify using p(y|x), which is determined using the Bayes’ rule. The discriminative
approaches (such as Linear Discriminant Analysis or the LDA [17]), on the other hand,
are known to yield better results in “clean” conditions [20] owing to the flexible decision
boundaries. The optimal decision boundaries are determined using the posterior p(y|x)
directly from the data and are consequently more sensitive to outliers. In the speaker
recognition community there is a growing interest for the exploration of these manifold
learning methods. The PCA for instance, has shown some good results in this regard
and is usually referred to as Eignevoice approach [107], [108]. Working on the same lines,
the concept of Fishervoice, based on the LDA approach, has recently been proposed to
address the problem of semi-supervised speaker clustering [109]. An important relevant
work is presented in [114] where an overcomplete dictionary matrix is developed using
training utterances from all speakers. Noting intrinsic sparsity of the dictionary matrix,
each test utterance is represented as a linear combination of all training utterances (the
dictionary matrix). The inverse problem is solved using the l1 optimization which is also
the sparsest solution. Consequently the vector of coefficients is also sparse with non-zero
entries corresponding to the class of the unknown speaker.
In the paradigm of pattern recognition it is well known that in general, samples from
same object class lie on a linear subspace [17], [50]. In this research we utilize this concept
to present a novel speaker identification algorithm. Essentially the idea of GMM mean
supervector is used to develop class-specific subspaces using training utterances from each
speaker. The fixed-length GMM mean supervector of a given test utterance from an
unknown speaker is represented against each class model, thereby defining the task of
speaker identification as a problem of linear regression. Least squares estimation is used
to estimate the vectors of parameters for a given test utterance against all speaker models.
Finally the decision is ruled in favor of the class with the most precise estimation. The
proposed classifier can be categorized as a Nearest Subspace (NS) approach. The proposed
algorithm is evaluated on a subset of the widely available TIMIT speech corpus [110].
113
Comparative analysis with the state-of-art speaker recognition algorithms yields a fairly
comparable performance index for the proposed algorithm.
The rest of the chapter is organized as follows: Section 6.2 presents the proposed
algorithm, followed by close-set speaker identification experiments in Section 6.3. The
chapter is concluded in Section 6.4.
6.2 Linear Regression Classification (LRC) Algorithm
Let there be N number of distinguished classes with pi number of training utterances
from the ith class, i = 1, 2, . . . , N . Each variable-length training utterance is mapped to
a fixed-dimension feature vector using the GMM mean supervector kernel [7]. Let the
resultant feature vector be designated as(m)wi∈ R
q×1, q being the length of the feature
vector and m = 1, 2, . . . , pi. Using the concept that patterns from the same class lie on a
linear subspace [50], we develop a class specific model Xi by stacking the feature vectors,
Xi = [(1)wi
(2)wi . . . . . .
(pi)wi ] ∈ R
q×pi , i = 1, 2, . . . , N (6.1)
Each vector(m)wi , m = 1, 2, . . . , pi, spans a subspace of R
q also called the column space
of Xi. Therefore at the training level each class i is represented by a vector subspace,
Xi, which is also called the regressor or predictor for class i. Let z be an unlabeled
test utterance and our problem is to classify z as one of the classes i = 1, 2, . . . , N . We
transform the utterance z to a feature vector y ∈ Rq×1 as discussed for training. If y
belongs to the ith class it should be represented as a linear combination of the training
utterances from the same class (lying in the same subspace) i.e.
y = Xiβi , i = 1, 2, . . . , N (6.2)
where βi ∈ Rpi×1 is the vector of parameters. Given that q ≥ pi the system of equations
in equation 6.2 is well-conditioned and βi can be estimated using least squares estimation
[54], [55], [56].
βi =(XT
i Xi
)−1XT
i y (6.3)
114
Algorithm: Linear Regression Classification (LRC)
Inputs: Class models Xi ∈ Rq×pi , i = 1, 2, . . . , N and a test utterance feature vector y ∈ R
q×1.Output: Class of y
1. βi ∈ Rpi×1 is evaluated against each class model, βi =
(XT
i Xi
)−1
XTi y, i = 1, 2, . . . , N
2. yi is computed for each βi, yi = Xiβi, i = 1, 2, . . . , N
3. Distance calculation di(y) = ‖y − yi‖2 , i = 1, 2, . . . , N
4. Decision is made in favor of the class with the minimum distance di(y)
The estimated vector of parameters, βi, along with the predictors Xi are used to predict
the response vector for each class i:
yi = Xiβi, i = 1, 2, . . . , N (6.4)
yi = Xi
(XT
i Xi
)−1XT
i y
yi = Hy
Where the predicted vector yi ∈ Rq×1 is the projection of y onto the ith subspaces. In
other words yi is the closest vector, in the ith subspace, to the observation vector y in the
Euclidean sense [57]. H is called a hat matrix since it maps y into yi. We now calculate
the distance measure between the predicted response vector yi, i = 1, 2, . . . , N and the
original response vector y,
di(y) = ‖y − yi‖2 , i = 1, 2, . . . , N (6.5)
and rule in favor of the class with minimum distance i.e.
min︸︷︷︸
i
di(y), i = 1, 2, . . . , N (6.6)
6.3 Experimental Results
The TIMIT corpus is a collection of phonetically balanced sentences sampled at 16 kHz (8
kHz bandwidth), consisting of 10 utterances from 630 speakers across 8 dialect regions in
the USA [110]. Two sets of experiments were conducted on a randomly selected subset of
the TIMIT database consisting of 200 speakers. For Experiment Set 1 we used 8 utterances
115
Table 6.1: Experiment Set 1: Recognition accuracy for various approaches with respectto different number of mixtures.
Approach 16 MIX 32 MIX 64 MIX 128 MIX 256 MIX
GMM 89.75% 94.25% 92.25% 87.50% 73.50%
GMM-UBM 89.75% 95.50% 96.50% 97% 98.50%
GMM-SVM 92.25% 95.75% NA NA NA
LRC 89.75% 96.00% 96.00% 96.25% 95.50%
Table 6.2: Experiment Set 2: Recognition accuracy for various approaches with respectto different number of mixtures.
Approach 16 MIX 32 MIX 64 MIX 128 MIX 256 MIX
GMM 66.50% 60.50% 51.25% 32% 14.75%
GMM-UBM 71.75% 75.25% 74.50% 74% 77.50%
GMM-SVM 70.75% 83.50% 82.50% NA NA
LRC 70.75% 83.00% 78.25% 66.00% 45.50%
per speaker for training (5 SX and 3 SI sentences) while 2 utterances (2 SA sentences)
constituted the testing set. Refer to [110], [105] for further details.
At the feature extraction stage, GMM mean supervector approach [7] is used to gen-
erate fixed-length feature vectors from variable length utterances. In all experiments a
pre-emphasis filter with coefficient 0.97 was applied to the sampled waveform and features
were extracted from each 25ms frame and generated every 10ms, all frames were win-
dowed using the Hamming window function. Essentially 13-dimensional MFCC features
were concatenated with 13-dimensional delta features and 13-dimensional acceleration fea-
tures thereby generating a 39-dimensional feature vector [112]. Comparative analysis is
performed using three state-of-the-art approaches i.e. the GMM [105], the GMM-UBM
[106] and the GMM-SVM [7] speaker identification algorithms. For the implementation
of the GMM and GMM-UBM systems the Hidden Markov Model ToolKit (HTK version
3.4.1) [112], was configured to model a single-state HMM with the standard MLLR (Maxi-
mum Likelihood Linear Regression) and MAP (Maximum-A-Posteriori) adaptation scripts
to adapt the UBM accordingly for the GMM-UBM models and GMM-SVM supervectors.
For the GMM-SVM the SVM-KM toolbox [113] was used to implement the one-against-all
SVM classifier.
For a comprehensive comparison, experiments were conducted with different number
of Gaussian mixtures, results are shown in Table 6.1 and Figure 6.1. The proposed linear
116
0 50 100 150 200 25070
75
80
85
90
95
100
Number of Mixtures
Rec
ogni
tion
Acc
urac
y
GMMGMM−UBMGMM−SVMLRC
Figure 6.1: Experiment Set 1: Recognition accuracy of various approaches with respectto number of mixtures.
regression classification algorithm demonstrates comparable performance index for all ex-
periments. For the case of 32 mixtures for instance it achieves 96.00% recognition accuracy
which is better than all the contesting approaches. The conventional GMM [105] approach
for instance, attains 94.25% recognition which lags 1.75% as compared to the proposed
approach. The state-of-art GMM UBM [106] approach yields a comparable identification
success of 95.50%. Recently proposed GMM-SVM system [7] also attains a good perfor-
mance with 95.75% recognition. The best recognition accuracy of 98.50% is reported for
the GMM-UBM approach with 256 Gaussian mixtures. It should be noted that as the
one-against-all SVM does not scale well with the vector size (i.e. number of mixtures)
results were not reported where the SVM failed to run.
It is interesting to note the reliability of standard speaker recognition approaches with
the change in the number of Gaussian mixtures in Figure 6.1. The primitive GMM system
is the most affected approach while the proposed linear regression classification method
117
showed a consistent performance with respect to varying number of Gaussian mixtures.
The Experiment Set 2 was targeted to evaluate the reliability of the proposed nearest
subspace classification algorithm with less number of training utterances. We therefore
selected 3 SI utterances for training and 2 SA sentences were used to validate the system.
Comprehensive results are shown in Table 6.2 and Figure 6.2. The conventional GMM
system was unable to cope with less number of training utterances and yielded a maximum
recognition accuracy of 66.50% with 32 Gaussian mixtures. The state-of-art GMM-UBM
system achieved a maximum performance index of 77.50% utilizing 256 mixtures. The pro-
posed linear regression classification algorithm attained 83% accuracy with 32 mixtures,
outperforming the GMM and the GMM-UBM systems by 16.50% and 5.50% respectively.
The proposed system achieved a competitive performance index compared to the latest
GMM-SVM approach. Noteworthy is the fact that computations with high number of mix-
tures were not possible for the sophisticated one-against-all SVM approach. The proposed
LRC system is however much simpler in architecture yet achieving comparable results.
0 50 100 150 200 25010
20
30
40
50
60
70
80
90
Number o Mixtures
Rec
ogni
tion
Acc
urac
y
GMMGMM−UBMGMM−SVMLRC
Figure 6.2: Experiment Set 2: Recognition accuracy of various approaches with respectto number of mixtures.
118
6.4 Conclusion and Future Directions
With the recent development in the paradigm of speaker recognition, variable-length utter-
ances can be represented as fixed length features in a high dimensional feature space. The
task of speaker identification can now therefore be viewed as a traditional pattern classi-
fication problem. Motivated with these studies, we propose a novel speaker identification
algorithm based on linear regression. Noting that samples from a particular class lie on
a linear subspace, we proposed to develop class-specific models using training utterances
from each class. A given test utterance is thus represented against each class model and
therefore the pattern recognition task is formulated as a problem of linear regression. The
inverse problem is solved using the least squares estimation and decision is ruled in favor
of the class with minimum reconstruction error. The proposed algorithm is evaluated on
the standard TIMIT database and comparative analysis is performed with the state-of-art
speaker identification approaches.
Although the initial investigations for the proposed algorithm are quite good, the
TIMIT database however characterizes ideal acquisition environment and does not depict
key robustness issues (e.g. reverberant noise and session variability). The proposed frame-
work is not expected to cope with the noisy conditions; since in the presence of outliers,
least squares estimation is inefficient and can be biased [89]. Although it has been claimed
that classical statistical methods are robust, they are only robust in the sense of type
I error. Type I error corresponds to the rejection of null hypothesis when it is in fact
true. It is straightforward to note that type I error rate for classical approaches in the
presence of outliers tend to be lower than the nominal value. This is often referred to
as conservatism of classical statistics. However due to contaminated data, type II error
increases drastically. Type II error is the error when the null hypothesis is not rejected
when it is in fact false. This drawback is often referred to as inadmissibility of the classical
approaches. Additionally, classical statistical methods are known to perform well with the
homoskedastic data model. In many real scenarios however, this assumption is not true
and heteroskedasticity is indispensable, thereby emphasizing the need of robust estimation
[86]. Additionally for realistic applications the assumption of Gaussian noise is not always
true, therefore in general LS estimation lacks robustness.
119
Although robust methods, in general, are superior to their classical counterparts, they
have rarely been addressed in applied fields [86]. Several reasons have been discussed in
[86] for this paradox, computational expense related to the robust methods has been a
major hindrance [88]. However, with recent developments in computational power, this
reason has become insignificant. Therefore the extension of the proposed work will be on
the lines of utilizing iterative robust estimation algorithms to solve the inverse problem in
Equation 6.2.
120
6.5 Publications
Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Linear Regression for
Speaker Identification”, Submitted to InterSpeech 2010
121
122
Chapter 7
Conclusions and Future Directions
In this chapter we summarize the key contributions of the research. Important future
directions identified by the research are also presented.
7.1 Contributions
Original contributions of the presented research are the following:
• Evaluation of the Sparse Representation Classification (SRC) Concept:
Feature extraction methodology, in the context of face recognition, has been a hot
research area for the past two decades. Sophisticated features with complex com-
putations have been successfully used to tackle various robustness issues. These
studies have recently been challenged by a new concept of SRC. It has been shown
for the first time that the correct design of the classifier induces independence in
the feature extraction module and even randomly selected features, such as down-
sampled images and random projections, can yield competitive results compared to
orthodox feature extraction methodologies. With the successful implementation of
the SRC for view-based face recognition problem, it became imperative to evaluate
the methodology for (1) other view-based biometrics and (2) harder problems in the
paradigm of face recognition.
With this understanding we successfully extended the SRC approach for the prob-
lem of ear recognition. Investigations yielded an agreeable performance index for
123
the SRC approach on several ear databases. The SRC was found to be robust to
light variations and head rotations. We also evaluated the SRC approach for two
challenging issues of face recognition i.e. mild-to-normal illumination variations and
severe expression variations. The SRC system has shown to be tolerant to illumi-
nation variations and has produced a comparable performance index compared to
the benchmark approaches. It has also yielded good results for moderate expression
variations, however for severe expression variations (such as anger and scream) there
is a tendency of performance degradation. The results are however comparable to
most of the benchmark approaches.
Due to the large scale deployment of video surveillance systems, it is becoming imper-
ative to evaluate face recognition algorithms on video sequences. The large amount
of data available for training and testing in a sequence poses many problems such as
over-fitting, natural head rotations, degradation due to computational complexity
etc. With this understanding we extended the evaluations of the SRC approach to
video sequences. The SRC approach attained a good performance index compared
to SIFT (Scale Invariant Feature Transform) achieving a verification rate of 98.23%
at 0.01 FAR for the VidTIMIT database. The complex design of the SRC classifier
due to iterative l1-optimization indicated a lagging in the context of computational
analysis. A randomly selected recognition trial for the SRC classifier is reported to
be approximately 5 times slower than the swift SIFT approach.
• The Novel Linear Regression Classification (LRC) Algorithm for Face
Recognition: A novel face recognition algorithm based on the concept of linear
regression has been presented. We showed for the first time that simply the down-
sampled images in combination with a linear regression classification approach can
produce excellent results for various problems of face recognition. Extensive experi-
ments, incorporating standard databases, were conducted to show the efficacy of the
proposed approach. In particular we showed that for the cases of severe expression
variations, where standard approaches fail to produce satisfactory result, the pro-
posed LRC algorithm attained an excellent performance index. We also introduced
a novel concept of Distance based Evidence Fusion (DEF) to develop a novel mod-
124
ular approach, called Modular LRC, to address the difficult problem of contiguous
occlusion. In particular, we attained the best results ever reported for the difficult
case of scar occlusion using the Modular LRC approach. It has to be noted that
amongst the contemporary databases, the scarf occlusion mode of the AR database
is arguably the only available database incorporating naturally occluded images.
• The Novel Robust Linear Regression Classification (RLRC) Algorithm
for Robust Face Recognition: Extending the concept of the proposed LRC al-
gorithm, and noting that the LRC approach actually formulates the pattern recog-
nition problem as a task of linear regression, we proposed to use robust estimation
to tackle the difficult problem of random pixel corruption. We showed phenomenal
results for severe illumination variations and random pixel image noise. In partic-
ular, we achieved 100% recognition for the most difficult Subset 5 of the Yale Face
Database B which has never been reported in the contemporary literature. Excellent
results have also been demonstrated for standard image-noise models compared to
benchmark generative approaches.
• The Novel Implementation of the SRC for the problem of Speaker Iden-
tification: In the paradigm of speaker recognition, probabilistic modeling has been
the major work horse. Recently however, an intriguing extension of the GMM-UBM
has made it possible to represent an utterance in a low-dimensional feature space
called the “Speaker Space”. The problem of speaker recognition can therefore be
viewed as a general task of pattern recognition. We therefore implemented the con-
cept of the SRC classification for this problem, experiments were conducted using a
subset of the TIMIT database, the proposed framework has produced a competitive
performance index compared to the state-of-art speaker recognition approaches.
• The novel Linear Regression Classification for the Problem of Speaker
Identification: Working along the same lines we proposed a novel implementation
of the Linear Regression Classification (LRC) algorithm for the problem of speaker
identification. To the best of our knowledge, this is for the first time that near-
est subspace classification concept has been introduced in the context of speaker
125
recognition. The proposed framework is the most simple of the present state-of-art
approaches. It primarily uses the concept of supervectors to develop class-specific
speaker models. The test utterance is presented against each speaker model and
therefore the otherwise probabilistic task of speaker identification boils down to a
problem of linear regression. The proposed algorithm is evaluated using the TIMIT
database and has shown a good performance index compared to the benchmark
approaches.
7.2 Future Directions
The presented research has opened a number of future directions in the fields of speaker
and face recognition. The simple architecture of the proposed LRC algorithm makes it
quite tempting for other biometric applications. Computationally complex video-based
face recognition could be a straightforward extension. It will also be interesting to study
the behavior of simple downsampled images, in conjunction with the LRC classifier, for
other view-based biometrics such as iris, lips, hand geometry, body gait etc. Based on
the concept of linear regression, the 3D- biometrics can also be tackled introducing a new
concept of classification for 3D faces, ears etc.
Robust regression has shown some good results for randomly spread noise in a face
image. Given that for contiguous images the corrupted pixels are known to have a con-
nected neighborhood, the proposed RLRC approach can be extended for the problem of
contiguous occlusion. Essentially, we made use of robust Huber estimation to solve the
inverse problem in the context of RLRC, this approach can further be evaluated and devel-
oped using other efficient robust estimation approaches presented in the robust statistics
literature. LRC algorithm has also shown good results for the problem of speaker iden-
tification. However, robustness in the context of LRC, is an open research area. It will
be very interesting to note if the robust statistical methods are able to tackle the issues
related to speech noise. In other words, the use of RLRC for robust speaker recognition
will be a promising research area.
126
Bibliography
[1] M Savvides, B. V. K Vijay Kumar, and P. K Khosla. Corefaces - Robust Shift
Invariant PCA based Correlation Filter for Illumination Tolerant Face Recognition.
In IEEE Conf. on Computer Vision and Pattern Recognition, 2004.
[2] A Jain, A Ross, and S Prabhakar. An Introduction to Biometric Recognition. IEEE
Transactions on Circuits and Systems for Video Technology, 14(1):4–20, Jan 2004.
[3] A. F Abate, M Nappi, D Riccio, and G Sabatino. 2D and 3D Face Recognition: A
Survey. Pattern Recognition Letters, 28(2007):1885–1906, 2007.
[4] M. Pawlewski and J. Jones. Speaker verification: Part 1. Biometric Technology
Today, 14(6):9–11, June 2006.
[5] D. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289–1306,
April 2006.
[6] J. Wright, A. Yang, A. Ganesh, S. Sastri, S, and Y. Ma. Robust Face Recognition
via Sparse Representation. IEEE Trans. PAMI, 31(2):210–227, Feb 2009.
[7] W Campbell, D Sturim, and D Reynolds. Support vector machines using GMM
supervectors for speaker verification. IEEE Signal Processing Letters, 13(5):308–
311, 2006.
[8] P Phillips, P Grother, R Micheals, D Blackburn, E Tabassi, and M Bone. Face
Recognition Vendor Test 2002: Evaluation Report. 2002.
[9] L. Collins. Earmarked(biometrics). IEE Review, 51(11):38–40, Nov. 2005.
127
[10] A. Iannarelli. Ear identification. Paramount Publishing Company, Freemont, Cali-
fornia, 1989.
[11] J. Hurley, D, B. Arbab-Zavar, and S. Nixon, M. Handbook of Biometrics, chapter
The ear as a biometric. 2007.
[12] B. Moreno and A. Sanchez. On the use of outer ear images for personal identifi-
cation in security applications. In Proc. IEEE 33rd Annual Intl. Conf. on Security
Technology, pages 469–476, 1999.
[13] M Turk and A Pentland. Eigenfaces for Recognition. Journal of Cognitive Neurosi-
cence, 3(1):71–86, 1991.
[14] K. Iwano, T. Hirose, E. Kamibayashi, and S. Furui. Audio-visual person authen-
tication using speech and ear images. In Proc. of Workshop on Multimodal User
Authentication, pages 85–90, 2003.
[15] B. Arbab-Zavar, S. Nixon, M, and J. Hurley, D. On model-based analysis of ear
biomtrics. In IEEE Intl. Conf. on Biometrics: Theory, Applications and Systems,
September 2007.
[16] I. T Jolliffe. Pricipal Component Analysis. Springer, New York, 1986.
[17] V Belhumeur, J Hespanha, and D Kriegman. Eigenfaces vs Fisherfaces: Recognition
Using Class Specific Linear Projection. IEEE Tran. PAMI, 17(7):711–720, July 1997.
[18] P Comon. Independent Component Analysis - A New Concept ? Signal Processing,
36:287–314, 1994.
[19] M Bartlett, H Lades, and T Sejnowski. Independent Component Representations
for Face Recognition. In Proc. of the SPIE: Conference on Human Vision and
Electronic Imaging III, 3299:528–539, 1998.
[20] R. O Duda, P. E Hart, and D. G Stork. Pattern Classification. John Wiley & Sons,
Inc., 2000.
[21] R. Baraniuk. Compressive sensing. IEEE Signal Processing Magazine, 24, 2007.
128
[22] E. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal
reconstruction from highly incomplete frequency information. IEEE Trans. Inform.
Theory, 52(2):489–509, 2006.
[23] R. Baraniuk, M. Davenport, R. DeVore, and B. Wakin, M. The johnson-
lindenstrauss lemma meets compressed sensing. dsp.rice.edu/cs/jlcs-v03.pdf, 2006.
[24] D. Donoho. For most large underdetermined systems of linear equations the minimal
l1-norm solution is also the sparsest solution. Comm. on Pure and Applied Math,
59(6):797–829, 2006.
[25] E. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and
inaccurate measurements. Comm. on Pure and Applied Math, 59(8):1207–1223,
2006.
[26] E. Candes and T. Tao. Near-optimal signal recovery from random projections:
Universal encoding strategies? IEEE Tran. Infm. Theory, 52(12):5406–5425, 2006.
[27] Yale Univ. Face Database. http://cvc.yale.edu/projects/yalefaces, 2002.
[28] J Yang, D Zhang, A. F Frangi, and J Yang. Two-dimensional PCA: A New Approach
to Appearance-based Face Representation and Recognition. IEEE Trans. PAMI,
26(1):131–137, January 2004.
[29] M. H Yang. Kernel Eignefaces vs Kernel Fisherfaces: Face Recognition using Kernel
Methods. Proc. Fifth IEEE Int’l Conf. Automatic FAce and Gesture Recognition
(RGR’02), pages 215–220, May 2002.
[30] Y Gao and M. K. H Leung. Face Recognition using Line Edge Map. IEEE Trans.
PAMI, 24(6):764–779, June 2002.
[31] P. C Yuen and J. H Lai. Face Representation using Independent Component Anal-
ysis. Pattern Recognition, 35(6):1247–1257, 2002.
[32] A Martinez and R Benavente. The AR Face Database. Technical Report 24, CVC,
June 1998.
129
[33] B Moghaddam and A. P Pentland. Probabilistic Visual Learning for Object Repre-
sentation. IEEE Trans. on PAMI, 19(7):696–710, 1997.
[34] P. J Phillips, H Wechsler, J. S Huang, and P. J Rauss. The FERET Database and
Evaluation Procedure for Face-recognition Algorithms. Image and Vision Comput-
ing, 16(5):295–306, 1998.
[35] P Penev and J Atick. Local Feature Analysis: A General Statistical Theory for
Object Representation. Network: Computation in Neural Systems, 7(3):477–500,
1996.
[36] R Gross, J Shi, and J Cohn. Quo Vadis Face Recognition? In Third Workshop on
Empirical Evaluation Methods in Computer Vision, 2001.
[37] M Bicego, A Lagorio, E Grosso, and M Tistarelli. On the use of SIFT features for
face authentication. CVPRW, 2006.
[38] C. Sanderson and K. Paliwal, K. Identity verification using speech and face infor-
mation. Digital Signal Processing, 14(5):449–480, 2004.
[39] C Sanderson. Biometric person recognition: Face, speech and fusion. VDM-Verlag,
2008.
[40] D Lowe. Object recognition from local scale-invariant features. Intl. Conf. on Com-
puter Vision, pages 1150–1157, 1999.
[41] K Lee, J Ho, M Yang, and D Kriegman. Visual tracking and recognition using
probabilistic appearance manifolds. CVIU, 99(3):303–331, 2005.
[42] P Viola and M Jones. Robust real-time face detection. International Journal of
Computer Vision, 57(2):137–154, 2004.
[43] J Kittler, M Hatef, R. P. W Duin, and J Matas. On combining classifiers. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 20(3):226–238, 1998.
[44] L Liu, Y Wang, and T Tan. Online appearance model. CVPR, pages 1–7, 2007.
130
[45] K Lee and D Kriegman. Online probabilistic appearance manifolds for video-based
recognition and tracking. CVPR, 1:852–859, 2005.
[46] J. Flynn, P, W. Bowyer, K, and J. Phillips, P. Assessment of time dependency
in face recognition: An initial study. Audio- and Video-Based Biometric Person
Authentication, pages 44–51, 2003.
[47] L. Lu, X. Zhang, Y. Zhao, and Y. Jia. Ear recognition based on statistical shape
model. In International Conference on Innovative Computing Information and Con-
trol (ICICIC-06), 2006.
[48] A. B Chan and N Vasconcelos. Modeling, clustering, and segmenting video with
mixtures of dynamic textures. IEEE Trans. PAMI, 30:909–926, May 2008.
[49] A Leonardis and H Bischof. Robust Recognition using Eigenimages. Computer
Vision and Image Understanding, 78(1):99–118, 2000.
[50] R Barsi and D Jacobs. Lambertian Reflection and Linear Subspaces. IEEE Trans.
PAMI, 25(3):218–233, 2003.
[51] X Chai, S Shan, X Chen, and W Gao. Locally Linear Regression for Pose-Invariant
Face Recognition. IEEE Trans. PAMI, 16(7):1716–1725, July 2007.
[52] J Chien and C Wu. Discriminant Waveletfaces and Nearest Feature Classifiers for
Face Recognition. IEEE Trans. PAMI, 24(12):1644–1649, Dec 2002.
[53] A Pentland, B Moghaddam, and T Starner. View-based and Modular Eigenspaces
for Face Recognition. Proc. of IEEE Conf. on Computer Vision and Pattern Recog-
nition, 1994.
[54] T Hastie, R Tibshirani, and J Friedman. The Elements of Statistical Learning; Data
Mining, Inference and Prediction. Springer Series in Statistics. Springer, 2001.
[55] G. A. F Seber. Linear Regression Analysis. Wiley, 2003.
[56] T. P Ryan. Modern Regression Methods. Wiley, 1997.
[57] R. G Staudte and S. J Sheather. Robust Estimation and Testing. Wiley, 1990.
131
[58] S Fidler, D Skocaj, and A Leonardis. Combining Reconstructive and Discriminative
Subspace Methods for Robust Classification and Regression by Subsampling. IEEE
Trans. PAMI, 28(3):337–350, March 2006.
[59] F Samaria and A Harter. Parameterisation of a Stochastic Model for Human Face
Identification. Proc. Second IEEE Workshop Applications of Computer Vision, Dec.
1994.
[60] Georgia Tech. Face Database. http://www.anefian.com/face reco.htm, 2007.
[61] Xudong Jiang, Bappaditya Mandal, and Alex Kot. Eigenfeature Regularization and
Extraction in Face Recognition. IEEE Trans. PAMI, 30(3):383–394, March 2008.
[62] P. J Phillips, H Moon, S Rizvi, and P Rauss. The FERET Evaluation Methodology
for Face Recognition Algorithms. IEEE Trans. PAMI, 22(10):1090–1104, Oct 2000.
[63] J Lu, K. N Plataniotis, A. N Venetsanopoulos, and S. Z Li. Ensemble-Based Discrim-
inant Learning with Boostign for Face Recognition. IEEE Trans. PAMI, 17(1):166–
178, Jan 2006.
[64] A Georghiades, P Belhumeur, and D Kriegman. From few to Many: Illumination
Cone Models for Face Recognition under Variable Lighting and Pose. IEEE Trans.
PAMI, 23(6):643–660, 2001.
[65] K. C Lee, J Ho, and D Kriegman. Acquiring Linear Subspaces for Face Recognition
under Variable Lighting. IEEE Trans. PAMI, 27(5):684–698, 2005.
[66] S. Z Li and A. K Jain, editors. Handbook of Face Recognition. Springer, 2005.
[67] P Ekman. The Argument and Evidence About Universals in Facial Expressions of
Emotions, pages 143–164. Wiley, 1989.
[68] K Scherer and P Ekman. Handbook of Methods in Nonverbal; Behavior Research.
Cambridge University Press, Cambridge, UK, 1982.
[69] L Zhang and G. W Cottrell. When Holistic Processing is Not Enough: Local Fea-
tures Save the Day. In Proc. of the Twenty-sixth Annual Cognitive Science Society
Conference, 2004.
132
[70] W Zhao, R Chellappa, P Phillips, and A Rosenfeld. Face Recognition: A Literature
Survey. ACM Computing Surveys, pages 399–458, 2000.
[71] F Macwilliams and N Sloane. The theory of Error-Correcting Codes. North Holland,
1981.
[72] M. P Roath and M Winter. Survey of Appearance-based Methods for Object Recog-
nition. Technical report, Inst. for Computer Graphics and Vision, Graz University,
Austria., January 2008.
[73] Danniel. D Lee and H. S Seung. Learning the Parts of Objects by Non-negative
Matrix Factorization. Nature, 401:788–791, 1999.
[74] Danniel. D Lee and H. S Seung. Algorithms for Non-negative Matrix Factorization.
Advances in Neural Information Processing Systems, pages 556–562, 2001.
[75] Imran Naseem, Roberto Togneri, and Mohammed Bennamoun. Linear Regression
for Face Recognition. IEEE Trans. on PAMI (in press), 2009.
[76] Imran Naseem, Roberto Togneri, and Mohammed Bennamoun. Face Identification
using Linear Regression. IEEE ICIP, 2009.
[77] Chen. Weilong, Meng Er Joo, and Shiqian Wu. Illumination Compensation and
Normalization for Robust Face Recognition Using Discrete Cosine Transform in
Logarithm Domain. IEEE Trans. on Systems, Man and Cybernetics, 36(2):458–464,
2006.
[78] Bo-Gun Park, Kyoung-Mu Lee, and Sang-Uk Lee. Face Recognition using Face-ARG
matching. IEEE Trans. on PAMI, 27(12):1982–1988, 2005.
[79] L. M. Alexandre Levada, D. C Correa, D. H. P Salvadeo, J. H Saito, and D. A
Nelson. Novel Approaches for Face Recognition: Template-Matching using Dynamic
Time Warping and LSTM Neural Network Supervised Classification. Intl. Conf. on
Systems, Signals and Image Processing, pages 241–244, 2008.
[80] Chia-Te Liao and Shang-Hong Lai. A Novel Robust Kernel for Appearance-based
Learning. Intl. Conf. on Pattern Recognition, pages 1–4, Dec. 2008.
133
[81] John Wright, Yi Ma, Julien Mairal, Guillermo Sapiro, Thomas Huang, and
Shuicheng Yan. Sparse Representation for Computer Vision and Pattern Recog-
nition. IEEE Intl. Conf. of Computer Vision and Pattern Recognition, 2009.
[82] Xiaoyin Xu and Majid Ahmadi. A Human Face Recognition System Using Neural
Classifiers. Intl. Conf. on Computer Graphics, imaging and Visualization, 2007.
[83] X He, S Yan, Y Hu, P Niyogi, and H-J Zhang. Face Recognition using Laplacianfaces.
IEEE Trans. PAMI, 27(3):328–340, March 2005.
[84] P. J Huber. Robust Statistics. New York: John Wiley, 1981.
[85] H. B Nielsen. Computing a Minimizer of a Piecewise Quadratic - Implementation.
Technical report, Informatics and Mathematical Modelling, Technical University of
Denmark, DTU, Sep. 1998.
[86] F. R Hampel, E. M Ronchetti, P. J Rousseeuw, and W. A Stahel. Robust Statistics:
The Approach Based on Influence Functions. John Wiley & Sons, 1986, 2005.
[87] K Madsen and H. B Nielsen. Finite Algorithms for Robust Linear Regression. BIT
Computer Science and Numerical Mathematics, 30(4):682 – 699, 1990.
[88] R. D Maronna, D Martin, and V Yohai. Robust Statistics: Theory and Methods.
Wiley, 2006.
[89] P. J Rousseeuw and A. M Leroy. Robust Regression and Outlier Detection. Wiley,
2003.
[90] T Sim, S Baker, and M Bsat. The CMU Pose, Illumination and Expression (PIE)
Database of Human Faces. Technical Report CMU-RT-TR-01-02, Robotics Insti-
tute, Carnegie Mellon University, January 2001.
[91] H. F Chen, P. N Belhumeur, and D. J Kriegman. In Search of Illumination Invariants.
In IEEE Conf. Computer Vision and Pattern Recognition, volume 1, pages 13–15,
2000.
134
[92] L Zhang and D Samaras. Face Recognition under Variable Lighting using Har-
monic Image Exemplars. In IEEE Conf. Computer Vision and Pattern Recognition,
volume 1, pages 19–25, 2003.
[93] J Zhao, Y Su, D Wang, and S Luo. Illumination Ratio Images: Synthesizing and
Recognition with Varying Illumination. Pattern Recognition Letters, 24:2703–2710,
2003.
[94] S Shan, W Gao, B Cao, and D Zhao. Illumination Normalization for Robust Face
Recognition against Varying Lighting Conditions. In IEEE Workshop on AMFG,
pages 157–164, 2003.
[95] K. C Lee, J Ho, and D. J Kriegman. Acquiring Linear subspaces for Face Recognition
under Variable Lighting. IEEE Trans. PAMI, 27(5):684–698, May 2005.
[96] H Wang, Stan Z. Li, and Y Wang. Generalized Quotient Image. In IEEE Conf. on
Computer Vision and Pattern Recognition, 2004.
[97] Weiwei Yu, Xiaolong Teng, and Chongqing Liu. Face Recognition using Discriminant
Locality Preserving Projections. Image Vision Computing, 24:239–248, 2006.
[98] Xudong Xie and Kin-Man Lam. Face Recognition under Varying Illumination based
on a 2D Face Shape Model. Pattern Recognition, 38, 2005.
[99] A Martinez. Recognizing Imprecisely Localized, Partially Occluded, and Expression
Variant Faces from a Single Sample per Class. IEEE TPAMI, 24(6):748–763, June
2002.
[100] Wenyi Zhao and Rama Chellappa, editors. Face Processing: Advanced Modelling
and Methods. Academic Press, 2006.
[101] R. C Gonzalez and R. E Woods. Digital Image Processing. Pearson Prenctice Hall,
2007.
[102] Linda. G Shapiro and George. C Stockman. Computer Vision. Prenctice Hall, 2001.
[103] Charles Boncelet. Handbook of Image and Video Processing, chapter Image Noise
Models. 2005.
135
[104] Junichi Nakamura. Image Sensors and Signal Processing for Digital Still Cameras.
CRC Press, 2005.
[105] D. A Reynolds. Speaker identification and verification using Gaussian mixture
speaker models. Speech Communication, 17(1-2):91–108, August 1995.
[106] D. A Reynolds, T. F Quatieri, and R. B Dunn. Speaker verification using adapted
Gaussian mixture models. Digital Signal Processing, 10(1-3), 2000.
[107] R Kuhn, J-C Junqua, P Nguyen, and N Niedzielski. Rapid speaker adaptation in
Eigenvoice space. IEEE Trans. on Speech and Audio Processing, 8(6):695–706, Nov
2000.
[108] R Kuhn, P Nguyen, J-C Junqua, L Goldwasser, N Niedzielski, S Fincke, K Field,
and M Contolini. Eigenvoices for speaker adaptation. ICSLP, pages 1771–1774,
1998.
[109] S. M Chu, H Tang, and T. S Huang. Fishervoice and semi-supervised speaker
clustering. ICASSP, pages 4089–4092, 2009.
[110] J Garofolo, L Lamel, W Fisher, J Fiscus, D Pallett, and N Dahlgren. Darpa
Timit: Acoustic-phonetic continuous speech corpus CD-ROM. LDC catalog number
LDC93S1, 1993.
[111] E. Candes. Compressive sampling. In International Congress of Mathematicians,
2006.
[112] S Young, D Kershaw, J Odell, D Ollason, V Valtchev, and P Woodland. Hidden
Markov model toolkit (HTK) version 3.4 user guide. 2002.
[113] S Canu, Y Grandvalet, V Guigue, and A Rakotomamonjy. SVM and kernel meth-
ods Matlab toolbox. Perception Systemes et Information, INSA de Rouen, Rouen,
France, 2005.
[114] Imran Naseem, Roberto Togneri, and Mohammed Bennamoun. Sparse Represen-
tation for Speaker Identification. International Conference on Pattern Recognition,
2010.
136