deborah thomas dissertation
DESCRIPTION
DissertationTRANSCRIPT
-
FACE RECOGNITION FROM SURVEILLANCE-QUALITY VIDEO
A Dissertation
Submitted to the Graduate School
of the University of Notre Dame
in Partial Fulfillment of the Requirements
for the Degree of
Doctor of Philosophy
by
Deborah Thomas,
Kevin W. Bowyer, Co-Director
Patrick J. Flynn, Co-Director
Graduate Program in Computer Science and Engineering
Notre Dame, Indiana
July 2010
-
c Copyright byDeborah Thomas
2010
All Rights Reserved
-
FACE RECOGNITION FROM SURVEILLANCE-QUALITY VIDEO
Abstract
by
Deborah Thomas
In this dissertation, we develop techniques for face recognition from surveillance-
quality video. We handle two specific problems that are characteristic of such
video, namely uncontrolled face pose changes and poor illumination. We conduct
a study that compares face recognition performance using two different types of
probe data and acquiring data in two different conditions. We describe approaches
to evaluate the face detections found in the video sequence to reduce the probe
images to those that contain true detections. We also augment the gallery set us-
ing synthetic poses generated using 3D morphable models. We show that we can
exploit temporal continuity of video data to improve the reliability of the matching
scores across probe frames. Reflected images are used to handle variable illumi-
nation conditions to improve recognition over the original images. While there
remains room for improvement in the area of face recognition from poor-quality
video, we have shown some techniques that help performance significantly.
-
CONTENTS
FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . 11.1 Description of surveillance-quality video . . . . . . . . . . . . . . 11.2 Overview of our work . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Organization of the dissertation . . . . . . . . . . . . . . . . . . . 5
CHAPTER 2: PREVIOUS WORK . . . . . . . . . . . . . . . . . . . . . . 62.1 Current evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Pose handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Illumination handling . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Other issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5 How this dissertation relates to prior work . . . . . . . . . . . . . 28
CHAPTER 3: EXPERIMENTAL SETUP . . . . . . . . . . . . . . . . . . 293.1 Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Nikon D80 . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.1.2 Surveillance camera installed by NDSP . . . . . . . . . . . 303.1.3 Sony IPELA camera . . . . . . . . . . . . . . . . . . . . . 313.1.4 Sony HDR Camcorder . . . . . . . . . . . . . . . . . . . . 31
3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.1 NDSP dataset . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.2 IPELA dataset . . . . . . . . . . . . . . . . . . . . . . . . 353.2.3 Comparison dataset . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3.1 FaceGen Modeller 3.2 . . . . . . . . . . . . . . . . . . . . . 433.3.2 IdentityEXPLORER . . . . . . . . . . . . . . . . . . . . . 433.3.3 Neurotechnoligija . . . . . . . . . . . . . . . . . . . . . . . 453.3.4 PittPatt . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
ii
-
3.3.5 CSUs preprocessing and PCA software . . . . . . . . . . . 463.4 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Rank one recognition rate . . . . . . . . . . . . . . . . . . 473.4.2 Equal error rate . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
CHAPTER 4: A STUDY: COMPARING RECOGNITION PERFORMANCEWHEN USING POOR QUALITY DATA . . . . . . . . . . . . . . . . 504.1 NDSP dataset: Baseline performance . . . . . . . . . . . . . . . . 51
4.1.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 514.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Comparison dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
CHAPTER 5: HANDLING POSE VARIATION IN SURVEILLANCE DATA 665.1 Pose handling: Enhanced gallery for multiple poses . . . . . . . . 665.2 Score-level fusion for improved recognition . . . . . . . . . . . . . 69
5.2.1 Description of fusion techniques . . . . . . . . . . . . . . . 695.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.1 NDSP dataset . . . . . . . . . . . . . . . . . . . . . . . . . 745.4.2 IPELA dataset . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
CHAPTER 6: HANDLING VARIABLE ILLUMINATION IN SURVEIL-LANCE DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.1 Acquisition setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.2 Reflecting images to handle uneven illumination . . . . . . . . . . 86
6.2.1 Averaging images . . . . . . . . . . . . . . . . . . . . . . . 916.3 Comparison approaches . . . . . . . . . . . . . . . . . . . . . . . . 946.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4.1 Test dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 966.4.2 Face detection . . . . . . . . . . . . . . . . . . . . . . . . . 986.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
CHAPTER 7: OTHER EXPERIMENTS . . . . . . . . . . . . . . . . . . . 1037.1 Face detection evaluation . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.1 Background subtraction . . . . . . . . . . . . . . . . . . . 105
iii
-
7.1.2 Approach to pick good frames: Gestalt clusters . . . . . . 1087.1.3 Results: Comparing performance on entire dataset and datasets
pruned using background subtraction and gestalt clustering 1117.2 Distance metrics and number of eigenvectors dropped . . . . . . . 116
7.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
CHAPTER 8: CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . 119
APPENDIX A: GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . . 121
APPENDIX B: POSE RESULTS . . . . . . . . . . . . . . . . . . . . . . . 123
APPENDIX C: ILLUMINATION RESULTS . . . . . . . . . . . . . . . . . 130
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
iv
-
FIGURES
1.1 Example showing the problem of variable illumination . . . . . . . 2
1.2 Example showing the variable pose in two frames of a video clip . 3
1.3 Example showing the low resolution of the face in the frame, whenthe subject is too far from the camera . . . . . . . . . . . . . . . . 4
1.4 Example showing the face to be out of view of the camera . . . . 4
3.1 Camera to capture gallery data: Nikon D80 . . . . . . . . . . . . 30
3.2 Surveillance camera: NDSP camera . . . . . . . . . . . . . . . . . 31
3.3 Surveillance camera: Sony IPELA camera . . . . . . . . . . . . . 32
3.4 High-definition camcorder: Sony HDR-HC7 . . . . . . . . . . . . 32
3.5 Gallery image acquisition setup . . . . . . . . . . . . . . . . . . . 34
3.6 Example frames from the NDSP camera . . . . . . . . . . . . . . 363.7 Example frames from the IPELA camera . . . . . . . . . . . . . . 373.8 Example frames from IPELA camcorder for the Comparison dataset 393.9 Example frames from the Sony HDR-HC7 camcorder . . . . . . . 403.10 FaceGen Modeller 3.2 Interface . . . . . . . . . . . . . . . . . . . 44
3.11 Example of CMC curve . . . . . . . . . . . . . . . . . . . . . . . . 48
3.12 Example of ROC curve . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 Baseline performance for the NDSP dataset . . . . . . . . . . . . 52
4.2 Detections on surveillance video data acquired indoors . . . . . . 57
4.3 Detections on surveillance video data acquired outdoors . . . . . . 58
4.4 Detections on high-definition video data acquired indoors . . . . . 59
4.5 Detections on high-definition video data acquired outdoors . . . . 60
4.6 Results: ROC curve comparing performance when using high-definitionand surveillance data (Indoor video) . . . . . . . . . . . . . . . . 61
4.7 Results: ROC curve comparing performance when using high-definitionand surveillance data (Outdoor video) . . . . . . . . . . . . . . . 62
v
-
4.8 Results: CMC curve comparing performance when using high -definition and surveillance data (Indoor video) . . . . . . . . . . . 63
4.9 Results: CMC curve comparing performance when using high -definition and surveillance data (Outdoor video) . . . . . . . . . . 64
5.1 Frames showing the variable pose seen in a video clip (the blackdots mark the detected eye locations) . . . . . . . . . . . . . . . . 67
5.2 Synthetic gallery poses . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Change in rank matrix for a new incoming image . . . . . . . . . 71
5.4 Results: Comparing rank one recognition rates when adding posesof increasing degrees of off-angle poses . . . . . . . . . . . . . . . 75
5.5 Results: Comparing rank one recognition rates when using frontal,+/-6 degree and +/-24 degree poses . . . . . . . . . . . . . . . . . 76
5.6 Results: Comparing rank one recognition rate when using fusiontechniques to improve recognition . . . . . . . . . . . . . . . . . . 78
5.7 Examples of poorly performing images . . . . . . . . . . . . . . . 81
6.1 Setup to acquire probe data and resulting illumination variation onthe face . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Comparison of gallery and probe images . . . . . . . . . . . . . . 88
6.3 Reflection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 896.4 Example images: original image, reflected left and reflected right . 92
6.5 Average intensity of each column . . . . . . . . . . . . . . . . . . 93
6.6 Reflection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 946.7 Example images: original image and averaged image . . . . . . . . 95
6.8 Example images: original image and quotient image . . . . . . . . 97
7.1 Example eye detections . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2 Structuring element used for erosion and dilation . . . . . . . . . 106
7.3 Example subject: Ground truth and Viisage locations . . . . . . . 109
7.4 Results: Rank one recognition rates when using the entire dataset 113
7.5 Results: Rank one recognition rates when using the dataset afterbackground subtraction . . . . . . . . . . . . . . . . . . . . . . . . 114
7.6 Results: Rank one recognition rates when using the dataset afterbackground subtraction and gestalt clustering . . . . . . . . . . . 115
B.1 CMC curves: Comparing fusion techniques approaches using a sin-gle frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
vi
-
B.2 ROC curves: Comparing fusion techniques using a single frame . . 125
B.3 CMC curves: Comparing approaches exploiting temporal continu-ity, using rank-based fusion . . . . . . . . . . . . . . . . . . . . . 126
B.4 ROC curves: Comparing fusion techniques exploiting temporal con-tinuity, using rank-based fusion . . . . . . . . . . . . . . . . . . . 127
B.5 CMC curves: Comparing fusion techniques exploiting temporalcontinuity, using score-based fusion . . . . . . . . . . . . . . . . . 128
B.6 ROC curves: Comparing fusion techniques approaches exploitingtemporal continuity, using score-based fusion . . . . . . . . . . . . 129
C.1 CMC curves: Comparing illumination approaches using a singleframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
C.2 ROC curves: Comparing illumination approaches using a single frame132
C.3 CMC curves: Comparing illumination approaches exploiting tem-poral continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
C.4 ROC curves: Comparing illumination approaches exploiting tem-poral continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
vii
-
TABLES
2.1 PREVIOUS WORK . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 FEATURES OF CAMERAS USED . . . . . . . . . . . . . . . . . 33
3.2 SUMMARY OF DATASETS . . . . . . . . . . . . . . . . . . . . . 42
4.1 COMPARISON DATASET RESULTS: DETECTIONS IN VIDEOUSING PITTPATT . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 COMPARISON DATASET RESULTS: COMPARISON OF RECOG-NITION RESULTS ACROSS CAMERAS USING PITTPATT . . 56
5.1 RESULTS: COMPARISON OF RECOGNITION PERFORMANCEUSING FUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 RESULTS: COMPARISON OF RECOGNITION PERFORMANCEUSING FUSION ON THE IPELA DATASET . . . . . . . . . . . 82
6.1 RESULTS: COMPARING RESULTS FOR DIFFERENT ILLUMI-NATION TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . 100
7.1 COMPARING DATASET SIZE OF ALL IMAGES TO BACK-GROUND SUBTRACTION APPROACH AND GESTALT CLUS-TERING APPROACH . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2 RESULTS: PERFORMANCE WHEN VARYING DISTANCE MET-RICS AND NUMBER OF EIGENVECTORS DROPPED . . . . 118
viii
-
CHAPTER 1
INTRODUCTION
Face recognition from video is an important area of biometrics research today.
Most of the existing work focuses on recognition from video where the images are
of high-resolution, containing faces in a frontal pose and where the lighting condi-
tions are optimal. However, face recognition from video surveillance has become
an increasingly important goal as more and more video surveillance cameras are
installed in public places. For example, the Metropolitan Police Department has
installed 14 pan, tilt, zoom (PTZ) cameras around the Washington D. C. area
[12]. Also, there are 2,397 cameras installed in Manhattan [30]. Face recogni-
tion using such video is a very challenging problem because of the low resolution,
lighting conditions of such video and the presence of uncontrolled movement. In
this dissertation, we focus on recognition in the presence of uncontrolled pose and
lighting in probe data.
1.1 Description of surveillance-quality video
We describe surveillance-quality video based on four different features. The
four characteristics are (1) variable illumination, (2) variable pose of the subjects
in the video, (3) the low resolution of the faces in the video and (4) obstructions
of the faces in the video.
1
-
Figure 1.1. Example showing the problem of variable illumination
Firstly, such video is affected by variable illumination. Often times, surveil-
lance cameras are pointed toward doorways where the sun is streaming in, or the
camera may be in a poorly-lit location. This can change the intensity of the image,
even causing different parts of the image to be illuminated differently, which can
cause problems for the recognition system. In Figure 1.1, we show an example
frame of such video affected by variable illumination.
The second feature of surveillance video is the variable pose of the subject in
the video. The subject is often not looking at the camera and the camera may
be mounted to the ceiling. Therefore, the subject may not be a frontal pose in
the video. While a lot of work has been done using images where the subject is
looking directly at the camera, there is a need to explore recognition when the
subject is not looking at the camera. In Figure 1.2, we two show such examples.
Another surveillance video characteristic is the low resolution of the face. Usu-
ally, such video is of low resolution and covers a large scene. Furthermore, the
2
-
Figure 1.2. Example showing the variable pose in two frames of a videoclip
camera may be located far from the subject. Hence, the subjects face may be
small, causing the number of pixels on the subjects face to be low, making it
difficult for robust face recognition. In Figure 1.3, we show an image where the
subject is too far from the camera for reliable face recognition.
The last feature of surveillance-quality video is obstruction to the human face.
A perpetrator may be aware of the presence of a camera and try to cover their face
to prevent the camera from capturing their face. Hats, glasses and makeup can
also be used to change the appearance of the face to cause problems for recognition
systems. Sometimes, the positioning of the camera may cause the face to be out
of view of the camera frame as seen in Figure 1.4.
1.2 Overview of our work
In this dissertation, we focus on variable pose and illumination. One theme
that we exploit throughout this dissertation is temporal continuity in the surveil-
lance video. One feature of video data that still images lack is the temporal
3
-
Figure 1.3. Example showing the low resolution of the face in the frame,when the subject is too far from the camera
Figure 1.4. Example showing the face to be out of view of the camera
4
-
continuity between the frames of the data. The identity of the subject will not
change in an instant, so the multiple frames available can be used for recogni-
tion. The matching scores between a pair of probe and gallery subjects can be
made more robust by using decisions about a previous frame for the current one.
First, we compare recognition performance when using surveillance video to per-
formance when using high-resolution video in our probe dataset. We also devise a
technique to evaluate the face detections to prune the dataset to true detections,
to improve recognition performance. We use a multi-gallery approach to make the
recognition system more robust to variable pose in the data. We generate these
poses using synthetic morphable models. We then create reflected images in order
to mitigate the effects of variable illumination.
By combining these techniques we show that we can handle some of the issues
of variable pose and illumination in surveillance data and improve recognition over
baseline performance.
1.3 Organization of the dissertation
The rest of the dissertation is organized as follows: Chapter 2 describes previ-
ous work done in the area. In Chapter 3, we describe the sensors and dataset and
the software used in our experiments. We study the effect of poor quality video on
recognition in Chapter 4. Chapters 5 through 7 describe the work we have done
in this dissertation. Finally, we end with our conclusions in Chapter 8.
5
-
CHAPTER 2
PREVIOUS WORK
In this chapter, we describe previous work that looks at face recognition from
unconstrained video. We first describe three studies that explore face recognition
from video. We look at two problems, namely uncontrolled pose and poor lighting
conditions. We then describe different approaches that have been used to handle
both of these problems.
2.1 Current evaluations
Three different studies that address face recognition from video are: the FRVT
2002 Evaluation, FRVT 2006 Evaluation and the Foto-Fahndung report. The
FRVT 2002 report describes the use of three-dimensional morphable models with
video and documents the benefits of using them for face recognition. FRVT 2006
reports on face recognition performance under controlled and uncontrolled light-
ing. The Foto-Fahndung report describes the face recognition performance of
three different pieces of software, when the data comes from video acquired by
a camera looking at an escalator in a German train station. We describe these
studies in further detail below.
In the FRVT 2002 Evaluation Report [32], face recognition experiments are
conducted in three new areas (three-dimensional morphable models, normalization
6
-
and face recognition from video). The first experiment compares face recognition
performance when using still images in the probe set to using 100 frames from a
video sequence, while the subject is talking with varied expression. The video is
similar to that of a mugshot with the added component of change in expression.
The gallery is a set of still images. Among all the participants in FRVT 2002,
except for DreamMIRH and VisionSphere, recognition performance is better when
using a still image rather than when using a video sequence. They observe that
if the subject were walking toward the camera, there would be a change in size
and orientation of the face that would be a further challenge to the system. In
this work, we focus on uncontrolled video, where data is captured using a surveil-
lance camera in uncontrolled lighting conditions, hence performance is expected
to be poor. They also conclude that 3D morphable models provide only slight
improvement over 2D images.
In 2006, the FRVT 2006 Evaluation Report [34], compared face recognition
when using 2D and 3D data. It also explores face recognition when using con-
trolled and uncontrolled lighting. When using 3D data, the algorithms were able
to meet the FRGC [33] goal of an improvement of an order of magnitude over
FRVT 2002. To test the effect of lighting, the gallery data was captured in a
controlled environment, whereas the probe data was captured in an uncontrolled
lighting environment (either indoors or outdoors). Cognitec, Neven Vision, SAIT
and Viisage outperformed the best FRGC results achieved, with SAIT having a
false reject rate between 0.103 and 0.130 at a false accept rate of 0.001. The per-
formance of FRVT participants when using uncontrolled probe data matches that
of the FRVT participants of 2002 when using controlled data. However, they also
show that illumination condition does have a huge effect on performance.
7
-
The Foto-Fahndung report [9] evaluates performance of three recognition sys-
tems when the data comes from a surveillance system in a German railway station.
They report recognition performance in four distinct conditions based on light-
ing and movement of the subjects and show that while face recognition systems
can be used in search scenarios, environmental conditions such as lighting and
quick movements influence performance greatly. They conclude that it is possible
to recognize people from video, provided the external conditions are right, espe-
cially lighting. They also state that high recognition performance can be achieved
indoors, where the light does not change much. However, drastic changes in
lighting conditions affect performance greatly. They state that High recognition
performance can be expected in indoor areas which have non-varying light con-
ditions. Varying light conditions (darkness, black light, direct sunlight) cause a
sharp decrease in recognition performance. A successful utilization of biometric
face recognition systems in outdoor areas does not seem to be very promising for
search purposes at the moment. [9] They suggest the use of 3D face recognition
technology as a way to improve performance.
2.2 Pose handling
Zhou and Chellappa [51] state that researchers have handled rotation problems
in three ways: (1) Using multiple images per person when they are available (2)
Using multiple training images but only one database image per subject when
running recognition and (3) Using a single image per subject where no training is
required.
Zhou et al. [54] apply a condensation approach to solve numerically the prob-
lem of face recognition from video. They point out that most surveillance video
8
-
is of poor quality and low image resolution and has large illumination and pose
variations. They believe that the posterior probability of the identity of a sub-
ject varies over time. They use a condensation algorithm that determines the
transformation of the kinematics in the sequence and the identity simultaneously,
incorporating two conditions into their model: (1) motion in a short time interval
depends on the previous interval, along with noise that is time-invariant and (2)
the identity of the subject in a sequence does not change over time. When they use
a gallery of 12 still images and 12 video sequences as probes, they achieve 100%
rank one recognition rate. However, the small size of the dataset may contribute
to the high accuracy.
In a later work, they extend this approach to apply to scenarios where the
illumination of probe videos is different from that of the gallery [21], which is also
made up of video clips. Each subject is represented as a set of exemplars from
a video sequence. They use a probabilistic approach to determine the set of the
images that minimizes the expected distance to a set of exemplar clusters and
assume that in a given clip, the identity of the subject does not change, Bayesian
probabilities are used over time to determine the identity of the faces in the frames.
A set of four clips of 24 subjects each walking on a treadmill is used for testing.
The background is plain and each clip is 300 frames long. They achieve 100%
rank one recognition rate on all four combinations of clips as probe and gallery.
Chellappa et al. build on this in [52]. They incorporate temporal information
in face recognition. They create a model that consists of a state equation, an
identity equation (containing information about the temporal change of the iden-
tity) and an observation equation. Using a set of four video clips with 25 subjects
walking on a treadmill, (from the MoBo [16] database), they train their model
9
-
on one or two clips per subject and use the remaining for testing. They are able
to achieve close to 100% rank one recognition rate overall. They expand on this
work [53] to incorporate both changes in pose within a video sequence and the
illumination change between a gallery and probe. They combine their likelihood
probability between frames over time which improves performance overall. In a
set of 30 subjects, where the gallery set consists of still images, they achieved 93%
rank one recognition rate.
Park and Jain [31] use a view synthesis strategy for face recognition from
surveillance video, where the poses are mainly non-frontal and the size of the faces
is small. They use frontal pose images for their gallery, whereas the probe data
contains variable pose. They propose a factorization method that develops 3D
face models from 2D images using Structure from Motion (SfM). They select a
video frame in which the pose of the face is the closest to a frontal pose, as a
texture model for the 3D face reconstruction. They then use a gradient descent
method to iteratively fit the 3D shape to the 72 feature points on the 2D image.
On a set of 197 subjects, they are able to demonstrate a 40% increase in rank one
recognition performance (from 30% to 70%).
Blanz and Vetter [10] describe a method to fit an image to a 3D morphable
model to handle pose changes for face recognition. Using a single image of a per-
son, they automatically estimate 3D shape, texture and illumination. They use
intrinsic characteristics of the face that are independent of the external conditions
to represent each face. In order to create the 3D morphable model, they use a
database of 3D laser scans that contains 200 subjects from a range of demograph-
ics. They build a dense point-to-point correspondence between the face model and
a new face using optical flow. Each face is fit to the 3D shape using seven facial
10
-
feature points (tip of nose, corners of eyes, etc.). They try to minimize the sum of
squared differences over all color channels from all pixels in the test image to all
pixels in the synthetic reconstruction. On a set of 68 subjects of the PIE database
[40], they achieve 95% rank one recognition rate when using the side view gallery.
Using the FERET set, with 194 subjects, they achieve 96% rank one recognition
when using the frontal images as gallery and the remaining images as probes.
Huang et al. [48] use 3D morphable models to handle pose and illumination
changes in face video. They create 3D face models based on three training images
per subject and then render 2D synthetic images to be used for face recognition.
They apply a component-based approach for face detection that uses 14 indepen-
dent component classifiers. The faces are rotated from 0 to 34 in increments of
2 using two different illuminations. At each instance, an image is saved. Out of
the 14 components detected, nine are used for face recognition. The recognition
system consists of second degree polynomial Support Vector Machine classifiers.
When they use 200 images of six different subjects, they get a true accept rate of
90% at a false accept rate of 10%.
Beymer [8] uses a template based approach to represent subjects in the gallery
when there are pose changes in the data. He first applies a pose estimator based
on the features of the face (eyes and mouth). Then, using the nose and the eyes,
the recognition system applies a transform to the input image to align the three
feature points with a training image. When using 930 images for training the
detector and 520 images for testing, the features are correctly detected 99.6% of
the time. For recognition, a feature-level set of systems is used for each eye, nose
and mouth. The probe images are compared only to those gallery images closest
to its pose. Then he uses a sum of correlations of the best matching eye, nose and
11
-
mouth templates to determine the best match. On the set of 62 subjects, when
using 10 images per subject with an inter-ocular distance of about 60 pixels in the
images, the rank one recognition rate is 98.39%. However, this is a relatively large
inter-ocular distance for good face recognition and not usually typical of faces in
surveillance quality video.
Arandjelovic and Cipolla [3] deal with face movement and observe that most
strategies use the temporal information of the frames to determine identity. They
propose a strategy that uses Resistor Average Distance (RAD), which is a measure
of dissimilarity between two disjoint probabilities. They claim that PCA does not
capture true modes of variation well and hence a Kernel PCA is used to map the
data to a high-dimensional space. Then, PCA can be applied to find the true
variations in the data. For recognition, the RAD between the distributions of
sets of gallery and probe points is used as a measure of distance. They test their
approach on two databases. One database contains 35 subjects and the other
contains 60 subjects. In both datasets, the illumination conditions are the same
for training and testing. They achieve around 98% rank one recognition rate on
the larger dataset.
Thomas et al. [43] use synthetic poses and score-level fusion to improve recog-
nition when there is variable pose in the data. They show that recognition can
be improved by exploiting temporal continuity. The gallery dataset consists of
one high-quality still image per subject. Using the approach in [10] to generate
synthetic poses, the gallery set is enhanced with multiple images per subject. A
dataset of 57 subjects is used, which contains subjects walking around a corner in
a hallway. When they use the original gallery images and treat each probe image
as a single frame with no temporal continuity, they achieve a rank one recog-
12
-
nition rate of 6%. However, by adding synthetic poses and exploiting temporal
continuity, they improved rank one recognition performance to 21%.
2.3 Illumination handling
Zhou et al. [55] separate the strategies to handle changes in illumination into
three categories. The first set of approaches is called subspace methods. These
approaches are most commonly used in recognition problems. Some common ex-
amples of this class of approaches are PCA [44] and LDA [50]. However, the
disadvantage of such techniques is that they are tuned to the illumination condi-
tions that they are trained on. When the gallery set consists of still images taken
indoors under controlled lighting conditions and the probe set is of surveillance
quality video acquired under uncontrolled lighting conditions, recognition perfor-
mance is poor. The second set of approaches is reflectance model methods. A
Lambertian reflectance model is used to model lighting. The disadvantage of this
approach is that it is not as effective an approach when the subjects in the testing
set are not encountered in the training set. The third set of approaches uses 3D
models for representation. These models are robust to illumination effects. How-
ever, they require a sensor that can capture such data or the data needs to be
built based on 2D images.
Adini et al. [2] describe four image representations that can be used to han-
dle illumination changes. They divide the approaches to handle illumination into
three categories: (1) Gray level information to extract a three-dimensional shape
of the object (2) A stored model that is relatively insensitive to changes in illumi-
nation and (3) A set of images of the same object under different illuminations.
The third approach may be not be realistic given the experiment and the setup.
13
-
Furthermore, one may not be able to fully capture all the possible variations in
the data. While it has been shown theoretically that a function invariant to il-
lumination does not exist, there are representations that are more robust than
others [2]. The four representations they consider are (1) the original gray-level
image, (2) the edge map of the image, (3) the image filtered with 2D Gabor like
filters and (4) the second-order derivative of the gray level image [2]. Some edges
of the image can be insensitive to illuminations whereas others are not. However,
an edge map is useful in that it is a compact representation of the original im-
age. Derivatives of the gray level image are useful because while ambient light
will affect the gray level image, under certain conditions it does not affect the
derivatives. In order to make the images more robust, they divide the face into
two sub parts by creating subregions of the eyes area and the lower part of the
face. They show that in highly variable lighting, the error rate is 100% on raw
gray level images, where there are changes in illumination direction. Performance
improves when using the filtered images. They also show that even though the
filtered images do not resemble the original face, they encode information to im-
prove recognition. However, they conclude that no one representation is sufficient
to overcome variations in illumination. While some are robust to changes along
the horizontal axis, others are more robust along the vertical axis. Hence, the
different approaches need to be combined to exploit the benefits of each of them.
Zhao and Chellappa [51] use 3D models to handle the problems of illumina-
tion in face recognition. They create synthesized images acquired under different
lighting and viewing conditions. They develop a 2D prototype image from a 2D
image acquired under variable lighting using a generic 3D model, rather than a
full 3D approach that uses accurate 3D information. For the generic 3D model,
14
-
a laser scanned range map is used. They use a Lambertian model to estimate the
albedo value, which they determine using a self-ratio image, which is the illu-
mination ratio of two differently aligned images. Using a 3D generic model, they
bypass the 2D to 3D step, since the pose is fixed in their dataset. When they test
their approach using the Yale database, on a set of 15 subjects with 4 images each
they obtain 100% rank one recognition rate, which was an improvement of about
25% improvement over using the original images (about 75% rank one recognition
rate).
Wei and Lai [47] describe a robust technique for face recognition under vary-
ing lighting conditions. They use a relative image gradient feature to represent
the image, which is the image gradient function of the original intensity image,
where each pixel is scaled by the maximum intensity of its neighbors. They use
a normalized correlation of the gradient maps of the probe and gallery images to
determine how well the images match. On the CMU-PIE face database [40], which
contains 22 images under varying illuminations of 68 individuals, they obtain an
equal error rate of 1.47% and show that their approach outperforms recognition
when using the original intensity images.
Price and Gee [36] also propose a PCA-based approach to address three issues
that could cause problems in face recognition, namely illumination, expression and
decoration (specifically, glasses and facial hair). They use an LDA-based approach
to handle changes in illumination and expression. They note that subregions of
the face are less sensitive to expression and decoration than the full face. So they
break the face into modular subregions: the full face, the region of the eyes and
the nose and then just the eyes. For each region, they independently determine
the distance from that region to each of the corresponding images in the database.
15
-
Hence, they have a parallel system of observations, one for each region mentioned
above. They then use a combination of results as their matching score to determine
the best match. They use a database of 106 subjects with varied illumination,
expression and decoration, where 400 still images are used for training and 276
for testing. When they combine the results from the three observers, using PCA
and LDA, they achieve a rank one recognition rate of 94.2%.
Hiremath and Prabhakar [18] use interval-type discriminating features to gen-
erate illuminant invariant images. They create symbolic faces for each subject
in each illumination type based on the maximum and minimum value found at
each pixel for a given dataset. While this is an appearance-based approach, it
does not suffer the same drawbacks as other approaches because it uses interval
type features. Therefore, it is insensitive to the particular illumination conditions
in which the data is captured within the range of illuminations in the training
data. They then use Factorial Discriminant Analysis to find a suitable subspace
with optimal separation between the face classes. They test their approach using
the CMU PIE [40] database and get a 0% error rate. This approach is advanta-
geous in that it does not require a probability distribution of the image gradient.
Furthermore, it does not use any complex modeling of reflection components or
assume a Lambertian model. However, it is limited by the range of illuminations
found in the training data. Therefore, it may not be applicable in cases where
there is a difference in the illuminations between the gallery and probe sets.
Belhumeur et al. [6] use LDA to produce well-separated classes for robustness
to lighting direction and facial expression, and compare their approach to using
eigenfaces (PCA) for recognition. They conclude that LDA performs the best
when there are variations in lighting or even simultaneous changes in lighting
16
-
and expression. They also state that In the [PCA] method, removing the first
three principal components results in better performance under variable lighting
conditions. [6] Their experiments use the Harvard database [17] to test variation
in lighting. The Harvard database contains 330 images from 5 subjects (66 images
each). The images are divided into five subsets based on the direction of the light
source (0, 30, 45, 60, 75 degrees). The Yale database consists of 16 subjects with
10 images each taken on the same day but with variation in expression, eyewear
and lighting. They use a nearest neighbor classifier for matching, though the
measure used to determine distance was not specified. The variation in expression
and lighting is tested using a leave-one-out error estimation strategy on all 16
subjects. They train the space on nine of the images and then tested it using the
image left out and achieve a 0.6% recognition error rate using LDA and a 19.4%
recognition error rate using PCA, with the first three dimensions dropped. They
do mention that the databases are small and more experimentation using larger
databases is needed.
Arandjelovic and Cipolla [5] handle variation in illumination and pose using
clustering and Gamma intensity correction. They create three clusters per sub-
ject corresponding to different poses and use locations of pupils and nostrils to
distinguish between the three clusters. Illumination is handled using Gamma in-
tensity correction. Here, the pixels in each image are transformed so as to match
a canonically illuminated image. Pose and illumination are combined by per-
forming PCA on variations of each persons images under different illuminations
from a given persons mean image and using simple Euclidean distance as their
distance measure. In order to match subjects to a novel image, they use the ratio
of the probability that three clusters belong to the same subject over the proba-
17
-
bility that they belong to a different subject. Their dataset consist of 20 subjects
for training and 40 others for testing, where each subject has 20-100 images in
random motion. They achieve 95% rank one recognition rate using this approach.
Arandjelovic and Cipolla [4] evaluate strategies to achieve illumination in-
variance when there are large and unpredictable illumination changes. In these
situations, the difference between two images of the same subject under different
illuminations is larger than that of two images under the same illumination but
of different subjects. Hence, they focus on ways to represent the subjects face
and put more emphasis on the classification stage. They show that both the high
pass filter and the self quotient image operations on the original intensity image
show recognition improvement over the raw grayscale representation of the images,
when the imaging conditions between the gallery and probe set are very different.
However, they also note that while they improve recognition in the difficult cases,
they actually reduce performance in the easy cases. They conclude that Lapla-
cian of Gaussian representation of the image as described in [2] and a quotient
image representation perform better than using the raw image. They demonstrate
a rank one recognition rate improvement from about 75% using the raw images,
to 85% using the Laplacian of Gaussian representation, to about 90%, using quo-
tient images. Since we are dealing with conditions which change drastically and
where the conditions for gallery and probe data differ, we use these approaches to
improve recognition in this work.
Gross and Brajovic [15] use an illuminance-reflectance model to generate im-
ages that are robust to illumination changes. Their model makes two assumptions:
human vision is mostly sensitive to scene reflectance and mostly insensitive to il-
lumination conditions and secondly that human vision responds to local changes
18
-
in contrast rather than to global brightness levels. [15] Since they focus on pre-
processing the images based on the intensity, there is no training required. They
test their approach using the Yale database, which contains 10 subjects acquired
under 576 lighting conditions. When using PCA for recognition, they improve
the rank one recognition rate from 60% to 93%, when using reflectance images in-
stead of the original intensity images. Since we are dealing with conditions which
change drastically and where the conditions for gallery and probe data differ, we
use these approaches to improve recognition in this work.
Wang et al. [46] expand on the approach in [15] and used self-quotient images
to handle the illumination variation for face recognition. The Lambertian model
of an image can be separated into two parts, the intrinsic and extrinsic part. If
one can estimate the extrinsic part based on the lighting, it can be factored out
of the image to retain the intrinsic part for face recognition. The image is found
by using a smoothing kernel and dividing the image pixels by this filter. Let F
be the smoothing filter and I the original image, then the self-quotient image, Q
is defined as IF [I]
. They demonstrate their approach on the Yale and PIE dataset
and show improvement over using the intensity images for recognition, from about
50% to about 95% rank one recognition rate.
Nishiyama et al. [25] show that self-quotient images [46] are insufficient to han-
dle partial cast shadows or partial specular reflection. They handle this weakness
by using an appearance-based quotient image. They use photometric linearization
to transform the image into the diffuse reflection. A linearized image is defined as
a linear combination of three basis images. In order to generate the basis images
to find the diffuse image, different images from other subjects are used. They
acquired images under fixed pose with a moving light source. The reflectance
19
-
image is then factored out using the estimated diffuse image. They compare their
algorithm to the self-quotient image and the quotient image and show that on the
Yale B database and show that they achieve a rank one recognition rate of 96%,
whereas self-quotient images achieve 87% rank one recognition rate and Support
Retinex images [37] achieves a rank one recognition rate of 93%.
2.4 Other issues
Howell and Buxton [19] propose a strategy for face recognition when using
low-resolution video. Their goal is to capture similarities of the face over a wide
range of conditions and solve the problem for just a small group (less than 100)
of subjects. The environment is unconstrained in that there are no restrictions on
movement. They use the temporal information of the frames linked by movement
information to match the frames. This allows them to make the assumption
that between two consecutive frames, the identity of the subject will not change
instantly. They use a two-layer, hybrid learning network with a supervised and
unsupervised layer and adjust weights using the Widrow-Hoff delta learning rule.
The network is trained to include the variation that they want their system to
tolerate. From a set of 400 images of 40 people, using 5 images per subject, and
discarding frames that do not include a face, they are able to achieve 95% rank
one recognition rate.
Lee et al. [22] discuss an approach to handle low resolution video using support
vector data description (SVDD). They project the input images as feature vectors
on the spherical boundary of the feature space and conduct face recognition using
correlation on the images normalized based on the inter-ocular distance. They use
the Asian Face database for their experiments and different resolutions, ranging
20
-
from 16 x 16 pixels to 128 x 128 pixels and achieve a rank one recognition rate of
92% when using the lowest resolution images.
Lin et al. [23] describe an approach to handle face recognition from video of low
resolution like those found in surveillance. They use optical flow for registration
to handle issues of non-planarity, non-rigidity, self-occlusion and illumination
and reflectance variation. [23] For each image in the sequence, they interpolate
between the rows and columns to obtain an image that is twice the size of the
original image. They then compute optical flow between the current frames and
the two previous and two next images and register the four adjacent images us-
ing displacements estimated by the optical flow. Then they compute the mean
using the registered images and the reference images. The final step is to apply a
deblurring Wiener deconvolution filter to the super resolved image. They tested
their approach on the CUAVE database, which contains 36 subjects. When they
reduce the images to 13x18 pixels, their approach (approximately 15% FRR at
1% FAR) performs slightly better than bilinear interpolated images and far out-
performs nearest neighbor interpolation. They expand on this work in [24] and
compare their approach to a hallucination approach (assumes a frontal view of
face and works well when faces are aligned exactly). They conclude that while
there is some improvement gained over using over the lower resolution images,
a fully automated recognition system is currently impractical, given the perfor-
mance. Hence, they relax their constraint to a rank ten match and can achieve
87.3% rank ten recognition rate on XM2VTS dataset that contains 295 subjects.
In Table 2.1, we summarize the different approaches along with their assump-
tions, dataset size and performance. We divide up the works based on the problem
they are trying to solve: (1) Variable pose (2) variable illumination and (3) other
21
-
problems, such as low resolution on the face. Performance is reported in rank one
recognition rate, unless otherwise specified. Some of the results are reported in
terms of equal error rate (or EER). Also, the results must be viewed in light of
the difficulty of the dataset (data features) and dataset size.
22
-
TA
BL
E2.
1
PR
EV
IOU
SW
OR
K
Auth
ors:
Tit
leB
asic
idea
Dat
afe
ature
sD
atas
etsi
zeP
erfo
r-m
ance
1
AP
PR
OA
CH
ES
TO
HA
ND
LE
VA
RIA
BL
EP
OSE
Zhou
:[5
4]P
oste
rior
pro
bab
ilit
yov
erti
me
Con
stra
ined
vid
eo12
100%
Bla
nz
and
Vet
ter:
[10]
3Dm
orphab
lem
odel
,p
oint-
to-p
oint
corr
esp
onden
cefo
rsi
milar
ity
Var
iable
pos
e19
496
%
Wey
rauch
,et
al:
[48]
3Dm
odel
sbas
edon
2Dtr
ainin
gim
ages
,re
nder
syn-
thet
icp
oses
for
reco
gnit
ion
2diff
eren
tillu
min
atio
ns,
face
sro
tate
d
690
%
Par
kan
dJai
n:
[31]
Vie
wsy
nth
esis
stra
tegi
es,
3Dfa
cem
odel
susi
ng
SfM
Non
fron
tal
face
s19
370
%
Bey
mer
:[8
]T
empla
te-b
ased
appro
ach,
feat
ure
-lev
elsy
stem
com
-bin
edusi
ng
sum
ofco
rrel
a-ti
ons
Pos
ech
ange
s62
98.3
9%
23
-
TA
BL
E2.
1
Con
tin
ued
Auth
ors:
Tit
leB
asic
idea
Dat
afe
ature
sD
atas
etsi
zeP
erfo
r-m
ance
Ara
ndje
lovic
,C
ipol
la:
[3]
Use
Ker
nel
PC
Ato
reduce
the
dim
ensi
onal
ity
ofim
ages
tonea
rly
linea
r.A
pply
RA
Dto
calc
ula
tedis
tance
sb
e-tw
een
two
sets
ofim
ages
Fac
em
ove-
men
t;N
oillu
min
atio
nch
ange
36A
bou
t97
%
AP
PR
OA
CH
ES
TO
HA
ND
LE
VA
RIA
BL
EIL
LU
MIN
AT
ION
Kru
eger
,Z
hou
:[2
1]E
xem
pla
rcl
ust
ers
tore
p-
rese
nt
sub
ject
s,B
ayes
ian
pro
bab
ilit
ies
over
tim
eto
eval
uat
eid
enti
ty
Sub
ject
ontr
eadm
ill,
fron
tal
vid
eo
2410
0%
Zhou
:C
hel
lappa:
[52]
Sta
teof
iden
tity
equat
ion,
tem
por
alco
nti
nuit
ySub
ject
ontr
eadm
ill,
fron
tal
vid
eo
25A
bou
t10
0%
Zhou
,C
hel
lappa:
[53]
Lik
elih
ood
pro
bab
ilit
yb
e-tw
een
fram
esov
erti
me
Sub
ject
ontr
eadm
ill,
fron
tal
vid
eo
3093
%
24
-
TA
BL
E2.
1
Con
tin
ued
Auth
ors:
Tit
leB
asic
idea
Dat
afe
ature
sD
atas
etsi
zeP
erfo
r-m
ance
Zhao
,C
hel
lappa:
[51]
Synth
etic
imag
esac
quir
edunder
diff
eren
tligh
ting,
use
aL
amb
erti
anm
odel
tohan
-dle
the
alb
edo
Var
ied
illu
mi-
nat
ion
1510
0%
Wei
and
Lai
:[4
7]M
odifi
edim
age
inte
n-
sity
funct
ion,
nor
mal
ized
corr
elat
ion
for
mat
chin
g
Var
yin
gillu
-m
inat
ions
681.
47%
EE
R
Pri
cean
dG
ee:
[36]
LD
Abas
edap
pro
ach
usi
ng
subre
gion
sof
the
face
Var
ied
il-
lum
inat
ion,
expre
ssio
n,
dec
orat
ion
106
94.2
%
Hir
emat
h,
Pra
b-
hak
ar:
[18]
Sym
bol
icin
terv
alty
pe
feat
ure
sto
repre
sent
face
clas
ses,
Fac
tor
Dis
crim
i-nan
tA
nal
ysi
sto
reduce
dim
ensi
ons
Diff
eren
til-
lum
inat
ions,
still
imag
es
680% E
ER
25
-
TA
BL
E2.
1
Con
tin
ued
Auth
ors:
Tit
leB
asic
idea
Dat
afe
ature
sD
atas
etsi
zeP
erfo
r-m
ance
Bel
hum
eur
etal
.:[6
]L
DA
for
reco
gnit
ion,tr
ained
onm
ult
iple
sam
ple
sp
ersu
b-
ject
wit
hva
ryin
gillu
min
a-ti
on
Var
iable
ligh
t-in
g5
0.6%
EE
R
Ara
nje
lovic
and
Cip
olla
:[4
]U
seim
age
filt
ers
since
the
il-
lum
inat
ion
isunpre
dic
table
Var
iable
ligh
t-in
g10
073
.6%
Ara
ndje
lovic
,C
ipol
la:
[5]
Cre
ate
Gau
ssia
ncl
ust
ers
corr
esp
ondin
gto
vari
ous
pos
es;
apply
Gam
ma
inte
n-
sity
corr
ecti
on;
Eucl
idea
ndis
tance
sto
det
erm
ine
diff
eren
ce
Var
ied
illu
mi-
nat
ion
96
AP
PR
OA
CH
ES
TO
HA
ND
LE
LO
W-
RE
SO
LU
TIO
NV
IDE
O
26
-
TA
BL
E2.
1
Con
tin
ued
Auth
ors:
Tit
leB
asic
idea
Dat
afe
ature
sD
atas
etsi
zeP
erfo
r-m
ance
How
ell,
Buxto
n:
[19]
Explo
itte
mp
oral
info
rma-
tion
offr
ames
,tw
o-la
yer
hy-
bri
dle
arnin
gnet
wor
k,
ad-
just
edusi
ng
the
Wid
row
-H
offle
arnin
gru
le
Unco
nst
rain
eden
vir
onm
ent
and
mov
e-m
ent
4095
%
Lee
,et
al:
[22]
Supp
ort
vect
ordat
ades
crip
-ti
on,
corr
elat
ion
for
reco
gni-
tion
Low
reso
lu-
tion
92%
Lin
etal
.:[2
3]U
seop
tica
lflow
for
regi
stra
-ti
on,
crea
tesi
ngl
eSR
fram
efr
om5
regi
ster
edfr
ames
,use
PC
Aw
ith
Mah
Cos
ine
dis
-ta
nce
for
reco
gnit
ion
Sub
ject
talk
-in
g36
15FA
R
27
-
2.5 How this dissertation relates to prior work
In this dissertation, we focus on the variations in pose and uncontrolled light-
ing.
To handle the variations in pose, we use a multi - gallery approach to repre-
sent all the poses in the dataset. We create synthetic poses to represent those
that may be present in our probe set. We then use score - level fusion. This ap-
proach requires no training and thus is useful for datasets where the poses in the
probe and gallery sets differ. On a dataset of 57 subjects, we achieve a rank one
recognition rate of 21%, which is an improvement of 6% rank one recognition rate
achieved using the baseline approach. The baseline approach used for comparison
is described in Section 4.1. This work is described in Chapter 5.
To handle the lighting conditions, we use an appearance-based model, which
doesnt require any training data or knowledge of the model. We create reflected
images by using one half of the face and reflecting it over the other half. We
then use score-level fusion to combine the two sets of results. We demonstrate
an improvement relative to the self-quotient image and quotient image approach,
which assumes a Lambertian model. On a dataset of 26 subjects, we show a rank
one recognition rate of 49.88% and an equal error rate of 18.27%, whereas base-
line performance using the original images 38.62% rank one recognition rate and
19.27% equal error rate. Here, baseline performance is the performance achieved
when using the original images as obtained from the surveillance cameras, with
no preprocessing. This work is described in Chapter 6.
28
-
CHAPTER 3
EXPERIMENTAL SETUP
In this chapter, we describe the sensors, data sets and software used in our
experiments. We acquire three different datasets for our experiments. We label
them the NDSP, IPELA and Comparison dataset. The first dataset is used to
show baseline performance and used in our pose and face detection experiments.
The IPELA dataset is used for our reflection experiments to handle pose and
illumination variation. Finally, the Comparison dataset is used to compare face
recognition performance when using high-quality data acquired on a camcorder
and when using data acquired on a surveillance camera.
The rest of the chapter is organized as follows: Section 3.1 describes the differ-
ent sensors we use in our experiments. We then describe our datasets in Section
3.2. Finally, the software we use is described in Section 3.3.
3.1 Sensors
We capture data using four different sensors. The first camera is a Nikon D80,
used to acquire the gallery data used in the experiments. The second camera is
a PTZ camera installed by the Notre Dame Security Police. The third camera is
a Sony IPELA camera with PTZ capability. The fourth camera is a Sony HDR
camcorder used to capture data as a comparison to the surveillance-quality data.
We describe each in detail below.
29
-
Figure 3.1. Camera to capture gallery data: Nikon D80
3.1.1 Nikon D80
The gallery data is acquired on a Nikon D80 [28]. It is a digital single-lens
reflex (SLR) camera. The resolution of the images is 3872 x 2592 pixels. The
camera is shown in Figure 3.1.
3.1.2 Surveillance camera installed by NDSP
The probe data is acquired using a surveillance video camera with PTZ (pan,
tilt, zoom) capability. The camera is part of the NDSP security system and is
attached to the ceiling on the first floor of Fitzpatrick Hall, as seen in Figure
3.2. The resolution of this camera is 640 x 480 pixels. The data is captured in
30
-
Figure 3.2. Surveillance camera: NDSP camera
interlaced mode.
3.1.3 Sony IPELA camera
We also acquire data on a surveillance camera called Sony SNC RZ25N surveil-
lance camera [41]. The resolution of this camera is 640 x 480 pixels and the data
is captured in interlaced mode. In Figure 3.3, we show an image of such a camera.
3.1.4 Sony HDR Camcorder
For our comparison dataset, we also acquire high quality data on a Sony HDR
camcorder [42]. The video was captured at a frame rate of 29.97 frames per second
in interlaced mode. In Figure 3.4, we show an image of this camcorder.
In Table 3.1, we compare all the cameras used in this dissertation.
31
-
Figure 3.3. Surveillance camera: Sony IPELA camera
Figure 3.4. High-definition camcorder: Sony HDR-HC7
32
-
TABLE 3.1
FEATURES OF CAMERAS USED
Features Camera
Name used Nikon D80 NSDP camera IPELA Sony HD
Model Nikon D80 Not available Sony SNCRZ25N
Sony HDRHC
Resolution 2592x3872 640x480 640x480 1920x1080
Image size 3,732 kb 40kb 52kb 466 kb
Interlaced No Yes Yes Yes
3.2 Dataset
We describe three datasets. They are named NDSP, IPELA and Comparison
dataset based on the camera used to acquire them and the experiments for which
they are used.
3.2.1 NDSP dataset
We use two kinds of sensors to acquire data for this dataset. The gallery
data containing high quality still images is acquired using the Nikon D80 camera.
The subject is sitting about two meters from the camera in a controlled well-lit
environment, in front of a gray background. The inter-ocular distance is about
230 pixels, with a range of between 135 and 698 pixels. In Figure 3.5, we show
the set up and two of the images acquired for the gallery.
The probe data is acquired using the NDSP surveillance camera, located on
the first floor of Fitzpatrick Hall. The video consists of a subject entering through
33
-
(a) Acquisition setup for gallery data
(b) Example gallery images
Figure 3.5. Gallery image acquisition setup
34
-
a glass door, walking around the corner till they are out of the camera view.
Each video sequence consists of between 50 and 150 frames. In Figure 3.6 we
show 10 frames acquired from this camera. We see that the illumination is highly
uneven due to the glare of the sun on the subject. The inter-ocular distance is
about 40 pixels on average. The pan, tilt and zoom are not changed during data
acquisition but could vary from day to day since the camera is part of a campus
security system. There are 57 subjects in this dataset. The time lapse between
the probe and gallery data varies from two weeks to about six months.
3.2.2 IPELA dataset
The gallery images are acquired using the Nikon D80, in a well-lit room under
controlled conditions. The subject is sitting about two meters from the camera in
front of a gray background, with a neutral expression, as seen in Figure 3.5. Since
these images are acquired indoors, the illumination is controlled. The inter-ocular
distance is about 300 pixels.
The probe data is acquired from the IPELA camera. The zoom on this camera
is set so that the inter-ocular distance of the subject is about 50 pixels, starting at
about 30 pixels when the subject enters the scene and is farthest from the camera
to about 115 pixels when the subject is closest to the camera. It is mounted on
a tripod set at a height of about five and a half feet. The camera position is not
changed during a day of capture, but may vary slightly from day to day. Each
clip consists of the subject walking around a corner until they are out of the view
of the camera. Therefore, we capture data of the subject in a variety of poses and
face sizes. Each video sequence is made up of 100 to 200 frames. In Figure 3.7,
we show 10 example frames acquired from one subject from one of the clips. The
35
-
Figure 3.6: Example frames from the NDSP camera
36
-
Figure 3.7: Example frames from the IPELA camera
illumination is also uncontrolled. This dataset consists of 104 subjects. The time
lapse between the probe and gallery data is between two weeks and six months.
Splitting up the dataset: In order to test our approach when using the
surveillance data, we use four-fold cross validation. We split up the dataset into
four disjoint subsets, where each set contains 26 subjects. The sets are subject-
disjoint. For our experiments, we train the space on three subsets and test on
the remaining subset. We use the average of the four scores as our measure of
37
-
performance of the different approaches.
3.2.3 Comparison dataset
For each of the subjects in this dataset, we have one high quality still im-
age, one high-quality video sequence and one video clip acquired from the IPELA
surveillance camera. The gallery data is acquired in a well-lit room under con-
trolled conditions. The subject is sitting about 2 meters from the camera against
a gray background. We show an example image in Figure 3.5.
The IPELA camera and HD Sony camcorder are set up to acquire data in
the same setting. The zoom on the IPELA camera is set so that the inter-ocular
distance of the subject is about 40 pixels on average. It is mounted on a tripod
set at a height of about five and one half feet. The camera position is not changed
during a day of capture, but may vary slightly from day to day. In each clip, the
subject walks toward the left, picks up an object and then walks towards the right
of the frame. Therefore, we capture data of the subject in a variety of poses and
face sizes. We acquired data on three consecutive days. Each video sequence was
made of between 100 and 300 frames. We show examples of each of these images
in Figures 3.8 to 3.9.
The Sony HDR camcorder is also mounted on a tripod set at a height of about
five and a half feet and adjusted according to the height of the subject. This
is captured simultaneously with the surveillance video, and thus consists of the
subject walking to the left, picking up an object and then walking towards the
right of the frame. The interocular distance of this dataset is about 45 pixels with
a range of about 15 pixels to 110 pixels. In Figure 3.9, we show 10 example frames
acquired from one subject from one of the clips.
38
-
Figure 3.8: Example frames from IPELA camcorder for the Comparison dataset
39
-
Figure 3.9: Example frames from the Sony HDR-HC7 camcorder
40
-
This dataset contains 176 subjects. Out of the 176 subjects, 78 are acquired
indoors in a hallway on the first floor of Fitzpatrick Hall. One half of the face
is partially lit by the sun. The remaining 98 subjects are acquired outdoors
in uncontrolled lighting conditions. We separate out these datasets to compare
recognition performance when using data acquired indoors rather than outdoors.
The probe and gallery data in this dataset are acquired on the same day. This
dataset partly overlaps with the data of the Multi - Biometric Grand Challenge
(or MBGC) dataset [29], but also includes surveillance data that is not part of
the MBGC dataset.
In Table 3.2, we summarize the details of the datasets we use in this disserta-
tion.
41
-
TABLE 3.2
SUMMARY OF DATASETS
Features Dataset
Name NDSPDataset
IPELADataset
Comparison Dataset
Gallery datasource
Nikon D80 Nikon D80 Nikon D80 Nikon D80
Probe datasource
NDSP in-stalledsurveillancecamera
Sony IPELAcamera
Sony IPELAcamera
Sony HDcamcorder
Number sub-jects
57 104 176 176
Numberimages pergallery sub-ject
1 1 1 1
Numberimages perprobe subject
50 - 150frames
100 - 300frames
100 - 300frames
300 - 450frames
Acquisitionenvironmentof probe data
FitzpatrickHallway
FitzpatrickHallway
Indoor andoutdoor
Indoor andoutdoor
Activity Subject en-ters through aglass door andwalks arounda corner
Subject walksaround a cor-ner and downa hallway andout of view ofthe camera
Subject picksup an ob-ject andwalks out ofcamera-view
Subject picksand objectand walks outof camera -view
Time lapsebetweenprobe andgallery data
2 weeks to 6months apart
2 weeks to 6months apart
Same day Same day
42
-
3.3 Software
We use a variety of software for our work: FaceGen Modeller 3.2, Viisage
IdentityExplorer, Neurotechnologija, PittPatt and CSUs PCA code. They are
described in further detail below:
3.3.1 FaceGen Modeller 3.2
For each gallery image, we create a 3D model using the Nikon image as input
and then rotate the model to get different poses. In order to create the models, we
use the FaceGen Modeller 3.2 Free Version manufactured by Singular Inversions
[26]. The software is based on the work by Vetter et al. [10].
This modeler creates a 3D model using the notion of an average 3D face and
a still frontal image. It is trained on a set of subjects from various demographics
such as age, gender and ethnicity. It requires eleven points to be marked on the
face: centers of the eyes, edges of the nose, the corners of the mouth, the chin,
the point at which the jaw line touches the face visually and the points at which
the ears touch the face. Once the 3D model is rendered using the still image,
different parameters, such as gauntness of cheeks and the jaw line can be tweaked
to represent the particular subject in the 2D image more accurately. The synthetic
3D face can then be rotated to get different views of the face. A screen shot of
the software is shown in Figure 3.10.
3.3.2 IdentityEXPLORER
Viisage manufactures an SDK for multi-biometric technology, called Identity-
EXPLORER. It provides packages for both face and fingerprint recognition. It
is based on Viisages Flexible Template Matching technology and a new set of
43
-
Figure 3.10. FaceGen Modeller 3.2 Interface
44
-
powerful multiple biometric recognition algorithms, incorporating a unique com-
bination of biometric tools [45]. We use it for detection and recognition:
1. Detection: It gives the centers of the eyes and the mouth, with an associated
confidence measure in the face localization, ranging from 0.0 to 100.0.
2. Recognition: It takes two images and gives a matching score between the
faces in the two images. The scores range from 0.0 to 100.0, where a higher
score implies a better match.
3.3.3 Neurotechnoligija
Neurotechnology [27] manufactures an SDK for face and fingerprint biometrics.
The face recognition package is called Neurotechnologija Verilook. It includes face
detection and face recognition capability. The face detection gives the eye and
mouth locations. The software also includes recognition software, which gives the
matching score between two faces in two images.
3.3.4 PittPatt
PittPatt manufactures a face detection and recognition package [35] that we
use in our comparison experiments. The face detection component is robust to
illumination and pose changes in the data and to a variety of demographics. Along
with its detection capability it can determine the pose of the face. It is able to
capture small faces, such as faces with an inter-ocular distance of eight pixels. The
face recognition component is also robust to a variety of poses and expressions by
using statistical learning techniques. By combining face detection and tracking,
PittPatt can also be used to recognize humans across video sequences.
45
-
3.3.5 CSUs preprocessing and PCA software
In order to form a template image of the face that is found in the image, we
use CSUs preprocessing code [13]. We create images that are 65x75 pixels in
size, based on the eye locations found by Viisage, because the subjects face in the
surveillance video has an average inter-ocular distance of about 40 pixels. In the
normalization stage, the images are first centered, based on eye locations, and the
mean image of the set is subtracted from each image in the set.
The CSU software also includes an implementation of Principal Component
Analysis for face recognition [44]. We use this software when using reflectance
images as input for recognition to handle illumination effects [15]. The basic PCA
algorithm is described by Turk and Pentland [44]. The process consists of two
parts, the oine training phase and the online recognition phase.
In the oine phase, the eigenspace is created. Each image is unraveled into
a vector and each vector becomes a column in an MxN matrix, where N is the
number of images and M is the number of pixels. Then the covariance matrix Q
is defined as the outer product of this metric.
The next step is to calculate the eigenvalues and eigenvectors of the matrix Q
and then keep the k eigenvectors with the k largest eigenvalues (which correspond
to the dimensions of highest variation). This defines a k-dimensional eigenspace
into which new images can be projected. In the recognition phase, the normalized
images are projected along their eigenvectors into the k-dimensional face space
and the projected gallery image closest to a projected probe image is the best
match.
46
-
3.4 Performance metrics
In this dissertation, we use two different performance metrics to evaluate recog-
nition performance. They are called rank one recognition rate and equal error
rate. They are shown graphically using cumulative matching curves and receiver
operating curves respectively. We describe each metric in further detail below.
3.4.1 Rank one recognition rate
When a image is probed against a set of gallery images, the gallery image
that has the highest matching score to that probe image is considered its rank
one match. Rank one recognition rate is then defined as the ratio of set of probe
images where the rank one match of each is its true match.
A CMC curve plots the change in recognition rate as the rank of acceptance is
increased. The x-axis ranges from 1 through M , where M is the number of unique
gallery subjects and the y - axis ranges from 0 to 100%. In Figure 3.11, we show
an example of such a curve. In this example, there are 26 subjects in the gallery
set.
Assume that there are n images the probe set and m images in the gallery set.
Let p be the number of probe images for which the rank one match is its true
match, then the rank one recognition rate R is defined as in Equation 3.1.
R =p
100 n (3.1)
3.4.2 Equal error rate
Another metric to measure the equal error rate of the receiver operating (ROC)
curve. An ROC curve plots the false accept rate against the true accept rate. The
47
-
Figure 3.11. Example of CMC curve
48
-
Figure 3.12. Example of ROC curve
rate at which the true accept rate equals the false accept rate is called the equal
error rate.
An ROC curve plots the change in false accept rate versus the true accept
rate. At each point on the graph, the threshold of acceptance as a true match is
varied. In Figure 3.12, we show an example of such a curve.
3.5 Conclusions
In this chapter, we discussed the sensors and datasets used in our experiments.
We also described the software we used to support our work. Finally, we closed
with a discussion about the metrics used to evaluate performance.
49
-
CHAPTER 4
A STUDY: COMPARING RECOGNITION PERFORMANCE WHEN USING
POOR QUALITY DATA
The sensor used to capture data used for face recognition can affect recognition
performance. Low quality cameras are often used for surveillance, which can
result in poor recognition because of the poor video quality and low resolution
on the face. In this chapter, we conduct two sets of experiments. The first
set of experiments demonstrate baseline performance using the NDSP dataset.
This dataset is captured indoors, where the sunlight streaming through the doors
affects the illumination of the scene. Then we show recognition experiments using
the Comparison dataset, where video data acquired from two different sources:
a high-quality camcorder and a surveillance camera. We also capture data both
indoors and outdoors to compare performance when acquiring data in different
acquisition settings. We then compare recognition performance when using each
of these two sources of video data as our probe set and show that performance
falls drastically when we use poor quality video and when we move from indoor
to outdoor settings.
The rest of the chapter is organized as follows: First, we describe baseline
performance for the NDSP dataset in Section 4.1. Then, Section 4.2 describes
the experiments we run to compare performance and in Sections 4.2.2 and 4.3, we
describe our results and conclusions.
50
-
4.1 NDSP dataset: Baseline performance
We first define baseline performance using the NDSP dataset to show the
difficulty of this dataset. While there is significant research done in the area of
face recognition using high - quality video where the subject is looking directly
at the camera, research using poor - quality data with off-angle poses is also
needed. So we define baseline performance for this dissertation to show that it is
a challenging problem.
4.1.1 Experiments
For each subject in the NDSP probe set, we compare each frame of their
probe video clip to the set of gallery images of the same subject. We describe
how we generate the multiple gallery images per subject in Section 5.1. For each
subject, we predetermine the best single probe video frame to use for that person.
We do this by picking the frame that gives us the highest matching score to
this corresponding set of gallery images. This gives us a new image set of 57
images (one image per subject), where each image represents the highest possible
matching score of that subject to the gallery images of the same subject. We use
this oracle set of probe video frames as our probe set. This is an optimistic
baseline, in that a recognition system would not necessarily be able to find the best
frame in each probe video clip. We then run recognition using this set of images
as probes and report the rank one recognition rate as our baseline performance.
4.1.2 Results
In Figure 4.1, we show the rank one recognition rates using this set of 57
images, when the images in the gallery set correspond to an off-angle degree from
51
-
Figure 4.1. Baseline performance for the NDSP dataset
the frontal position of 0, +/- 6, +/-12, +/-18 in the yaw angle. The face is
also rotated to +/- 6 in pitch angle.
We see that performance steadily increases as we increase the range of poses
available in the gallery set. We determine the best frame per subject based on its
matching score to all 17 poses. This explains why performance peaks when we use
all 17 poses, since we use all 17 poses to pick the frames that make up the oracle
probe set. This shows that this is a challenging dataset, where performance is
poor even when we pick out the probe frame with the best matching score to its
52
-
gallery image. Secondly, we demonstrate that using a variety of poses increases
recognition performance.
We show that performance increases as we increase the number of poses, till we
stop at 17 poses. So the question arises as to whether or not performance would
continue to increase if we were to increase the off-angle of the poses and add more
images to our gallery. However, we generated additional synthetic poses, but the
face detection system was unable to handle poses that were greater than 18 from
a frontal position. So, those poses were not used by the recognition system and
even if they were, would have not been useful for recognition. Furthermore, if the
video contained images of the subject in a strictly frontal position, the additional
poses would not be useful for recognition. However, as we showed in [43], multiple
images can be used to be improve recognition even in instances where the subject
is in a frontal pose.
4.2 Comparison dataset
For comparison, we run recognition experiments using the Comparison dataset
described in Section 3.2. This set contains high quality still images as gallery data
and two sets of probe data, one acquired on a high - quality camcorder and the
other on a surveillance camera. This dataset also contains data acquired indoors
and outdoors. This shows how the change in lighting can also affect recognition
performance.
4.2.1 Experiments
For our experiments, we use PittPatts detector and recognition system. Once
we have detected all the faces in the probe and gallery data, we create a single
53
-
gallery of all the gallery images, with minimal clustering to ensure that each image
is considered a unique subject. Then for each video sequence, we create a gallery
and cluster it so that they all correspond to the same subject. We then run
recognition of each set of videos against the gallery of high-quality still images.
We report results using rank one recognition rate and equal error rate.
Since we cluster the video frames to correspond to one subject, distances are
reported between one sequence and a gallery image. So results are reported per
video sequence, rather than per frame. Our experiments are grouped into four
categories, depending on the sensor used and the acquisition condition in which
the data is acquired.
4.2.2 Results
In this section, we describe the detection and recognition results when we run
recognition experiments described in 4.2.
Detection results: In Table 4.1 we show the results of the face detection
and how many faces were detected in the video sequences. The number of faces
detected in the outdoor video is far fewer than that in the indoor video. We
also notice that the number of faces detected reduces as we move from high-
quality video to outdoor video. With the high - quality video indoors, detection is
about 50% and falls to less than 5% when we move outdoors, using a surveillance
camera. So we see that the type of camera and the acquisition condition affects
face detection performance.
In Figures 4.2 through 4.5, we show an example frame from each acquisition
and camera. We also show some of the thumbnails created after we run detection
on the surveillance and high-definition video (both indoor and outdoor video). We
54
-
TABLE 4.1
COMPARISON DATASET RESULTS: DETECTIONS IN VIDEO
USING PITTPATT
Performance Indoor video Outdoor video
metric High-resolutionvideo
Surveillancevideo
High-resolution