face recognition: a literature survey - machine perception

Face Recognition: A Literature Survey

W. ZHAO

Sarnoff Corporation

R. CHELLAPPA

University of Maryland

P. J. PHILLIPS

National Institute of Standards and Technology

AND

A. ROSENFELD

University of Maryland

As one of the most successful applications of image analysis and understanding, facerecognition has recently received significant attention, especially during the pastseveral years. At least two reasons account for this trend: the first is the wide range ofcommercial and law enforcement applications, and the second is the availability offeasible technologies after 30 years of research. Even though current machinerecognition systems have reached a certain level of maturity, their success is limited bythe conditions imposed by many real applications. For example, recognition of faceimages acquired in an outdoor environment with changes in illumination and/or poseremains a largely unsolved problem. In other words, current systems are still far awayfrom the capability of the human perception system.

This paper provides an up-to-date critical survey of still- and video-based facerecognition research. There are two underlying motivations for us to write this surveypaper: the first is to provide an up-to-date review of the existing literature, and thesecond is to offer some insights into the studies of machine recognition of faces. Toprovide a comprehensive survey, we not only categorize existing recognition techniquesbut also present detailed descriptions of representative methods within each category.In addition, relevant topics such as psychophysical studies, system evaluation, andissues of illumination and pose variation are covered.

Categories and Subject Descriptors: I.5.4 [Pattern Recognition]: Applications

General Terms: Algorithms

Additional Key Words and Phrases: Face recognition, person identification

An earlier version of this paper appeared as “Face Recognition: A Literature Survey,” Technical Report CAR-TR-948, Center for Automation Research, University of Maryland, College Park, MD, 2000.Authors’ addresses: W. Zhao, Vision Technologies Lab, Sarnoff Corporation, Princeton, NJ 08543-5300;email: [email protected]; R. Chellappa and A. Rosenfeld, Center for Automation Research, University ofMaryland, College Park, MD 20742-3275; email: {rama,ar}@cfar.umd.edu; P. J. Phillips, National Instituteof Standards and Technology, Gaithersburg, MD 20899; email: [email protected] to make digital/hard copy of part or all of this work for personal or classroom use is granted with-out fee provided that the copies are not made or distributed for profit or commercial advantage, the copyrightnotice, the title of the publication, and its date appear, and notice is given that copying is by permission ofACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specificpermission and/or a fee.c©2003 ACM 0360-0300/03/1200-0399 $5.00

ACM Computing Surveys, Vol. 35, No. 4, December 2003, pp. 399–458.

400 Zhao et al.

1. INTRODUCTION

As one of the most successful applicationsof image analysis and understanding, facerecognition has recently received signifi-cant attention, especially during the pastfew years. This is evidenced by the emer-gence of face recognition conferences suchas the International Conference on Audio-and Video-Based Authentication (AVBPA)since 1997 and the International Con-ference on Automatic Face and GestureRecognition (AFGR) since 1995, system-atic empirical evaluations of face recog-nition techniques (FRT), including theFERET [Phillips et al. 1998b, 2000; Rizviet al. 1998], FRVT 2000 [Blackburn et al.2001], FRVT 2002 [Phillips et al. 2003],and XM2VTS [Messer et al. 1999] pro-tocols, and many commercially availablesystems (Table II). There are at least tworeasons for this trend; the first is the widerange of commercial and law enforcementapplications and the second is the avail-ability of feasible technologies after 30years of research. In addition, the prob-lem of machine recognition of human facescontinues to attract researchers from dis-ciplines such as image processing, patternrecognition, neural networks, computervision, computer graphics, and psychology.

The strong need for user-friendly sys-tems that can secure our assets and pro-tect our privacy without losing our iden-tity in a sea of numbers is obvious. Atpresent, one needs a PIN to get cash froman ATM, a password for a computer, adozen others to access the internet, andso on. Although very reliable methods ofbiometric personal identification exist, for

Table I. Typical Applications of Face RecognitionAreas Specific applications

Video game, virtual reality, training programsEntertainment Human-robot-interaction, human-computer-interaction

Drivers’ licenses, entitlement programsSmart cards Immigration, national ID, passports, voter registration

Welfare fraudTV Parental control, personal device logon, desktop logon

Information security Application security, database security, file encryptionIntranet security, internet access, medical recordsSecure trading terminals

Law enforcement Advanced video surveillance, CCTV controland surveillance Portal control, postevent analysis

Shoplifting, suspect tracking and investigation

example, fingerprint analysis and retinalor iris scans, these methods rely on thecooperation of the participants, whereasa personal identification system based onanalysis of frontal or profile images of theface is often effective without the partici-pant’s cooperation or knowledge. Some ofthe advantages/disadvantages of differentbiometrics are described in Phillips et al.[1998]. Table I lists some of the applica-tions of face recognition.

Commercial and law enforcement ap-plications of FRT range from static,controlled-format photographs to uncon-trolled video images, posing a wide rangeof technical challenges and requiring anequally wide range of techniques from im-age processing, analysis, understanding,and pattern recognition. One can broadlyclassify FRT systems into two groups de-pending on whether they make use ofstatic images or of video. Within thesegroups, significant differences exist, de-pending on the specific application. Thedifferences are in terms of image qual-ity, amount of background clutter (posingchallenges to segmentation algorithms),variability of the images of a particularindividual that must be recognized, avail-ability of a well-defined recognition ormatching criterion, and the nature, type,and amount of input from a user. A listof some commercial systems is given inTable II.

A general statement of the problem ofmachine recognition of faces can be for-mulated as follows: given still or videoimages of a scene, identify or verifyone or more persons in the scene us-ing a stored database of faces. Available

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

Face Recognition: A Literature Survey 401

Table II. Available Commercial Face Recognition Systems (Some of these Web sitesmay have changed or been removed.) [The identification of any company, commercial

product, or trade name does not imply endorsement or recommendation by the NationalInstitute of Standards and Technology or any of the authors or their institutions.]

Commercial products WebsitesFaceIt from Visionics http://www.FaceIt.comViisage Technology http://www.viisage.comFaceVACS from Plettac http://www.plettac-electronics.comFaceKey Corp. http://www.facekey.comCognitec Systems http://www.cognitec-systems.deKeyware Technologies http://www.keywareusa.com/Passfaces from ID-arts http://www.id-arts.com/ImageWare Sofware http://www.iwsinc.com/Eyematic Interfaces Inc. http://www.eyematic.com/BioID sensor fusion http://www.bioid.comVisionsphere Technologies http://www.visionspheretech.com/menu.htmBiometric Systems, Inc. http://www.biometrica.com/FaceSnap Recoder http://www.facesnap.de/htdocs/english/index2.htmlSpotIt for face composite http://spotit.itc.it/SpotIt.html

Fig. 1. Configuration of a generic face recognitionsystem.

collateral information such as race, age,gender, facial expression, or speech may beused in narrowing the search (enhancingrecognition). The solution to the probleminvolves segmentation of faces (face de-tection) from cluttered scenes, feature ex-traction from the face regions, recognition,or verification (Figure 1). In identificationproblems, the input to the system is an un-known face, and the system reports backthe determined identity from a databaseof known individuals, whereas in verifica-tion problems, the system needs to confirmor reject the claimed identity of the inputface.

Face perception is an important part ofthe capability of human perception sys-tem and is a routine task for humans,while building a similar computer sys-tem is still an on-going research area. Theearliest work on face recognition can betraced back at least to the 1950s in psy-chology [Bruner and Tagiuri 1954] and tothe 1960s in the engineering literature[Bledsoe 1964]. Some of the earliest stud-ies include work on facial expressionof emotions by Darwin [1972] (see alsoEkman [1998]) and on facial profile-basedbiometrics by Galton [1888]). But re-search on automatic machine recogni-tion of faces really started in the 1970s[Kelly 1970] and after the seminal workof Kanade [1973]. Over the past 30years extensive research has been con-ducted by psychophysicists, neuroscien-tists, and engineers on various aspectsof face recognition by humans and ma-chines. Psychophysicists and neuroscien-tists have been concerned with issuessuch as whether face perception is adedicated process (this issue is still be-ing debated in the psychology community[Biederman and Kalocsai 1998; Ellis 1986;Gauthier et al. 1999; Gauthier and Logo-thetis 2000]) and whether it is done holis-tically or by local feature analysis.

Many of the hypotheses and theoriesput forward by researchers in these dis-ciplines have been based on rather smallsets of images. Nevertheless, many of the


402 Zhao et al.

findings have important consequences forengineers who design algorithms and sys-tems for machine recognition of humanfaces. Section 2 will present a concise re-view of these findings.

Barring a few exceptions that use rangedata [Gordon 1991], the face recognitionproblem has been formulated as recogniz-ing three-dimensional (3D) objects fromtwo-dimensional (2D) images.1 Earlier ap-proaches treated it as a 2D pattern recog-nition problem. As a result, during theearly and mid-1970s, typical pattern clas-sification techniques, which use measuredattributes of features (e.g., the distancesbetween important points) in faces or faceprofiles, were used [Bledsoe 1964; Kanade1973; Kelly 1970]. During the 1980s, workon face recognition remained largely dor-mant. Since the early 1990s, research in-terest in FRT has grown significantly. Onecan attribute this to several reasons: an in-crease in interest in commercial opportu-nities; the availability of real-time hard-ware; and the increasing importance ofsurveillance-related applications.

Over the past 15 years, research hasfocused on how to make face recognitionsystems fully automatic by tackling prob-lems such as localization of a face in agiven image or video clip and extractionof features such as eyes, mouth, etc.Meanwhile, significant advances havebeen made in the design of classifiersfor successful face recognition. Amongappearance-based holistic approaches,eigenfaces [Kirby and Sirovich 1990;Turk and Pentland 1991] and Fisher-faces [Belhumeur et al. 1997; Etemadand Chellappa 1997; Zhao et al. 1998]have proved to be effective in experimentswith large databases. Feature-basedgraph matching approaches [Wiskottet al. 1997] have also been quite suc-cessful. Compared to holistic approaches,feature-based methods are less sensi-tive to variations in illumination andviewpoint and to inaccuracy in face local-

1There have been recent advances on 3D face recogni-tion in situations where range data acquired throughstructured light can be matched reliably [Bronsteinet al. 2003].

ization. However, the feature extractiontechniques needed for this type of ap-proach are still not reliable or accurateenough [Cox et al. 1996]. For example,most eye localization techniques assumesome geometric and textural models anddo not work if the eye is closed. Section 3will present a review of still-image-basedface recognition.

During the past 5 to 8 years, much re-search has been concentrated on video-based face recognition. The still imageproblem has several inherent advantagesand disadvantages. For applications suchas drivers’ licenses, due to the controllednature of the image acquisition process,the segmentation problem is rather easy.However, if only a static picture of an air-port scene is available, automatic locationand segmentation of a face could pose se-rious challenges to any segmentation al-gorithm. On the other hand, if a videosequence is available, segmentation of amoving person can be more easily accom-plished using motion as a cue. But thesmall size and low image quality of facescaptured from video can significantly in-crease the difficulty in recognition. Video-based face recognition is reviewed inSection 4.

As we propose new algorithms and buildmore systems, measuring the performanceof new systems and of existing systemsbecomes very important. Systematic datacollection and evaulation of face recogni-tion systems is reviewed in Section 5.

Recognizing a 3D object from its 2D im-ages poses many challenges. The illumina-tion and pose problems are two prominentissues for appearance- or image-based ap-proaches. Many approaches have beenproposed to handle these issues, with themajority of them exploring domain knowl-edge. Details of these approaches are dis-cussed in Section 6.

In 1995, a review paper [Chellappa et al.1995] gave a thorough survey of FRTat that time. (An earlier survey [Samaland Iyengar 1992] appeared in 1992.) Atthat time, video-based face recognitionwas still in a nascent stage. During thepast 8 years, face recognition has receivedincreased attention and has advanced



technically. Many commercial systems forstill face recognition are now available.Recently, significant research efforts havebeen focused on video-based face model-ing/tracking, recognition, and system in-tegration. New datasets have been createdand evaluations of recognition techniquesusing these databases have been carriedout. It is not an overstatement to say thatface recognition has become one of themost active applications of pattern recog-nition, image analysis and understanding.

In this paper we provide a critical reviewof current developments in face recogni-tion. This paper is organized as follows: inSection 2 we briefly review issues that arerelevant from a psychophysical point ofview. Section 3 provides a detailed reviewof recent developments in face recognitiontechniques using still images. In Section 4face recognition techniques based on videoare reviewed. Data collection and perfor-mance evaluation of face recognition algo-rithms are addressed in Section 5 with de-scriptions of representative protocols. InSection 6 we discuss two important prob-lems in face recognition that can be math-ematically studied, lack of robustness toillumination and pose variations, and wereview proposed methods of overcomingthese limitations. Finally, a summary andconclusions are presented in Section 7.

2. PSYCHOPHYSICS/NEUROSCIENCEISSUES RELEVANT TO FACERECOGNITION

Human recognition processes utilize abroad spectrum of stimuli, obtained frommany, if not all, of the senses (visual,auditory, olfactory, tactile, etc.). In manysituations, contextual knowledge is alsoapplied, for example, surroundings playan important role in recognizing faces inrelation to where they are supposed tobe located. It is futile to even attempt todevelop a system using existing technol-ogy, which will mimic the remarkable facerecognition ability of humans. However,the human brain has its limitations in thetotal number of persons that it can accu-rately “remember.” A key advantage of acomputer system is its capacity to handle

large numbers of face images. In mostapplications the images are available onlyin the form of single or multiple views of2D intensity data, so that the inputs tocomputer face recognition algorithms arevisual only. For this reason, the literaturereviewed in this section is restricted tostudies of human visual perception offaces.

Many studies in psychology and neuro-science have direct relevance to engineersinterested in designing algorithms or sys-tems for machine recognition of faces. Forexample, findings in psychology [Bruce1988; Shepherd et al. 1981] about the rela-tive importance of different facial featureshave been noted in the engineering liter-ature [Etemad and Chellappa 1997]. Onthe other hand, machine systems providetools for conducting studies in psychologyand neuroscience [Hancock et al. 1998;Kalocsai et al. 1998]. For example, a pos-sible engineering explanation of the bot-tom lighting effects studied in Johnstonet al. [1992] is as follows: when the actuallighting direction is opposite to the usuallyassumed direction, a shape-from-shadingalgorithm recovers incorrect structural in-formation and hence makes recognition offaces harder.

A detailed review of relevant studies inpsychophysics and neuroscience is beyondthe scope of this paper. We only summa-rize findings that are potentially relevantto the design of face recognition systems.For details the reader is referred to thepapers cited below. Issues that are of po-tential interest to designers are2:

—Is face recognition a dedicated process?[Biederman and Kalocsai 1998; Ellis1986; Gauthier et al. 1999; Gauthier andLogothetis 2000]: It is traditionally be-lieved that face recognition is a dedi-cated process different from other ob-ject recognition tasks. Evidence for theexistence of a dedicated face process-ing system comes from several sources[Ellis 1986]. (a) Faces are more eas-ily remembered by humans than other

2Readers should be aware of the existence of diverseopinions on some of these issues. The opinions givenhere do not necessarily represent our views.


404 Zhao et al.

objects when presented in an uprightorientation. (b) Prosopagnosia patientsare unable to recognize previously fa-miliar faces, but usually have no otherprofound agnosia. They recognize peo-ple by their voices, hair color, dress, etc.It should be noted that prosopagnosiapatients recognize whether a given ob-ject is a face or not, but then have dif-ficulty in identifying the face. Sevendifferences between face recognitionand object recognition can be summa-rized [Biederman and Kalocsai 1998]based on empirical evidence: (1) con-figural effects (related to the choice ofdifferent types of machine recognitionsystems), (2) expertise, (3) differencesverbalizable, (4) sensitivity to contrastpolarity and illumination direction (re-lated to the illumination problem in ma-chine recognition systems), (5) metricvariation, (6) Rotation in depth (relatedto the pose variation problem in ma-chine recognition systems), and (7) ro-tation in plane/inverted face. Contraryto the traditionally held belief, some re-cent findings in human neuropsychol-ogy and neuroimaging suggest that facerecognition may not be unique. Accord-ing to [Gauthier and Logothetis 2000],recent neuroimaging studies in humansindicate that level of categorization andexpertise interact to produce the speci-fication for faces in the middle fusiformgyrus.3 Hence it is possible that the en-coding scheme used for faces may alsobe employed for other classes with simi-lar properties. (On recognition of famil-iar vs. unfamiliar faces see Section 7.)

—Is face perception the result of holisticor feature analysis? [Bruce 1988; Bruceet al. 1998]: Both holistic and featureinformation are crucial for the percep-tion and recognition of faces. Studiessuggest the possibility of global descrip-tions serving as a front end for finer,feature-based perception. If dominantfeatures are present, holistic descrip-

3The fusiform gyrus or occipitotemporal gyrus, lo-cated on the ventromedial surface of the temporaland occipital lobes, is thought to be critical for facerecognition.

tions may not be used. For example, inface recall studies, humans quickly fo-cus on odd features such as big ears, acrooked nose, a staring eye, etc. One ofthe strongest pieces of evidence to sup-port the view that face recognition in-volves more configural/holistic process-ing than other object recognition hasbeen the face inversion effect in whichan inverted face is much harder to rec-ognize than a normal face (first demon-strated in [Yin 1969]). An excellent ex-ample is given in [Bartlett and Searcy1993] using the “Thatcher illusion”[Thompson 1980]. In this illusion, theeyes and mouth of an expressing faceare excised and inverted, and the re-sult looks grotesque in an upright face;however, when shown inverted, the facelooks fairly normal in appearance, andthe inversion of the internal features isnot readily noticed.

—Ranking of significance of facial features[Bruce 1988; Shepherd et al. 1981]: Hair,face outline, eyes, and mouth (not nec-essarily in this order) have been de-termined to be important for perceiv-ing and remembering faces [Shepherdet al. 1981]. Several studies have shownthat the nose plays an insignificant role;this may be due to the fact that al-most all of these studies have been doneusing frontal images. In face recogni-tion using profiles (which may be im-portant in mugshot matching applica-tions, where profiles can be extractedfrom side views), a distinctive noseshape could be more important than theeyes or mouth [Bruce 1988]. Anotheroutcome of some studies is that bothexternal and internal features are im-portant in the recognition of previ-ously presented but otherwise unfamil-iar faces, but internal features are moredominant in the recognition of familiarfaces. It has also been found that theupper part of the face is more usefulfor face recognition than the lower part[Shepherd et al. 1981]. The role of aes-thetic attributes such as beauty, attrac-tiveness, and/or pleasantness has alsobeen studied, with the conclusion that



the more attractive the faces are, thebetter is their recognition rate; the leastattractive faces come next, followed bythe midrange faces, in terms of ease ofbeing recognized.

—Caricatures [Brennan 1985; Bruce 1988;Perkins 1975]: A caricature can be for-mally defined [Perkins 1975] as “a sym-bol that exaggerates measurements rel-ative to any measure which varies fromone person to another.” Thus the lengthof a nose is a measure that varies fromperson to person, and could be usefulas a symbol in caricaturing someone,but not the number of ears. A stan-dard caricature algorithm [Brennan1985] can be applied to different qual-ities of image data (line drawings andphotographs). Caricatures of line draw-ings do not contain as much informationas photographs, but they manage to cap-ture the important characteristics of aface; experiments based on nonordinaryfaces comparing the usefulness of line-drawing caricatures and unexaggeratedline drawings decidedly favor the former[Bruce 1988].

—Distinctiveness [Bruce et al. 1994]: Stud-ies show that distinctive faces are bet-ter retained in memory and are rec-ognized better and faster than typicalfaces. However, if a decision has to bemade as to whether an object is a faceor not, it takes longer to recognize anatypical face than a typical face. Thismay be explained by different mecha-nisms being used for detection and foridentification.

—The role of spatial frequency analysis[Ginsburg 1978; Harmon 1973; Sergent1986]: Earlier studies [Ginsburg 1978;Harmon 1973] concluded that informa-tion in low spatial frequency bandsplays a dominant role in face recog-nition. Recent studies [Sergent 1986]have shown that, depending on the spe-cific recognition task, the low, band-pass and high-frequency componentsmay play different roles. For examplegender classification can be successfullyaccomplished using low-frequency com-ponents only, while identification re-

quires the use of high-frequency com-ponents [Sergent 1986]. Low-frequencycomponents contribute to global de-scription, while high-frequency compo-nents contribute to the finer detailsneeded in identification.

—Viewpoint-invariant recognition? [Bie-derman 1987; Hill et al. 1997; Tarrand Bulthoff 1995]: Much work in vi-sual object recognition (e.g. [Biederman1987]) has been cast within a theo-retical framework introduced in [Marr1982] in which different views of ob-jects are analyzed in a way whichallows access to (largely) viewpoint-invariant descriptions. Recently, therehas been some debate about whether ob-ject recognition is viewpoint-invariantor not [Tarr and Bulthoff 1995]. Someexperiments suggest that memory forfaces is highly viewpoint-dependent.Generalization even from one profileviewpoint to another is poor, thoughgeneralization from one three-quarterview to the other is very good [Hill et al.1997].

—Effect of lighting change [Bruce et al.1998; Hill and Bruce 1996; Johnstonet al. 1992]: It has long been informallyobserved that photographic negativesof faces are difficult to recognize. How-ever, relatively little work has exploredwhy it is so difficult to recognize nega-tive images of faces. In [Johnston et al.1992], experiments were conducted toexplore whether difficulties with nega-tive images and inverted images of facesarise because each of these manipula-tions reverses the apparent direction oflighting, rendering a top-lit image of aface apparently lit from below. It wasdemonstrated in [Johnston et al. 1992]that bottom lighting does indeed make itharder to identity familiar faces. In [Hilland Bruce 1996], the importance of toplighting for face recognition was demon-strated using a different task: match-ing surface images of faces to determinewhether they were identical.

—Movement and face recognition [O’Tooleet al. 2002; Bruce et al. 1998; Knight andJohnston 1997]: A recent study [Knight


406 Zhao et al.

and Johnston 1997] showed that fa-mous faces are easier to recognize whenshown in moving sequences than instill photographs. This observation hasbeen extended to show that movementhelps in the recognition of familiar facesshown under a range of different typesof degradations—negated, inverted, orthresholded [Bruce et al. 1998]. Evenmore interesting is the observationthat there seems to be a benefitdue to movement even if the informa-tion content is equated in the mov-ing and static comparison conditions.However, experiments with unfamiliarfaces suggest no additional benefit fromviewing animated rather than staticsequences.

—Facial expressions [Bruce 1988]: Basedon neurophysiological studies, it seemsthat analysis of facial expressions is ac-complished in parallel to face recogni-tion. Some prosopagnosic patients, whohave difficulties in identifying famil-iar faces, nevertheless seem to recog-nize expressions due to emotions. Pa-tients who suffer from “organic brainsyndrome” suffer from poor expressionanalysis but perform face recognitionquite well.4 Similarly, separation of facerecognition and “focused visual process-ing” tasks (e.g., looking for someone witha thick mustache) have been claimed.

3. FACE RECOGNITION FROMSTILL IMAGES

As illustrated in Figure 1, the prob-lem of automatic face recognition involvesthree key steps/subtasks: (1) detection andrough normalization of faces, (2) featureextraction and accurate normalization offaces, (3) identification and/or verification.Sometimes, different subtasks are not to-tally separated. For example, the facialfeatures (eyes, nose, mouth) used for facerecognition are often used in face detec-tion. Face detection and feature extractioncan be achieved simultaneously, as indi-

4From a machine recognition point of view, dramaticfacial expressions may affect face recognition perfor-mance if only one photograph is available.

cated in Figure 1. Depending on the natureof the application, for example, the sizes ofthe training and testing databases, clutterand variability of the background, noise,occlusion, and speed requirements, someof the subtasks can be very challenging.

Though fully automatic face recognitionsystems must perform all three subtasks,research on each subtask is critical. Thisis not only because the techniques usedfor the individual subtasks need to be im-proved, but also because they are criticalin many different applications (Figure 1).For example, face detection is needed toinitialize face tracking, and extraction offacial features is needed for recognizinghuman emotion, which is in turn essentialin human-computer interaction (HCI) sys-tems. Isolating the subtasks makes it eas-ier to assess and advance the state of theart of the component techniques. Earlierface detection techniques could only han-dle single or a few well-separated frontalfaces in images with simple backgrounds,while state-of-the-art algorithms can de-tect faces and their poses in clutteredbackgrounds [Gu et al. 2001; Heisele et al.2001; Schneiderman and Kanade 2000; Vi-ola and Jones 2001]. Extensive research onthe subtasks has been carried out and rel-evant surveys have appeared on, for exam-ple, the subtask of face detection [Hjelmasand Low 2001; Yang et al. 2002].

In this section we survey the state of theart of face recognition in the engineeringliterature. For the sake of completeness,in Section 3.1 we provide a highlightedsummary of research on face segmenta-tion/detection and feature extraction. Sec-tion 3.2 contains detailed reviews of recentwork on intensity image-based face recog-nition and categorizes methods of recog-nition from intensity images. Section 3.3summarizes the status of face recognitionand discusses open research issues.

3.1. Key Steps Prior to Recognition: FaceDetection and Feature Extraction

The first step in any automatic facerecognition systems is the detection offaces in images. Here we only provide asummary on this topic and highlight a few



very recent methods. After a face has beendetected, the task of feature extraction isto obtain features that are fed into a faceclassification system. Depending on thetype of classification system, features canbe local features such as lines or fiducialpoints, or facial features such as eyes,nose, and mouth. Face detection may alsoemploy features, in which case featuresare extracted simultaneously with facedetection. Feature extraction is also akey to animation and recognition of facialexpressions.

Without considering feature locations,face detection is declared successful if thepresence and rough location of a face hasbeen correctly identified. However, with-out accurate face and feature location, no-ticeable degradation in recognition perfor-mance is observed [Martinez 2002; Zhao1999]. The close relationship between fea-ture extraction and face recognition moti-vates us to review a few feature extractionmethods that are used in the recognitionapproaches to be reviewed in Section 3.2.Hence, this section also serves as an intro-duction to the next section.

3.1.1. Segmentation/Detection: Summary.Up to the mid-1990s, most work onsegmentation was focused on single-facesegmentation from a simple or complexbackground. These approaches includedusing a whole-face template, a deformablefeature-based template, skin color, and aneural network.

Significant advances have been madein recent years in achieving automaticface detection under various conditions.Compared to feature-based methods andtemplate-matching methods, appearance-or image-based methods [Rowley et al.1998; Sung and Poggio 1997] that trainmachine systems on large numbers ofsamples have achieved the best results.This may not be surprising since faceobjects are complicated, very similar toeach other, and different from nonface ob-jects. Through extensive training, comput-ers can be quite good at detecting faces.

More recently, detection of faces underrotation in depth has been studied. One

approach is based on training on multiple-view samples [Gu et al. 2001; Schnei-derman and Kanade 2000]. Compared toinvariant-feature-based methods [Wiskottet al. 1997], multiview-based methods offace detection and recognition seem to beable to achieve better results when the an-gle of out-of-plane rotation is large (35◦).In the psychology community, a similardebate exists on whether face recognitionis viewpoint-invariant or not. Studies inboth disciplines seem to support the ideathat for small angles, face perception isview-independent, while for large angles,it is view-dependent.

In a detection problem, two statisticsare important: true positives (also referredto as detection rate) and false positives(reported detections in nonface regions).An ideal system would have very hightrue positive and very low false positiverates. In practice, these two requirementsare conflicting. Treating face detection asa two-class classification problem helpsto reduce false positives dramatically[Rowley et al. 1998; Sung and Poggio 1997]while maintaining true positives. This isachieved by retraining systems with false-positive samples that are generated bypreviously trained systems.

3.1.2. Feature Extraction: Summary andMethods

3.1.2.1. Summary. The importance of fa-cial features for face recognition cannotbe overstated. Many face recognition sys-tems need facial features in addition tothe holistic face, as suggested by studiesin psychology. It is well known that evenholistic matching methods, for example,eigenfaces [Turk and Pentland 1991] andFisherfaces [Belhumeur et al. 1997], needaccurate locations of key facial featuressuch as eyes, nose, and mouth to normal-ize the detected face [Martinez 2002; Yanget al. 2002].

Three types of feature extraction meth-ods can be distinguished: (1) generic meth-ods based on edges, lines, and curves;(2) feature-template-based methods thatare used to detect facial features suchas eyes; (3) structural matching methods


408 Zhao et al.

that take into consideration geometricalconstraints on the features. Early ap-proaches focused on individual features;for example, a template-based approachwas described in [Hallinan 1991] to de-tect and recognize the human eye in afrontal face. These methods have difficultywhen the appearances of the featureschange significantly, for example, closedeyes, eyes with glasses, open mouth. To de-tect the features more reliably, recent ap-proaches have used structural matchingmethods, for example, the Active ShapeModel [Cootes et al. 1995]. Compared toearlier methods, these recent statisticalmethods are much more robust in termsof handling variations in image intensityand feature shape.

An even more challenging situation forfeature extraction is feature “restoration,”which tries to recover features that areinvisible due to large variations in headpose. The best solution here might be tohallucinate the missing features either byusing the bilateral symmetry of the face orusing learned information. For example, aview-based statistical method claims to beable to handle even profile views in whichmany local features are invisible [Cooteset al. 2000].

3.1.2.2. Methods. A template-based ap-proach to detecting the eyes and mouth inreal images was presented in [Yuille et al.1992]. This method is based on match-ing a predefined parameterized templateto an image that contains a face region.Two templates are used for matching theeyes and mouth respectively. An energyfunction is defined that links edges, peaksand valleys in the image intensity tothe corresponding properties in the tem-plate, and this energy function is min-imized by iteratively changing the pa-rameters of the template to fit the im-age. Compared to this model, which ismanually designed, the statistical shapemodel (Active Shape Model, ASM) pro-posed in [Cootes et al. 1995] offers moreflexibility and robustness. The advantagesof using the so-called analysis throughsynthesis approach come from the factthat the solution is constrained by a flex-

ible statistical model. To account for tex-ture variation, the ASM model has beenexpanded to statistical appearance mod-els including a Flexible Appearance Model(FAM) [Lanitis et al. 1995] and an ActiveAppearance Model (AAM) [Cootes et al.2001]. In [Cootes et al. 2001], the pro-posed AAM combined a model of shapevariation (i.e., ASM) with a model of theappearance variation of shape-normalized(shape-free) textures. A training set of 400images of faces, each manually labeledwith 68 landmark points, and approxi-mately 10,000 intensity values sampledfrom facial regions were used. The shapemodel (mean shape, orthogonal mappingmatrix Ps and projection vector bs) is gen-erated by representing each set of land-marks as a vector and applying principal-component analysis (PCA) to the data.Then, after each sample image is warpedso that its landmarks match the meanshape, texture information can be sam-pled from this shape-free face patch. Ap-plying PCA to this data leads to a shape-free texture model (mean texture, Pgand bg ). To explore the correlation be-tween the shape and texture variations,a third PCA is applied to the concate-nated vectors (bs and bg ) to obtain thecombined model in which one vector cof appearance parameters controls boththe shape and texture of the model. Tomatch a given image and the model, anoptimal vector of parameters (displace-ment parameters between the face regionand the model, parameters for linear in-tensity adjustment, and the appearanceparameters c) are searched by minimiz-ing the difference between the syntheticimage and the given one. After match-ing, a best-fitting model is constructedthat gives the locations of all the facialfeatures and can be used to reconstructthe original images. Figure 2 illustratesthe optimization/search procedure forfitting the model to the image. To speed upthe search procedure, an efficient methodis proposed that exploits the similaritiesamong optimizations. This allows the di-rect method to find and apply directionsof rapid convergence which are learnedoff-line.



Fig. 2. Multiresolution search from a displaced position using a face model. (Courtesy of T. Cootes,K. Walker, and C. Taylor.)

3.2. Recognition from Intensity Images

Many methods of face recognition havebeen proposed during the past 30 years.Face recognition is such a challengingyet interesting problem that it has at-tracted researchers who have differentbackgrounds: psychology, pattern recogni-tion, neural networks, computer vision,and computer graphics. It is due to thisfact that the literature on face recognitionis vast and diverse. Often, a single sys-tem involves techniques motivated by dif-ferent principles. The usage of a mixtureof techniques makes it difficult to classifythese systems based purely on what typesof techniques they use for feature repre-sentation or classification. To have a clearand high-level categorization, we insteadfollow a guideline suggested by the psy-chological study of how humans use holis-tic and local features. Specifically, we havethe following categorization:

(1) Holistic matching methods. Thesemethods use the whole face region asthe raw input to a recognition system.One of the most widely used repre-sentations of the face region is eigen-pictures [Kirby and Sirovich 1990;Sirovich and Kirby 1987], which arebased on principal component analy-sis.

(2) Feature-based (structural) matchingmethods. Typically, in these methods,

local features such as the eyes, nose,and mouth are first extracted and theirlocations and local statistics (geomet-ric and/or appearance) are fed into astructural classifier.

(3) Hybrid methods. Just as the humanperception system uses both local fea-tures and the whole face region to rec-ognize a face, a machine recognitionsystem should use both. One can ar-gue that these methods could poten-tially offer the best of the two types ofmethods.

Within each of these categories, furtherclassification is possible (Table III). Usingprincipal-component analysis (PCA),many face recognition techniques havebeen developed: eigenfaces [Turk andPentland 1991], which use a nearest-neighbor classifier; feature-line-basedmethods, which replace the point-to-pointdistance with the distance between a pointand the feature line linking two storedsample points [Li and Lu 1999]; Fisher-faces [Belhumeur et al. 1997; Liu andWechsler 2001; Swets and Weng 1996b;Zhao et al. 1998] which use linear/Fisherdiscriminant analysis (FLD/LDA) [Fisher1938]; Bayesian methods, which use aprobabilistic distance metric [Moghaddamand Pentland 1997]; and SVM methods,which use a support vector machine as theclassifier [Phillips 1998]. Utilizing higher-order statistics, independent-component


410 Zhao et al.

Table III. Categorization of Still Face Recognition TechniquesApproach Representative workHolistic methods

Principal-component analysis (PCA)Eigenfaces Direct application of PCA [Craw and Cameron 1996; Kirby

and Sirovich 1990; Turk and Pentland 1991]Probabilistic eigenfaces Two-class problem with prob. measure [Moghaddam and

Pentland 1997]Fisherfaces/subspace LDA FLD on eigenspace [Belhumeur et al. 1997; Swets and Weng

1996b; Zhao et al. 1998]SVM Two-class problem based on SVM [Phillips 1998]Evolution pursuit Enhanced GA learning [Liu and Wechsler 2000a]Feature lines Point-to-line distance based [Li and Lu 1999]ICA ICA-based feature analysis [Bartlett et al. 1998]

Other representationsLDA/FLD LDA/FLD on raw image [Etemad and Chellappa 1997]PDBNN Probabilistic decision based NN [Lin et al. 1997]

Feature-based methodsPure geometry methods Earlier methods [Kanade 1973; Kelly 1970]; recent

methods [Cox et al. 1996; Manjunath et al. 1992]Dynamic link architecture Graph matching methods [Okada et al. 1998; Wiskott et al.

1997]Hidden Markov model HMM methods [Nefian and Hayes 1998; Samaria 1994;

Samaria and Young 1994]Convolution Neural Network SOM learning based CNN methods [Lawrence et al. 1997]

Hybrid methodsModular eigenfaces Eigenfaces and eigenmodules [Pentland et al. 1994]Hybrid LFA Local feature method [Penev and Atick 1996]Shape-normalized Flexible appearance models [Lanitis et al. 1995]Component-based Face region and components [Huang et al. 2003]

analysis (ICA) is argued to have morerepresentative power than PCA, andhence may provide better recognition per-formance than PCA [Bartlett et al. 1998].Being able to offer potentially greatergeneralization through learning, neuralnetworks/learning methods have alsobeen applied to face recognition. One ex-ample is the Probabilistic Decision-BasedNeural Network (PDBNN) method [Linet al. 1997] and the other is the evolutionpursuit (EP) method [Liu and Wechsler2000a].

Most earlier methods belong to the cat-egory of structural matching methods, us-ing the width of the head, the distancesbetween the eyes and from the eyes to themouth, etc. [Kelly 1970], or the distancesand angles between eye corners, mouthextrema, nostrils, and chin top [Kanade1973]. More recently, a mixture-distancebased approach using manually extracteddistances was reported [Cox et al. 1996].Without finding the exact locations offacial features, Hidden Markov Model-(HMM-) based methods use strips of pix-

els that cover the forehead, eye, nose,mouth, and chin [Nefian and Hayes 1998;Samaria 1994; Samaria and Young 1994].[Nefian and Hayes 1998] reported bet-ter performance than Samaria [1994] byusing the KL projection coefficients in-stead of the strips of raw pixels. One ofthe most successful systems in this cate-gory is the graph matching system [Okadaet al. 1998; Wiskott et al. 1997], whichis based on the Dynamic Link Architec-ture (DLA) [Buhmann et al. 1990; Ladeset al. 1993]. Using an unsupervised learn-ing method based on a self-organizing map(SOM), a system based on a convolutionalneural network (CNN) has been developed[Lawrence et al. 1997].

In the hybrid method category, wewill briefly review the modular eigenfacemethod [Pentland et al. 1994], a hybridrepresentation based on PCA and localfeature analysis (LFA) [Penev and Atick1996], a flexible appearance model-basedmethod [Lanitis et al. 1995], and a recentdevelopment [Huang et al. 2003] alongthis direction. In [Pentland et al. 1994],



Fig. 3. Electronically modified images which were correctly identified.

the use of hybrid features by combiningeigenfaces and other eigenmodules is ex-plored: eigeneyes, eigenmouth, and eigen-nose. Though experiments show slightimprovements over holistic eigenfaces oreigenmodules based on structural match-ing, we believe that these types of methodsare important and deserve further inves-tigation. Perhaps many relevant problemsneed to be solved before fruitful resultscan be expected, for example, how to opti-mally arbitrate the use of holistic and localfeatures.

Many types of systems have been suc-cessfully applied to the task of face recog-nition, but they all have some advantagesand disadvantages. Appropriate schemesshould be chosen based on the specific re-quirements of a given task. Most of thesystems reviewed here focus on the sub-task of recognition, but others also in-clude automatic face detection and featureextraction, making them fully automaticsystems [Lin et al. 1997; Moghaddam andPentland 1997; Wiskott et al. 1997].

3.2.1. Holistic Approaches

3.2.1.1. Principal-Component Analysis.Starting from the successful low-dimensional reconstruction of facesusing KL or PCA projections [Kirby andSirovich 1990; Sirovich and Kirby 1987],eigenpictures have been one of the majordriving forces behind face representa-tion, detection, and recognition. It iswell known that there exist significantstatistical redundancies in natural im-ages [Ruderman 1994]. For a limited class

of objects such as face images that arenormalized with respect to scale, trans-lation, and rotation, the redundancy iseven greater [Penev and Atick 1996; Zhao1999]. One of the best global compactrepresentations is KL/PCA, which decor-relates the outputs. More specifically,sample vectors x can be expressed as lin-ear combinations of the orthogonal basis�i: x = ∑n

i=1 ai�i ≈ ∑mi=1 ai�i (typically

m � n) by solving the eigenproblem

C� = ��, (1)

where C is the covariance matrix for inputx.

An advantage of using such representa-tions is their reduced sensitivity to noise.Some of this noise may be due to small oc-clusions, as long as the topological struc-ture does not change. For example, goodperformance under blurring, partial oc-clusion and changes in background hasbeen demonstrated in many eigenpicture-based systems, as illustrated in Figure 3.This should not come as a surprise, sincethe PCA reconstructed images are muchbetter than the original distorted im-ages in terms of their global appearance(Figure 4).

For better approximation of face imagesoutside the training set, using an extendedtraining set that adds mirror-imaged faceswas shown to achieve lower approxima-tion error [Kirby and Sirovich 1990]. Us-ing such an extended training set, theeigenpictures are either symmetric or an-tisymmetric, with the most leading eigen-pictures typically being symmetric.


412 Zhao et al.

Fig. 4. Reconstructed images using 300 PCA projection coefficients for electronically modi-fied images (Figure 3). (From Zhao [1999].)

The first really successful demonstra-tion of machine recognition of faces wasmade in [Turk and Pentland 1991] usingeigenpictures (also known as eigenfaces)for face detection and identification. Giventhe eigenfaces, every face in the databasecan be represented as a vector of weights;the weights are obtained by projecting theimage into eigenface components by a sim-ple inner product operation. When a newtest image whose identification is requiredis given, the new image is also representedby its vector of weights. The identificationof the test image is done by locating theimage in the database whose weights arethe closest to the weights of the test image.By using the observation that the projec-tion of a face image and a nonface imageare usually different, a method of detect-ing the presence of a face in a given imageis obtained. The method was demon-strated using a database of 2500 face im-ages of 16 subjects, in all combinations ofthree head orientations, three head sizes,and three lighting conditions.

Using a probabilistic measure of sim-ilarity, instead of the simple Euclideandistance used with eigenfaces [Turk andPentland 1991], the standard eigenfaceapproach was extended [Moghaddam andPentland 1997] to a Bayesian approach.Practically, the major drawback of aBayesian method is the need to esti-mate probability distributions in a high-dimensional space from very limited num-bers of training samples per class. To avoidthis problem, a much simpler two-classproblem was created from the multiclassproblem by using a similarity measure

based on a Bayesian analysis of image dif-ferences. Two mutually exclusive classeswere defined: �I , representing intraper-sonal variations between multiple imagesof the same individual, and �E , represent-ing extrapersonal variations due to dif-ferences in identity. Assuming that bothclasses are Gaussian-distributed, likeli-hood functions P (�|�I ) and P (�|�E ) wereestimated for a given intensity difference� = I1 − I2. Given these likelihood func-tions and using the MAP rule, two face im-ages are determined to belong to the sameindividual if P (�|�I ) > P (�|�E ). A largeperformance improvement of this prob-abilistic matching technique over stan-dard nearest-neighbor eigenspace match-ing was reported using large face datasetsincluding the FERET database [Phillipset al. 2000]. In Moghaddam and Pentland[1997], an efficient technique of probabil-ity density estimation was proposed by de-composing the input space into two mu-tually exclusive subspaces: the principalsubspace F and its orthogonal subspace F(a similar idea was explored in Sung andPoggio [1997]). Covariances only in theprincipal subspace are estimated for usein the Mahalanobis distance [Fukunaga1989]. Experimental results have been re-ported using different subspace dimen-sionalities MI and ME for �I and �E .For example, MI = 10 and ME = 30were used for internal tests, while MI =ME = 125 were used for the FERET test.In Figure 5, the so-called dual eigenfacesseparately trained on samples from �Iand �E are plotted along with the stan-dard eigenfaces. While the extrapersonal



Fig. 5. Comparison of “dual” eigenfaces and stan-dard eigenfaces: (a) intrapersonal, (b) extraper-sonal, (c) standard [Moghaddam and Pentland 1997].(Courtesy of B. Moghaddam and A. Pentland.)

eigenfaces appear more similar to thestandard eigenfaces than the intraper-sonal ones, the intrapersonal eigenfacesrepresent subtle variations due mostlyto expression and lighting, suggestingthat they are more critical for identifica-tion [Moghaddam and Pentland 1997].

Face recognition systems usingLDA/FLD have also been very suc-cessful [Belhumeur et al. 1997; Etemadand Chellappa 1997; Swets and Weng1996b; Zhao et al. 1998; Zhao et al. 1999].LDA training is carried out via scattermatrix analysis [Fukunaga 1989]. Foran M -class problem, the within- andbetween-class scatter matrices Sw, Sb arecomputed as follows:

Sw =M∑

i=1

Pr(ωi)Ci,

(2)

Sb =M∑

i=1

Pr(ωi)(mi − m0)(mi − m0)T ,

where Pr(ωi) is the prior class probability,and is usually replaced by 1/M in practicewith the assumption of equal priors. HereSw is the within-class satter matrix, show-ing the average scatter5 Ci of the sam-ple vectors x of different classes ωi around

5These are also conditional covariance matrices; the

Fig. 6. Different projection bases constructed froma set of 444 individuals, where the set is augumentedvia adding noise and mirroring. The first row showsthe first five pure LDA basis images W ; the secondrow shows the first five subspace LDA basis imagesW�; the average face and first four eigenfaces � areshown on the third row [Zhao et al. 1998].

their respective means mi: Ci = E[(x(ω)−mi)(x(ω) − mi)T |ω = ωi]. Similarly, Sb isthe Between-class Scatter Matrix, repre-senting the scatter of the conditional meanvectors mi around the overall mean vectorm0. A commonly used measure for quan-tifying discriminatory power is the ratioof the determinant of the between-classscatter matrix of the projected samples tothe determinant of the within-class scat-ter matrix: J (T ) = |T T SbT |/|T T SwT |.The optimal projection matrix W whichmaximizes J (T ) can be obtained by solv-ing a generalized eigenvalue problem:

SbW = SwW�W . (3)

It is helpful to make comparisonsamong the so-called (linear) projection al-gorithms. Here we illustrate the com-parison between eigenfaces and Fisher-faces. Similar comparisons can be madefor other methods, for example, ICA pro-jection methods. In all these projection al-gorithms, classification is performed by (1)projecting the input x into a subspace viaa projection/basis matrix Proj

6:

total covariance C used to compute the PCA projec-tion is C = ∑M

i=1 Pr(ωi)Ci .6Proj is � for eigenfaces, W for Fisherfaces withpure LDA projection, and W� for Fisherfaces with


414 Zhao et al.

z = Proj x; (4)

(2) comparing the projection coefficientvector z of the input to all the prestoredprojection vectors of labeled classes todetermine the input class label. Thevector comparison varies in differentimplementations and can influence thesystem’s performance dramatically [Moonand Phillips 2001]. For example, PCAalgorithms can use either the angle orthe Euclidean distance (weighted or un-weighted) between two projection vectors.For LDA algorithms, the distance can beunweighted or weighted.

In Swets and Weng [1996b], discrimi-nant analysis of eigenfeatures is appliedin an image retrieval system to determinenot only class (human face vs. nonfaceobjects) but also individuals within theface class. Using tree-structure learning,the eigenspace and LDA projectionsare recursively applied to smaller andsmaller sets of samples. Such recursivepartitioning is carried out for every nodeuntil the samples assigned to the nodebelong to a single class. Experiments onthis approach were reported in Swets andWeng [1996]. A set of 800 images wasused for training; the training set camefrom 42 classes, of which human facesbelong to a single class. Within the singleface class, 356 individuals were includedand distinguished. Testing results onimages not in the training set were 91%for 78 face images and 87% for 38 nonfaceimages based on the top choice.

A comparative performance analysiswas carried out in Belhumeur et al. [1997].Four methods were compared in this pa-per: (1) a correlation-based method, (2) avariant of the linear subspace method sug-gested in Shashua [1994], (3) an eigenfacemethod Turk and Pentland [1991], and (4)a Fisherface method which uses subspaceprojection prior to LDA projection toavoid the possible singularity in Sw asin Swets and Weng [1996b]. Experimentswere performed on a database of 500images created by Hallinan [1994] and a

sequential PCA and LDA projections; these threebases are shown for visual comparison in Figure 6.

database of 176 images created at Yale.The results of the experiments showedthat the Fisherface method performedsignificantly better than the other threemethods. However, no claim was madeabout the relative performance of thesealgorithms on larger databases.

To improve the performance of LDA-based systems, a regularized subspaceLDA system that unifies PCA and LDAwas proposed in Zhao [1999] and Zhaoet al. [1998]. Good generalization abilityof this system was demonstrated by ex-periments that carried out testing on newclasses/individuals without retraining thePCA bases �, and sometimes the LDAbases W . While the reason for not re-training PCA is obvious, it is interestingto test the adaptive capability of the sys-tem by fixing the LDA bases when im-ages from new classes are added.7 Thefixed PCA subspace of dimensionality 300was trained from a large number of sam-ples. An augmented set of 4056 mostlyfrontal-view images constructed from theoriginal 1078 FERET images of 444 in-dividuals by adding noise and mirroringwas used in Zhao et al. [1998]. At leastone of the following three characteristicsseparates this system from other LDA-based systems: (1) the unique selectionof the universal face subspace dimension,(2) the use of a weighted distance mea-sure, and (3) a regularized procedure thatmodifies the within-class scatter matrixSw. The authors selected the dimension-ality of the universal face subspace basedon the characteristics of the eigenvectors(face-like or not) instead of the eigenval-ues [Zhao et al. 1998], as is commonlydone. Later it was concluded in Penev andSirovich [2000] that the global face sub-space dimensionality is on the order of400 for large databases of 5,000 images.A weighted distance metric in the pro-jection space z was used to improve per-formance [Zhao 1999].8 Finally, the LDA

7This makes sense because the final classification iscarried out in the projection space z by comparisonwith prestored projection vectors.8Weighted metrics have also been used in the pureLDA approach [Etemad and Chellappa 1997] and the



Fig. 7. Two architectures for performing ICA on images. Left: architecture forfinding statistically independent basis images. Performing source separation onthe face images produces independent images in the rows of U . Right: architecturefor finding a factorial code. Performing source separation on the pixels produces afactorial code in the columns of the output matrix U [Bartlett et al. 1998]. (Courtesyof M. Bartlett, H. Lades, and T. Sejnowski.)

training was regularized by modifying theSw matrix to Sw+δI , where δ is a relativelysmall positive number. Doing this solvesa numerical problem when Sw is close tobeing singular. In the extreme case whereonly one sample per class is available, thisregularization transforms the LDA prob-lem into a standard PCA problem with Sbbeing the covariance matrix C. Applyingthis approach, without retraining the LDAbasis, to a testing/probe set of 46 individ-uals of which 24 were trained and 22 werenot trained (a total of 115 images including19 untrained images of nonfrontal views),the authors reported the following perfor-mance based on a front-view-only gallerydatabase of 738 images: 85.2% for all im-ages and 95.1% for frontal views.

An evolution pursuit- (EP-) based adap-tive representation and its application toface recognition were presented in Liu andWechsler [2000a]. In analogy to projectionpursuit methods, EP seeks to learn an op-timal basis for the dual purpose of datacompression and pattern classification. Inorder to increase the generalization abilityof EP, a balance is sought between min-imizing the empirical risk encounteredduring training and narrowing the con-fidence interval for reducing the guaran-teed risk during future testing on unseendata [Vapnik 1995]. Toward that end, EPimplements strategies characteristic of ge-netic algorithms (GAs) for searching the

so-called enhanced FLD (EFM) approach [Liu andWechsler 2000b].

space of possible solutions to determinethe optimal basis. EP starts by projectingthe original data into a lower-dimensionalwhitened PCA space. Directed random ro-tations of the basis vectors in this spaceare then searched by GAs where evolutionis driven by a fitness function defined interms of performance accuracy (empiricalrisk) and class separation (confidence in-terval). The feasibility of this method hasbeen demonstrated for face recognition,where the large number of possible basesrequires a greedy search algorithm. Theparticular face recognition task involves1107 FERET frontal face images of 369subjects; there were three frontal imagesfor each subject, two for training and theremaining one for testing. The authors re-ported improved face recognition perfor-mance as compared to eigenfaces [Turkand Pentland 1991], and better gen-eralization capability than Fisherfaces[Belhumeur et al. 1997].

Based on the argument that for taskssuch as face recognition much of theimportant information is contained inhigh-order statistics, it has been pro-posed [Bartlett et al. 1998] to use ICAto extract features for face recognition.Independent-component analysis is a gen-eralization of principal-component anal-ysis, which decorrelates the high-ordermoments of the input in addition to thesecond-order moments. Two architectureshave been proposed for face recognition(Figure 7): the first is used to find a setof statistically independent source images


416 Zhao et al.

Fig. 8. Comparison of basis images using two architectures for performing ICA: (a) 25 indepen-dent components of Architecture I, (b) 25 independent components of Architecture II [Bartlettet al. 1998]. (Courtesy of M. Bartlett, H. Lades, and T. Sejnowski.)

that can be viewed as independent imagefeatures for a given set of training im-ages [Bell and Sejnowski 1995], and thesecond is used to find image filters thatproduce statistically independent out-puts (a factorial code method) [Bell and Se-jnowski 1997]. In both architectures, PCAis used first to reduce the dimensional-ity of the original image size (60 × 50).ICA is performed on the first 200 eigenvec-tors in the first architecture, and is carriedout on the first 200 PCA projection coeffi-cients in the second architecture. The au-thors reported performance improvementof both architectures over eigenfaces inthe following scenario: a FERET subsetconsisting of 425 individuals was used;all the frontal views (one per class) wereused for training and the remaining (upto three) frontal views for testing. Basisimages of the two architectures are shownin Figure 8 along with the correspondingeigenfaces.

3.2.1.2. Other Representations. In additionto the popular PCA representation and itsderivatives such as ICA and EP, other fea-tures have also been used, such as raw in-tensities and edges.

A fully automatic face detec-tion/recognition system based on aneural network is reported in Lin et al.[1997]. The proposed system is basedon a probabilistic decision-based neu-ral network (PDBNN, an extended(DBNN) [Kung and Taur 1995]) whichconsists of three modules: a face detector,an eye localizer, and a face recognizer.Unlike most methods, the facial regionscontain the eyebrows, eyes, and nose,but not the mouth.9 The rationale ofusing only the upper face is to build arobust system that excludes the influenceof facial variations due to expressionsthat cause motion around the mouth.To improve robustness, the segmentedfacial region images are first processedto produce two features at a reducedresolution of 14×10: normalized intensityfeatures and edge features, both in therange [0, 1]. These features are fed intotwo PDBNNs and the final recognitionresult is the fusion of the outputs of thesetwo PDBNNs. A unique characteristic ofPDBNNs and DBNNs is their modularstructure. That is, for each class/person

9Such a representation was also used in Kirby andSirovich [1990]



Fig. 9. Structure of the PDBNN face recognizer. Each class subnet isdesigned to recognize one person. All the network weightings are in prob-abilistic format [Lin et al. 1997]. (Courtesy of S. Lin, S. Kung, and L. Lin.)

to be recognized, PDBNN/DBNN devotesone of its subnets to the representation ofthat particular person, as illustrated inFigure 9. Such a one-class-in-one-network(OCON) structure has certain advan-tages over the all-classes-in-one-network(ACON) structure that is adopted bythe conventional multilayer perceptron(MLP). In the ACON structure, all classesare lumped into one supernetwork,so large numbers of hidden units areneeded and convergence is slow. Onthe other hand, the OCON structureconsists of subnets that consist of smallnumbers of hidden units; hence it notonly converges faster but also has bettergeneralization capability. Compared tomost multiclass recognition systems thatuse a discrimination function between

any two classes, PDBNN has a lowerfalse acceptance/rejection rate because ituses the full density description for eachclass. In addition, this architecture isbeneficial for hardware implementationsuch as distributed computing. However,it is not clear how to accurately estimatethe full density functions for the classeswhen there are only limited numbers ofsamples. Further, the system could haveproblems when the number of classesgrows exponentially.

3.2.2. Feature-Based Structural Matching Ap-proaches. Many methods in the structuralmatching category have been proposed,including many early methods based ongeometry of local features [Kanade 1973;


418 Zhao et al.

Fig. 10. The bunch graph representation of faces used in elastic graph matching [Wiskott et al.1997]. (Courtesy of L. Wiskott, J.-M. Fellous, and C. von der Malsburg.)

Kelly 1970] as well as 1D [Samaria andYoung 1994] and pseudo-2D [Samaria1994] HMM methods. One of the mostsuccessful of these systems is the Elas-tic Bunch Graph Matching (EBGM) sys-tem [Okada et al. 1998; Wiskott et al.1997], which is based on DLA [Buhmannet al. 1990; Lades et al. 1993]. Wavelets,especially Gabor wavelets, play a buildingblock role for facial representation in thesegraph matching methods. A typical localfeature representation consists of waveletcoefficients for different scales and rota-tions based on fixed wavelet bases (calledjets in Okada et al. [1998]). These locallyestimated wavelet coefficients are robustto illumination change, translation, dis-tortion, rotation, and scaling.

The basic 2D Gabor function and itsFourier transform are

g (x, y : u0, v0) = exp(−[

x2/2σ 2x + y2/2σ 2

y

]+ 2πi[u0x + vo y]),

G(u, v) = exp( − 2π2(σ 2

x (u − u0)2

+ σ 2y (v − v0)2)), (5)

where σx and σ y represent the spatialwidths of the Gaussian and (u0, v0) is thefrequency of the complex sinusoid.

DLAs attempt to solve some of the con-ceptual problems of conventional artificialneural networks, the most prominent ofthese being the representation of syntac-tical relationships in neural networks.DLAs use synaptic plasticity and areable to form sets of neurons grouped intostructured graphs while maintainingthe advantages of neural systems. Both

Buhmann et al. [1990] and Ladeset al. [1993] used Gabor-based wavelets(Figure 10(a)) as the features. As de-scribed in Lades et al. [1993] DLA’s basicmechanism, in addition to the connectionparameter Tij betweeen two neurons (i,j ), is a dynamic variable Jij . Only theJ -variables play the roles of synapticweights for signal transmission. TheT -parameters merely act to constrain theJ -variables, for example, 0 ≤ Jij ≤ Tij .The T -parameters can be changed slowlyby long-term synaptic plasticity. Theweights Jij are subject to rapid modifi-cation and are controlled by the signalcorrelations between neurons i and j .Negative signal correlations lead to adecrease and positive signal correlationslead to an increase in Jij . In the absenceof any correlation, Jij slowly returns to aresting state, a fixed fraction of Tij . Eachstored image is formed by picking a rect-angular grid of points as graph nodes. Thegrid is appropriately positioned over theimage and is stored with each grid point’slocally determined jet (Figure 10(a)), andserves to represent the pattern classes.Recognition of a new image takes place bytransforming the image into the grid ofjets, and matching all stored model graphsto the image. Conformation of the DLAis done by establishing and dynamicallymodifying links between vertices in themodel domain.

The DLA architecture was recently ex-tended to Elastic Bunch Graph Match-ing [Wiskott et al. 1997] (Figure 10). Thisis similar to the graph described above,but instead of attaching only a single jetto each node, the authors attached a set



of jets (called the bunch graph represen-tation, Figure 10(b)), each derived from adifferent face image. To handle the posevariation problem, the pose of the face isfirst determined using prior class infor-mation [Kruger et al. 1997], and the “jet”transformations under pose variation arelearned [Maurer and Malsburg 1996a].Systems based on the EBGM approachhave been applied to face detection andextraction, pose estimation, gender classi-fication, sketch-image-based recognition,and general object recognition. The suc-cess of the EBGM system may be due toits resemblance to the human visual sys-tem [Biederman and Kalocsai 1998].

3.2.3. Hybrid Approaches. Hybrid ap-proaches use both holistic and localfeatures. For example, the modular eigen-faces approach [Pentland et al. 1994]uses both global eigenfaces and localeigenfeatures.

In Pentland et al. [1994], the capa-bilities of the earlier system [Turk andPentland 1991] were extended in severaldirections. In mugshot applications, usu-ally a frontal and a side view of a personare available; in some other applications,more than two views may be appropriate.One can take two approaches to handlingimages from multiple views. The firstapproach pools all the images and con-structs a set of eigenfaces that representall the images from all the views. Theother approach uses separate eigenspacesfor different views, so that the collection ofimages taken from each view has its owneigenspace. The second approach, knownas view-based eigenspaces, performsbetter.

The concept of eigenfaces can beextended to eigenfeatures, such aseigeneyes, eigenmouth, etc. Using alimited set of images (45 persons, twoviews per person, with different facialexpressions such as neutral vs. smiling),recognition performance as a function ofthe number of eigenvectors was measuredfor eigenfaces only and for the combinedrepresentation. For lower-order spaces,the eigenfeatures performed better than

Fig. 11. Comparison of matching: (a) testviews, (b) eigenface matches, (c) eigenfea-ture matches [Pentland et al. 1994].

the eigenfaces [Pentland et al. 1994];when the combined set was used, onlymarginal improvement was obtained.These experiments support the claim thatfeature-based mechanisms may be usefulwhen gross variations are present in theinput images (Figure 11).

It has been argued that practical sys-tems should use a hybrid of PCA andLFA (Appendix B in Penev and Atick[1996]). Such view has been long held inthe psychology community [Bruce 1988].It seems to be better to estimate eigen-modes/eigenfaces that have large eigen-values (and so are more robust againstnoise), while for estimating higher-ordereigenmodes it is better to use LFA. To sup-port this point, it was argued in Penevand Atick [1996] that the leading eigenpic-tures are global, integrating, or smooth-ing filters that are efficient in suppress-ing noise, while the higher-order modesare ripply or differentiating filters that arelikely to amplify noise.

LFA is an interesting biologically in-spired feature analysis method [Penevand Atick 1996]. Its biological motivationcomes from the fact that, though a hugearray of receptors (more than six millioncones) exist in the human retina, only a


420 Zhao et al.

Fig. 12. LFA kernels K (xi , y) at different grids xi [Penev and Atick 1996].

small fraction of them are active, corre-sponding to natural objects/signals thatare statistically redundant [Ruderman1994]. From the activity of these sparselydistributed receptors, the brain has todiscover where and what objects are inthe field of view and recover their at-tributes. Consequently, one expects to rep-resent the natural objects/signals in a sub-space of lower dimensionality by findinga suitable parameterization. For a lim-ited class of objects such as faces whichare correctly aligned and scaled, this sug-gests that even lower dimensionality canbe expected [Penev and Atick 1996]. Onegood example is the successful use of thetruncated PCA expansion to approximatethe frontal face images in a linear sub-space [Kirby and Sirovich 1990; Sirovichand Kirby 1987].

Going a step further, the whole face re-gion stimulates a full 2D array of recep-tors, each of which corresponds to a lo-cation in the face, but some of these re-ceptors may be inactive. To explore thisredundancy, LFA is used to extract to-pographic local features from the globalPCA modes. Unlike PCA kernels �i whichcontain no topographic information (theirsupports extend over the entire grid ofimages), LFA kernels (Figure12) K (xi, y)at selected grids xi have local support.10

10These kernels (Figure 12) indexed by grids xi aresimilar to the ICA kernels in the first ICA systemarchitecture [Bartlett et al. 1998; Bell and Sejnowski1995].

The search for the best topographic set ofsparsely distributed grids {xo} based on re-construction error is called sparsificationand is described in Penev and Atick [1996].Two interesting points are demonstratedin this paper: (1) using the same numberof kernels, the perceptual reconstructionquality of LFA based on the optimal setof grids is better than that of PCA; themean square error is 227, and 184 for aparticular input; (2) keeping the secondPCA eigenmodel in LFA reconstruction re-duces the mean square error to 152, sug-gesting the hybrid use of PCA and LFA. Noresults on recognition performance basedon LFA were reported. LFA is claimed tobe used in Visionics’s commercial systemFaceIt (Table II).

A flexible appearance model basedmethod for automatic face recognition waspresented in [Lanitis et al. 1995]. To iden-tify a face, both shape and gray-level infor-mation are modeled and used. The shapemodel is an ASM; these are statisticalmodels of the shapes of objects which it-eratively deform to fit to an example ofthe shape in a new image. The statis-tical shape model is trained on exam-ple images using PCA, where the vari-ables are the coordinates of the shapemodel points. For the purpose of classifi-cation, the shape variations due to inter-class variation are separated from thosedue to within-class variations (such assmall variations in 3D orientation and fa-cial expression) using discriminant anal-ysis. Based on the average shape of the



Fig. 13. The face recognition scheme based on flexible appearancemodel [Lanitis et al. 1995]. (Courtesy of A. Lanitis, C. Taylor, and T.Cootes.)

shape model, a global shape-free gray-level model can be constructed, again us-ing PCA.11 To further enhance the ro-bustness of the system against changesin local appearance such as occlusions,local gray-level models are also builton the shape model points. Simple lo-cal profiles perpendicular to the shapeboundary are used. Finally, for an inputimage, all three types of information,including extracted shape parameters,shape-free image parameters, and localprofiles, are used to compute a Maha-lanobis distance for classification as illus-trated in Figure 13. Based on training 10and testing 13 images for each of 30 indi-viduals, the classification rate was 92% forthe 10 normal testing images and 48% forthe three difficult images.

The last method [Huang et al. 2003] thatwe review in this category is based on re-cent advances in component-based detec-tion/recognition [Heisele et al. 2001] and3D morphable models [Blanz and Vetter1999]. The basic idea of component-basedmethods [Heisele et al. 2001] is to decom-pose a face into a set of facial componentssuch as mouth and eyes that are intercon-

11Recall that in Craw and Cameron [1996] andMoghaddam and Pentland [1997] these shape-freeimages are used as the inputs to the classifier.

ntected by a flexible geometrical model.(Notice how this method is similar to theEBGM system [Okada et al. 1998; Wiskottet al. 1997] except that gray-scale compo-nents are used instead of Gabor wavelets.)The motivation for using components isthat changes in head pose mainly lead tochanges in the positions of facial compo-nents which could be accounted for by theflexibility of the geometric model. How-ever, a major drawback of the system isthat it needs a large number of trainingimages taken from different viewpointsand under different lighting conditions. Toovercome this problem, the 3D morphableface model [Blanz and Vetter 1999] is ap-plied to generate arbitrary synthetic im-ages under varying pose and illumination.Only three face images (frontal, semipro-file, profile) of a person are needed to com-pute the 3D face model. Once the 3D modelis constructed, synthetic images of size58 × 58 are generated for training boththe detector and the classifer. Specifically,the faces were rotated in depth from 0◦ to34◦ in 2◦ increments and rendered withtwo illumination models (the first modelconsists of ambient light alone and thesecond includes ambient light and a ro-tating point light source) at each pose.Fourteen facial components were used forface detection, but only nine components


422 Zhao et al.

that were not strongly overlapped and con-tained gray-scale structures were used forclassification. In addition, the face regionwas added to the nine components to forma single feature vector (a hybrid method),which was later trained by a SVM clas-sifer [Vapnik 1995]. Training on three im-ages and testing on 200 images per sub-ject led to the following recognition rateson a set of six subjects: 90% for the hybridmethod and roughly 10% for the globalmethod that used the face region only; thefalse positive rate was 10%.

3.3. Summary and Discussion

Face recognition based on still images orcaptured frames in a video stream canbe viewed as 2D image matching andrecognition; range images are not avail-able in most commercial/law enforcementapplications. Face recognition based onother sensing modalities such as sketchesand infrared images is also possible. Eventhough this is an oversimplification of theactual recognition problem of 3D objectsbased on 2D images, we have focused onthis 2D problem, and we will address twoimportant issues about 2D recognition of3D face objects in Section 6. Significantprogress has been achieved on various as-pects of face recognition: segmentation,feature extraction, and recognition of facesin intensity images. Recently, progress hasalso been made on constructing fully au-tomatic systems that integrate all thesetechniques.

3.3.1. Status of Face Recognition. Aftermore than 30 years of research and de-velopment, basic 2D face recognition hasreached a mature level and many commer-cial systems are available (Table II) forvarious applications (Table I).

Early research on face recognition wasprimarily focused on the feasibility ques-tion, that is: is machine recognition offaces possible? Experiments were usuallycarried out using datasets consisting ofas few as 10 images. Significant advanceswere made during the mid-1990s, withmany methods proposed and tested ondatasets consisting of as many as 100

images. More recently, practical meth-ods have emerged that aim at more re-alistic applications. In the recent com-prehensive FERET evaluations [Phillipset al. 2000; Phillips et al. 1998b; Rizviet al. 1998], aimed at evaluating dif-ferent systems using the same largedatabase containing thousands of images,the systems described in Moghaddam andPentland [1997]; Swets and Weng [1996b];Turk and Pentland [1991]; Wiskott et al.[1997]; Zhao et al. [1998], as well asothers, were evaluated. The EBGM sys-tem [Wiskott et al. 1997], the subspaceLDA system [Zhao et al. 1998], and theprobabilistic eigenface system [Moghad-dam and Pentland 1997] were judged tobe among the top three, with each methodshowing different levels of performance ondifferent subsets of sequestered images.A brief summary of the FERET evalua-tions will be presented in Section 5. Re-cently, more extensive evaluations usingcommercial systems and thousands of im-ages have been performed in the FRVT2000 [Blackburn et al. 2001] and FRVT2002 [Phillips et al. 2003] tests.

3.3.2. Lessons, Facts and Highlights. Dur-ing the development of face recognitionsystems, many lessons have been learnedwhich may provide some guidance in thedevelopment of new methods and systems.

—Advances in face recognition have comefrom considering various aspects of thisspecialized perception problem. Earliermethods treated face recognition as astandard pattern recognition problem;later methods focused more on the rep-resentation aspect, after realizing itsuniqueness (using domain knowledge);more recent methods have been con-cerned with both representation andrecognition, so a robust system withgood generalization capability can bebuilt. Face recognition continues toadopt state-of-the-art techniques fromlearning, computer vision, and patternrecognition. For example, distributionmodeling using mixtures of Gaussians,and SVM learning methods, have beenused in face detection/recognition.



—Among all face detection/recognitionmethods, appearance/image-based ap-proaches seem to have dominated upto now. The main reason is the strongprior that all face images belong to a faceclass. An important example is the useof PCA for the representation of holisticfeatures. To overcome sensitivity to geo-metric change, local appearance-basedapproaches, 3D enhanced approaches,and hybrid approaches can be used.The most recent advances toward fast3D data acquisition and accurate 3Drecognition are likely to influence futuredevelopements.12

—The methodological difference betweenface detection and face recognition maynot be as great as it appears to be. Wehave observed that the multiclass facerecognition problem can be convertedinto a two-class “detection” problem byusing image differences [Moghaddamand Pentland 1997]; and the face de-tection problem can be converted into amulticlass “recognition” problem by us-ing additional nonface clusters of nega-tive samples [Sung and Poggio 1997].

—It is well known that for face detection,the image size can be quite small. Butwhat about face recognition? Clearly theimage size cannot be too small for meth-ods that depend heavily on accuratefeature localization, such as graphmatching methods [Okada et al. 1998].However, it has been demonstrated thatthe image size can be very small forholistic face recognition: 12 × 11 for thesubspace LDA system [Zhao et al. 1999],14×10 for the PDBNN system [Lin et al.1997], and 18 × 24 for human percep-tion [Bachmann 1991]. Some authorshave argued that there exists a uni-versal face subspace of fixed dimension;hence for holistic recognition, image sizedoes not matter as long as it exceedsthe subspace dimensionality [Zhao et al.1999]. This claim has been supportedby limited experiments using normal-ized face images of different sizes, for

12Early work using range images was reportedin Gordon [1991].

example, from 12 × 11 to 48 × 42, toobtain different face subspaces [Zhao1999]. Indeed, slightly better perfor-mance was observed when smaller im-ages were used. One reason is that thesignal-to-noise ratio improves with thedecrease in image size.

—Accurate feature location is critical forgood recognition performance. This istrue even for holistic matching methods,since accurate location of key facial fea-tures such as eyes is required to normal-ize the detected face [Yang et al. 2002;Zhao 1999]. This was also verified in Linet al. [1997] where the use of smaller im-ages led to slightly better performancedue to increased tolerance to location er-rors. In Martinez [2002], a systematicstudy of this issue was presented.

—Regarding the debate in the psychologycommunity about whether face recog-nition is a dedicated process, the re-cent success of machine systems thatare trained on large numbers of samplesseems to confirm recent findings sug-gesting that human recognition of facesmay be not unique/dedicated, but needsextensive training.

—When comparing different systems, weshould pay close attention to imple-mentation details. Different implemen-tations of a PCA-based face recogni-tion algorithm were compared in Moonand Phillips [2001]. One class of varia-tions examined was the use of seven dif-ferent distance metrics in the nearest-neighbor classifier, which was found tobe the most critical element. This raisesthe question of what is more impor-tant in algorithm performance, the rep-resentation or the specifics of the im-plementation. Implementation detailsoften determine the performance of asystem. For example, input images arenormalized only with respect to trans-lation, in-plane rotation, and scale inBelhumeur et al. [1997], Swets andWeng [1996b], Turk and Pentland[1991], and Zhao et al. [1998], whereasin Moghaddam and Pentland [1997]the normalization also includes mask-ing and affine warping to align the


424 Zhao et al.

shape. In Craw and Cameron [1996],manually selected points are used towarp the input images to the meanshape, yielding shape-free images. Be-cause of this difference, PCA was agood classifier in Moghaddam and Pent-land [1997] for the shape-free repre-sentations, but it may not be as goodfor the simply normalized representa-tions. Recently, systematic comparisonsand independent reevaluations of ex-isting methods have been published[Beveridge et al. 2001]. This is benefi-cial to the research community. How-ever, since the methods need to be reim-plemented, and not all the details in theoriginal implementation can be takeninto account, it is difficult to carry outabsolutely fair comparisons.

—Over 30 years of research has providedus with a vast number of methods andsystems. Recognizing the fact that eachmethod has its advantages and disad-vantages, we should select methods andsystems appropriate to the application.For example, local feature based meth-ods cannot be applied when the inputimage contains a small face region, say15 × 15. Another issue is when to usePCA and when to use LDA in building asystem. Apparently, when the number oftraining samples per class is large, LDAis the best choice. On the other hand,if only one or two samples are availableper class (a degenerate case for LDA),PCA is a better choice. For a more de-tailed comparison of PCA versus LDA,see Beveridge et al. [2001]; Martinezand Kak [2001]. One way to unify PCAand LDA is to use regularized subspaceLDA [Zhao et al. 1999].

3.3.3. Open Research Issues. Thoughmachine recognition of faces from stillimages has achieved a certain level ofsuccess, its performance is still far fromthat of human perception. Specifically, wecan list the following open issues:

—Hybrid face recognition systems thatuse both holistic and local features re-semble the human perceptual system.While the holistic approach provides a

quick recognition method, the discrim-inant information that it provides maynot be rich enough to handle very largedatabases. This insufficiency can becompensated for by local feature meth-ods. However, many questions need tobe answered before we can build such acombined system. One important ques-tion is how to arbitrate the use of holisticand local features. As a first step, a sim-ple, naive engineering approach wouldbe to weight the features. But how todetermine whether and how to use thefeatures remains an open problem.

—The challenge of developing face detec-tion techniques that report not only thepresence of a face but also the accuratelocations of facial features under largepose and illumination variations still re-mains. Without accurate localization ofimportant features, accurate and robustface recognition cannot be achieved.

—How to model face variation under re-alistic settings is still challenging—forexample, outdoor environments, natu-ral aging, etc.

4. FACE RECOGNITION FROM IMAGESEQUENCES

A typical video-based face recognition sys-tem automatically detects face regions, ex-tracts features from the video, and recog-nizes facial identity if a face is present. Insurveillance, information security, and ac-cess control applications, face recognitionand identification from a video sequenceis an important problem. Face recognitionbased on video is preferable over using stillimages, since as demonstrated in Bruceet al. [1998] and Knight and Johnston[1997], motion helps in recognition of (fa-miliar) faces when the images are negated,inverted or threshold. It was also demon-strated that humans can recognize ani-mated faces better than randomly rear-ranged images from the same set. Thoughrecognition of faces from video sequenceis a direct extension of still-image-basedrecognition, in our opinion, true video-based face recognition techniques that co-herently use both spatial and temporalinformation started only a few years ago



and still need further investigation. Sig-nificant challenges for video-based recog-nition still exist; we list several of themhere.

(1) The quality of video is low. Usually,video acquisition occurs outdoors (orindoors but with bad conditions forvideo capture) and the subjects are notcooperative; hence there may be largeillumination and pose variations in theface images. In addition, partial occlu-sion and disguise are possible.

(2) Face images are small. Again, due tothe acquisition conditions, the face im-age sizes are smaller (sometimes muchsmaller) than the assumed sizes inmost still-image-based face recogni-tion systems. For example, the validface region can be as small as 15 ×15 pixels,13 whereas the face imagesizes used in feature-based still image-based systems can be as large as 128×128. Small-size images not only makethe recognition task more difficult, butalso affect the accuracy of face segmen-tation, as well as the accurate detec-tion of the fiducial points/landmarksthat are often needed in recognitionmethods.

(3) The characteristics of faces/humanbody parts. During the past 8 years,research on human action/behaviorrecognition from video has been veryactive and fruitful. Generic descriptionof human behavior not particular to anindividual is an interesting and usefulconcept. One of the main reasons forthe feasibility of generic descriptions ofhuman behavior is that the intraclassvariations of human bodies, and in par-ticular faces, is much smaller than thedifference between the objects insideand outside the class. For the same rea-son, recognition of individuals withinthe class is difficult. For example, de-tecting and localizing faces is typicallymuch easier than recognizing a specificface.

13Notice this is totally different from the situationwhere we have images with large face regions butthe final face regions feed into a classifier is 15 × 15.

Before we examine existing video-basedface recognition algorithms, we brieflyreview three closely related techniques:face segmentation and pose estimation,face tracking, and face modeling. Thesetechniques are critical for the realizationof the full potential of video-based facerecognition.

4.1. Basic Techniques of Video-Based FaceRecognition

In Chellappa et al. [1995], four computervision areas were mentioned as being im-portant for video-based face recognition:segmentation of moving objects (humans)from a video sequence; structure estima-tion; 3D models for faces; and nonrigid mo-tion analysis. For example, in Jebara et al.[1998] a face modeling system which es-timates facial features and texture froma video stream was described. This sys-tem utilizes all four techniques: segmen-tation of the face based on skin color toinitiate tracking; use of a 3D face modelbased on laser-scanned range data to nor-malize the image (by facial feature align-ment and texture mapping to generate afrontal view) and construction of an eigen-subspace for 3D heads; use of structurefrom motion (SfM) at each feature pointto provide depth information; and non-rigid motion analysis of the facial fea-tures based on simple 2D SSD (sum ofsquared differences) tracking constrainedby a global 3D model. Based on the currentdevelopment of video-based face recogni-tion, we think it is better to review threespecific face-related techniques instead ofthe above four general areas. The threevideo-based face-related techniques are:face segmentation and pose estimation,face tracking, and face modeling.

4.1.1. Face Segmentation and Pose Estima-tion. Early attempts [Turk and Pentland1991] at segmenting moving faces from animage sequence used simple pixel-basedchange detection procedures based on dif-ference images. These techniques may runinto difficulties when multiple moving ob-jects and occlusion are present. More so-phisticated methods use estimated flow


426 Zhao et al.

fields for segmenting humans in mo-tion [Shio and Sklansky 1991]. More re-cent methods [Choudhury et al. 1999;McKenna and Gong 1998] have used mo-tion and/or color information to speed upthe process of searching for possible faceregions. After candidate face regions arelocated, still-image-based face detectiontechniques can be applied to locate thefaces [Yang et al. 2002]. Given a face re-gion, important facial features can be lo-cated. The locations of feature points canbe used for pose estimation, which is im-portant for synthesizing a virtual frontalview [Choudhury et al. 1999]. Newly de-veloped segmentation methods locate theface and estimate its pose simultaneouslywithout extracting features [Gu et al.2001; Li et al. 2001b]. This is achieved bylearning multiview face examples whichare labeled with manually determinedpose angles.

4.1.2. Face and Feature Tracking. Afterfaces are located, the faces and their fea-tures can be tracked. Face tracking andfeature tracking are critical for recon-structing a face model (depth) throughSfM, and feature tracking is essentialfor facial expression recognition and gazerecognition. Tracking also plays a keyrole in spatiotemporal-based recognitionmethods [Li and Chellappa 2001; Li et al.2001a] which directly use the tracking in-formation.

In its most general form, tracking isessentially motion estimation. However,general motion estimation has fundamen-tal limitations such as the aperture prob-lem. For images like faces, some regionsare too smooth to estimate flow accurately,and sometimes the change in local appear-ances is too large to give reliable flow.Fortunately, these problems are alleviatedthanks to face modeling, which exploitsdomain knowledge. In general, trackingand modeling are dual processes: track-ing is constrained by a generic 3D modelor a learned statistical model under de-formation, and individual models are re-fined through tracking. Face tracking canbe roughly divided into three categories:

(1) head tracking, which involves trackingthe motion of a rigid object that is perform-ing rotations and translations; (2) facialfeature tracking, which involves trackingnonrigid deformations that are limited bythe anatomy of the head, that is, articu-lated motion due to speech or facial expres-sions and deformable motion due to mus-cle contractions and relaxations; and (3)complete tracking, which involves track-ing both the head and the facial features.

Early efforts focused on the first twoproblems: head tracking [Azarbayejaniet al. 1993] and facial feature track-ing [Terzopoulos and Waters 1993; Yuilleand Hallinan 1992]. In Azarbayejani et al.[1993], an approach to head tracking usingpoints with high Hessian values was pro-posed. Several such points on the head aretracked and the 3D motion parameters ofthe head are recovered by solving an over-constrained set of motion equations. Facialfeature tracking methods may make useof the feature boundary or the feature re-gion. Feature boundary tracking attemptsto track and accurately delineate theshape of the facial feature, for example, totrack the contours of the lips and mouth[Terzopoulos and Waters 1993]. Featureregion tracking addresses the simplerproblem of tracking a region such as abounding box that surrounds the facialfeature [Black et al. 1995].

In Black et al. [1995], a tracking sys-tem based on local parameterized mod-els is used to recognize facial expressions.The models include a planar model forthe head, local affine models for the eyes,and local affine models and curvature forthe mouth and eyebrows. A face track-ing system was used in Maurer and Mals-burg [1996b] to estimate the pose of theface. This system used a graph represen-tation with about 20–40 nodes/landmarksto model the face. Knowledge about facesis used to find the landmarks in thefirst frame. Two tracking systems de-scribed in Jebara et al. [1998] and Stromet al. [1999] model faces completely withtexture and geometry. Both systems usegeneric 3D models and SfM to recoverthe face structure. Jebara et al. [1998] re-lied fixed feature points (eyes, nose tip),



Table IV. Categorization of Video-Based Face Recognition TechniquesApproach Representative workStill-image methods Basic methods [Turk and Pentland 1991; Lin et al. 1997;

Moghaddam and Pentland 1997; Okada et al. 1998; Penev andAtick 1996; Wechsler et al. 1997; Wiskott et al. 1997]

Tracking-enhanced [Edwards et al. 1998; McKenna and Gong1997, 1998; Steffens et al. 1998]

Multimodal methods Video- and audio-based [Bigun et al. 1998; Choudhury et al. 1999]Spatiotemporal methods Feature trajectory-based [Li and Chellappa 2001; Li et al. 2001a]

Video-to video methods [Zhou et al. 2003]

while Strom et al. [1999] tracked onlypoints with high Hessian values. Also, Je-bara et al. [1998] tracked 2D features in3D by deforming them, while Strom et al.[1999] relied on direct comparison of a 3Dmodel to the image. Methods have beenproposed in Black et al. [1998] and Hagerand Belhumeur [1998] to solve the vary-ing appearance (both geometry and pho-tometry) problem in tracking. Some of thenewest model-based tracking methods cal-culate the 3D motions and deformationsdirectly from image intensities [Brandand Bhotika 2001], thus eliminating theinformation-lossy intermediate represen-tations.

4.1.3. Face Modeling. Modeling of facesincludes 3D shape modeling and texturemodeling. For large texture variations dueto changes in illumination, we will addressthe illumination problem in Section 6.Here we focus on 3D shape modeling. 3Dmodels of faces have been employed in thegraphics, animation, and model-based im-age compression literature. More compli-cated models are used in applications suchas forensic face reconstruction from par-tial information.

In computer vision, one of the mostwidely used methods of estimating 3Dshape from a video sequence is SfM, whichestimates the 3D depths of interestingpoints. The unconstrained SfM problemhas been approached in two ways. In thedifferential approach, one computes sometype of flow field (optical, image, or nor-mal) and uses it to estimate the depthsof visible points. The difficulty in this ap-proach is reliable computation of the flowfield. In the discrete approach, a set of fea-tures such as points, edges, corners, lines,or contours are tracked over a sequence

of frames, and the depths of these fea-tures are computed. To overcome the dif-ficulty of feature tracking, bundle adjust-ment [Triggs et al. 2000] can be used toobtain better and more robust results.

Recently, multiview based 2D methodshave gained popularity. In Li et al. [2001b],a model consisted of a sparse 3D shapemodel learned from 2D images labeledwith pose and landmarks, a shape-and-pose-free texture model, and an affine ge-ometrical model. An alternative approachis to use 3D models such as the deformablemodel of DeCarlo and Metaxas [2000] orthe linear 3D object class model of Blanzand Vetter [1999]. (In Blanz and Vetter[1999] a morphable 3D face model con-sisting of shape and texture was directlymatched to single/multiple input images;as a consequence, head orientation, illumi-nation conditions, and other parameterscould be free variables subject to optimiza-tion.) In Blanz and Vetter [1999], real-time3D modeling and tracking of faces wasdescribed; a generic 3D head model wasaligned to match frontal views of the facein a video sequence.

4.2. Video-Based Face Recognition

Historically, video face recognition origi-nated from still-image-based techniques(Table IV). That is, the system automati-cally detects and segments the face fromthe video, and then applies still-image facerecognition techniques. Many methods re-viewed in Section 3 belong to this category:eigenfaces [Turk and Pentland 1991],probabilistic eigenfaces [Moghaddamand Pentland 1997], the EBGMmethod [Okada et al. 1998; Wiskottet al. 1997], and the PDBNN method [Linet al. 1997]. An improvement over thesemethods is to apply tracking; this can help


428 Zhao et al.

in recognition, in that a virtual frontalview can be synthesized via pose anddepth estimation from video. Due to theabundance of frames in a video, anotherway to improve the recognition rate is theuse of “voting” based on the recognitionresults from each frame. The voting canbe deterministic, but probabilistic votingis better in general [Gong et al. 2000;McKenna and Gong 1998]. One drawbackof such voting schemes is the expense ofcomputing the deterministic/probabilisticresults for each frame.

The next phase of video-based facerecognition will be the use of multimodalcues. Since humans routinely use multi-ple cues to recognize identities, it is ex-pected that a multimodal system will dobetter than systems based on faces only.More importantly, using multimodal cuesoffers a comprehensive solution to the taskof identification that might not be achiev-able by using face images alone. For exam-ple, in a totally noncooperative environ-ment, such as a robbery, the face of therobber is typically covered, and the onlyway to perform faceless identificationmight be to analyize body motion charac-teristics [Klasen and Li 1998]. Excludingfingerprints, face and voice are the mostfrequently used cues for identification.They have been used in many multimodalsystems [Bigun et al. 1998; Choudhuryet al. 1999]. Since 1997, a dedicated con-ference focused on video- and audio-basedperson authentication has been held everyother year.

More recently, a third phase of videoface recognition has started. These meth-ods [Li and Chellappa 2001; Li et al.2001a] coherently exploit both spatial in-formation (in each frame) and temporal in-formation (such as the trajectories of fa-cial features). A big difference betweenthese methods and the probabilistic votingmethods [McKenna and Gong 1998] is theuse of representations in a joint temporaland spatial space for identification.

We first review systems that applystill-image-based recognition to selectedframes, and then multimodal systems. Fi-nally, we review systems that use spatialand temporal information simultaneously.

In Wechsler et al. [1997], a fully auto-matic person authentication system wasdescribed which included video break, facedetection, and authentication modules.Video skimming was used to reduce thenumber of frames to be processed. Thevideo break module, corresponding to key-frame detection based on object motion,consisted of two units. The first unit im-plemented a simple optical flow method; itwas used when the image SNR level waslow. When the SNR level was high, simplepair-wise frame differencing was used todetect the moving object. The face detec-tion module consisted of three units: facelocalization using analysis of projectionsalong the x- and y-axes; face region label-ing using a decision tree learned from posi-tive and negative examples taken from 12images each consisting of 2759 windowsof size 8×8; and face normalization basedon the numbers of face region labels. Thenormalized face images were then usedfor authentication, using an RBF network.This system was tested on three image se-quences; the first was taken indoors withone subject present, the second was takenoutdoors with two subjects, and the thirdwas taken outdoors with one subject understormy conditions. Perfect results were re-ported on all three sequences, as verifiedagainst a database of 20 still face images.

An access control system based onperson authentication was describedin McKenna and Gong [1997]. The systemcombined two complementary visual cues:motion and facial appearance. In orderto reliably detect significant motion, spa-tiotemporal zero crossings computed fromsix consecutive frames were used. Thesemotions were grouped into moving objectsusing a clustering algorithm, and Kalmanfilters were employed to track the groupedobjects. An appearance-based face detec-tion scheme using RBF networks (similarto that discussed in Rowley et al. [1998])was used to confirm the presence of aperson. The face detection scheme was“bootstrapped” using motion and objectdetection to provide an approximate headregion. Face tracking based on the RBFnetwork was used to provide feedback tothe motion clustering process to help deal



Fig. 14. Varying the most significant identity parameters (top) andmanipulating residual variation without affecting identity (bottom)[Edwards et al. 1998].

with occlusions. Good tracking resultswere demonstrated. In McKenna andGong [1998], this work was extended toperson authentication using PCA or LDA.The authors argued that recognitionbased on selected frames is not adequatesince important information is discarded.Instead, they proposed a probabilisticvoting scheme; that is, face identificationwas carried out continuously. Though theygave examples demonstrating improvedperformance in identifying 8 or 15 peopleby using sequences, no performancestatistics were reported.

An appearance model based method forvideo tracking and enhancing identifica-tion was proposed in Edwards et al. [1998].The appearance model is a combinationof the active shape model (ASM) [Cooteset al. 1995] and the shape-free texturemodel after warping the face into a meanshape. Unlike Lanitis et al. [1995], whichused the two models separately, the au-thors used a combined set of parametersfor both models. The main contributionwas the decomposition of the combinedmodel parameters into an identity sub-space and an orthogonal residual subspaceusing linear discriminant analysis. (SeeFigure 14 for an illustration of separat-ing identity and residue.) The residualsubspace would ideally contain intraper-son variations caused by pose, lighting,and expression. In addition, they pointedout that optimal separation of identityand residue is class-specific. For exam-ple, the appearance change of a person’snose depends on its length, which is aperson-specific quantity. To correct this

class-specific information, a sequence ofimages of the same class was used. Specif-ically, a linear mapping was assumed tocapture the relation between the class-specific correction to the identity sub-space and the intraperson variation in theresidual subspace. Examples of face track-ing and visual enhancement were demon-strated, but no recognition experimentswere reported. Though this method is be-lieved to enhance tracking and make it ro-bust against appearance change, it is notclear how efficient it is to learn the class-specific information from a video sequencethat does not present much residualvariation.

In De Carlo and Metaxas [2000], a sys-tem called PersonSpotter was described.This system is able to capture, track,and recognize a person walking towardor passing a stereo CCD camera. It hasseveral modules, including a head tracker,preselector, landmark finder, and identi-fier. The head tracker determines the im-age regions that are changing due to objectmotion based on simple image differences.A stereo algorithm then determines thestereo disparities of these moving pixels.The disparity values are used to com-pute histograms for image regions. Re-gions within a certain disparity intervalare selected and referred to as silhouettes.Two types of detectors, skin color basedand convex region based, are applied tothese silhouette images. The outputs ofthese detectors are clustered to form re-gions of interest which usually correspondto heads. To track a head robustly, tempo-ral continuity is exploited in the form of


430 Zhao et al.

the thresholds used to initiate, track, anddelete an object.

To find the face region in an image, thepreselector uses a generic sparse graphconsisting of 16 nodes learned from eightexample face images. The landmark finderuses a dense graph consisting of 48 nodeslearned from 25 example images to findlandmarks such as the eyes and the nosetip. Finally, an elastic graph matchingscheme is employed to identify the face.A recognition rate of about 90% wasachieved; the size of the database is notknown.

A multimodal person recognition sys-tem was described in Choudhury et al.[1999]. This system consists of a facerecognition module, a speaker identifica-tion module, and a classifier fusion mod-ule. It has the following characteristics:(1) the face recognition module can de-tect and compensate for pose variations;the speaker identification module can de-tect and compensate for changes in theauditory background; (2) the most reli-able video frames and audio clips are se-lected for recognition; (3) 3D informationabout the head obtained through SfM isused to detect the presence of an actualperson as opposed to an image of thatperson.

Two key parts of the face recognitionmodule are face detection/tracking andeigen-face recognition. The face is de-tected using skin color information usinga learned model of a mixture of Gaussians.The facial features are then located usingsymmetry transforms and image intensitygradients. Correlation-based methods areused to track the feature points. The loca-tions of these feature points are used toestimate the pose of the face. This poseestimate and a 3D head model are usedto warp the detected face image into afrontal view. For recognition, the featurelocations are refined and the face is nor-malized with eyes and mouth in fixed lo-cations. Images from the face tracker areused to train a frontal eigenspace, andthe leading 35 eigenvectors are retained.Face recognition is then performed usinga probabilistic eigenface approach wherethe projection coefficients of all images of

each person are modeled as a Gaussiandistribution.

Finally, the face and speaker recogni-tion modules are combined using a Bayesnet. The system was tested in an ATMscenario, a controlled environment. AnATM session begins when the subject en-ters the camera’s field of view and thesystem detects his/her face. The systemthen greets the user and begins the bank-ing transaction, which involves a seriesof questions by the system and answersby the user. Data for 26 people were col-lected; the normalized face images were40 × 80 pixels and the audio was sam-pled at 16 kHz. These experiments onsmall databases and well-controlled en-vironments showed that the combinationof audio and video improved performance,and that 100% recognition and verificationwere achieved when the image/audio clipswith highest confidence scores were used.

In Li and Chellappa [2001], a face ver-ification system based on tracking facialfeatures was presented. The basic idea ofthis approach is to exploit the temporalinformation available in a video sequenceto improve face recognition. First, the fea-ture points defined by Gabor attributes ona regular 2D grid are tracked. Then, thetrajectories of these tracked feature pointsare exploited to identify the person pre-sented in a short video sequence. The pro-posed tracking-for-verification scheme isdifferent from the pure tracking schemein that one template face from a databaseof known persons is selected for track-ing. For each template with a specificpersonal ID, tracking can be performedand trajectories can be obtained. Basedon the characteristics of these trajecto-ries, identification can be carried out. Ac-cording to the authors, the trajectories ofthe same person are more coherent thanthose of different persons, as illustratedin Figure 15. Such characteristics can alsobe observed in the posterior probabilitiesover time by assuming different classes.In other words, the posterior probabilitiesfor the true hypothesis tend to be higherthan those for false hypotheses. This inturn can be used for identification. Test-ing results on a small databases of 19



Fig. 15. Corresponding feature points obtained from 20 frames: (a) resultof matching the same person to a video, (b) result of matching a differentperson to the video, (c) trajectories of (a), (d) trajectories of (b) [Li andChellappa 2001].

individuals have suggested that perfor-mance is favorable over a frame-to-framematching and voting scheme, especiallyin the case of large lighting changes. Thetesting result is based on comparison withalternative hypotheses.

Some details about the tracking algo-rithm are as follows [Li and Chellappa2001]. The motion of facial feature pointsis modeled as a global two-dimensional(2D) affine transformation (accounting forhead motion) plus a local deformation (ac-counting for residual motion that is dueto inaccuracies in the 2D affine model-ing and other factors such as facial ex-pression). The tracking problem has beenformulated as a Bayesian inference prob-lem and sequential importance sampling(SIS) [Liu and Chen 1998] (one form of SISis called Condensation [Isard and Blake1996] in the computer vision literature)proposed as an empirical solution to theinference problem. Since SIS has difficultyin high-dimensional spaces, a reparame-terization that captures essentially onlythe difference was used to facilitate thecomputation.

While most face recognition algorithmstake still images as probe inputs, a video-based face recognition approach that takesvideo sequences as inputs has recentlybeen developed [Zhou et al. 2003]. Sincethe detected face might be moving in thevideo sequence, one has to deal with uncer-tainty in tracking as well as in recognition.Rather than resolving these two uncer-tainties separately, Zhou et al. [2003] per-formed simultaneous tracking and recog-nition of human faces from a videosequence.

In still-to-video face recognition, wherethe gallery consists of still images, a timeseries state space model is proposed tofuse temporal information in a probevideo, which simultaneously character-izes the kinematics and identity using amotion vector and an identity variable,respectively. The joint posterior distribu-tion of the motion vector and the identityvariable is first estimated at each timeinstant and then propagated to the nexttime instant. Marginalization over themotion vector yields a robust estimate ofthe posterior distribution of the identityvariable and marginalization over theidentity variable yields a robust estimateof the posterior distribution of the motionvector, so that tracking and recognitionare handled simultaneously. A computa-tionally efficient sequential importancesampling (SIS) algorithm is used to esti-mate the posterior distribution. Empiricalresults demonstrate that, due to the prop-agation of the identity variable over time,degeneracy in the posterior probabilityof the identity variable is achieved togive improved recognition. The galleryis generalized to videos in order to re-alize video-to-video face recognition.An exemplar-based learning strategy isemployed to automatically select videorepresentatives from the gallery, servingas mixture centers in an updated likeli-hood measure. The SIS algorithm is usedto approximate the posterior distributionof the motion vector, the identity variable,and the exemplar index. The marginaldistribution of the identity variable pro-duces the recognition result. The modelformulation is very general and allows a


432 Zhao et al.

Fig. 16. Identity surface [Li et al. 2001a].(Courtesy of Y. Li, S. Gong, and H. Liddell.)

variety of image representations andtransformations. Experimental re-sults using images/videos collectedat UMD, NIST/USF, and CMU withpose/illumination variations have illus-trated the effectiveness of this approachin both still-to-video and video-to-videoscenarios with appropriate model choices.

In Li et al. [2001a], a multiview basedface recognition system was proposed torecognize faces from videos with largepose variations. To address the challeng-ing pose issue, the concept of an iden-tity surface that captures joint spatial andtemporal information was used. An iden-tity surface is a hypersurface formed byprojecting all the images of one individ-ual onto the discriminating feature spaceparameterized on head pose (Figure 16).14

To characterize the head pose, two an-gles, yaw and tilt, are used as basis coor-dinates in the feature space. As plotted inFigure 16, the other basis coordinates rep-resent discriminating feature patterns offaces; this will be discussed later. Based onrecovered pose information, a trajectoryof the input feature pattern can be con-structed. The trajectories of features fromknown subjects arranged in the same tem-poral order can be synthesized on their re-spective identity surfaces. To recognize aface across views over time, the trajectoryfor the input face is matched to the trajec-tories synthesized for the known subjects.This approach can be thought of as a gen-eralized version of face recognition basedon single images taken at different poses.

14Notice that this view-based idea has already beenexplored, for example, in Pentland et al. [1994].

Experimental results using twelve train-ing sequences, each containing one sub-ject, and new testing sequences of thesesubjects were reported. Recognition rateswere 100% and 93.9%, using 10 and 2 KDA(kernel discriminant analysis) vectors, re-spectively.

Other techniques have also been usedto construct the discriminating basis inthe identity surface: kernel discriminantanalysis (KDA) [Mika et al. 1999] wasused to compute a nonlinear discriminat-ing basis, and a dynamic face model isused to extract a shape-and-pose-free fa-cial texture pattern. The multiview dy-namic face model [Li et al. 2001b] con-sists of a sparse Point Distribution Model(PDM) [Cootes et al. 1995], a shape-and-pose-free texture model, and an affine ge-ometrical model. The 3D shape vector ofa face is estimated from a set of 2D faceimages in different views using landmarkpoints. Then a face image fitted by theshape model is warped to the mean shapein a frontal view, yielding a shape-and-pose-free texture pattern.15 When part ofa face is invisible in an image due to rota-tion in depth, the facial texture is recov-ered from the visible side of the face usingthe bilateral symmetry of faces. To obtaina low-dimensional statistical model, PCAwas applied to the 3D shape patterns andshape-and-pose-free texture patterns sep-arately. To further suppress within-classvariations, the shape-and-pose-free tex-ture patterns were further projected intoa KDA feature space. Finally, the iden-tity surface can be approximated and con-structed from discrete samples at fixedposes using a piece-wise planar model.

4.3. Summary

The availability of video/image sequencesgives video-based face recognition a dis-tinct advantage over still-image-basedface recognition: the abundance of tem-poral information. However, the typicallylow-quality images in video present asignificant challenge: the loss of spatial

15Notice that this procedure is very similar toAAM [Cootes et al. 2001].



information. The key to building a success-ful video-based system is to use tempo-ral information to compensate for the lostspatial information. For example, a high-resolution frame can in principle be recon-structed from a sequence of low-resolutionvideo frames and used for recognition. Afurther step is to use the image sequenceto reconstruct the 3D shape of the trackedface object via SfM and thus enhance facerecognition performance. Finally, a com-prehensive approach is to use spatial andtemporal information simultaneously forface recognition. This is also supported byrelated psychological studies.

However, many issues remain for exist-ing systems:

—SfM is a common technique used incomputer vision for recovering 3D in-formation from video sequences. How-ever, a major obstacle exists to apply-ing this technique in face recognition:the accuracy of 3D shape recovery. Faceimages contain smooth, textureless re-gions and are often acquired under vary-ing illumination,16 resulting in signifi-cant difficulties in accurate recovery of3D information. The accuracy issue maynot be very important for face detection,but it is for face recognition, which mustdifferentiate the 3D shapes of similarobjects. One possible solution is the com-plementary use of shape-from-shading,which can utilize the illumination infor-mation. A recent paper on using flow-based SfM techniques for face modelingis A. K. R. Chowdhury, and R. Chellappa[2003].

—Up to now, the databases used inmany systems have been very small,say 20 subjects. This is partiallydue to the tremendous amount ofstorage space needed for video se-quences. Fortunately, relatively largevideo databases exist, for example,the XM2TV database [Messer et al.1999], the BANCA database [Bailly-Bailliere et al. 2003], and the additionof video into the FERET and FRVT2002

16Stereo is less sensitive to illumination change butstill has difficulty in handling textureless regions.

databases. However, large-scale system-atic evaluations are still lacking.

—Although we argue that it is best touse both temporal and spatial infor-mation for face recognition, existingspatiotemporal methods have not yetshown their full potential. We believethat these types of methods deserve fur-ther investigation.

During the past 8 years, recognition ofhuman behavior has been actively stud-ied: facial expression recognition, handgesture recognition, activity recognition,etc. As pointed out earlier, descriptions ofhuman behavior are useful and are eas-ier to obtain than recognition of faces. Of-ten they provide complementary informa-tion for face recognition or additional cuesuseful for identification. In principle, bothgender classification and facial expressionrecognition can assist in the classificationof identity. For recent reviews on facialexpression recognition, see Donato et al.[1999] and Pantic and Rothkrantz [2000].We also believe that analysis of body move-ments such as gait or hand gestures canhelp in person recognition.

5. EVALUATION OF FACE RECOGNITIONSYSTEMS

Given the numerous theories and tech-niques that are applicable to face recogni-tion, it is clear that evaluation and bench-marking of these algorithms is crucial.Previous work on the evaluation of OCRand fingerprint classification systems pro-vided insights into how the evaluation ofalgorithms and systems can be performedefficiently. One of the most important factslearned in these evaluations is that largesets of test images are essential for ade-quate evaluation. It is also extremely im-portant that the samples be statistically assimilar as possible to the images that arisein the application being considered. Scor-ing should be done in a way that reflectsthe costs of errors in recognition. Reject-error behavior should be studied, not justforced recognition.

In planning an evaluation, it is impor-tant to keep in mind that the operation


434 Zhao et al.

of a pattern recognition system is statis-tical, with measurable distributions ofsuccess and failure. These distributionsare very application-dependent, and notheory seems to exist that can predictthem for new applications. This stronglysuggests that an evaluation should bebased as closely as possible on a specificapplication.

During the past 5 years, several large,publicly available face databases havebeen collected and corresponding testingprotocols have been designed. The se-ries of FERET evaluations [Phillips et al.2000b, 1998; Rizvi et al. 1998]17 attractednine institutions and companies to partic-ipate. They were succeeded by the seriesof FRVT vendor tests. We describe herethe most important face databases andtheir associated evaluation methods, in-cluding the XM2VTS and BANCA [Bailly-Bailliere et al. 2003] database.

5.1. The FERET Protocol

Until recently, there did not exist a com-mon FRT evaluation protocol that in-cluded large databases and standard eval-uation methods. This made it difficult toassess the status of FRT for real appli-cations, even though many existing sys-tems reported almost perfect performanceon small databases.

The first FERET evaluation test wasadministered in August 1994 [Phillipset al. 1998b]. This evaluation establisheda baseline for face recognition algorithms,and was designed to measure performanceof algorithms that could automatically lo-cate, normalize, and identify faces. Thisevaluation consisted of three tests, eachwith a different gallery and probe set. (Agallery is a set of known individuals, whilea probe is a set of unknown faces pre-sented for recognition.) The first test mea-sured identification performance from agallery of 316 individuals with one im-age per person; the second was a false-alarm test; and the third measured the ef-fects of pose changes on performance. Thesecond FERET evaluation was adminis-

17http://www.itl.nist.gov/iad/humanid/feret/.

tered in March 1995; it consisted of a sin-gle test that measured identification per-formance from a gallery of 817 individ-uals, and included 463 duplicates in theprobe set [Phillips et al. 1998b]. (A dupli-cate is a probe for which the correspondinggallery image was taken on a different day;there were only 60 duplicates in the Aug94evaluation.) The third and last evaluation(Sep96) was administered in September1996 and March 1997.

5.1.1. Database. Currently, the FERETdatabase is the only large database that isgenerally available to researchers withoutcharge. The images in the database wereinitially acquired with a 35-mm cameraand then digitized.

The images were collected in 15 sessionsbetween August 1993 and July 1996. Eachsession lasted 1 or 2 days, and the locationand setup did not change during the ses-sion. Sets of 5 to 11 images of each individ-ual were acquired under relatively uncon-strained conditions; see Figure 17. Theyincluded two frontal views; in the first ofthese (fa) a neutral facial expression wasrequested and in the second (fb) a differ-ent facial expression was requested (theserequests were not always honored). For200 individuals, a third frontal view wastaken using a different camera and differ-ent lighting; this is referred to as the fcimage. The remaining images were non-frontal and included right and left profiles,right and left quarter profiles, and rightand left half profiles. The FERET databaseconsists of 1564 sets of images (1199 orig-inal sets and 365 duplicate sets)—a to-tal of 14,126 images. A development setof 503 sets of images were released toresearchers; the remaining images weresequestered for independent evaluation.In late 2000 the entire FERET databasewas released along with the Sep96 eval-uation protocols, evaluation scoring code,and baseline PCA algorithms.

5.1.2. Evaluation. For details of the threeFERET evaluations, see Phillips et al.[2000, 1998b] and Rizvi et al. [1998].The results of the most recent FERET



Fig. 17. Images from the FERET dataset; these images are of size 384×256.

evaluation (Sep96) will be briefly reviewedhere. Because the entire FERET data sethas been released, the Sep96 protocol pro-vides a good benchmark for performance ofnew algorithms. For the Sep96 evaluation,there was a primary gallery consisting ofone frontal image (fa) per person for 1196individuals. This was the core gallery usedto measure performance for the followingfour different probe sets:

—fb probes—gallery and probe images ofan individual taken on the same daywith the same lighting (1195 probes);

—fc probes—gallery and probe images ofan individual taken on the same daywith different lighting (194 probes);

—Dup I probes—gallery and probe im-ages of an individual taken on differentdays—duplicate images (722 probes);and

—Dup II probes—gallery and probe im-ages of an individual taken over a yearapart (the gallery consisted of 894 im-ages; 234 probes).

Performance was measured using twobasic methods. The first measured identi-fication performance, where the primaryperformance statistic is the percentageof probes that are correctly identified bythe algorithm. The second measured veri-fication performance, where the primaryperformance measure is the equal errorrate between the probability of false alarmand the probability of correct verification.(A more complete method of reportingidentification performance is a cumulativematch characteristic; for verification per-formance it is a receiver operating charac-teristic (ROC).)

The Sep96 evaluation tested the follow-ing 10 algorithms:

—an algorithm from Excalibur Corpora-tion (Carlsbad, CA)(Sept. 1996);

—two algorithms from MIT Media Labo-ratory (Sept. 1996) [Moghaddam et al.1996; Turk and Pentland 1991];

—three linear discriminant analysis-based algorithms from Michigan StateUniversity [Swets and Weng 1996b](Sept. 1996) and the University of Mary-land [Etemad and Chellappa 1997; Zhaoet al. 1998] (Sept. 1996 and March1997);

—a gray-scale projection algorithm fromRutgers University [Wilder 1994] (Sept.1996);

—an Elastic Graph Matching algorithmfrom the University of Southern Cali-fornia [Okada et al. 1998; Wiskott et al.1997] (March 1997);

—a baseline PCA algorithm [Moon andPhillips 2001; Turk and Pentland 1991];and

—a baseline normalized correlationmatching algorithm.

Three of the algorithms performedvery well: probabilistic eigenface fromMIT [Moghaddam et al. 1996], sub-space LDA from UMD [Zhao et al. 1998,1999], and Elastic Graph Matching fromUSC [Wiskott et al. 1997].

A number of lessons were learned fromthe FERET evaluations. The first is thatperformance depends on the probe cate-gory and there is a difference between bestand average algorithm performance.

Another lesson is that the scenariohas an impact on performance. For


436 Zhao et al.

identification, on the fb and duplicateprobes, the USC scores were 94% and 59%,and the UMD scores were 96% and 47%.However, for verification, the equal errorrates were 2% and 14% for USC, and 1%and 12% for UMD.

5.1.3. Summary. The availability of theFERET database and evaluation tech-nology has had a significant impact onprogress in the development of face recog-nition algorithms. The series of testshas allowed advances in algorithm de-velopment to be quantified—for exam-ple, the performance improvements in theMIT algorithms between March 1995 andSeptember 1996, and in the UMD al-gorithms between September 1996 andMarch 1997.

Another important contribution of theFERET evaluations is the identification ofareas for future research. In general thetest results revealed three major problemareas: recognizing duplicates, recognizingpeople under illumination variations, andrecognizing them under pose variations.

5.1.4. FRVT 2000. The Sep96 FERETevaluation measured performance on pro-totype laboratory systems. After March1997 there was rapid advancement in thedevelopment of commercial face recogni-tion systems. This advancement repre-sented both a maturing of face recognitiontechnology, and the development of thesupporting system and infrastructure nec-essary to create commercial off-the-shelf(COTS) systems. By the beginning of 2000,COTS face recognition systems were read-ily available.

To assess the state of the art in COTSface recognition systems the Face Recogni-tion Vendor Test (FRVT) 200018 was orga-nized [Blackburn et al. 2001]. FRVT 2000was a technology evaluation that used theSep96 evaluation protocol, but was signif-icantly more demanding than the Sep96FERET evaluation.

Participation in FRVT 2000 was re-stricted to COTS systems, with companies

18http://www.frvt.org.

from Australia, Germany, and the UnitedStates participating. The five companiesevaluated were Banque-Tec InternationalPty. Ltd., C-VIS Computer Vision und Au-tomation GmbH, Miros, Inc., Lau Tech-nologies, and Visionics Corporation.

A greater variety of imagery was usedin FRVT 2000 than in the FERET evalua-tions. FRVT 2000 reported results in eightgeneral categories: compression, distance,expression, illumination, media, pose, res-olution, and temporal. There was no com-mon gallery across all eight categories; thesizes of the galleries and probe sets variedfrom category to category.

We briefly summarize the results ofFRVT 2000. Full details can be found in[Blackburn et al. 2001], and include iden-tification and verification performancestatistics. The media experiments showedthat changes in media do not adverselyaffect performance. Images of a personwere taken simultaneously on conven-tional film and on digital media. Thecompression experiments showed thatcompression does not adversely affect per-formance. Probe images compressed up to40:1 did not reduce recognition rates. Thecompression algorithm was JPEG.

FRVT 2000 also examined the effect ofpose angle on performance. The resultsshow that pose does not significantly affectperformance up to ±25◦, but that perfor-mance is significantly affected when thepose angle reaches ±40◦.

In the illumination category, two keyeffects were investigated. The first waslighting change indoors. This was equiv-alent to the fc probes in FERET. For thebest system in this category, the indoorchange of lighting did not significantlyaffect performance. A second experimenttested recognition with an indoor galleryand an outdoor probe set. Moving from in-door to outdoor lighting significantly af-fected performance, with the best systemachieving an identification rate of only0.55.

The temporal category is equivalent tothe duplicate probes in FERET. To com-pare progress since FERET, dup I anddup II scores were reported. For FRVT2000 the dup I identification rate was 0.63



compared with 0.58 for FERET. The cor-responding rates for dup II were 0.64 forFRVT 2000 and 0.52 for FERET. These re-sults showed that there was algorithmicprogress between the FERET and FRVT2000 evaluations. FRVT 2000 showed thattwo common concerns, the effects of com-pression and recording media, do not af-fect performance. It also showed thatfuture areas of interest continue to be du-plicates, pose variations, and illuminationvariations generated when comparing in-door images with outdoor images.

5.1.5. FRVT 2002. The Face Recogni-tion Vendor Test (FRVT) 2002 [Phillipset al. 2003]18 was a large-scale evalua-tion of automatic face recognition technol-ogy. The primary objective of FRVT 2002was to provide performance measures forassessing the ability of automatic facerecognition systems to meet real-world re-quirements. Ten participants were evalu-ated under the direct supervision of theFRVT 2002 organizers in July and August2002.

The heart of the FRVT 2002 was thehigh computational intensity test (HCInt).The HCInt consisted of 121,589 opera-tional images of 37,437 people. The im-ages were provided from the U.S. Depart-ment of State’s Mexican nonimmigrantVisa archive. From this data, real-worldperformance figures on a very large dataset were computed. Performance statisticswere computed for verification, identifica-tion, and watch list tasks.

FRVT 2002 results showed that nor-mal changes in indoor lighting do not sig-nificantly affect performance of the topsystems. Approximately the same perfor-mance results were obtained using two in-door data sets, with different lighting, inFRVT 2002. In both experiments, the bestperformer had a 90% verification rate ata false accept rate of 1%. On comparableexperiments conducted 2 years earlier inFRVT 2000, the results of FRVT 2002 indi-cated that there has been a 50% reductionin error rates. For the best face recogni-tion systems, the recognition rate for facescaptured outdoors, at a false accept rate of

1%, was only 50%. Thus, face recognitionfrom outdoor imagery remains a researchchallenge area.

A very important question for real-world applications is the rate of decreasein performance as time increases betweenthe acquisition of the database of imagesand new images presented to a system.FRVT 2002 found that for the top systems,performance degraded at approximately5% per year.

One open question in face recognitionis: how does database and watch list sizeeffect performance? Because of the largenumber of people and images in the FRVT2002 data set, FRVT 2002 reported thefirst large-scale results on this question.For the best system, the top-rank iden-tification rate was 85% on a database of800 people, 83% on a database of 1,600,and 73% on a database of 37,437. For ev-ery doubling of database size, performancedecreases by two to three overall percent-age points. More generally, identificationperformance decreases linearly in the log-arithm of the database size.

Previous evaluations have reported facerecognition performance as a function ofimaging properties. For example, previousreports compared the differences in perfor-mance when using indoor versus outdoorimages, or frontal versus nonfrontal im-ages. FRVT 2002, for the first time, exam-ined the effects of demographics on per-formance. Two major effects were found.First, recognition rates for males werehigher than females. For the top systems,identification rates for males were 6% to9% points higher than that of females.For the best system, identification per-formance on males was 78% and for fe-males it was 79%. Second, recognitionrates for older people were higher thanfor younger people. For 18- to 22-year-oldsthe average identification rate for the topsystems was 62%, and for 38- to 42-year-olds it was 74%. For every 10-year in-crease in age, performance increased onthe average by approximately 5% throughage 63.

FRVT 2002 looked at two of thesenew techniques. The first was the three-dimensional morphable models technique


438 Zhao et al.

of Blanz and Vetter [1999]. Morphablemodels are a technique for improvingrecognition of nonfrontal images. FRVT2002 found that Blanz and Vetter’s tech-nique significantly increased recognitionperformance. The second technique isrecognition from video sequences. UsingFRVT 2002 data, recognition performanceusing video sequences was the same as theperformance using still images.

In summary, the key lessons learnedin FRVT 2002 were: (1) given reason-able controlled indoor lighting, the cur-rent state of the art in face recognitionis 90% verification at a 1% false acceptrate. (2) Face Recognition in outdoor im-ages is a research problem. (3) The useof morphable models can significantly im-prove nonfrontal face recognition. (3) Iden-tification performance decreases linearlyin the logarithm of the size of the gallery.(4) In face recognition applications, ac-commodations should be made for demo-graphic information since characteristicssuch as age and sex can significantly affectperformance.

5.2. The XM2VTS Protocol

Multimodal methods19 are a very promis-ing approach to user-friendly (hence ac-ceptable), highly secure personal verifica-tion. Recognition and verification systemsneed training; the larger the training set,the better the performance achieved. Thevolume of data required for training a mul-timodal system based on analysis of videoand audio signals is on the order of TBytes;technology that allows manipulation andeffective use of such volumes of data hasonly recently become available in the formof digital video. The XM2VTS multimodaldatabase [Messer et al. 1999] contains fourrecordings of 295 subjects taken over a pe-riod of 4 months. Each recording containsa speaking head shot and a rotating headshot. Available data from this databaseinclude high-quality color images, 32-kHz16-bit sound files, video sequences, and a3D model.

19http://www.ee.surrey.ac.uk/Research/VSSP/xm2vtsdb/.

The XM2VTS database is an expan-sion of the earlier M2VTS database[Pigeon and Vandendorpe 1999]. TheM2VTS project (Multimodal Verificationfor Teleservices and Security Applica-tions), a European ACTS (Advanced Com-munications Technologies and Services)project, deals with access control by mul-timodal identification of human faces.The goal of the project was to improverecognition performance by combining themodalities of face and voice. The M2VTSdatabase contained five shots of each of37 subjects. During each shot, the subjectswere asked to count from “0” to “9” in theirnative language (most of the subjects wereFrench-speaking) and rotate their headsfrom 0◦ to −90◦, back to 0◦, and then to+90◦. They were then asked to rotate theirheads again with their glasses off, if theywore any. Three subsequences were ex-tracted from these video sequences: voicesequences, motion sequences, and glasses-off motion sequences. The voice sequencescan be used for speech verification, frontalview face recognition, and speech/lips cor-relation analysis. The other two sequencesare intended for face recognition only.

It was found that the subjects were rel-atively difficult to recognize in the fifthshot because it varied significantly inface/voice/camera setup from the othershots. Several experiments have been con-ducted using the first four shots with thegoals of investigating

—text-dependent speaker verificationfrom speech,

—text-independent speaker verificationfrom speech,

—facial feature extraction and trackingfrom moving images,

—verification from an overall frontal view,—verification from lip shape,—verification from depth information (ob-

tained using structured light),—verification from a profile, and—synchronization of speech and lip move-

ment.

5.2.1. Database. The XM2VTS databasediffers from the M2VTS database



primarily in the number of subjects (295rather than 37). The M2VTS databasecontains five shots of each subject takenat sessions over a period of 3 months; theXM2VTS database contains eight shots ofeach subject taken at four sessions over aperiod of 4 months (so that each sessioncontains two repetitions of the sequence).The XM2VTS database was acquiredusing a Sony VX1000E digital camcorderand a DHR1000UX digital VCR.

In the XM2VTS database, the first shotis a speaking head shot. Each subject, whowore a clip-on microphone, was asked toread three sentences that were writtenon a board positioned just below the cam-era. The subjects were asked to read thethree simple sentences twice at their nor-mal pace and to pause briefly at the end ofeach sentence.

The second shot is a rotating head se-quence. Each subject was asked to ro-tate his/her head to the left, to the right,up, and down, and finally to return tothe center. The subjects were told thata full profile was required and wereasked to repeat the entire sequence twice.The same sequence was used in all foursessions.

An additional dataset containing a 3Dmodel of each subject’s head was acquiredduring each session using a high-precisionstereo-based 3D camera developed by theTuring Institute.20

5.2.2. Evaluation. The M2VTS Lau-sanne protocol was designed to evaluatethe performance of vision- and speech-based person authentication systems onthe XM2VTS database. This protocol wasdefined for the task of verification. Thefeatures of the observed person are com-pared with stored features correspondingto the claimed identity, and the systemdecides whether the identity claim is trueor false on the basis of a similarity score.The subjects whose features are stored inthe system’s database are called clients,whereas persons claiming a false identityare called imposters.

20Turing Institute Web address: http://www.turing.gla.ac.uk/.

The database is divided into three parts:a training set, an evaluation set, and a testset. The training set is used to build clientmodels. The evaluation set is used to com-pute client and imposter scores. On thebasis of these scores, a threshold is cho-sen that determines whether a person isaccepted or rejected. In multimodal clas-sification, the evaluation set can also beused to optimally combine the outputs ofseveral classifiers. The test set is selectedto simulate a real authentication scenario.295 subjects were randomly divided into200 clients, 25 evaluation imposters, and70 test imposters. Two different evalu-ation configurations were used with dif-ferent distributions of client training andclient evaluation data. For more details,see Messer et al. [1999].

In order to collect face verification re-sults on this database using the Lau-sanne protocol, a contest was organizedin conjuction with ICPR 2000 (the Inter-national Conference on Pattern Recogni-tion). There were twelve algorithms fromfour partipicants in this contest [Mataset al. 2000]: an EBGM algorithm fromIDIAP (Daller Molle Institute for Per-ceptual Artificial Intelligence), a slightlymodified EBGM algorithm from Aristo-tle University of Thessaloniki, a FND-based (Fractal Neighbor Distance) algo-rithm from the University of Sydney, andeight variants of LDA algorithms and oneSVM algorithm from the University ofSurrey. The performance measures of averification system are the false accep-tance rate (FA) and the false rejection rate(FR). Both FA and FR are influenced byan acceptance threshold. According to theLausanne protocol, the threshold is set tosatisfy certain performance levels on theevaluation set. The same threshold is ap-plied to the test data and FA and FR onthe test data are computed. The best re-sults of FA and FR on the test data (FA/FR:2.3%/2.5% and 1.2%/1.0% for evaulationconfigurations I and II, respectively) wereobtained using an LDA algorithm with anon-Euclidean metric (University of Sur-rey) when the threshold was set so thatFA was equal to FB on the evaulation re-sult. This result seems to concur with the


440 Zhao et al.

equal error rates reported in the FERETprotocol. In addition, FA and FR on thetest data were reported when the thresh-old was set set so that FA or FB was zeroon the evaulation result. For more detailson the results, see Matas et al. [2000].

5.2.3. Summary. The results of theM2VTS/XM2VTS projects can be usedfor a broad range of applications. Inthe telecommunication field, the resultsshould have a direct impact on networkservices where security of informationand access will become increasinglyimportant. (Telephone fraud in the U.S.has been estimated to cost several billiondollars a year.)

6. TWO ISSUES IN FACE RECOGNITION:ILLUMINATION AND POSE VARIATION

In this section, we discuss two importantissues that are related to face recogni-tion. The best face recognition techniquesreviewed in Section 3 were successful interms of their recognition performance onlarge databases in well-controlled envi-ronments. However, face recognition inan uncontrolled environment is still verychallenging. For example, the FERETevaluations and FRVTs revealed thatthere are at least two major challenges:the illumination variation problem andthe pose variation problem. Though manyexisting systems build in some sort ofperformance invariance by applying pre-processing methods such as histogramequalization or pose learning, significantillumination or pose change can cause se-rious performance degradation. In addi-tion, face images can be partially occluded,or the system may need to recognize a per-son from an image in the database thatwas acquired some time ago (referred toas the duplicate problem in the FERETtests).

These problems are unavoidable whenface images are acquired in an uncon-trolled, uncooperative environment, as insurveillance video clips. It is beyond thescope of this paper to discuss all these is-sues and possible solutions. In this sectionwe discuss only two well-defined problems

and review approaches to solving them.Pros and cons of these approaches arepointed out so an appropriate approachcan be applied to a specific task. The ma-jority of the methods reviewed here aregenerative approaches that can synthe-size virtual views under desired illumi-nation and viewing conditions. Many ofthe reviewed methods have not yet beenapplied to the task of face recognition,at least not on large databases.21 Thismay be for several reasons; some meth-ods may need many sample images perperson, pixel-wise accurate alignment ofimages, or high-quality images for recon-struction; or they may be computationallytoo expensive to apply to recognition tasksthat process thousands of images in near-real-time.

To facilitate discussion and analysis,we adopt a varying-albedo Lambertian re-flectance model that relates the image Iof an object to the object (p, q) [Horn andBrooks 1989]:

I = ρ1 + pPs + qQs√

1 + p2 + q2√

1 + P2s + Q2

s

, (6)

where (p, q), ρ are the partial derivativesand varying albedo of the object, respec-tively. (Ps, Qs, −1) represents a single dis-tant light source. The light source can alsobe represented by the illuminant slant andtilt angles; slant α is the angle betweenthe opposite lighting direction and the pos-itive z-axis, and tilt τ is the angle betweenthe opposite lighting direction and the x-zplane. These angles are related to Ps andQs by Ps = tan α cos τ , Qs = tan α sin τ .To simplify the notation, we replace theconstant

√1 + P2

s + Q2s by K . For easier

analysis, we assume that frontal face ob-jects are bilaterally symmetric about thevertical midlines of the faces.

21One exception is a recent report [Blanz and Vetter2003] where faces were represented using 4448 im-ages from the CMU-PIE databases and 1940 imagesfrom the FERET database.



Fig. 18. In each row, the same face appears differently under different illu-minations (from the Yale face database).

6.1. The Illumination Problem in FaceRecognition

The illumination problem is illustratedin Figure 18, where the same face ap-pears different due to a change in light-ing. The changes induced by illuminationare often larger than the differences be-tween individuals, causing systems basedon comparing images to misclassify inputimages. This was experimentally observedin Adini et al. [1997] using a dataset of 25individuals.

In Zhao [1999], an analysis was carriedout of how illumination variation changesthe eigen-subspace projection coefficientsof images under the assumption of a Lam-bertian surface. Consider the basic expres-sion for the subspace decomposition of aface image I : I � IA +∑m

i=1 ai�i, where IAis the average image, �i are the eigenim-ages, and ai are the projection coefficients.Assume that for a particular individual wehave a prototype image Ip that is a nor-mally lighted frontal view (Ps = 0, Qs = 0in Equation (6)) in the database, and wewant to match it against a new image I ofthe same class under lighting (Ps, Qs, −1).The corresponding subspace projection co-efficient vectors a = [a1, a2, . . . , am]T (forIp) and a = [a1, a2, . . . , am]T (for I ) arecomputed as follows:

ai = Ip � �i − IA � �i,

ai = I � �i − IA � �i,(7)

where � denotes the sum of all element-wise products of two matrices (vectors). If

we divide the images and the eigenimagesinto two halves, for example, left and right,we have

ai = I Lp � �L

i + I Rp � �R

i − IA � �i,

ai = I L � �Li + I R � �R

i − IA � �i.

(8)

Based on Equation (6), the symmetricproperty of eigenimages and face objects,we have

ai = 2I Lp [x, y] � �L

i [x, y] − IA � �i,

ai =(

2K

)(I L

p [x, y] + I Lp [x, y]qL[x, y]Qs

)� �L

i [x, y] − IA � �i,(9)

leading to the following relation:

a =(

1K

a)

+ Qs

K

[f a

1 , f a2 , . . . , f a

m

]T

(10)− K − 1

KaA.,

where f ai = 2(I L

p [x, y]qL[x, y])��Li [x, y]

and aA is the projection coefficient vectorof the average image IA: [IA ��1, . . . , IA ��m]. Now let us assume that the trainingset is extended to include mirror imagesas in Kirby and Sirovich [1990]. A similaranalysis can be carried out, since in such acase the eigenimages are either symmet-ric (for most leading eigenimages) or anti-symmetric.

In general, Equation (11) suggests thata significant illumination change canseriously degrade the performance of


442 Zhao et al.

Fig. 19. Changes of projection vectors due to class variation (a) and illumination change (b) is of thesame order [Zhao 1999].

subspace-based methods. Figure 19 plotsthe projection coefficients for the sameface under different illuminations (α ∈[00, 400], τ ∈ [00, 1800]) and comparesthem against the variations in the projec-tion coefficient vectors due to pure differ-ences in class.

In general, the illumination problem isquite difficult and has received consider-able attention in the image understand-ing literature. In the case of face recog-nition, many approaches to this problemhave been proposed that make use of thedomain knowledge that all faces belong toone face class. These approaches can bedivided into four types [Zhao 1999]: (1)heuristic methods, for example, discardingthe leading principal components; (2) im-age comparison methods in which appro-priate image representations and distancemeasures are used; (3) class-based meth-ods using multiple images of the sameface in a fixed pose but under differentlighting conditions; and (4) model-basedapproaches in which 3D models are em-ployed.

6.1.1. Heuristic Approaches. Many exist-ing systems use heuristic methods to com-pensate for lighting changes. For exam-ple, in Moghaddam and Pentland [1997]simple contrast normalization was usedto preprocess the detected faces, whilein Sung and Poggio [1997] normalization

in intensity was done by first subtract-ing a best-fit brightness plane and thenapplying histogram equalization. In theface eigen-subspace domain, it was sug-gested and later experimentally verified inBelhumeur et al. [1997] that by discard-ing a few most significant principal com-ponents, variations due to lighting canbe reduced. The plot in Figure 19(b) alsosupports this observation. However, in or-der to maintain system performance fornormally illuminated images, while im-proving performance for images acquiredunder changes in illumination, it mustbe assumed that the first three principalcomponents capture only variations due tolighting. Other heuristic methods basedon frontal-face symmetry have also beenproposed [Zhao 1999].

6.1.2. Image Comparison Approaches.In Adini et al. [1997], approaches basedon image comparison using differentimage representations and distancemeasures were evaluated. The imagerepresentations used were edge maps,derivatives of the gray level, imagesfiltered with 2D Gabor-like functions,and a representation that combines alog function of the intensity with theserepresentations. The distance measuresused were point-wise distance, regionaldistance, affine-GL (gray level) dis-tance, local affine-GL distance, and log



point-wise distance. For more detailsabout these methods and about the eval-uation database, see Adini et al. [1997].It was concluded that none of theserepresentations alone can overcome theimage variations due to illumination.

A recently proposed image comparisonmethod Jacobs et al. [1998] used a newmeasure robust to illumination change.The rationale for develop such a methodof directly comparing images is the poten-tial difficulty of building a complete repre-sentation of an object’s possible images assuggested in [Belhumeur and Kriegman1997]. The authors argued that it is notclear whether it is possible to constructthe complete representation using a smallnumber of training images taken un-der uncontrolled viewing conditions andcontaining multiple light sources. It wasshown that given two images of an ob-ject with unknown structure and albedo,there is always a large family of solutions.Even in the case of given light sources,only two out of three independent compo-nents of the Hessian of the surface can bedetermined. Instead, the authors arguedthat the ratio of two images of the sameobject is simpler than if the images arefrom different objects. Based on this ob-servation, the complexity of the ratio oftwo aligned images was proposed as thesimilarity measure. More specifically, wehave

I1

I2=

(K2

K1

) (1 + pI Ps,1 + qI Qs,1

1 + pI Ps,2 + qI Qs,2

)(11)

for images of the same object, and

I1

J2=

(K2

K1

)(ρI

ρJ

)(1 + pI Ps,1 + qI Qs,1

1 + pJ Ps,2 + qJ Qs,2

)

(12)

×√

1 + p2J + q2

J

1 + p2I + q2

I

for images of different objects. They chosethe integral of the magnitude of thegradient of the function (ratio image)

as the measure of complexity and pro-posed the following symmetric similaritymeasure:

dG(I, J ) =∫ ∫

min(I, J )∥∥∥∥�(

IJ

)∥∥∥∥

(13)∥∥∥∥�JI

∥∥∥∥ dx dy.

They noticed the similarity between thismeasure and the measure that simplycompares the edges. It is also clear thatthe measure is not strictly illumination-invariant because it changes for a pairof images of the same object when theillumination changes. Experiments onface recognition showed improved perfor-mance over eigenfaces, which were some-what worse than the illumination cone-based method [Georghiades et al. 1998] onthe same set of data.

6.1.3. Class-Based Approaches. Underthe assumptions of Lambertian sur-faces and no shadowing, a 3D linearillumination subspace for a personwas constructed in Belhumeur andKriegman [1997], Hallinan [1994],Murase and Nayar [1995], Ricklin-Raviv and Shashua [1999], and Shashua[1994] for a fixed viewpoint, using threealigned faces/images acquired underdifferent lighting conditions. Underideal assumptions, recognition based onthis subspace is illumination-invariant.More recently, an illumination cone hasbeen proposed as an effective methodof handling illumination variations, in-cluding shadowing and multiple lightsources [Belhumeur and Kriegman 1997;Georghiades et al. 1998]. This method isan extension of the 3D linear subspacemethod [Hallinan 1994; Shashua 1994]and has the same drawback, requiringat least three aligned training imagesacquired under different lighting con-ditions per person. A more detailedreview of this approach and its extensionto handle the combined illuminationand pose problem will be presented inSection 6.2.


444 Zhao et al.

Fig. 20. Testing the invariance of the quotient image (Q-image) to varying illu-mination. (a) Original images of a novel face taken under five different illumina-tions. (b) The Q-images corresponding to the novel images, computed with respectto the bootstrap set of ten objects [Riklin-Raviv and Shashua 1999]. (Courtesy ofT. Riklin-Raviv and A. Shashua.)

More recently, a method based on quo-tient images was introduced [Riklin-Ravivand Shashua 1999]. Like other class-basedmethods, this method assumes that thefaces of different individuals have thesame shape and different textures. Giventwo objects a, b, the quotient image Qis defined to be the ratio of their albedofunctions ρa/ρb, and hence is illumination-invariant. Once Q is computed, the en-tire illumination space of object a can begenerated by Q and a linear illuminationsubspace constructed from three imagesof object b. To make this basic idea workin practice, a training set (called the boot-strap set in the paper) is needed that con-sists of images of N objects under variouslighting conditions, and the quotient im-age of a novel object y is defined relativeto the average object of the bootstrap set.More specifically, the bootstrap set con-sists of 3N images taken from three fixed,linearly independent light sources s1, s2,and s3 that are not known. Under this as-sumption, any light source s can be ex-pressed as a linear combination of the si:s = x1s1 + x2s2 + x3s3. The authors fur-ther defined the normalized albedo func-tion ρ of the bootstrap set as the squaredsum of the ρi, where ρi is the albedo func-tion of object i. An interesting energy/costfunction is defined that is quite differ-ent from the traditional bilinear form. LetA1, A2, . . . , AN be m × 3 matrices whosecolumns are images of object i (from thebootstrap set) that contain the same mpixels; then the bilinear energy/cost func-

tion [Freeman and Tenenbaum 2000] foran image ys of object y under illuminations is (

ys −N∑

i=1

αi Aix

)2

, (14)

which is a bilinear problem in the N un-knowns αi and 3 unknowns x. For compar-ison, the proposed energy function is

N∑i=1

(αi ys − Aix)2. (15)

This formation of the energy function isa major reason why the quotient imagemethod works better than “reconstruc-tion” methods based on Equation (14) interms of smaller size of the bootstrap setand less requirement for pixel-wise imagealignment. As pointed out by the authors,another factor contributing to the successof using only a small bootstrap set is thatthe albedo functions occupy only a smallsubspace. Figure 20 demonstrates the in-variance of the quotient image againstchange in illumination conditions; theimage synthesis results are shown inFigure 21.

6.1.4. Model-Based Approaches. Inmodel-based approaches, a 3D face modelis used to synthesize the virtual imagefrom a given image under desired illu-mination conditions. When the 3D modelis unknown, recovering the shape from



Fig. 21. Image synthesis example. Original image (a) and its quotient image (b)from the N = 10 bootstrap set. The quotient image is generated relative to theaverage object of the bootstrap set, shown in (c), (d), and (e). Images (f) through (k)are synthetic images created from (b) and (c), (d), (e) [Riklin-Raviv and Shashua1999]. (Courtesy of T. Riklin-Raviv and A. Shashua.)

the images accurately is difficult withoutusing any priors. Shape-from-shading(SFS) can be used if only one image isavailable; stereo or structure from motioncan be used when multiple images of thesame object are available.

Fortunately, for face recognition the dif-ferences in the 3D shapes of different faceobjects are not dramatic. This is especiallytrue after the images are aligned and nor-malized. Recall that this assumption wasused in the class-based methods reviewedabove. Using a statistical representationof the 3D heads, PCA was suggested as atool for solving the parametric SFS prob-lem [Atick et al. 1996]. An eigenhead ap-proximation of a 3D head was obtainedafter training on about 300 laser-scannedrange images of real human heads. Theill-posed SFS problem is thereby trans-formed into a parametric problem. The au-thors also demonstrated that such a rep-resentation helps to determine the lightsource. For a new face image, its 3D headcan be approximated as a linear combina-tion of eigenheads and then used to de-termine the light source. Using this com-plete 3D model, any virtual view of the faceimage can be generated. A major draw-back of this approach is the assumptionof constant albedo. This assumption doesnot hold for most real face images, eventhough it is the most common assumptionused in SFS algorithms.

To address the issue of varying albedo,a direct 2D-to-2D approach was proposedbased on the assumption that front-viewfaces are symmetric and making use of ageneric 3D model [Zhao et al. 1999]. Recallthat a prototype image Ip is a frontal viewwith Ps = 0, Qs = 0. Substituting this intoEquation (6), we have

Ip[x, y] = ρ1√

1 + p2 + q2. (16)

Comparing Equations (6) and (16), we ob-tain

Ip[x, y] = K2(1 + qQs)

(I [x, y] + I [−x, y]).

(17)

This simple equation relates the proto-type image Ip to I [x, y] + I [x, − y], whichis already available. The two advantagesof this approach are: (1) there is no needto recover the varying albedo ρ[x, y]; (2)there is no need to recover the full shapegradients (p, q); q can be approximatedby a value derived from a generic 3Dshape. As part of the proposed automaticmethod, a model-based light source iden-tification method was also proposed toimprove existing source-from-shading al-gorithms. Figure 22 shows some com-parisons between rendered images ob-tained using this method and using a


446 Zhao et al.

Fig. 22. Image rendering comparison. The original images are shownin the first column. The second column shows prototype images renderedusing the local SFS algorithm [Tsai and Shah 1994]. Prototype imagesrendered using symmetric SFS are shown in the third column. Finally,the fourth column shows real images that are close to the prototypeimages [Zhao and Chellappa 2000].

local SFS algorithm [Tsai and Shah 1994].Using the Yale and Weizmann databases(Table V), significant performance im-provements were reported when the pro-totype images were used in a subspaceLDA system in place of the original in-put images [Zhao et al. 1999]. In these ex-periments, the gallery set contained about500 images from various databases andthe probe set contained 60 images fromthe Yale database and 96 images from theWeizmann database.

Recently, a general method of ap-proximating Lambertian reflectance us-ing second-order spherical harmonics hasbeen reported [Basri and Jacobs 2001].Assuming Lambertian objects under dis-tant, isotropic lightng, the authors wereable to show that the set of all re-flectance functions can be approximatedusing the surface spherical harmonic ex-pansion. Specifically, they have provedthat using a second-order (nine harmon-ics, i.e., nine-dimensional 9D-space) ap-proximation, the accuracy for any lightfunction exceeds 97.97%. They then ex-tended this analysis to image formation,which is a much more difficult problem dueto possible occlusion, shape, and albedovariations. As indicated by the authors,worst-case image approximation can bearbitrarily bad, but most cases are good.Using their method, an image can be de-composed into so-called harmonic images,

which are produced when the object isilluminated by harmonic functions. Thenine harmonic images of a face are plottedin Figure 23. An interesting comparisonwas made between the proposed methodand the 3D linear illumination subspacemethods [Hallinan 1994; Shashua 1994];the 3D linear methods are just first-orderharmonic approximations without the DCcomponents.

Assuming precomputed object pose andknown color albedo/texture, the authorsreported an 86% correct recognition ratewhen applying this technique to the taskof face recognition using a probe setof 10 people and a gallery set of 42people.

6.2. The Pose Problem in Face Recognition

It is not surprising that the performanceof face recognition systems drops signif-icantly when large pose variations arepresent, in the input images. This diffi-culty was documented in the FERET andFRVT test reports [Blackburn et al. 2001;Phillips et al. 2002b, 2003], and was sug-gested as a major research issue. When il-lumination variation is also present, thetask of face recognition becomes even moredifficult. Here we focus on the out-of-planerotation problem, since in-plane rotationis a pure 2D problem and can be solvedmuch more easily.



Table V. Internet Resources for Research and DatabasesResearch pointers

Face recognition homepage www.cs.rug.nl/∼peterkr/FACE/frhp.htmlFace detection homepage home.t-online.de/home/Robert.Frischholz/face.htmFacial analysis homepage mambo.ucsc.edu/psl/fanl.htmlFacial animiation homepage mambo.ucsc.edu/psl/fan.html

Face databasesFERET database http://www.itl.nist.gov/iad/humanid/feret/XM2TVS database http://www.ee.surrey.ac.uk/Research/VSSP/xm2vtsdb/UT Dallas database http://www.utdallas.edu/dept/bbs/FACULTY PAGES/otoole/

database.htmNotre Dame database http://www.nd.edu/∼cvrl/HID-data.htmlMIT face databases ftp://whitechapel.media.mit.edu/pub/images/Shimon Edelman’s face database ftp://ftp.wisdom.weizmann.ac.il/pub/FaceBase/CMU face detection database www.ius.cs.cmu.edu/IUS/dylan usr0/har/faces/test/CMU PIE database www.ri.cmu.edu/projects/project 418.htmlStirling face database pics.psych.stir.ac.ukM2VTS multimodal database www.tele.ucl.ac.be/M2VTS/Yale face database cvc.yale.edu/projects/yalefaces/yalefaces.htmlYale face database B cvc.yale.edu/projects/yalefacesB/yalefacesB.htmlHarvard face database hrl.harvard.edu/pub/facesWeizmann face database www.wisdom.weizmann.ac.il/∼yael/UMIST face database images.ee.umist.ac.uk/danny/database.htmlPurdue University face database rvl1.ecn.purdue.edu/∼aleix/aleix face DB.htmlOlivetti face database www.cam-orl.co.uk/facedatabase.htmlOulu physics-based face database www.ee.oulu.fi/research/imag/color/pbfd.html

Fig. 23. The first nine harmonic images of a faceobject (from left to right, top to bottom) [Basri andJacobs 2001]. (Courtesy of R. Basri and D. Jacobs.)

Earlier methods focused on constructinginvariant features [Wiskott et al. 1997] orsynthesizing a prototypical view (frontalview) after a full model is extracted fromthe input image [Lanitis et al. 1995].22

Such methods work well for small rota-tion angles, but they fail when the angleis large, say 60◦, causing some importantfeatures to be invisible. Most proposedmethods are based on using large num-

22One exception is the multiview eigenfaces ofPentland et al. [1994].

bers of multiview samples. This seems toconcur with the findings of the psychologycommunity; face perception is believed tobe view-independent for small angles, butview-dependent for large angles.

To assess the pose problem more sys-tematically, an attempt has been made toclassify pose problems [Zhao 1999; Zhaoand Chellappa 2000b]. The basic idea ofthis analysis is to use a varying-albedo re-flectance model (Equation (6)) to synthe-size new images in different poses from areal image, thus providing a tool for sim-ulating the pose problem. More specifi-cally, the 2D-to-2D image transformationunder 3D pose change has been studied.The drawback of this analysis is the re-striction of using a generic 3D model; nodeformation of this 3D shape was carriedout, though the authors suggested doingso.

Researchers have proposed variousmethods of handling the rotation prob-lem. They can be divided into threeclasses [Zhao 1999]: (1) multiview im-age methods, when multiview databaseimages of each person are available; (2)hybrid methods, when multiview trainingimages are available during trainingbut only one database image per person


448 Zhao et al.

is available during recognition; and(3) single-image/shape-based meth-ods where no training is carried out.Akamatsu et al. [1992], Beymer [1993],Georghiades et al. [1999, 2001], andUllman and Basri [1991] are examples ofthe first class and Beymer [1995], Beymerand Poggio [1995], Cootes et al. [2000],Maurer and Malsburg [1996a], Sali andUllman [1998], and Vetter and Poggio[1997] of the second class. Up to now,the second type of approach has beenthe most popular. The third approachdoes not seem to have received muchattention.

6.2.1. Multiview-Based Approaches. Oneof the earliest examples of the first class ofapproaches is the work of Beymer [1993],which used a template-based correlationmatching scheme. In this work, poseestimation and face recognition werecoupled in an iterative loop. For eachhypothesized pose, the input image wasaligned to database images correspondingto that pose. The alignment was firstcarried out via a 2D affine transformationbased on three key feature points (eyesand nose), and optical flow was then usedto refine the alignment of each template.After this step, the correlation scores ofall pairs of matching templates were usedfor recognition. The main limitations ofthis method, and other methods belongingto this type of approach, are (1) manydifferent views per person are needed inthe database; (2) no lighting variations orfacial expressions are allowed; and (3) thecomputational cost is high, since iterativesearching is involved.

More recently, an illumination-cone-based [Belhumeur and Kriegman 1997]image synthesis method [Georghiadeset al. 1999] has been proposed to han-dle both pose and illumination problemsin face recognition. It handles illumina-tion variation quite well, but not posevariation. To handle variations due torotation, it needs to completely resolvethe GBR (generalized-bas-relief) ambigu-ity and then reconstruct the Euclidean 3Dshape. Without resolving this ambiguity,

images from nonfrontal viewpoints syn-thesized from a GBR reconstruction willdiffer from a valid image by an affinewarp of the image coordinates.23 To ad-dress GBR ambiguity, the authors pro-posed exploiting face symmetry (to correcttilt) and the fact that the chin and the fore-head are at about the same height (to cor-rect slant), and requiring that the rangeof heights of the surface be about twicethe distance between the eyes (to correctscale) [Georghiades et al. 2001]. They pro-pose a pose- and illumination-invariantface recognition method based on build-ing illumination cones at each pose foreach person. Though conceptually this is agood idea, in practice it is too expensive toimplement. The authors suggested manyways of speeding up the process, includingfirst subsampling the illumination coneand then approximating the subsampledcone with a 11D linear subspace. Experi-ments on building illumination cones andon 3D shape reconstruction based on seventraining images per class were reported.To visualize illumination-cone based im-age synthesis, see Figure 24. Figure 25demonstrates the effectiveness of imagesynthesis under variable pose and lightingafter the GBR ambiguity is resolved. Al-most perfect recognition results on ten in-dividuals were reported using nine posesand 45 viewing conditions.

6.2.2. Hybrid Approaches. Numerous al-gorithms of the second type have beenproposed. These methods, which makeuse of prior class information, are themost successful and practical methods upto now. We review several representa-tive methods here: (1) a view-based eigen-face method [Pentland et al. 1994], (2)a graph matching-based method [Wiskottet al. 1997], (3) a linear class-basedmethod [Blanz and Vetter 1999; Vetterand Poggio 1997], (4) a vectorized im-age representation based method [Beymer1995; Beymer and Poggio 1995], and (5)a view-based appearance model [Cootes

23GBR is a 3D affine transformation with three pa-rameters: scale, slant, and tilt. A weak-perspectiveimaging model is assumed.



Fig. 24. The process of constructing the illumination cone. (a) The seven training im-ages from Subset 1 (near frontal illumination) in frontal pose. (b) Images correspond-ing to the columns of B. (c) Reconstruction up to a GBR transformation. On the left,the surface was rendered with flat shading, that is, the albedo was assumed to be con-stant across the surface, while on the right the surface was texture-mapped using thefirst basis image of B shown in Figure 24(b). (d) Synthesized images from the illumina-tion cone of the face under novel lighting conditions but fixed pose. Note the large vari-ations in shading and shadowing as compared to the seven training images. (Courtesy ofA. Georghiades, P. Belhumeur, and D. Kriegman.)

et al. 2000]. Some of the reviewed meth-ods are very closely related—for example,methods 3, 4, and 5. Despite their popular-ity, these methods have two common draw-backs: (1) they need many example imagesto cover the range of possible views; (2) theillumination problem is not explicitly ad-dressed, though in principle it can be han-dled if images captured under the samepose but different illumination conditionsare available.

The popular eigenface approach [Turkand Pentland 1991] to face recognition hasbeen extended to a view-based eigenfacemethod in order to achieve pose-invariant

recognition [Pentland et al. 1994]. Thismethod explicitly codes the pose informa-tion by constructing an individual eigen-face for each pose. More recently, a uni-fied framework called the bilinear modelwas proposed in Freeman and Tenenbaum[2000] that can handle either pure posevariation or pure class variation. (A bilin-ear example is given in Equation (14) forthe illumination problem.)

In Wiskott et al. [1997], a robust facerecognition scheme based on EBGM wasproposed. The authors assumed a planarsurface patch at each feature point (land-mark), and learned the transformations


450 Zhao et al.

Fig. 25. Synthesized images undervariable pose and lighting gener-ated from the training images shownin Figure 24 and 25. (Courtesy ofA. Georghiades, P. Belhumeur, andD. Kriegman.)

of “jets” under face rotation. Their re-sults demonstrated substantial improve-ment in face recognition under rotation.Their method is also fully automatic, in-cluding face localization, landmark detec-tion, and flexible graph matching. Thedrawback of this method is its require-ment for accurate landmark localiza-tion, which is not an easy task, espe-cially when illumination variations arepresent.

The image synthesis method in Vetterand Poggio [1997] is based on the assump-tion of linear 3D object classes and the ex-tension of linearity to images (both shapeand texture) that are 2D projections of the3D objects. It extends the linear shapemodel (which is very similar to the ac-tive shape model of Cootes et al. [1995])from a representation based on featurepoints to full images of objects. To imple-ment this method, a correspondence be-tween images of the input object and areference object is established using opti-cal flow. Correspondences between the ref-erence image and other example imageshaving the same pose are also computed.Finally, the correspondence field for theinput image is linearly decomposed intothe correspondence fields for the exam-ples. Compared to the parallel deforma-tion scheme in Beymer and Poggio [1995],

Fig. 26. The best fit to a profile model is projectedto the frontal model to predict new views [Cooteset al. 2000]. (Courtesy of T. Cootes, K. Walker, andC. Taylor.)

this method reduces the need to computecorrespondences between images of differ-ent poses. On the other hand, parallel de-formation was able to preserve some pe-culiarities of texture that are nonlinearand that could be “erased” by linear meth-ods. This method was extended in Saliand Ullman [1998] to include an additiveerror term for better synthesis. In Blanzand Vetter [1999], a morphable 3D facemodel consisting of shape and texture wasdirectly matched to single/multiple inputimages. As a consequence, head orienta-tion, illumination conditions, and otherparameters could be free variables subjectto optimization.

In Cootes et al. [2000], a view-basedstatistical method was proposed based ona small number of 2D statistical mod-els (AAM). Unlike most existing methodsthat can handle only images with rota-tion angles up to, say 45◦, the authors ar-gued that their method can handle evenprofile views in which many features areinvisible. To deal with such large posevariations, they needed sample views at90◦ (full profile), 45◦ (quasiprofile), and0◦ (frontal view). A key element that isunique to this method is that for eachpose, a different set of features is used.Given a single image of a new person,all the models are used to match the im-age, and estimation of the pose is achievedby choosing the best fit. To synthesizea new view from the input image, therelationship between models at different



views are learned. More specifically, thefollowing steps are needed: (1) remov-ing the effects of orientation, (2) pro-jecting into the identity subspace [Ed-wards et al. 1998], (3) projecting acrossinto the subspace of the target model,and (4) adding the appropriate orienta-tion. Figure 26 demonstrates the syn-thesis of a virtual view of a novel faceusing this method. Results of track-ing a face across large pose variationsand predicting novel views were reportedon a limited dataset of about 15 shortsequences.

Earlier work on multiview-based meth-ods [Beymer 1993] was extended to ex-plore the prior class information that isspecific to a face class and can be learnedfrom a set of prototypes [Beymer 1993,1995]. The key idea of these methods isthe vectorized representation of the im-ages at each pose; this is similar to view-based AAM [Cootes et al. 2000]. A vec-torized representation at each pose con-sists of both shape and texture, which aremapped into the standard/average refer-ence shape. The reference shape is com-puted off-line by averaging shapes consist-ing of manually defined line segments sur-rounding the eyes, eyebrows, nose, mouth,and facial outline. The shape-free textureis represented either by the original geo-metrically normalized prototype images orby PCA bases constructed from these im-ages. Given a new image, a vectorizationprocedure (similar to the iterative energyminimization procedure in AAM [Cooteset al. 2001]) is invoked that iterates be-tween a shape step and a texture step.In the texture step, the input image iswarped onto a previously computed align-ment with the reference shape and thenprojected into the eigen-subspace. In theshape step, the PCA-reconstructed imageis used to compute the alignment for nextiteration. In both methods [Beymer 1995;Beymer and Poggio 1995], an optical flowalgorithm is used to compute a dense cor-respondence between the images. To syn-thesize a virtual view at pose θ2 of a novelimage at pose θ1, the flow between theseposes of the prototype images is computedand then warped to the novel image af-

Fig. 27. View synthesis by parallel deformation.First (A) the prototype flow is measured between theprototype image and the novel image at the samepose, then (B) the flow is mapped onto the novel face,and finally (C) the novel face is 2D-warped to the vir-tual view [Beymer and Poggio 1995].

ter the correspondence between the newimage and the prototype image at poseθ1 is computed; using the warped flow, avirtual view can be generated by warpingthe novel image. Figure 27 illustrates aparticular procedure adopted in Beymerand Poggio [1995]: the parallel deforma-tion needed to compute the flow betweenthe prototype image and the novel im-age. An obvious drawback of this approachis the difficulty of computing flow whenthe prototype image and novel image aredramatically different. To handle this is-sue, Beymer [1995] proposed first subsam-pling the estimated dense flow to locatelocal features (line segments) based onprior knowledge about both images, andthen matching the local features. Feed-ing the virtual views into a simple recog-nizer based on templates of eyes, nose, andmouth, a recognition rate of 85% was re-ported on a test set of 620 images (62 peo-ple, 10 views per person) given one sin-gle real view. Apparently this method isnot adequate, since it needs to synthe-size all virtual views. A better strategy isto detect the pose of the novel face andsynthesize only the prototype (say) frontalview.


452 Zhao et al.

6.2.3. Single-Image-Based Approaches.Finally, the third class of approachesincludes low-level feature-based methods,invariant-feature-based methods, and 3Dmodel-based methods. In Manjunath et al.[1992], a Gabor wavelet-based featureextraction method is proposed for facerecognition which is robust to small-anglerotations. In these methods, face shape isusually represented by either a polygonalmodel or a mesh model which simulatestissue. Due to its complexity and compu-tational cost, no serious attempt to applythis approach to face recognition has beenmade, except for Gordon [1991], where3D range data was available. In Zhao andChellappa [2000b], a unified approachwas proposed to solving both the pose andillumination problems. This method is anatural extension of the method proposedin Zhao and Chellappa [2000] to handlethe illumination problem. Using a generic3D model, they approximately solved thecorrespondence problem involved in a3D rotation, and performed an input-to-prototype image computation. To addressthe varying albedo issue in the estimationof both pose and light source, the useof a self-ratio image was proposed. Theself-ratio image rI [x, y] was defined as

rI [x, y] = I [x, y] − I [−x, y]I [x, y] + I [−x, y]

(18)= p[x, y]Ps

1 + q[x, y]Qs,

where I [x, y] is the original image andI [−x, y] is the mirrored image.

Using the self-ratio image, which isalbedo-free, the authors formulated thefollowing combined estimation problemfor pose θ and light source (α, τ ):

(θ∗, α∗, τ ∗)= argθ ,α,τ min[rIm(α, τ ) − rI (θ , α, τ )]2,

(19)

where rI (θ ,α,τ ) is the self-ratio image forthe virtual frontal view synthesized fromthe original rotated image IR via imagewarping and texture mapping, and rIm isthe self-ratio image generated from the 3Dface model. Improved recognition results

based on subspace LDA [Zhao et al. 1999]were reported on a small database consist-ing of frontal and quasiprofile images of115 novel objects (size 48×42). In these ex-periments, the frontal view images servedas the gallery images and nonfrontal viewimages served as the probe images. Unfor-tunately, estimation of a single pose valuefor all the images was done manually. Formany images, this estimate was not good,negating the performance improvement.

7. SUMMARY AND CONCLUSIONS

In this paper we have presented an ex-tensive survey of machine recognition ofhuman faces and a brief review of relatedpsychological studies. We have consideredtwo types of face recognition tasks: onefrom still images and the other from video.We have categorized the methods used foreach type, and discussed their character-istics and their pros and cons. In addi-tion to a detailed review of representativework, we have provided summaries of cur-rent developments and of challenging is-sues. We have also identified two impor-tant issues in practical face recognitionsystems: the illumination problem and thepose problem. We have categorized pro-posed methods of solving these problemsand discussed the pros and cons of thesemethods. To emphasize the importance ofsystem evaluation, three sets of evalua-tions were described: FERET, FRVT, andXM2VTS.

Getting started in performing experi-ments in face recognition is very easy. TheColorado State University’s Evaluation ofFace Recognition Algorithms Web site,http: //www.cs.colostate.edu/evalfacerec/,has an archive of baseline face recog-nition algorithms. Baseline algorithmsavailable are PCA, LDA, elastic bunchgraph matching, and Bayesian Intrap-ersonal/Extrapersoanl Image DiffferenceClassifier. Source code, and scripts for run-ning the algorithms can be downloaded.The Web site includes scripts for runningthe FERET Sep96 evaluation protocol (theFERET data set needs to be obtained fromthe FERET Web site). The baseline algo-rithms and FERET Sep96 protocol provide



a framework for benchmarking new algo-rithms. The scripts can be modified to rundifferent sets of images against the base-line. For on-line resources related to facerecognition, such as research papers anddatabases, see Table V.

We give below a concise summary of ourdiscussion, followed by our conclusions,in the same order as the topics have ap-peared in this paper:

—Machine recognition of faces hasemerged as an active research areaspanning disciplines such as imageprocessing, pattern recognition, com-puter vision, and neural networks.There are numerous applications ofFRT to commercial systems such asface verification-based ATM and accesscontrol, as well as law enforcementapplications to video surveillance, etc.Due to its user-friendly nature, facerecognition will remain a powerful toolin spite of the existence of very reliablemethods of biometric personal identifi-cation such as fingerprint analysis andiris scans.

—Extensive research in psychophysicsand the neurosciences on human recog-nition of faces is documented in the lit-erature. We do not feel that machinerecognition of faces should strictly fol-low what is known about human recog-nition of faces, but it is beneficial for en-gineers who design face recognition sys-tems to be aware of the relevant find-ings. On the other hand, machine sys-tems provide tools for conducting stud-ies in psychology and neuroscience.

—Numerous methods have been proposedfor face recognition based on image in-tensities [Chellappa et al. 1995]. Manyof these methods have been success-fully applied to the task of face recog-nition, but they have advantages anddisadvantages. The choice of a methodshould be based on the specific require-ments of a given task. For example,the EBGM-based method [Okada et al.1998] has very good performance, butit requires an image size, for exam-ple, 128 × 128, which severely restrictsits possible application to video-based

surveillance where the image size of theface area is very small. On the otherhand, the subspace LDA method [Zhaoet al. 1999] works well for both large andsmall images, for example, 96 × 84 or12 × 11.

—Recognition of faces from a video se-quence (especially a surveillance video)is still one of the most challenging prob-lems in face recognition because video isof low quality and the images are small.Often, the subjects of interest are not co-operative, for example, not looking intothe camera. One particular difficultyin these applications is how to obtaingood-quality gallery images. Neverthe-less, video-based face recognition sys-tems using multiple cues have demon-strated good results in relatively con-trolled environments.

—A crucial step in face recognition is theevaluation and benchmarking of algo-rithms. Two of the most important facedatabases and their associated evalua-tion methods have been reviewed: theFERET, FRVT, and XM2VTS protocols.The availability of these evaluations hashad a significant impact on progress inthe development of face recognition al-gorithms.

—Although many face recognition tech-niques have been proposed and haveshown significant promise, robust facerecognition is still difficult. There areat least three major challenges: illumi-nation, pose, and recognition in outdoorimagery. A detailed review of methodsproposed to solve these problems hasbeen presented. Some basic problems re-main to be solved; for example, pose dis-crimination is not difficult but accuratepose estimation is hard. In addition tothese two problems, there are other evenmore difficult ones, such as recognitionof a person from images acquired yearsapart.

—The impressive face recognition capabil-ity of the human perception system hasone limitation: the number and types offaces that can be easily distinguished.Machines, on the other hand, canstore and potentially recognize as many


454 Zhao et al.

people as necessary. Is it really possiblethat a machine can be built that mimicsthe human perceptual system withoutits limitations on number and types?

To conclude our paper, we present a con-jecture about face recognition based onpsychological studies and lessons learnedfrom designing algorithms. We conjecturethat different mechanisms are involved inhuman recognition of familiar and unfa-miliar faces. For example, it is possiblethat 3D head models are constructed, byextensive training for familiar faces, butfor unfamiliar faces, multiview 2D imagesare stored. This implies that we have fullprobability density functions for familiarfaces, while for unfamiliar faces we onlyhave discriminant functions.

REFERENCES

ADINI, Y., MOSES, Y., AND ULLMAN, S. 1997. Facerecognition: The problem of compensating forchanges in illumination direction. IEEE Trans.Patt. Anal. Mach. Intell. 19, 721–732.

AKAMATSU, S., SASAKI, T., FUKAMACHI, H., MASUI, N.,AND SUENAGA, Y. 1992. An accurate and robustface identification scheme. In Proceedings, In-ternational Conference on Pattern Recognition.217–220.

ATICK, J., GRIFFIN, P., AND REDLICH, N. 1996. Sta-tistical approach to shape from shading: Re-construction of three-dimensional face surfacesfrom single two-dimensional images. NeuralComputat. 8, 1321–1340.

AZARBAYEJANI, A., STARNER, T., HOROWITZ, B., AND

PENTLAND, A. 1993. Visually controlled graph-ics. IEEE Trans. Patt. Anal. Mach. Intell. 15,602–604.

BACHMANN, T. 1991. Identification of spatiallyquantized tachistoscopic images of faces: Howmany pixels does it take to carry identity? Euro-pean J. Cog. Psych. 3, 87–103.

BAILLY-BAILLIERE, E., BENGIO, S., BIMBOT, F., HAMOUZ,M., KITTLER, J., MARIETHOZ, J., MATAS, J., MESSER,K., POPOVICI, V., POREE, F., RUIZ, B., AND THIRAN,J. P. 2003. The BANCA database and evalua-tion protocol. In Proceedings of the InternationalConference on Audio- and Video-Based BiometricPerson Authentication. 625–638.

BARTLETT, J. C. AND SEARCY, J. 1993. Inversion andconfiguration of faces. Cog. Psych. 25, 281–316.

BARTLETT, M. S., LADES, H. M., AND SEJNOWSKI, T.1998. Independent component representationfor face recognition. In Proceedings, SPIE Sym-posium on Electronic Imaging: Science and Tech-nology. 528–539.

BASRI, R. AND JACOBS, D. W. 2001. Lambertian ref-electances and linear subspaces. In Proceedings,International Conference on Computer Vision.Vol. II. 383–390.

BELHUMEUR, P. N., HESPANHA, J. P., AND KRIEGMAN, D.J. 1997. Eigenfaces vs. Fisherfaces: Recogni-tion using class specific linear projection. IEEETrans. Patt. Anal. Mach. Intell. 19, 711–720.

BELHUMEUR, P. N. AND KRIEGMAN, D. J. 1997. Whatis the set of images of an object under all possiblelighting conditions? In Proceedings, IEEE Con-ference on Computer Vision and Pattern Recog-nition. 52–58.

BELL, A. J. AND SEJNOWSKI, T. J. 1995. An informa-tion maximisation approach to blind separationand blind deconvolution. Neural Computation 7,1129–1159.

BELL, A. J. AND SEJNOWSKI, T. J. 1997. The indepen-dent components of natural scenes are edge fil-ters. Vis. Res. 37, 3327–3338.

BEVERIDGE, J. R., SHE, K., DRAPER, B. A., AND

GIVENS, G. H. 2001. A nonparametric statisi-cal comparison of principal component andlinear discriminant subspaces for face recog-nition. In Proceedings, IEEE Conference onComputer Vision and Pattern Recognition.(An updated version can be found onlineat http://www.cs.colostate.edu/evalfacerec/news.html.)

BEYMER, D. 1995. Vectorizing face images by in-terleaving shape and texture computations. MITAI Lab memo 1537. Massachusetts Institute ofTechnology, Cambridge, MA.

BEYMER, D. J. 1993. Face recognition under vary-ing pose. Tech. Rep. 1461. MIT AI Lab, Mas-sachusetts Institute of Technology, Cambridge,MA.

BEYMER, D. J. AND POGGIO, T. 1995. Face recognitionfrom one example view. In Proceedings, Interna-tional Conference on Computer Vision. 500–507.

BIEDERMAN, I. 1987. Recognition by components: Atheory of human image understanding. Psych.Rev. 94, 115–147.

BIEDERMAN, I. AND KALOCSAI, P. 1998. Neural andpsychophysical analysis of object and face recog-nition. In Face Recognition: From Theory to Ap-plications, H. Wechsler, P. J. Phillips, V. Bruce, F.F. Soulie, and T. S. Huang, Eds. Springer-Verlag,Berlin, Germany, 3–25.

BIGUN, J., DUC, B., SMERALDI, F., FISCHER, S., AND

MAKAROV, A. 1998. Multi-modal person au-thentication. In Face Recognition: From The-ory to Applications, H. Wechsler, P. J. Phillips,V. Bruce, F. F. Soulie, and T. S. Huang, Eds.Springer-Verlag, Berlin, Germany, 26–50.

BLACK, M., FLEET, D., AND YACOOB, Y. 1998. AFramework for modelling appearance change inimage sequences. In Proceedings, InternationalConference on Computer Vision, 660–667.

BLACK, M. AND YACOOB, Y. 1995. Tracking and rec-ognizing facial expressions in image sequences



using local parametrized models of image mo-tion. Tech. rep. CS-TR-3401. Center for Automa-tion Research, Unversity of Maryland, CollegePark, MD.

BLACKBURN, D., BONE, M., AND PHILLIPS, P. J. 2001.Face recognition vendor test 2000. Tech. rep.http://www.frvt.org.

BLANZ, V. AND VETTER, T. 1999. A Morphable modelfor the synthesis of 3D faces. In Proceedings,SIGGRAPH’99, 187–194.

BLANZ, V. AND VETTER, T. 2003. Face recognitionbased on fitting a 3D morphable model. IEEETrans. Patt. Anal. Mach. Intell. 25, 1063–1074.

BLEDSOE, W. W. 1964. The model method in facialrecognition. Tech. rep. PRI:15, Panoramic re-search Inc., Palo Alto, CA.

BRAND, M. AND BHOTIKA, R. 2001. Flexible flow for3D nonrigid tracking and shape recovery. In Pro-ceedings, IEEE Conference on Computer Visionand Pattern Recognition.

BRENNAN, S. E. 1985. The caricature generator.Leonardo, 18, 170–178.

BRONSTEIN, A., BRONSTEIN, M., GORDON, E., AND KIMMEL,R. 2003. 3D face recognition using geometricinvariants. In Proceedings, International Confer-ence on Audio- and Video-Based Person Authen-tication.

BRUCE, V. 1988. Recognizing faces, Lawrence Erl-baum Associates, London, U.K.

BRUCE, V., BURTON, M., AND DENCH, N. 1994. What’sdistinctive about a distinctive face? Quart. J.Exp. Psych. 47A, 119–141.

BRUCE, V., HANCOCK, P. J. B., AND BURTON, A. M. 1998.Human face perception and identification. InFace Recognition: From Theory to Applications,H. Wechsler, P. J. Phillips, V. Bruce, F. F. Soulie,and T. S. Huang, Eds. Springer-Verlag, Berlin,Germany, 51–72.

BRUNER, I. S. AND TAGIURI, R. 1954. The percep-tion of people. In Handbook of Social Psychology,Vol. 2, G. Lindzey, Ed., Addison-Wesley, Reading,MA, 634–654.

BUHMANN, J., LADES, M., AND MALSBURG, C. V. D. 1990.Size and distortion invariant object recognitionby hierarchical graph matching. In Proceedings,International Joint Conference on Neural Net-works. 411–416.

CHELLAPPA, R., WILSON, C. L., AND SIROHEY, S. 1995.Human and machine recognition of faces: A sur-vey. Proc. IEEE, 83, 705–740.

CHOUDHURY, T., CLARKSON, B., JEBARA, T., AND PENTLAND,A. 1999. Multimodal person recognition us-ing unconstrained audio and video. In Proceed-ings, International Conference on Audio- andVideo-Based Person Authentication. 176–181.

COOTES, T., TAYLOR, C., COOPER, D., AND GRAHAM, J.1995. Active shape models—their training andapplication. Comput. Vis. Image Understand. 61,18–23.

COOTES, T., WALKER, K., AND TAYLOR, C. 2000. View-

based active appearance models. In Proceedings,International Conference on Automatic Face andGesture Recognition.

COOTES, T. F., EDWARDS, G. J., AND TAYLOR, C. J. 2001.Active appearance models. IEEE Trans. Patt.Anal. Mach. Intell. 23, 681–685.

COX, I. J., GHOSN, J., AND YIANILOS, P. N. 1996.Feature-based face recognition using mixture-distance. In Proceedings, IEEE Conference onComputer Vision and Pattern Recognition. 209–216.

CRAW, I. AND CAMERON, P. 1996. Face recognition bycomputer. In Proceedings, British Machine Vi-sion Conference. 489–507.

DARWIN, C. 1972. The Expression of the Emotionsin Man and Animals. John Murray, London, U.K.

DECARLO, D. AND METAXAS, D. 2000. Optical flowconstraints on deformable models with applica-tions to face tracking. Int. J. Comput. Vis. 38,99–127.

DONATO, G., BARTLETT, M. S., HAGER, J. C., EKMAN, P.,AND SEJNOWSKI, T. J. 1999. Classifying facialactions. IEEE Trans. Patt. Anal. Mach. Intell. 21,974–989.

EDWARDS, G. J., TAYLOR, C. J., AND COOTES, T. F.1998. Learning to identify and track facesin image sequences. In Proceedings, Interna-tional Conference on Automatic Face and GestureRecognition.

EKMAN, P. Ed., 1998. Charles Darwin’s The Ex-pression of the Emotions in Man and Animals,Third Edition, with Introduction, Afterwordsand Commentaries by Paul Ekman. Harper-Collins/Oxford University Press, New York,NY/London, U.K.

ELLIS, H. D. 1986. Introduction to aspects of faceprocessing: Ten questions in need of answers. InAspects of Face Processing, H. Ellis, M. Jeeves,F. Newcombe, and A. Young, Eds. Nijhoff, Dor-drecht, The Netherlands, 3–13.

ETEMAD, K. AND CHELLAPPA, R. 1997. Discriminantanalysis for recognition of human face images.J. Opt. Soc. Am. A 14, 1724–1733.

FISHER, R. A. 1938. The statistical utilization ofmultiple measuremeents. Ann. Eugen. 8, 376–386.

FREEMAN, W. T. AND TENENBAUM, J. B. 2000. Sepa-rating style and contents with bilinear models.Neural Computat. 12, 1247–1283.

FUKUNAGA, K. 1989. Statistical Pattern Recogni-tion, Academic Press, New York, NY.

GALTON, F. 1888. Personal identification and de-scription. Nature, (June 21), 173–188.

GAUTHIER, I., BEHRMANN, M., AND TARR, M. J. 1999.Can face recognition really be dissociated fromobject recognition? J. Cogn. Neurosci. 11, 349–370.

GAUTHIER, I. AND LOGOTHETIS, N. K. 2000. Is facerecognition so unique after All? J. Cogn. Neu-ropsych. 17, 125–142.


456 Zhao et al.

GEORGHIADES, A. S., BELHUMEUR, P. N., AND KRIEGMAN,D. J. 1999. Illumination-based image synthe-sis: Creating novel images of human faces un-der differing pose and lighting. In Proceedings,Workshop on Multi-View Modeling and Analysisof Visual Scenes, 47–54.

GEORGHIADES, A. S., BELHUMEUR, P. N., AND KRIEGMAN,D. J. 2001. From few to many: Illuminationcone models for face recognition under variablelighting and pose. IEEE Trans. Patt. Anal. Mach.Intell. 23, 643–660.

GEORGHIADES, A. S., KRIEGMAN, D. J., AND BELHUMEUR,P. N. 1998. Illumination cones for recognitionunder variable lighting: Faces. In Proceedings,IEEE Conference on Computer Vision and Pat-tern Recognition, 52–58.

GINSBURG, A. G. 1978. Visual information process-ing based on spatial filters constrained by bio-logical data. AMRL Tech. rep. 78–129.

GONG, S., MCKENNA, S., AND PSARROU, A. 2000. Dy-namic Vision: From Images to Face Recognition.World Scientific, Singapore.

GORDON, G. 1991. Face recognition based on depthmaps and surface curvature. In SPIE Proceed-ings, Vol. 1570: Geometric Methods in ComputerVision. SPIE Press, Bellingham, WA 234–247.

GU, L., LI, S. Z., AND ZHANG, H. J. 2001. Learningprobabilistic distribution model for multiviewface dectection. In Proceedings, IEEE Conferenceon Computer Vision and Pattern Recognition.

HAGER, G. D., AND BELHUMEUR, P. N. 1998. Efficientregion tracking with parametri models of geom-etry and illumination. IEEE Trans. Patt. Anal.Mach. Intell. 20, 1–15.

HALLINAN, P. W. 1991. Recognizing human eyes. InSPIE Proceedings, Vol. 1570: Geometric MethodsIn Computer Vision. 214–226.

HALLINAN, P. W. 1994. A low-dimensional repre-sentation of human faces for arbitrary lightingconditions. In Proceedings, IEEE Conference onComputer Vision and Pattern Recognition. 995–999.

HANCOCK, P., BRUCE, V., AND BURTON, M. 1998. Acomparison of two computer-based face recogni-tion systems with human perceptions of faces.Vis. Res. 38, 2277–2288.

HARMON, L. D. 1973. The recognition of faces. Sci.Am. 229, 71–82.

HEISELE, B., SERRE, T., PONTIL, M., AND POGGIO, T.2001. Component-based face detection. In Pro-ceedings, IEEE Conference on Computer Visionand Pattern Recognition.

HILL, H. AND BRUCE, V. 1996. Effects of lighting onmatching facial surfaces. J. Exp. Psych.: HumanPercept. Perform. 22, 986–1004.

HILL, H., SCHYNS, P. G., AND AKAMATSU, S. 1997.Information and viewpoint dependence in facerecognition. Cognition 62, 201–222.

HJELMAS, E. AND LOW, B. K. 2001. Face detection:A Survey. Comput. Vis. Image Understand. 83,236–274.

HORN, B. K. P. AND BROOKS, M. J. 1989. Shape fromShading. MIT Press, Cambridge, MA.

HUANG, J., HEISELE, B., AND BLANZ, V. 2003.Component-based face recognition with 3D mor-phable models. In Proceedings, InternationalConference on Audio- and Video-Based PersonAuthentication.

ISARD, M. AND BLAKE, A. 1996. Contour tracking bystochastic propagation of conditional density. InProceedings, European Conference on ComputerVision.

JACOBS, D. W., BELHUMEUR, P. N., AND BASRI, R.1998. Comparing images under variable illu-mination. In Proceedings, IEEE Conference onComputer Vision and Pattern Recognition. 610–617.

JEBARA, T., RUSSEL, K., AND PENTLAND, A. 1998.Mixture of eigenfeatures for real-time struc-ture from texture. Tech. rep. TR-440, MIT Me-dia Lab, Massachusetts Institute of Technology,Cambridge, MA.

JOHNSTON, A., HILL, H., AND CARMAN, N. 1992. Rec-ognizing faces: Effects of lighting direction, in-version and brightness reversal. Cognition 40,1–19.

KALOCSAI, P. K., ZHAO, W., AND ELAGIN, E. 1998.Face similarity space as perceived by humansand artificial systems. In Proceedings, Interna-tional Conference on Automatic Face and GestureRecognition. 177–180.

KANADE, T. 1973. Computer recognition of hu-man faces. Birkhauser, Basel, Switzerland, andStuttgart, Germany.

KELLY, M. D. 1970. Visual identification of peo-ple by computer. Tech. rep. AI-130, Stanford AIProject, Stanford, CA.

KIRBY, M. AND SIROVICH, L. 1990. Application of theKarhunen-Loeve procedure for the characteri-zation of human faces. IEEE Trans. Patt. Anal.Mach. Intell. 12.

KLASEN, L. AND LI, H. 1998. Faceless identification.In Face Recognition: From Theory to Applica-tions, H. Wechsler, P. J. Phillips, V. Bruce, F. F.Soulie, and T. S. Huang, Eds. Springer-Verlag,Berlin, Germany, 513–527.

KNIGHT, B. AND JOHNSTON, A. 1997. The role ofmovement in face recognition. Vis. Cog. 4, 265–274.

KRUGER, N., POTZSCH, M., AND MALSBURG, C. V. D. 1997.Determination of face position and pose with alearned representation based on labelled graphs.Image Vis. Comput. 15, 665–673.

KUNG, S. Y. AND TAUR, J. S. 1995. Decision-basedneural networks with signal/image classificationapplications. IEEE Trans. Neural Netw. 6, 170–181.

LADES, M., VORBRUGGEN, J., BUHMANN, J., LANGE, J.,MALSBURG, C. V.D., WURTZ, R., AND KONEN, W.1993. Distortion invariant object recognitionin the dynamic link architecture. IEEE Trans.Comput. 42, 300–311.



LANITIS, A., TAYLOR, C. J., AND COOTES, T. F. 1995.Automatic face identification system using flex-ible appearance models. Image Vis. Comput. 13,393–401.

LAWRENCE, S., GILES, C. L., TSOI, A. C., AND BACK,A. D. 1997. Face recognition: A convolutionalneural-network approach. IEEE Trans. NeuralNetw. 8, 98–113.

LI, B. AND CHELLAPPA, R. 2001. Face verificationthrough tracking facial features. J. Opt. Soc. Am.18.

LI, S. Z. AND LU, J. 1999. Face recognition using thenearest feature line method. IEEE Trans. Neu-ral Netw. 10, 439–443.

LI, Y., GONG, S., AND LIDDELL, H. 2001a. Construct-ing facial identity surfaces in a nonlinear dis-criminating space. In Proceedings, IEEE Confer-ence on Computer Vision and Pattern Recogni-tion.

LI, Y., GONG, S., AND LIDDELL, H. 2001b. Modellingface dynamics across view and over time. In Pro-ceedings, International Conference on ComputerVision.

LIN, S. H., KUNG, S. Y., AND LIN, L. J. 1997. Facerecognition/detection by probabilistic decision-based neural network. IEEE Trans. NeuralNetw. 8, 114–132.

LIU, C. AND WECHSLER, H. 2000a. Evolutionary pur-suit and its application to face recognition. IEEETrans. Patt. Anal. Mach. Intell. 22, 570–582.

LIU, C. AND WECHSLER, H. 2000b. Robust codingscheme for indexing and retrieval from large facedatabases. IEEE Trans. Image Process. 9, 132–137.

LIU, C. AND WECHSLER, H. 2001. A shape- andtexture-based enhanced fisher classifier for facerecognition. IEEE Trans. Image Process. 10,598–608.

LIU, J. AND CHEN, R. 1998. Sequential Monte Carlomethods for dynamic systems. J. Am. Stat. Assoc.93, 1031–1041.

MANJUNATH, B. S., CHELLAPPA, R., AND MALSBURG, C. V. D.1992. A feature based approach to face recogni-tion. In Proceedings, IEEE Conference on Com-puter Vision and Pattern Recognition. 373–378.

MARR, D. 1982. Vision. W. H. Freeman, San Fran-cisco, CA.

MARTINEZ, A. 2002. Recognizing imprecisely local-ized, partially occluded and expression variantfaces from a single sample per class. IEEE Trans.Patt. Anal. Mach. Intell. 24, 748–763.

MARTINEZ, A. AND KAK, A. C. 2001. PCA versusLDA. IEEE Trans. Patt. Anal. Mach. Intell. 23,228–233.

MAURER, T. AND MALSBURG, C. V. D. 1996a. Single-view based recognition of faces rotated in depth.In Proceedings, International Workshop on Auto-matic Face and Gesture Recognition. 176–181.

MAURER, T. AND MALSBURG, C. V. D. 1996b. Track-ing and learning graphs and pose on image

sequences of faces. In Proceedings, Interna-tional Conference on Automatic Face and GestureRecognition. 176–181.

MCKENNA, S. J. AND GONG, S. 1997. Non-intrusiveperson authentication for access control by vi-sual tracking and face recognition. In Proceed-ings, International Conference on Audio- andVideo-Based Person Authentication. 177–183.

MCKENNA, S. AND GONG, S. 1998. Recognising mov-ing faces. In Face Recognition: From Theory toApplications, H. Wechsler, P. J. Phillips, V. Bruce,F. F. Soulie, and T. S. Huang, Eds. Springer-Verlag, Berlin, Germany, 578–588.

MATAS, J. ET. AL., 2000. Comparison of face verifi-cation results on the XM2VTS database. In Pro-ceedings, International Conference on PatternRecognition, Vol. 4, 858–863.

MESSER, K., MATAS, J., KITTLER, J., LUETTIN, J., AND

MAITRE, G. 1999. XM2VTSDB: The ExtendedM2VTS Database. In Proceedings, InternationalConference on Audio- and Video-Based PersonAuthentication. 72–77.

MIKA, S., RATSCH, G., WESTON, J., SCHOLKOPF, B.,AND MULLER, K.-R. 1999. Fisher discriminantanalysis with kernels. In Proceedings, IEEEWorkshop on Neural Networks for Signal Pro-cessing.

MOGHADDAM, B., NASTAR, C., AND PENTLAND, A. 1996.A Bayesian similarity measure for direct imagematching. In Proceedings, International Confer-ence on Pattern Recognition.

MOGHADDAM, B. AND PENTLAND, A. 1997. Probabilis-tic visual learning for object representation.IEEE Trans. Patt. Anal. Mach. Intell. 19, 696–710.

MOON, H. AND PHILLIPS, P. J. 2001. Computationaland performance aspects of PCA-based facerecognition algorithms. Perception, 30, 301–321.

MURASE, H. AND NAYAR, S. 1995. Visual learningand recognition of 3D objects from appearances.Int. J. Comput. Vis. 14, 5–25.

NEFIAN, A. V. AND HAYES III, M. H. 1998. Hid-den Markov models for face recognition. In Pro-ceedings, International Conference on Acoustics,Speech and Signal Processing. 2721–2724.

OKADA, K., STEFFANS, J., MAURER, T., HONG, H., ELAGIN,E., NEVEN, H., AND MALSBURG, C. V. D. 1998. TheBochum/USC Face Recognition System and howit fared in the FERET Phase III Test. In FaceRecognition: From Theory to Applications, H.Wechsler, P. J. Phillips, V. Bruce, F. F. Soulie,and T. S. Huang, Eds. Springer-Verlag, Berlin,Germany, 186–205.

O’TOOLE, A. J., ROARK, D., AND ABDI, H. 2002.Recognitizing moving faces. A psychological andneural synthesis. Trends Cogn. Sci. 6, 261–266.

PANTIC, M. AND ROTHKRANTZ, L. J. M. 2000. Auto-matic analysis of facial expressions: The state ofthe art. IEEE Trans. Patt. Anal. Mach. Intell. 22,1424–1446.


458 Zhao et al.

PENEV, P. AND SIROVICH, L. 2000. The global dimen-sionality of face space. In Proceedings, Interna-tional Conference on Automatic Face and GestureRecognition.

PENEV, P. AND ATICK, J. 1996. Local feature analy-sis: A general statistical theory for objecct repre-sentation. Netw.: Computat. Neural Syst. 7, 477–500.

PENTLAND, A., MOGHADDAM, B., AND STARNER, T. 1994.View-based and modular eigenspaces for facerecognition. In Proceedings, IEEE Conference onComputer Vision and Pattern Recognition.

PERKINS, D. 1975. A definition of caricature andrecognition. Stud. Anthro. Vis. Commun. 2, 1–24.

PHILLIPS, P. J., GROTHER, P. J., MICHEALS, R. J., BLACK-BURN, D. M., TABASSI, E., AND BONE, J. M. 2003.Face recognition vendor test 2002: Evaluationreport. NISTIR 6965, 2003. Available online athttp://www.frvt.org.

PHILLIPS, P. J. 1998. Support vector machines ap-plied to face fecognition. Adv. Neural Inform.Process. Syst. 11, 803–809.

PHILLIPS, P. J., MCCABE, R. M., AND CHELLAPPA, R.1998. Biometric image processing and recogni-tion. In Proceedings, European Signal Process-ing Conference.

PHILLIPS, P. J., MOON, H., RIZVI, S., AND RAUSS, P. 2000.The FERET evaluation methodology for face-recognition algorithms. IEEE Trans. Patt. Anal.Mach. Intell. 22.

PHILLIPS, P. J., WECHSLER, H., HUANG, J., AND RAUSS,P. 1998b. The FERET database and evalua-tion procedure for face-recognition algorithms.Image Vis. Comput. 16, 295–306.

PIGEON, S. AND VANDENDORPE, L. 1999. The M2VTSmultimodal face database (Release 1.00). InProceedings, International Conference on Audio-and Video-Based Person Authentication. 403–409.

RIKLIN-RAVIV, T. AND SHASHUA, A. 1999. The quo-tient image: Class based re-rendering and recog-nition with varying illuminations. In Proceed-ings, IEEE Conference on Computer Vision andPattern Recognition. 566–571.

RIZVI, S. A., PHILLIPS, P. J., AND MOON, H. 1998. Averification protocol and statistical performanceanalysis for face recognition algorithms. In Pro-ceedings, IEEE Conference on Computer Visionand Pattern Recognition. 833–838.

ROWLEY, H. A., BALUJA, S., AND KANADE, T. 1998.Neural network based face detection. IEEETrans. Patt. Anal. Mach. Intell. 20.

CHOUDHURY, A. K. R. AND CHELLAPPA, R. 2003. Facereconstruction from monocular video using un-certainty analysis and a generic model. Comput.Vis. Image Understand. 91, 188–213.

RUDERMAN, D. L. 1994. The statistics of natural im-ages. Netw.: Comput. Neural Syst. 5, 598–605.

SALI, E. AND ULLMAN, S. 1998. Recognizing novel3-D objects under new illumination and viewing

position using a small number of example viewsor even a single view. In Proceedings, IEEE Con-ference on Computer Vision and Pattern Recog-nition. 153–161.

SAMAL, A. AND IYENGAR, P. 1992. Automatic recog-nition and analysis of human faces and facialexpressions: A survey. Patt. Recog. 25, 65–77.

SAMARIA, F. 1994. Face recognition using hiddenmarkov models. Ph.D. dissertation. Universityof Cambridge, Cambridge, U.K.

SAMARIA, F. AND YOUNG, S. 1994. HMM based archi-tecture for face identification. Image Vis. Com-put. 12, 537–583.

SCHNEIDERMAN, H. AND KANADE, T. 2000. Probabilis-tic modelling of local Appearance and spatialreationships for object recognition. In Proceed-ings, IEEE Conference on Computer Vision andPattern Recognition. 746–751.

SERGENT, J. 1986. Microgenesis of face perception.In Aspects of Face Processing, H. D. Ellis, M. A.Jeeves, F. Newcombe, and A. Young, Eds. Nijhoff,Dordrecht, The Netherlands.

SHASHUA, A. 1994. Geometry and photometry in3D visual recognition. Ph.D. dissertation. Mas-sachusetts Institute of Technology, Cambridge,MA.

SHEPHERD, J. W., DAVIES, G. M., AND ELLIS, H. D. 1981.Studies of cue saliency. In Perceiving and Re-membering Faces, G. M. Davies, H. D. Ellis, andJ. W. Shepherd, Eds. Academic Press, London,U.K.

SHIO, A. AND SKLANSKY, J. 1991. Segmentation ofpeople in motion. In Proceedings, IEEE Work-shop on Visual Motion. 325–332.

SIROVICH, L. AND KIRBY, M. 1987. Low-dimensionalprocedure for the characterization of humanface. J. Opt. Soc. Am. 4, 519–524.

STEFFENS, J., ELAGIN, E., AND NEVEN, H. 1998.PersonSpotter—fast and robust system for hu-man detection, tracking and recognition. In Pro-ceedings, International Conference on AutomaticFace and Gesture Recognition. 516–521.

STROM, J., JEBARA, T., BASU, S., AND PENTLAND, A.1999. Real time tracking and modeling offaces: An EKF-based analysis by synthesis ap-proach. Tech. rep. TR-506, MIT Media Lab, Mas-sachusetts, Institute of Technology, Cambridge,MA.

SUNG, K. AND POGGIO, T. 1997. Example-basedlearning for view-based human face detection.IEEE Trans. Patt. Anal. Mach. Intell. 20, 39–51.

SWETS, D. L. AND WENG, J. 1996b. Using dis-criminant eigenfeatures for image retrieval.IEEE Trans. Patt. Anal. Mach. Intell. 18, 831–836.

SWETS, D. L. AND WENG, J. 1996. Discriminantanalysis and eigenspace partition tree for faceand object recognition from views. In Proceed-ings, International Conference on AutomaticFace and Gesture Recognition. 192–197.



TARR, M. J. AND BULTHOFF, H. H. 1995. Is hu-man object recognition better described by geonstructural descriptions or by multiple views—comment on Biederman and Gerhardstein(1993). J. Exp. Psych.: Hum. Percep. Perf. 21, 71–86.

TERZOPOULOS, D. AND WATERS, K. 1993. Analysisand synthesis of facial image sequences usingphysical and anatomical models. IEEE Trans.Patt. Anal. Mach. Intell. 15, 569–579.

THOMPSON, P. 1980. Margaret Thatcher—A new il-lusion. Perception, 9, 483–484.

TSAI, P. S. AND SHAH, M. 1994. Shape from shadingusing linear approximation. Image Vis. Comput.12, 487–498.

TRIGGS, B., MCLAUCHLAN, P., HARTLEY, R., AND

FITZGIBBON, A. 2000. Bundle adjustment—amodern synthesis. In Vision Algorithms: Theoryand Practice, Springer-Verlag, Berlin, Germany.

TURK, M. AND PENTLAND, A. 1991. Eigenfaces forrecognition. J. Cogn. Neurosci. 3, 72–86.

ULLMAN, S. AND BASRI, R. 1991. Recognition by lin-ear combinations of models. IEEE Trans. Patt.Anal. Mach. Intell. 13, 992–1006.

VAPNIK, V. N. 1995. The Nature of StatisticalLearning Theory. Springer-Verlag, New York,NY.

VETTER, T. AND POGGIO, T. 1997. Linear objectclasses and image synthesis from a single exam-ple image. IEEE Trans. Patt. Anal. Mach. Intell.19, 733–742.

VIOLA, P. AND JONES, M. 2001. Rapid object detec-tion using a boosted cascade of simple features.In Proceedings, IEEE Conference on ComputerVision and Pattern Recognition.

WECHSLER, H., KAKKAD, V., HUANG, J., GUTTA, S., AND

CHEN, V. 1997. Automatic video-based personauthentication using the RBF network. In Pro-ceedings, International Conference on Audio-and Video-Based Person Authentication. 85–92.

WILDER, J. 1994. Face recognition using transformcoding of gray scale projection and the neu-ral tree network. In Artificial Neural Networkswith Applications in Speech and Vision, R. J.

Mammone, Ed. Chapman Hall, New York, NY,520–536.

WISKOTT, L., FELLOUS, J.-M., AND VON DER MALSBURG, C.1997. Face recognition by elastic bunch graphmatching. IEEE Trans. Patt. Anal. Mach. Intell.19, 775–779.

YANG, M. H., KRIEGMAN, D., AND AHUJA, N. 2002. De-tecting faces in images: A survey. IEEE Trans.Patt. Anal. Mach. Intell. 24, 34–58.

YIN, R. K. 1969. Looking at upside-down faces.J. Exp, Psych. 81, 141–151.

YUILLE, A. L., COHEN, D. S., AND HALLINAN, P. W.1992. Feature extractiong from faces using de-formable templates. Int. J. Comput. Vis. 8, 99–112.

YUILLE, A. AND HALLINAN, P. 1992. Deformable tem-plates. In Active vision, A. Blake, and A. Yuille,Eds., Cambridge, MA, 21–38.

ZHAO, W. 1999. Robust Image Based 3D FaceRecognition, Ph.D. dissertation. University ofMaryland, College Park, MD.

ZHAO, W. AND CHELLAPPA, R. 2000b. SFS BasedView synthesis for robust face recognition. InProceedings, International Conference on Auto-matic Face and Gesture Recognition.

ZHAO, W. AND CHELLAPPA, R. 2000. Illumination-insensitive face recognition using symmetricshape-from-shading. In Proceedings, Conferenceon Computer Vision and Pattern Recognition.286–293.

ZHAO, W., CHELLAPPA, R., AND KRISHNASWAMY, A. 1998.Discriminant analysis of principal componentsfor face recognition. In Proceedings, Interna-tional Conference on Automatic Face and GestureRecognition. 336–341.

ZHAO, W., CHELLAPPA, R., AND PHILLIPS, P. J. 1999.Subspace linear discriminant analysis for facerecognition. Tech. rep. CAR-TR-914, Center forAutomation Research, University of Maryland,College Park, MD.

ZHOU, S., KRUEGER, V., AND CHELLAPPA, R. 2003.Probabilistic recognition of human faces fromvideo. Comput. Vis. Image Understand. 91, 214–245.

Received July 2002; accepted June 2003


face recognition: a literature survey - machine perception

Documents