zhiwei zhu phd thesis
TRANSCRIPT
REAL-TIME HUMAN FACIAL BEHAVIORUNDERSTANDING FOR HUMAN COMPUTER
INTERACTION
By
Zhiwei Zhu
A Thesis Submitted to the Graduate
Faculty of Rensselaer Polytechnic Institute
in Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY
Major Subject: Department of Electrical, Computer and Systems Engineering
Approved by theExamining Committee:
Qiang Ji, Thesis Adviser
Badrinath Roysam, Member
John Wen, Member
Wayne Gray, Member
Rensselaer Polytechnic InstituteTroy, New York
December 2005(For Graduation December 2005)
REAL-TIME HUMAN FACIAL BEHAVIORUNDERSTANDING FOR HUMAN COMPUTER
INTERACTION
By
Zhiwei Zhu
An Abstract of a Thesis Submitted to the Graduate
Faculty of Rensselaer Polytechnic Institute
in Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY
Major Subject: Department of Electrical, Computer and Systems Engineering
The original of the complete thesis is on filein the Rensselaer Polytechnic Institute Library
Examining Committee:
Qiang Ji, Thesis Adviser
Badrinath Roysam, Member
John Wen, Member
Wayne Gray, Member
Rensselaer Polytechnic InstituteTroy, New York
December 2005(For Graduation December 2005)
c© Copyright 2005
by
Zhiwei Zhu
All Rights Reserved
ii
CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Vision-Based Human Sensing . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Fundamental Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2. Real-Time Eye Detection and Tracking . . . . . . . . . . . . . . . . . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Eye Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Initial Eye Position Detection . . . . . . . . . . . . . . . . . . 16
2.2.2 Eye Verification Using Support Vector Machines . . . . . . . . 18
2.2.2.1 Support Vector Machines . . . . . . . . . . . . . . . 19
2.2.2.2 SVM Training . . . . . . . . . . . . . . . . . . . . . 20
2.2.2.3 Retraining Using Mis-labeled Data . . . . . . . . . . 22
2.2.2.4 Eye Detection with SVM . . . . . . . . . . . . . . . 23
2.3 Eye Tracking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Eye (Pupil) Tracking with Kalman Filtering . . . . . . . . . . 24
2.3.2 Mean Shift Eye Tracking . . . . . . . . . . . . . . . . . . . . . 28
2.3.2.1 Similarity Measure . . . . . . . . . . . . . . . . . . . 29
2.3.2.2 Eye Appearance Model . . . . . . . . . . . . . . . . . 29
2.3.2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.2.4 Mean Shift Tracking Parameters . . . . . . . . . . . 31
2.3.2.5 Experiments On Mean Shift Eye Tracking . . . . . . 33
2.4 Combining Kalman Filtering Tracking with Mean Shift Tracking . . . 34
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.1 Eye Tracking Under Significant Head Pose Changes . . . . . . 36
2.5.2 Eye Tracking Under Different Illuminations . . . . . . . . . . 37
iii
2.5.3 Eye Tracking With Glasses . . . . . . . . . . . . . . . . . . . . 39
2.5.4 Eye Tracking With Multiple People . . . . . . . . . . . . . . . 40
2.5.5 Occlusion Handling . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.6 Tracking Accuracy Validation . . . . . . . . . . . . . . . . . . 41
2.5.7 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3. Eye Gaze Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 2D Mapping-Based Gaze Estimation Technique . . . . . . . . 45
3.2.2 Direct 3D Gaze Estimation Technique . . . . . . . . . . . . . 47
3.3 Direct 3D Gaze Estimation Technique . . . . . . . . . . . . . . . . . . 49
3.3.1 The Structure of Human Eyeball . . . . . . . . . . . . . . . . 49
3.3.2 Derivation of 3D Cornea Center . . . . . . . . . . . . . . . . . 50
3.3.2.1 The Structure of Cornea . . . . . . . . . . . . . . . . 50
3.3.3 Image Formation in the Convex Mirror . . . . . . . . . . . . . 51
3.3.3.1 Glint Formation In Cornea Reflection . . . . . . . . . 52
3.3.3.2 Curvature Center of the Cornea . . . . . . . . . . . . 53
3.3.4 Computation of 3D Gaze Direction . . . . . . . . . . . . . . . 55
3.3.4.1 Estimation of Optic Axis . . . . . . . . . . . . . . . 55
3.3.4.2 Compensation of the Angle Deviation between Vi-sual Axis and Optic Axis . . . . . . . . . . . . . . . 56
3.4 2D Mapping-Based Gaze Estimation Technique . . . . . . . . . . . . 57
3.4.1 Classical PCCR Technique . . . . . . . . . . . . . . . . . . . . 57
3.4.2 Head Motion Effects on Pupil-glint Vector . . . . . . . . . . . 58
3.4.3 Dynamic Head Compensation Model . . . . . . . . . . . . . . 60
3.4.3.1 Approach Overview . . . . . . . . . . . . . . . . . . 60
3.4.3.2 Image Projection of Pupil-glint Vector . . . . . . . . 61
3.4.3.3 First Case: The cornea center and the pupil centerlie on the camera’s X − Z plane: . . . . . . . . . . . 63
3.4.3.4 Second Case: The cornea center and the pupil centerdo not lie on the camera’s X − Z plane: . . . . . . . 64
3.4.3.5 Iterative Algorithm for Gaze Estimation . . . . . . . 66
3.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5.1 System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
iv
3.5.2 Performance of 3D Gaze Tracking Technique . . . . . . . . . . 68
3.5.2.1 Gaze Estimation Accuracy . . . . . . . . . . . . . . . 68
3.5.2.2 Comparison with Other Methods . . . . . . . . . . . 70
3.5.3 Performance of 2D Mapping Based Gaze Tracking Technique . 70
3.5.3.1 Head Compensation Model Validation . . . . . . . . 70
3.5.3.2 Gaze Estimation Accuracy . . . . . . . . . . . . . . . 71
3.6 Comparison of Both Techniques . . . . . . . . . . . . . . . . . . . . . 73
3.6.1 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4. Robust Face Tracking Using Case-Based Reasoning with Confidence . . . . 76
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 The Mathematical Framework . . . . . . . . . . . . . . . . . . . . . . 79
4.3.1 2D Visual Tracking . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.2 The Proposed Solution . . . . . . . . . . . . . . . . . . . . . . 80
4.4 The CBR Visual Tracking Algorithm . . . . . . . . . . . . . . . . . . 81
4.4.1 Case-Based Reasoning . . . . . . . . . . . . . . . . . . . . . . 81
4.4.2 Case Base Construction . . . . . . . . . . . . . . . . . . . . . 83
4.4.3 Case Retrieving . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.4 Case Adaption (Reusing) . . . . . . . . . . . . . . . . . . . . . 85
4.4.5 Case Revising and Retaining . . . . . . . . . . . . . . . . . . . 86
4.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5.1 Drifting-Elimination Capability . . . . . . . . . . . . . . . . . 87
4.5.2 Confidence-Assessment Capability . . . . . . . . . . . . . . . . 89
4.5.3 Performance under Illumination Changes . . . . . . . . . . . . 90
4.5.4 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5. Real-Time Facial Feature Tracking . . . . . . . . . . . . . . . . . . . . . . 93
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Facial Feature Representation . . . . . . . . . . . . . . . . . . . . . . 95
5.2.1 Pyramidal Gabor Wavelets . . . . . . . . . . . . . . . . . . . . 97
5.2.2 Fast Phase-Based Displacement Estimation . . . . . . . . . . . 98
5.3 Facial Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . 99
v
5.3.1 Facial Feature Approximation . . . . . . . . . . . . . . . . . . 100
5.3.2 Facial Feature Refinement . . . . . . . . . . . . . . . . . . . . 101
5.4 Facial Feature Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4.1 Facial Feature Prediction . . . . . . . . . . . . . . . . . . . . . 103
5.4.2 Facial Feature Measurement . . . . . . . . . . . . . . . . . . . 105
5.4.3 Facial Feature Correction . . . . . . . . . . . . . . . . . . . . 105
5.4.3.1 Facial Feature Refinement . . . . . . . . . . . . . . . 106
5.4.3.2 Imposing Geometry Constraints . . . . . . . . . . . . 106
5.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5.1 Facial Feature Tracking Accuracy . . . . . . . . . . . . . . . . 112
5.5.2 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.6 Comparison with IR-Based Eye Tracker . . . . . . . . . . . . . . . . . 114
5.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6. Nonrigid and Rigid Facial Motion Estimation . . . . . . . . . . . . . . . . 116
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3 Pose and Expression Modelling . . . . . . . . . . . . . . . . . . . . . 119
6.3.1 3D Face Representation . . . . . . . . . . . . . . . . . . . . . 119
6.3.2 3D Deformable Face Model . . . . . . . . . . . . . . . . . . . 120
6.3.3 3D Motion Projection Model . . . . . . . . . . . . . . . . . . . 122
6.4 Normalized SVD for Pose and Expression Decomposition . . . . . . . 123
6.4.1 SVD Decomposition Method . . . . . . . . . . . . . . . . . . . 123
6.4.2 Condition of the Linear System . . . . . . . . . . . . . . . . . 124
6.4.3 Normalization SVD Technique . . . . . . . . . . . . . . . . . . 126
6.4.4 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5 Nonlinear Decomposition Method . . . . . . . . . . . . . . . . . . . . 130
6.6 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.6.1 Performance on Synthetic Data . . . . . . . . . . . . . . . . . 132
6.6.2 Performance on Real Image Sequences . . . . . . . . . . . . . 134
6.6.2.1 Neutral Face Under Various Face Orientations . . . . 134
6.6.2.2 Frontal Face with Different Facial Expressions . . . . 136
6.6.2.3 Non-neutral Face Under Various Face orientations . . 138
6.6.3 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
vi
7. Facial Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.2 Facial Expressions with AUs . . . . . . . . . . . . . . . . . . . . . . . 143
7.3 Coding AUs with Feature Movement Parameters . . . . . . . . . . . . 144
7.4 Modelling Spatial Dependency . . . . . . . . . . . . . . . . . . . . . . 145
7.5 Modelling Temporal Dynamics . . . . . . . . . . . . . . . . . . . . . . 147
7.6 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.6.1 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
vii
LIST OF TABLES
2.1 Experiment results using 3 kernel types with different parameters . . . . 22
2.2 Tracking statistics comparison for both trackers under different eyesconditions (open, closed, occluded) on the first person . . . . . . . . . . 39
2.3 Tracking statistics comparison for both trackers under different eyesconditions (open, closed, occluded) on the second person . . . . . . . . 40
3.1 The gaze estimation accuracy for the first subject. . . . . . . . . . . . . 69
3.2 The gaze estimation accuracy for seven subjects . . . . . . . . . . . . . 69
3.3 Comparison with other systems . . . . . . . . . . . . . . . . . . . . . . 70
3.4 Pupil-glint vector comparison at different eye locations . . . . . . . . . 71
3.5 Gaze estimation accuracy under different eye image resolutions . . . . . 73
6.1 The average RMSEs of the extracted facial deformation vectors for dif-ferent image sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.2 The average error of the extracted face pose angles for different imagesequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.3 The average RMSEs of the extracted facial deformation vectors for dif-ferent image sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.1 The association of six basic facial expressions with AUs . . . . . . . . . 144
7.2 The association between facial action units and facial feature movementparameters (FMPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.3 Confusion statistics from the 700-frame sequence . . . . . . . . . . . . . 150
viii
LIST OF FIGURES
2.1 The disappearance or weakness of the bright pupils due to (a) eye clo-sure, (b) oblique face orientation, (c) eye glasses glare and (d) strongexternal illumination interference . . . . . . . . . . . . . . . . . . . . . . 12
2.2 The combined eye tracking flowchart . . . . . . . . . . . . . . . . . . . . 15
2.3 The bright-pupil (a) and dark-pupil (b) images . . . . . . . . . . . . . . 15
2.4 Eye detection block diagram . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Background illumination interference removal: (a) the even-field imagesobtained under both ambient and IR light; (b) the odd-field imagesobtained under only ambient light; (c) the difference images resultedfrom subtracting (b) from (a) . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 The thresholded difference image marked with pupil candidates . . . . . 17
2.7 The thresholded difference image after removing some blobs based ontheir geometric properties (shape and size). The blobs marked withcircles are selected for further consideration . . . . . . . . . . . . . . . . 18
2.8 (a) The thresholded difference image superimposed with possible pupilcandidates. (b) The dark image marked with possible eye candidatesaccording to the positions of pupil candidates in (a) . . . . . . . . . . . 21
2.9 The eye images in the positive training set . . . . . . . . . . . . . . . . 21
2.10 The non-eye images in the negative training set . . . . . . . . . . . . . . 22
2.11 The result images (a) and (b) marked with identified eyes. Comparedwith images in Figure 2.8 (b), many false alarms have been removed . . 23
2.12 The combined eye tracking flowchart . . . . . . . . . . . . . . . . . . . . . 25
2.13 The eye images: (a)(b) left and right bright-pupil eyes; (c)(d) corre-sponding left and right dark-pupil eyes . . . . . . . . . . . . . . . . . . 29
2.14 (a) The image frame 13; (b) Values of Bhattacharyya coefficient corre-sponding to the marked region(40 × 40 pixels) around the left eye inframe 13. Mean shift algorithm converges from the initial location(∗) tothe convergence point(), which is a mode of the Bhattacharyya surface 32
2.15 The error distribution of tracking results: (a) error distribution vs. in-tensity quantization values and different window sizes; (b) error distri-bution vs. quantization levels only . . . . . . . . . . . . . . . . . . . . . 33
ix
2.16 Mean-shift tracking both eyes with initial search area of 40*40 pixels, asrepresented by the large black rectangle. The eyes marked with whiterectangles in frame 1 are used as the eye model and the tracked eyes inthe following frames are marked by the smaller black rectangles . . . . . 34
2.17 (a) Image of frame 135, with the initial eye position marked and initialsearch area outlined by the large black rectangle. (b) Values of Bhat-tacharyya coefficient corresponding to the marked region(40×40 pixels)around the left eye in (a). Mean shift algorithm cannot converge fromthe initial location()(which is in the valley of two modes) to the correctmode of the surface. Instead, it is trapped in the valley . . . . . . . . . 35
2.18 Bright pupil based Kalman tracker fails to track eyes due to absenceof bright pupils caused by either eye closure or oblique face orienta-tions. The mean shift eye tracker, however, tracks eyes successfully asindicated by the black rectangles . . . . . . . . . . . . . . . . . . . . . . 35
2.19 An image sequence to demonstrate the drift-away problem of the mean-shift tracker as well as the correction of the problem by the integratedeye tracker. Frames (a-e) show the drift away case of the mean-shifteye tracker; for the same image sequences, Frames (A-E) show the im-proved results of the combined eye tracker. White rectangles show theeyes tracked by the Kalman tracker while the black rectangles show thetracked eyes by the mean shift tracker . . . . . . . . . . . . . . . . . . . 36
2.20 Tracking results of the combined eye tracker for a person undergoingsignificant head movements . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.21 Tracking results of the combined eye tracker for four image sequences(a),(b),(c) and (d) under significant head movements. . . . . . . . . . . 38
2.22 Tracking results of the combined eye tracker for two image sequences(a) and (b) under significant illumination changes. . . . . . . . . . . . . 39
2.23 Tracking results of the combined eye tracker for two image sequences(a), (b) with persons wearing glasses. . . . . . . . . . . . . . . . . . . . 40
2.24 Tracking results of the combined eye tracker for multiple persons. . . . . 41
2.25 Tracking results of the combined eye tracker for an image sequenceinvolving multiple persons occluding each other’s eyes. . . . . . . . . . . 42
2.26 The comparison between the automatically tracked eye positions andthe manually located eye positions for 100 randomly selected consecu-tive frames: (a) x coordinate and (b) y coordinate. . . . . . . . . . . . . 42
x
3.1 Eye images with corneal reflection (glint): (a) dark pupil image (b)bright pupil image. Glint is a small bright spot as indicated in (a) and(b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 The structure of the eyeball (top view of the right eye) . . . . . . . . . 49
3.3 The reflection diagram of Purkinje images . . . . . . . . . . . . . . . . . 51
3.4 A ray diagram to locate the image of an object in a convex mirror . . . 51
3.5 The image formation of a point light source in the cornea when thecornea serves as a convex mirror . . . . . . . . . . . . . . . . . . . . . . 53
3.6 The ray diagram of the virtual image of the IR light source in front ofthe cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7 The ray diagram of two IR light sources in front of the cornea . . . . . 54
3.8 Pupil and glint image formations when eyes are located at differentpositions while gazing at the same screen point (side view) . . . . . . . 59
3.9 The pupil-glint vectors generated in the eye images when the eye islocated at O1 and O2 in Figure 3.8 . . . . . . . . . . . . . . . . . . . . . 59
3.10 Pupil and glint image formation when the eye is located at differentpositions in front of the camera . . . . . . . . . . . . . . . . . . . . . . 62
3.11 Pupil and glint image formation when the eye is located at differentpositions in front of the camera (top-down view) . . . . . . . . . . . . . 63
3.12 Projection into camera’s X − Z plane . . . . . . . . . . . . . . . . . . . 65
3.13 The configuration of the gaze-tracking system . . . . . . . . . . . . . . 67
3.14 The pupil-glint vector transformation errors: (a) transformation erroron the X component of the pupil-glint vector, (b)transformation erroron the Y component of the pupil-glint vector . . . . . . . . . . . . . . . 71
3.15 The plot of the estimated gaze points and the true gaze points, where,“+” represents the estimated gaze point and “*” represents the actualgaze point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1 The diagram of the proposed algorithm to improve the accuracy of the2D object tracking model . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2 The Case-Based Cycle of the face tracking system. . . . . . . . . . . . 83
xi
4.3 Comparison of the face tracking results with different techniques. Thefirst row shows the tracking results by the incremental subspace learningtechnique and the tracked face is marked by a red rectangle; the secondrow represents the tracking results by our proposed technique and thetracked face is marked by a dark square. The images are frame 26, 126,352, 415 and 508 from left to right. . . . . . . . . . . . . . . . . . . . . 87
4.4 Comparisons of the tracked face position error: (a) between the pro-posed tracker and the two-frame tracker; (b) between the proposedtracker and the offline tracker; (c) between the proposed tracker andthe incremental subspace learning tracker; . . . . . . . . . . . . . . . . 88
4.5 (a)The similarities computed by the two-frame tracker; (b) The RMSEerrors computed by the incremental subspace learning tracker; (c) Theconfidence scores computed by our proposed tracker . . . . . . . . . . . 89
4.6 The face tracking results with significant facial expression changes, largehead movements and occlusion. For each frame, the tracked face ismarked by a dark square. The upper row displays the image frames 29,193, 211, 236 from left to right; while the lower row displays the imageframes 237, 238, 412 and 444 from left to right . . . . . . . . . . . . . . 90
4.7 The estimated confidence measures . . . . . . . . . . . . . . . . . . . . 91
4.8 Face tracking results under significant changes in illumination and headmovement. The tracked face is marked by a dark square in each image.From left to right, the selected image frames are 75, 246, 295, 444 inthe first row, while the second row displays the image frames 662, 706,745 and 898 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1 The flowchart of the proposed facial feature detection and tracking al-gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 An image pyramid with three levels: (a) base level contains a 320×240pixels image; (b) first level contains a 160×120 pixels image; (c) secondlevel contains a 80× 60 pixels image; (d) third level contains a 40× 30pixels image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3 A face mesh with facial features . . . . . . . . . . . . . . . . . . . . . . 100
5.4 Spatial geometry of the facial features in a frontal face region. Eyes aremarked by the small white rectangles, the face region is marked withlarge white rectangles, and the facial features are marked with whitecircles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.5 (a) face mesh, (b) face image with approximated facial features, (c) faceimage with refined facial features . . . . . . . . . . . . . . . . . . . . . 101
xii
5.6 The face images with detected facial features under different facial ex-pressions: (a) disgust, (b) anger, (c) surprise and (d) happy . . . . . . 102
5.7 The flowchart of the proposed tracking algorithm . . . . . . . . . . . . . 103
5.8 (a) A frontal face image, and (b) its 3D face geometry, with the selectedfacial features marked as the white dots . . . . . . . . . . . . . . . . . . 107
5.9 The randomly selected face images from different image sequences. . . . 112
5.10 The computed position errors of the automatically extracted facial fea-tures by the proposed facial feature tracker: (a) position errors in theX-direction for each facial feature; (b) position errors in the Y-directionfor each facial feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.1 (a) The spatial geometry of the selected facial features marked by thedark dots; (b) The 3D face mesh with the selected facial features markedby the white dots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2 Average errors of the estimated parameters by the SVD method and theproposed N-SVD method respectively as a function of Gaussian noise:(a) face pose error; (b) face scale factor error; (c) facial deformation error134
6.3 Average errors of the estimated parameters by the proposed N-SVDmethod and nonlinear method respectively as a function of Gaussiannoise: (a) face pose error; (b) face scale factor error; (c) facial deforma-tion error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.4 The randomly selected face images from a set of different neutral faceimage sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.5 The calculated RMSEs of the estimated facial deformations . . . . . . . 136
6.6 (a) The three estimated face pose angles; (b) The estimated face scalefactor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.7 The randomly selected images from a frontal face image sequence . . . . 138
6.8 (a) The calculated RMSE of the estimated facial deformation vectors;(b) The average error of the estimated face pose angles . . . . . . . . . 139
6.9 The randomly selected images from three face image sequences withdifferent facial expressions. (Top: happy; Middle: surprise; Bottom:disgust) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.10 (a) The calculated RMSE of the estimated facial deformation vectors;(b) The estimated three face pose angles; (c) The estimated face scalefactor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
xiii
7.1 The spatial geometry of the selected facial features marked by the darkdots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2 The BN model of six basic facial expressions. In this model, “HAP” rep-resents “Happy,” “ANG” represents “Anger,” “SAD” represents “Sad,”“DIS” represents “Disgust,” “FEA” represents “Fear,” and “SUR” rep-resents “Surprise” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.3 The temporal links of DBN for modelling facial expression (two timeslices are shown since the structure repeats by ”unrolling” the two-sliceBN). Node notations are given in Figure 7.2 . . . . . . . . . . . . . . . 148
7.4 Upper: a video sequence with 700 frames containing six basic facial ex-pressions. It only shows 8 snapshots for illustration. Bottom: the out-put result shows probability distributions (emotional intensities) oversix basic facial expression resulting from sampling the sequence every 7frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
xiv
ACKNOWLEDGEMENT
I would like to express my deepest thanks to my thesis advisor, Professor Qiang Ji,
for his expert guidance and valuable suggestions contributed to this dissertation.
Without his help, this dissertation would not have been possible. For the past five
and a half years, I have learned a lot from him and his energetic working style has
influenced me greatly. I’m really appreciative that I have met Professor Qiang Ji.
My sincere thanks also go to Professor Wayne Gray, who helped me enrich
my knowledge significantly while working with his team for the past few years. I
also thank the rest of my committee members: Professor Badrinath Roysam and
Professor John Wen. Their valuable feedback helped me to improve the dissertation
significantly.
I thank all the members in my group for their wonderful support, and my
special thanks go to Wenhui Liao for her time and outstanding advice on the devel-
opment of some ideas in my thesis throughout these years.
This dissertation could not have been accomplished without Shuwen Xia, my
girlfriend who is always with me no matter how difficult the situations were. She
always gives me warm encouragement and love in every situation.
Last but not least, I thank my parents and my sister very much for supporting
me through all these years.
xv
ABSTRACT
To enhance the interaction between human and computer, a major task for the
Human Computer Interaction (HCI) community is to equip the computer with the
ability to recognize the user’s affective states, intentions and needs in real time and in
a non-intrusive manner. Using video cameras together with a set of computer vision
techniques to interpret and understand the human’s behaviors, vision-based human
sensing technology has the advantages of non-intrusiveness and naturalness. Since
the human face contains rich and powerful information about human behaviors, it
has been extensively studied. Typical facial behaviors characterizing human states
include eye gaze, head gestures and facial expression. This research focuses on
developing real time and non-intrusive computer vision techniques to understand
and recognize various facial behaviors.
Specifically, we have developed a range of computer vision techniques. First,
based on systematically combining the appearance model with the bright-pupil effect
of the eye, we develop a new real-time technique to robustly detect and track the
eyes under variable lightings and face orientations. Second, we introduce a new
gaze estimation method for robustly tracking eye-gaze under natural head movement
and with minimum personal calibration. Third, a robust visual tracking framework
is proposed to track the faces under significant changes in lighting, scale, facial
expression and face movement. Fourth, given the detected face, we develop a new
technique for detecting and tracking twenty-eight facial features under significant
facial expressions and various face orientations. Fifth, based on the set of tracked
facial features, a framework is proposed to recover the rigid and non-rigid facial
motions successfully from a monocular image sequence. Subsequently, from the
recovered non-rigid facial motions, a Dynamic Bayesian Network is utilized to model
and recognize the six basic facial expressions under natural head movement.
All of these techniques are extensively tested with numerous subjects under
various situations such as different lighting conditions, significant head movements,
wearing glasses, etc. Experimental study shows significant improvement of our tech-
niques over the existing techniques.
xvi
CHAPTER 1
Introduction
Today, the keyboard and mouse are the main devices in information exchanges
between human and computer. Interacting with keyboard-mouse-based computers,
however, can be a cumbersome experience because it requires the user to adapt to
the computer by learning how to use the keyboard and mouse. In our daily life, we
employ vision, hearing and touch as natural ways of interaction to communicate with
one another. Although we give off much information about ourselves, the computer’s
inability to recognize this complex information dramatically limits its ability to help
humans. If the computer could understand the visual and audio information from
the human, then it would be able to communicate with humans in natural ways. As a
result, rather than requiring the human to adapt to the computer, the computer can
adapt to the human intelligently as if it were a human by understanding the human,
such as what the mood of the human is, where the human is looking, what the human
is doing, and how the human performs. Therefore, by equipping the computer with
the ability to see and sense the human, it will make the interaction between human
and computer easier, more efficient, more intuitive, and more flexible.
Thus, advanced interface technologies that support the human-like interaction
between human and computer need be developed, which will help computers hear
(speech recognition), speak (speech synthesis), see (face tracking, eye tracking, hu-
man body tracking), and sense (gaze estimation, face recognition, affect recognition).
However, developing such interface technologies is challenging.
Recently, using a video camera together with a set of computer vision tech-
niques, researchers on vision-based human sensing technology have been trying to
provide computers with the capability of perception, seeing and sensing the human.
In order to achieve this, numerous research topics have been explored for HCI, such
as hand gesture analysis, head gesture analysis, lip movement analysis, eye gaze
estimation, facial expression analysis, and other body movements analysis. But
most work is still not fast, robust and efficient enough to integrate into functioning
1
2
user-interfaces.
In the following section, we will discuss the current developments of those
vision-based human sensing techniques for HCI.
1.1 Vision-Based Human Sensing
By vision, we refer to the use of video cameras and a set of visual or graphical
techniques for representing and processing information. Non-vision-based human
sensing methods are fairly intrusive in that they require physical contacts with the
user. For example, Picard et al. [91] have tried to recognize human affective states
based on four different physiological signals that measure human’s facial muscle ac-
tivity, heart activity, skin conductance and respiration. These physiological signals
are collected from a set of physiological sensors attached to different parts of the
human body, such as face, fingers, chest, etc. Although these sensors are designed
with the minimal size, they are still very invasive to the human and deprive the hu-
man of the ease and naturalness when he/she interacts with the computer-controlled
environment. However, the vision-based human sensing technique does not require
any physical contact with the user; in fact it works when the user is physically
located at a distance from the sensors. For example, with the use of remotely lo-
cated video cameras together with computer vision techniques, the eye-gaze can
be estimated so that it can be used to control the computer remotely [114], which
even can free our hands to perform other things when we are using our gaze for
controlling the computer. Therefore, compared to non-vision-based human sensing
methods, vision-based human sensing has the advantage of unobtrusiveness and it
gives a sense of “naturalness” and being comfortable during the process of human-
computer interactions. Furthermore, the ever decreasing price/performance ratio
of computing coupled with recent decreases in video image acquisition cost imply
that computer vision systems can be deployed in desktop and embedded systems
[87, 88, 89].
There are numerous areas of research and application where vision-based hu-
man sensing techniques have been studied for HCI. For example, Maes et al. [69]
developed an “ALIVE” system that allows wireless full-body interaction between
3
a user and a 3D graphical world in a computer. A vision system is developed to
extract information about the user, such as the 3D location of the user and the po-
sition of various body parts as well as simple hand gestures. Therefore, a user can
directly interact with the ALIVE space by gestures as if he were in the real, physical
space, such as playing with a virtual dog via head gestures. A fully automated and
interactive narrative play-space for children called the KidsRoom was demonstrated
by Bobick et al. [9]. Using a vision-based action recognition technique to recognize
what the children are doing, the KidsRoom will automatically react to the children’s
actions by providing entertaining feedback, and the children are aware that the room
is responsive.
A smart kiosk interface is presented by Waters et al. [115]. Equipping with
the vision-based techniques, the kiosk can detect a user in front of it and com-
municate with the user automatically as the user approaches it. However, some
important information about the user, such as the facial expression and gaze, has
not been extracted, which limits the kiosk’s functions significantly. Also, there are
various other potential applications that were demonstrated, such as a platform for
simultaneously tracking multiple people and recognizing their behaviors for high-
level interpretation [100]. Other systems include a gaze-assisted translator [46], an
eye interpretation engine [22], an intelligent mediator [99] and a driver’s fatigue
monitoring system [51].
In this thesis, we focus on facial behavior analysis and recognition with the
use of remote video cameras rather than having extra measurement devices such as
helmet and special sensors. The research is important because despite the recent
research and technological efforts, advanced human-computer interaction devices
still suffer from a number of problems. Two major problems with the current devices
are intrusiveness (wearing a helmet or attaching a set of sensors) and expensiveness
related to the need of special hardware. These problems, if not solved, are likely
to hamper the widespread use of the next generation computer interfaces. We
wish to develop a set of vision-based techniques to overcome these problems. We
have developed a set of close-range visual sensing technologies that can identify the
users’ eyelid movement, eye gaze, head movement and facial expressions accurately.
4
Equipping with these vision-based human sensing technologies along with a user
cognitive model, the computer will become aware of user’s intentions and mental
states. Hence, the computer will respond or react to the user’s actions intelligently,
which can enhance the interaction between human and computer significantly. This
thesis describes the necessary computer vision algorithms needed for such vision-
based human sensing technologies.
1.2 Fundamental Issues
As one of the most salient features of human face, eyes play an important role
in interpreting and understanding a person’s desires, needs, and emotional states.
In particular, the eye-gazes, indicating where a person is looking, can reveal the
person’s focus of attention. Recently, Zhai et al. [125] proposed an approach named
as MAGIC pointing, which utilizes eye gaze to place the cursor in the vicinity of
every new object the user looks at. Therefore, rather than controlling the movements
of the cursor by hand all the time, the user only needs to refine the cursor’s position
near the object. Hence, with the aid of the eye gaze, the amount of stress to the
hand can be reduced significantly. Also, such an interface can allow handicapped
people to control the systems via eye-gaze input. This will give them a way to
aid themselves independently. However, the design and implementation of such an
interface has been faced with several challenges. The most difficult challenge is the
eye tracking technology itself, which is not robust and accurate enough.
A good eye tracker is therefore a prerequisite of eye gaze monitoring. Robust
techniques for eye detection are of particular importance to eye-gaze tracking sys-
tems. Information about the eyes can also be used to detect human faces, which
will be further analyzed to obtain the face pose information. Many eye tracking
methods rely on intrusive techniques such as measuring the electric potential of the
skin around the eyes or applying special contact lenses that facilitate eye tracking.
This causes serious problems of user acceptance. To alleviate these problems, we
have developed a non-intrusive eye tracker that can detect and track a user’s eyes
in real-time as soon as the face appears in the view of the camera. The eye tracker
is aided by the active IR lighting and leaves no markers on the user’s face. By
5
combining the conventional appearance-based object recognition method (Support
Vector Machines) and object tracking method (mean shift) with Kalman filtering
based on active IR illumination, our technique is able to benefit from the strengths of
different techniques and overcome their respective limitations. Experimental study
shows significant improvement of our technique over the existing techniques.
After the eyes are tracked successfully, the eye gaze is subsequently extracted
from the tracked eyes. Unlike most of the existing gaze tracking techniques, which
often require assuming a static head to work well and require a cumbersome cali-
bration process for each person, our gaze tracker can perform robust and accurate
gaze estimation with one-time calibration and under rather significant head move-
ment. When the head moves to a new position, the gaze mapping function at this
new position will be automatically updated by the proposed dynamic computational
head compensation model to accommodate the eye position changes. Our proposed
method will dramatically increase the usability of the eye gaze tracking technology,
and we believe that it is a significant step for the eye tracker to be accepted as a
natural computer input device.
Furthermore, head gestures – from the simplest actions of nodding or head-
shaking to the most complex head movements that can express our feelings and
reveal our cognitive states – are a kind of non-verbal interaction among people.
Besides the head gestures, the facial expression is another important cue that can
reveal our emotions and intentions directly. Vision-based head gesture and facial
expression recognition research has also been studied extensively. However, accurate
and fast human face detection and tracking is a crucial first step for them, which still
remains a very challenging task under changes in lighting, scale, facial expressions
and head movements. Therefore, in this thesis, we propose a robust visual tracking
framework based on Case-Based Reasoning with a confidence paradigm to track the
face, so that the face can be tracked robustly under significant changes on lighting,
scale, facial expression and face orientations.
Based on the tracked face images, a set of twenty-eight prominent facial fea-
tures are detected and tracked automatically. However, in reality, the image appear-
ance of the facial features varies significantly among different individuals. Even for
6
a specific person, the appearance of the facial features is easily affected by lighting
conditions, face orientations and facial expressions. Therefore, in order to com-
pensate for the image appearance changes during tracking, we developed a robust
tracking algorithm based on a shape-constrained correction mechanism so that the
facial features can be detected and tracked successfully under the above challenging
situations. Subsequently, the spatio-temporal relationships among the tracked facial
features can be utilized to recover the facial motions.
The face motion is the sum of rigid motion related with face pose and non-rigid
motion related with facial expression. Both motions are nonlinearly coupled in the
captured face image such that they cannot be easily recovered. In this thesis, a novel
technique is proposed to recover 3D rigid and non-rigid facial motions simultaneously
with the use of a set of tracked facial features from a monocular video sequence in
real time. First, the coupling between rigid and non-rigid motions in the image is
expressed analytically by a nonlinear model. Subsequently, techniques are proposed
to decompose the non-linear coupling between them so that the pose and expression
parameters can be recovered simultaneously. Experiments show that the proposed
method can recover the rigid and non-rigid facial motions very accurately.
Once the rigid and non-rigid facial motions are separated successfully from the
face images, the facial expressions are subsequently recognized from the recovered
non-rigid facial motions. A Dynamic Bayesian Network (DBN) is constructed to
model the facial expression based on the recovered non-rigid facial motions. In
the DBN model, the non-rigid facial motions are probabilistically combined with
Ekman’s Facial Action Coding System (FACS) to understand the facial expressions.
With the use of DBN model, the spatial dependencies, uncertainties and temporal
behaviors of facial expressions can be addressed in a coherent and unified hierarchical
probabilistic framework. Hence, the facial expressions can be recognized robustly
over time. Furthermore, since the recovered non-rigid motions are independent of the
face pose, the facial expression can be recognized under arbitrary face orientations.
Through this research, we have developed an integrated prototype system
that tracks a person’s eyelid movement, eye gaze, head movement, face pose and
facial expression all in real time. The specific contributions of this research address
7
several fundamental issues associated with the development of real-time computer
vision algorithms for
1. real-time human eye detection and tracking under various lighting conditions
and face orientations.
2. real-time eye gaze estimation under natural head movements, with minimum
personal calibration.
3. real-time face tracking under changes in lighting, facial expression and face
orientation.
4. real-time facial feature detection and tracking under various face orientations
and significant facial expression changes.
5. real-time 3D rigid and nonrigid facial motions recovering from an uncalibrated
monocular camera.
6. real-time facial expression analysis under natural head movements.
In addition, we also make theoretical contributions in several areas of computer
vision including object detection and tracking, motion analysis and estimation, and
pose estimation.
1.3 Structure of the Thesis
This thesis is arranged as follows. In chapter 2, it presents a new real time
eye tracking methodology that works under variable and realistic lighting conditions
and various face orientations. Chapter 3 describes techniques for real time eye gaze
tracking under natural head movement and with minimum personal calibration.
Chapter 4 proposes a robust visual tracking framework to track faces with significant
facial expressions under various face orientations and lighting conditions in real time.
Chapter 5 proposes a novel technique to detect and track twenty-eight prominent
facial features under different face orientations and various facial expressions in
real time. Chapter 6 proposes a framework for recovering 3D rigid and nonrigid
facial motions from a monocular image sequence obtained from an uncalibrated
8
camera. Subsequently, facial expression is recognized from the recovered nonrigid
facial motions successfully under natural head movements in chapter 7. Finally, a
summary of this research and the possible future research direction is discussed in
chapter 8.
CHAPTER 2
Real-Time Eye Detection and Tracking
2.1 Introduction
As one of the salient features of the human face, human eyes play an impor-
tant role in face detection, face recognition and facial expression analysis. Robust
non-intrusive eye detection and tracking is a crucial step for vision-based man-
machine interaction technology to be widely accepted in common environments such
as homes and offices. Eye tracking has also found applications in other areas in-
cluding monitoring human vigilance [51], gaze-contingent smart graphics [53], and
assisting people with disabilities. The existing work in eye detection and tracking
can be classified into two categories: traditional image-based passive approaches
and active IR-based approaches. The former approaches detect eyes based on the
unique intensity distribution or shape of the eyes. The underlying assumption is
that the eyes appear different from the rest of the face both in intensity and shape;
eyes can be detected and tracked based on exploiting these differences. The active
IR-based approach, on the other hand, exploits the spectral (reflective) properties
of pupils under near IR illumination to produce the bright/dark pupil effect, and
accomplishes eye detection and tracking by detecting and tracking pupils.
Traditional methods can be broadly classified into three categories: template-
based methods [124, 119, 61, 127, 54, 26, 27, 26, 83, 36], appearance-based methods
[90, 42, 41] and feature-based methods [57, 56, 59, 58, 106, 112, 93, 102, 103]. In the
template-based methods, a generic eye model, based on the eye shape, is designed
first. Template matching is then used to search the image for the eyes. Nixon
[83] proposed an approach for accurate measurement of eye spacing using Hough
transform. The eye is modelled by a circle for the iris and a “tailored” ellipse for the
sclera boundary. Their method, however, is time-consuming, needs a high contrast
eye image, and only works with frontal faces. Deformable templates are commonly
used [124, 119, 61]. First, an eye model is designed, which is allowed to translate,
rotate and deform to fit the best representation of the eye shape in the image. Next,
9
10
the eye position can be obtained through a recursive energy-minimization process.
While this method can detect eyes accurately, it requires that the eye model be
properly initialized near the eyes. Furthermore, it is computationally expensive,
and requires good image contrast for the method to converge correctly.
The appearance-based methods [90],[42], [41] detect eyes based on their photo-
metric appearance. These methods usually need to collect a large amount of training
data, representing the eyes of different subjects, under different face orientations,
and under different illumination conditions. These data are used to train a classifier,
such as a neural network or the SVM, and detection is achieved via classification.
In [90], Pentland et al. extended the eigenface technique to the description and
coding of facial features, yielding eigeneyes, eigennoses and eigenmouths. For eye
detection, they extracted appropriate eye templates for training and constructed a
principal component projective space called “Eigeneyes,” accomplishing eye detec-
tion by comparing a query image with an eye image in the eigeneyes space. Huang
et al. [42] also employed the eigeneyes to perform initial eye-position detection.
Huang et al. [41] presented a method to represent eye image using wavelets and to
perform eye detection using an RBF NN classifier. Reinders et al. [93] proposed
several improvements on the neural network-based eye detector. The trained neu-
ral network eye detector can detect rotated or scaled eyes under different lighting
conditions, but it is trained only for the frontal face image.
A number of feature-based methods explore the characteristics (such as edge
and intensity of the iris, the color distributions of the sclera and the flesh) of the eyes
to identify some distinctive features around the eyes. Kawato et al [56] proposed a
feature-based method for eye detection and tracking. Instead of detecting eyes, they
proposed to detect the point between two eyes. The authors believe the point is
more stable and easier to detect than are the individual eyes. Eyes are subsequently
detected as two dark parts, symmetrically located on each side of the between-eye-
point. Feng et al. [26, 27] designed a new eye model consisting of six landmarks
(eye corner points). Their technique first locates the eye landmarks based on the
variance projection function (VPF) and the located landmarks are then employed
to guide the eye detection. Unfortunately, experiments show that their method
11
will fail if the eye is closed or partially occluded by hair or face orientation, and in
addition, their technique may mistake eyebrows for eyes. Tian et al. [106] proposed
a new method to track the eye and extract the eye parameters. The method requires
manually initializing the eye model in the first frame. The eye’s inner corner and
eyelids are tracked using a modified version of the Lucas-Kanade tracking algorithm
[68]. The edge and intensity of the iris are used to extract the shape information of
the eye. Their method, however, requires a high contrast image to detect and track
eye corners and to obtain a good edge image.
In summary, the traditional image-based eye tracking approaches detect and
track the eyes by exploiting eyes’ differences in appearance and shape from the rest
of the face. The special characteristics of the eye such as dark pupil, white sclera,
circular iris, eye corners, eye shape, etc. are utilized to distinguish the human eye
from other objects. But due to eye closure, eye occlusion, variability in scale and
location, different lighting conditions, and face orientations, these differences will
often diminish or even disappear. Wavelet filtering [82, 98] has been commonly used
in computer vision to reduce the illumination effect by removing subbands sensitive
to illumination changes; however, it only works under slight illumination variation,
and illumination variation for eye tracking applications could be significant. Hence,
the eye image will not look much different in appearance or shape from the rest of the
face, and the traditional image-based approaches cannot work very well, especially
for the faces with non-frontal orientations, under different illuminations, and for
different subjects.
Eye detection and tracking based on the active remote IR illumination is a
simple yet effective approach. It exploits the spectral (reflective) properties of the
pupil under near IR illumination. Numerous techniques [21, 19, 81, 80, 37, 51] have
been developed based on this principle, including some commercial eye trackers [2, 1].
They all rely on an active IR light source to produce the dark or bright pupil effects.
Ebisawa et al. [21] generate the bright/dark pupil images based on a differential
lighting scheme using two IR light sources (on and off camera axis). The eye can be
tracked effectively by tracking the bright pupils in the difference image resulting from
subtracting the dark pupil image from the bright pupil image. Later in [19], they
12
further improved their method by using pupil brightness stabilization to eliminate
glass reflection. Morimoto et al. [81] also utilize a differential lighting scheme to
generate the bright/dark pupil images, and pupil detection is done after thresholding
the difference image. A larger temporal support is used to reduce artifacts caused
mostly by head motion, and geometric constraints are used to group the pupils.
Most of these methods require distinctive a bright/dark pupil effect to work
well. The success of such a system strongly depends on the brightness and size
of the pupils, which are often affected by several factors including eye closure, eye
occlusion due to face rotation, external illumination interferences, and the distances
of the subjects from the camera. Figure 2.1 summarizes different conditions under
which the pupils may not appear very bright or even disappear. These conditions
include eye closure as shown in Figure 2.1 (a), oblique face orientations as shown in
Figure 2.1 (b), presence of other bright objects (due to either eye glasses glares or
motion) as shown in Figure 2.1 (c), and external illumination interference as shown
in Figure 2.1 (d).
(a) (b) (c) (d)
Figure 2.1: The disappearance or weakness of the bright pupils due to (a)eye closure, (b) oblique face orientation, (c) eye glasses glareand (d) strong external illumination interference
The absence of bright pupils, or weak pupil intensity, poses serious problems for
the existing eye tracking methods using IR in that they all require relatively stable
lighting conditions, users close to the camera, small out-of-plane face rotations, and
open and un-occluded eyes. These conditions impose serious restrictions on the part
of their systems as well as on the user, and therefore limit their application scope.
Realistically, however, lighting can be variable in many application domains; the
natural movement of the head often involves out-of-plane rotation, and eye closures
due to blinking and winking are physiological necessities for humans. Furthermore,
13
thick eye glasses tend to disturb the infrared light so much that the pupils appear
very weak. It is therefore very important for the eye tracking system to be able to
robustly and accurately track eyes under these conditions as well.
To alleviate some of these problems, Ebisawa [19] proposed an image difference
method based on two light sources to perform pupil detection under various lighting
conditions. The background can be eliminated using the image difference method,
and the pupils can be easily detected by setting the threshold as low as possible in the
difference image. Ebisawa [19] also proposed an ad hoc algorithm for eliminating
glare on glasses, based on thresholding and morphological operations. However,
the automatic determination of the threshold and the structure element size for
morphological operations is difficult, and the threshold value cannot be set as low
as possible considering the efficiency of the algorithm. Also, eliminating the noise
blobs just according to their sizes is not enough.
Haro [37] proposed performing pupil tracking based on combining eye appear-
ance, the bright pupil effect, and motion characteristics so that pupils can be sepa-
rated from other equally bright objects in the scene. To do so, Haro [37] proposed
to verify the pupil blobs using a conventional appearance-based matching method
and the motion characteristics of the eyes. But their method cannot track closed or
occluded eyes, or eyes with weak pupil intensity due to interference from external
illuminations. Ji et al. [51] proposed a real time subtraction and a special filter
to eliminate the external light interferences, but their technique fails to track the
closed/occluded eyes. To handle the presence of other bright objects, their method
performs pupil verification based on the shape and size of pupil blobs to eliminate
spurious pupil blobs. Usually, however, spurious blobs have similar shape and size
to those of the pupil blobs and make it difficult to distinguish the real pupil blobs
from the noise blobs based on only shape and size.
In this chapter, we have proposed a real-time robust method for eye tracking
under variable lighting conditions and face orientations, based on combining the
appearance-based methods and the active IR illumination approach. Combining
the respective strengths of different complementary techniques and overcoming their
shortcomings, the proposed method uses an active infrared illumination to brighten
14
subject’s faces to produce the bright pupil effect. The bright pupil effect and the
appearance of eyes are utilized simultaneously for eye detection and tracking. The
latest technologies in pattern classification recognition (the SVM) and in object
tracking (the mean-shift) are employed for pupil detection and tracking based on
eye appearance. Some of the ideas presented in this chapter have been briefly
reported in [135] and [132].
In this chapter, we report our algorithm in detail. Our method consists of
two parts: eye detection and eye tracking. Eye detection is accomplished by si-
multaneously utilizing the bright/dark pupil effect under active IR illumination and
the eye appearance pattern under ambient illumination via the SVM classifier. Eye
tracking is composed of two major modules. The first module is a conventional
Kalman filtering tracker based on the bright pupil. The Kalman filtering tracker is
augmented with the SVM classifier [15, 40] to perform verification of the detected
eyes. If the Kalman filtering eye tracker fails due to either weak pupil intensity or
the absence of bright pupils, eye tracking based the on mean shift is activated [12]
to continue tracking the eyes. Eye tracking returns to the Kalman filtering tracker
as soon as the bright pupils reappear, since eye tracking using bright pupils is much
more robust than the mean shift tracker which, we find, tends to drift away. The
two trackers alternate, complementing each other and overcoming their limitations.
Figure 2.2 summarizes our eye tracking algorithm.
2.2 Eye Detection
To facilitate subsequent image processing, the person’s face is illuminated using
a near-infrared illuminator. The use of an infrared illuminator serves three purposes:
first, it minimizes the impact of different ambient light conditions, therefore ensur-
ing image quality under varying real-world conditions including poor illumination,
day, and night; second, it allows production of the bright/dark pupil effect, which
constitutes the foundation for the proposed eye detection and tracking algorithm;
third, since near infrared is barely visible to the user, it minimizes interference with
the user’s work. According to the original patent (from Hutchinson [43]), a bright
pupil can be obtained if the eyes are illuminated with a near infrared illuminator
15
Eye Detection Based on SVM
Success?
Kalman Filter Based Bright Pupil Eye Tracker
Yes
Success?
Update the Target Model for the Mean Shift Eye Tracker
Yes
Yes
No
Mean Shift Eye Tracker
Success?
No
No
Input IR Images
Figure 2.2: The combined eye tracking flowchart
beaming light along the camera’s optical axis at a certain wavelength. At near
infrared wavelengths, pupils reflect almost all infrared light they receive along the
path back to the camera, producing the bright pupil effect, very much similar to
the red eye effect in photography. If illuminated off the camera’s optical axis, the
pupils appear dark since the reflected light will not enter the camera lens. This
produces the so-called dark pupil effects. Examples of bright and dark pupils are
given in Figure 2.3. Details about the construction of the IR illuminator and its
configuration may be found in [52].
(a) (b)
Figure 2.3: The bright-pupil (a) and dark-pupil (b) images
Given the IR illuminated eye images, eye detection is accomplished via pupil
16
detection. Pupil detection is accomplished based on both the intensity of the pupils
(the bright and dark pupils) and the appearance of the eyes using the SVM classifier.
Specifically, pupil detection starts with preprocessing to remove external illumina-
tion interference, followed by searching the whole image for pupils based on pupil
intensity and eye appearance. Multiple pupils can be detected if there is more than
one person present, and the use of SVM avoids falsely identifying a bright region as
a pupil. Figure 2.4 gives an overview of the eye detection module.
Image Subtraction
Adaptive Thresholding
Connected Component Analysis
SVM Eye Verification
Interlaced Image
Even Field Image
Odd Field Image Binary Image
Geometric Constraints
Blobs Eye Candidates Eyes
Video Decoder
Figure 2.4: Eye detection block diagram
2.2.1 Initial Eye Position Detection
The detection algorithm starts with preprocessing to minimize interference
from illumination sources other than the IR illuminator, including sunlight and
ambient light interference. A differential method is used to remove background
interference by subtracting the dark eye image (odd field) from the bright eye image
(even field), producing a difference image, with most of the background and external
illumination effects removed, as shown in Figure 2.5 (c). For real time eye tracking,
the image subtraction must be implemented efficiently in real time. To achieve this,
we developed circuitry to synchronize the outer ring of LEDs and inner ring of LEDs
with the even and odd fields of the interlaced image, respectively, so that they can
be turned on and off alternately. When the even field is being scanned, the inner
ring of LEDs is on and the outer ring of LEDs is off and vice versa when the even
field is scanned. The interlaced input image is subsequently de-interlaced via a video
decoder, producing the even and odd field images as shown in Figure 2.5 (a) and
17
(b). More on our image subtraction circuitry may be found in [52].
(a) (b) (c)
Figure 2.5: Background illumination interference removal: (a) the even-field images obtained under both ambient and IR light; (b)the odd-field images obtained under only ambient light; (c)the difference images resulted from subtracting (b) from (a)
The difference image is subsequently thresholded automatically based on its
histogram, producing a binary image. Connected component analysis is then applied
to the binary image to identify the binary blobs. Our task is then to find out which of
the blobs actually is the real pupil blob. Initially, we mark all the blobs as potential
candidates for pupils as shown in Figure 2.6.
Figure 2.6: The thresholded difference image marked with pupil candi-dates
18
2.2.2 Eye Verification Using Support Vector Machines
As shown in Figure 2.6, there are usually many potential candidates for pupils.
Typically, pupils are found among the binary blobs. However, it is usually not
possible to isolate the pupil blob only by picking the right threshold value, since
pupils are often small and not bright enough compared with other noise blobs.
Thus, we will have to make use of information other than intensity to correctly
identify them.
One initial way to distinguish the pupil blobs from other noise blobs is based
on their geometric shapes. Usually, the pupil is an ellipse-like blob and we can use
an ellipse fitting method [29] to extract the shape of each blob and use the shape and
size to remove some blobs from further consideration. It must be noted, however,
that due to scale change (distance from the camera) and to variability in individual
pupil size, size is not a reliable criterion. It is only used to remove very large or very
small blobs. Shape criterion, on the other hand, is scale-invariant. Nevertheless,
shape alone is not sufficient since there are often present other non-pupil blobs with
similar shape and size, as shown in Figure 2.7, where we can see that there are still
Figure 2.7: The thresholded difference image after removing some blobsbased on their geometric properties (shape and size). Theblobs marked with circles are selected for further considera-tion
several non-pupil blobs left. Because they are so similar in shape and size, we can’t
distinguish the real pupil blobs from them, so we have to use other features.
We have observed that the eye region surrounding pupils has a unique intensity
distribution; they appear different from other parts of the face in the dark pupil
19
image as shown in Figure 2.3 (b). The appearance of an eye can therefore be utilized
to separate it from non-eyes. We map the locations of the remaining binary blobs
to the dark pupil images and then apply the SVM classifier [15, 40] to automatically
identify the binary blobs that correspond to eyes, as discussed below.
2.2.2.1 Support Vector Machines
SVM [15] is a two-class classification method that finds the optimal decision
hyper-plane based on the concept of structural risk minimization. Ever since its
introduction, SVM has become increasingly popular. The theory of SVM can be
briefly summarized as follows. For the case of two-class pattern recognition, the
task of predictive learning from examples can be formulated as follows. Given a set
of functions fα and an input domain RN of N dimensions:
fα : α ∈ Λ, fα : RN −→ −1, +1
(Λ is an index set) and a set of l examples:
(x1, y1), ...(xi, yi), ..., (xl, yl), xi ∈ RN , yi ∈ −1, +1
where xi is an input feature vector and yi represents the class, which has only
two values, -1 and +1. Each (xi, yi) is generated from an unknown probability
distribution p(x, y), and the goal is to find a particular function f ∗
α which provides
the smallest possible value for the risk:
R(α) =
∫|fα(x)− y|dp(x, y) (2.1)
Suppose that there is a separating hyper-plane that separates the positive
class from the negative class. The data characterizing the boundary between the
two classes are called the support vectors since they alone define the optimal hyper-
plane. First, a set (xi, yi) of labeled training data are collected as the input to the
SVM. Then, a trained SVM will be characterized by a set of Ns support vectors si,
coefficient weights αi for the support vectors, class labels yi of the support vectors,
20
and a constant term w0.
For the linearly separable case, the linear decision surface (the hyperplane) is
defined as
w · x + w0 = 0 (2.2)
where x is a point the hyperplane, “·” denotes dot product, w is the normal of the
hyperplane, and w0 is the distance to the hyperplane from the origin. Through the
use of training data, w can be estimated by
w =Ns∑
i=1
αiyisi (2.3)
Given w and w0, an input vector xi can be classified into one of the two classes,
depending on whether w · x + w0 is larger or smaller than 0.
Classes are often not linearly separable. In this case, SVM can be extended by
using a kernel K(., .), which performs a nonlinear mapping of the feature space to
a higher dimension, where classes are linearly separable. The most common SVM
kernels include Gaussian kernels, Radial-based kernels, and polynomial kernels. The
decision rule with a kernel can be expressed as
Ns∑
i=1
αiyiK(si, x) + w0 = 0 (2.4)
2.2.2.2 SVM Training
To use SVM, training data are needed to obtain the optimal hyper-plane. An
eye image is represented as a vector I consisting of the original pixel values. For
this project, after obtaining the positions of pupil candidates using the methods
mentioned above, we obtain the sub-images from the dark image according to those
positions as shown in Figure 2.8.
Usually, the eyes are included in those cropped images of 20× 20 pixels. The
cropped image data are processed using histogram equalization and normalized to
a [0, 1] range before training. The eye training images are divided into two sets: a
positive set and a negative set. In the positive image set, we include eye images of
different gazes, different degrees of opening, different face poses, different subjects,
21
(a) (b)
Figure 2.8: (a) The thresholded difference image superimposed with pos-sible pupil candidates. (b) The dark image marked with pos-sible eye candidates according to the positions of pupil can-didates in (a)
and with/without glasses. The non-eye images are placed in the negative image set.
Figures 2.9 and 2.10 contain examples of eye and non-eye images in the training
sets, respectively.
Figure 2.9: The eye images in the positive training set
After finishing the above step, we get a training set, which has 558 positive
images and 560 negative images. In order to obtain the best accuracy, we need
to identify the best parameters for the SVM. In Table 2.1, we list three different
SVM kernels with various parameter settings and each SVM was tested on 1757 eye
22
Figure 2.10: The non-eye images in the negative training set
Table 2.1: Experiment results using 3 kernel types with different param-eters
Kernel Type Deg Sigma # Support Accuracyσ Vectors
Linear 376 0.914058Polynomial 2 334 0.912351Polynomial 3 358 0.936255Polynomial 4 336 0.895845Gaussian 1 1087 0.500285Gaussian 2 712 0.936255Gaussian 3 511 0.955037Gaussian 4 432 0.946500Gaussian 5 403 0.941377
candidate images obtained from different persons.
From the above table, we can see that the best accuracy we can achieve is
95.5037%, using a Gaussian kernel with a σ of 3.
2.2.2.3 Retraining Using Mis-labeled Data
Usually, supervised learning machines rely on only limited labeled training
examples, and cannot reach very high learning accuracy. So we have to test on
thousands of unlabeled data, pick up the mis-labeled data, then put them into the
correct training sets and retrain the classifier. After performing this procedure on
the unlabeled data obtained from different conditions several times, we can boost
the accuracy of the learning machine at the cost of extra time needed for retraining.
Specifically, we have eye data sets from ten people, which we obtained using
the same method. We choose the first person’s data set and label the eye images and
23
non-eye images manually, then we train the Gaussian SVM on this training set and
test Gaussian SVM on the second person’s data set. We check the second person’s
data one by one, pick up all the mis-labeled data, label them correctly and add
them into the training set. After finishing the above step, we retrain the SVM on
this increased training set and repeat the above step on the next person’s data set.
The whole process then repeats until the classification errors stabilize. Through the
retraining process, we can significantly boost the accuracy of the Gaussian SVM.
2.2.2.4 Eye Detection with SVM
During eye detection, we crop the regions in the dark pupil image according to
the locations of pupil candidates in the difference image as shown in Figure 2.8 (b).
After some preprocessing on these eye candidate images, they will be provided to
the trained SVM for classification. The trained SVM will classify the input vector
I into eye class or non-eye class. Figure 2.11 shows that the SVM eye classifier
correctly identifies the real eye regions as marked.
(a) (b)
Figure 2.11: The result images (a) and (b) marked with identified eyes.Compared with images in Figure 2.8 (b), many false alarmshave been removed
Pupil verification with SVM works reasonably well and can generalize to people
of the same race. However, for people from a race that is significantly different from
those in training images, the SVM may fail and need to be retrained. SVM can
work under different illumination conditions due to the intensity normalization for
the training images via histogram equalization.
24
2.3 Eye Tracking Algorithm
Given the detected eyes in the initial frames, the eyes in subsequent frames
can be tracked from frame to frame. Eye tracking can be done by performing pupil
detection in each frame. This brute force method, however, will significantly slow
down the speed of pupil tracking, making real time pupil tracking impossible since
it needs to search the entire image for each frame. This can be done more efficiently
by using the scheme of prediction and detection. Kalman filtering [8] provides
a mechanism to accomplish this. The Kalman pupil tracker, however, may fail if
pupils are not bright enough under the conditions mentioned previously. In addition,
rapid head movement may also cause the tracker to lose the eyes. This problem is
addressed by augmenting the Kalman tracker with the mean shift tracker.
Figure 2.12 summarizes our eye tracking scheme. Specifically, after locating
the eyes in the initial frames, Kalman filtering is activated to track bright pupils. If
it fails in a frame due to disappearance of bright pupils, eye tracking based on the
mean shift will take over. Our eye tracker will return to bright pupil tracking as
soon as bright pupil appears again since it is much more robust and reliable tracking.
Pupil detection will be activated if the mean shift tracking fails. These two stage
eye trackers work together and they complement each other. The robustness of the
eye tracker is improved significantly. The Kalman tracking, the mean shift tracking,
and their integration are briefly discussed below.
2.3.1 Eye (Pupil) Tracking with Kalman Filtering
A Kalman filter is a set of recursive algorithms that estimate the position and
uncertainty of moving targets in the next time frame, that is, where to look for
the targets, and how large a region should be searched in the next frame around
the predicted position in order to find the targets with a certain confidence. It
recursively conditions current estimate on all of the past measurements and the
process is repeated with the previous a posteriori estimates used to project the new
a priori estimates. This recursive nature is one of the very appealing features of the
Kalman filter since it makes practical implementation much more feasible.
Our pupil tracking method based on Kalman filtering can be formalized as
25
Eye detection
Success?
Kalman filter based bright pupil eye tracker
Yes
Success?
No
Update the eye target model for Mean shift
eye tracker
Yes
Initialize estimated center (y0) with Kalman filter, then y1=y0
Calculate the combined weights (bright pupil image and dark pupil
image as two channels)
Calculate the new target center y0
dis(y1-y0)<threshold value?
y1=y0
No
No
Bhattacharyya coefficient < threshold value?
Yes
No
Yes
Mean Shift Tracker
Figure 2.12: The combined eye tracking flowchart
follows. The state of a pupil at each time instance (frame) t can be characterized by
its position and velocity. Let (ct, rt) represent the pupil pixel position (its centroid)
at time t and (ut, vt) be its velocity at time t in c and r directions respectively. The
state vector at time t can therefore be represented as Xt = (ct rt ut vt)t.
According to the theory of Kalman filtering [73], Xt+1, the state vector at the
next time frame t+1, linearly relates to current state Xt by the system model as
26
follows:
Xt+1 = ΦXt + Wt (2.5)
where Φ is the state transition matrix and Wt represents system perturbation. Wt
is normally distributed as p(Wt) ∼ N(0, Q), and Q represents the process noise
covariance.
We further assume that a fast feature extractor estimates Zt = (ct, rt), the
detected pupil position at time t. Therefore, the measurement model in the form
needed by the Kalman filter is
Zt = HXt + Mt (2.6)
where matrix H relates current state to current measurement and Mt represents
measurement uncertainty. Mt is normally distributed as p(Mt) ∼ N(0, R), and R
is the measurement noise covariance. For simplicity, since Zt only involves position,
H can be represented as
H =
1 0 0 0
0 1 0 0
The feature detector (e.g., thresholding or correlation) searches the region as
determined by the projected pupil position and its uncertainty to find the feature
point at time t + 1. The detected point is then combined with the prediction
estimation to produce the final estimate.
Specifically, given the state model in equation 2.5 and measurement model in
equation 2.6, as well as some initial conditions, the state vector Xt+1, along with
its covariance matrix Σt+1, can be updated as follows. For subsequent discussion,
let us define a few more variables. Let X−
t+1 be the estimated state at time t+1,
resulting from using the system model only. It is often referred to as the a priori
state estimate. Xt+1 differs from X−
t+1 in that it is estimated using both the system
model (equation 2.5) and the measurement model (equation 2.6). Xt+1 is usually
referred as the a posteriori state estimate. Let Σ−
t+1 and Σt+1 be the covariance
matrices for the state estimates X−
t+1 and Xt+1 respectively. They characterize the
27
uncertainties associated with the a priori and a posteriori state estimates. The
goal of Kalman filtering is therefore to estimate Xt+1 and Σt+1 given Xt, Σt, Zt,
and the system and measurement models. The Kalman filtering algorithm for state
prediction and updating is summarized below.
1. State prediction
Given current state Xt and its covariance matrix Σt, state prediction involves
two steps: state projection (X−
t+1) and error covariance estimation (Σ−
t+1) as
summarized in Eq. 2.7 and Eq. 2.8.
X−
t+1 = ΦXt (2.7)
Σ−
t+1 = ΦΣtΦt + Qt (2.8)
Given the estimate X−
t+1, and its covariance matrix Σ−
t+1, pupil detection is
performed to detect the pupil around X−
t+1, with the search area determined
by Σ−
t+1. In practice, to speed up the computation, the values of Σ−
t+1[0][0] and
Σ−
t+1[1][1] are used to compute the search area size. Specifically, the search area
size is chosen as 20+2*Σ−
t+1[0][0] pixels and 20+2*Σ−
t+1[1][1] pixels, where 20
by 20 pixels is the basic window size. This means the larger the Σ−
t+1[0][0] and
Σ−
t+1[1][1], the more uncertain is the estimation, and the larger is the search
area. The search area is therefore adaptively adjusted. Therefore, the pupil
can be located quickly.
2. State updating
The detected pupil position is represented by Zt+1. Then, state updating can
be performed to derive the final state and its covariance matrix. The first
task during state updating is to compute the Kalman gain Kt+1. It is done as
follows:
Kt+1 =Σ−
t+1HT
HΣ−
t+1HT + R
(2.9)
28
The gain matrix K can be physically interpreted as a weighting factor to
determine the contribution of measurement Zt+1 and prediction HX−
t+1 to the
a posteriori state estimate Xt+1. The next step is to generate a posteriori state
estimate Xt+1 by incorporating the measurement into equation 2.5. Xt+1 is
computed as follows:
Xt+1 = X−
t+1 + Kt+1(Zt+1 −HX−
t+1) (2.10)
The final step is to obtain the a posteriori error covariance estimate. It is
computed as follows:
Σt+1 = (I −Kt+1H)Σ−
t+1 (2.11)
After each time and measurement update pair, the Kalman filtering recursively
conditions the current estimate on all of the past measurements and the process is
repeated with the previous a posteriori estimates used to project a new a priori
estimate.
The Kalman filtering tracker works reasonably well under frontal face rotation
with the eye open. However, it will fail if the pupils are not bright due to either face
orientation or external illumination interferences. The Kalman filter also fails when a
sudden head movement occurs due to incorrect prediction because the assumption of
smooth head motion has been violated. In each case, Kalman filtering fails because
the Kalman filter detector cannot detect pupils. We proposed to use the mean shift
tracking to augment Kalman filtering tracking to overcome this limitation.
2.3.2 Mean Shift Eye Tracking
Due to the IR illumination, the eye region in the dark and bright pupil images
exhibits strong and unique visual patterns such as the dark iris in the white part.
This unique pattern should be utilized to track eyes in case the bright pupils fail
to appear on the difference images. This is accomplished via the use of mean shift
tracking. Mean shift tracking is an appearance-based object tracking method. It
employs mean shift analysis to identify a target candidate region, which has the
29
most similar appearance to the target model in terms of intensity distribution.
2.3.2.1 Similarity Measure
The similarity of two distributions can be expressed by a metric based on the
Bhattacharyya coefficient as described in [12]. The derivation of the Bhattacharyya
coefficient from sample data involves the estimation of the target density q and the
candidate density p, for which we employ the histogram formulation. Therefore, the
discrete density q = quu=1...m (with∑m
u=1 qu = 1 ) is estimated from the m-bin
histogram of the target model, while p(y) = pu(y)u=1...m (with∑m
u=1 pu = 1 ) is
estimated at a given location y from the m-bin histogram of the target candidate.
Then at location y, the sample estimate of the Bhattacharyya coefficient for target
density q and candidate density p(y) is given by
ρ(y) ≡ ρ [p(y), q] =m∑
u=1
√puqu (2.12)
The distance between two distributions can be defined as
d(y) =√
1− ρ [p(y), q] (2.13)
2.3.2.2 Eye Appearance Model
To reliably characterize the intensity distribution of eyes and non-eyes, the
intensity distribution is characterized by two images: even and odd field images,
resulting from de-interlacing the original input images. They are under different il-
luminations, with one producing bright pupils and the other producing dark pupils
as shown in Figure 2.13. The use of two channel images to characterize eye appear-
ance represents a new contribution and can therefore improve the accuracy of eye
detection.
(a) (b) (c) (d)
Figure 2.13: The eye images: (a)(b) left and right bright-pupil eyes;(c)(d) corresponding left and right dark-pupil eyes
30
Thus, there are two different feature probability distributions of the eye target
corresponding to dark-pupil and bright-pupil images , respectively. We use a 2D
joint histogram, which is derived from the grey level dark-pupil and bright-pupil
image spaces with m = l × l bins, to represent the feature probability distribution
of the eyes. Before calculating the histogram, we employ a convex and monotonic
decreasing kernel profile k to assign a smaller weight to the locations that are farther
from the center of the target. Let us denote by xii=1...nhthe pixel locations of
a target candidate that has nh pixels, centered at y in the current frame. The
probability distribution of the intensity vector I = (Ib, Id), where Id and Ib represent
the intensities in the dark and bright images respectively, in the target candidate is
given by
pu(y) =
∑nh
i=1 k(‖y−xi
h‖
2)δ[b(xi)− u]
∑nh
i=1 k(‖y−xi
h‖
2)
where u=1,2,..,m (2.14)
in which the b(xi) is the index to a bin in the joint histogram of the intensity vector
I at location xi, h is the radius of the kernel profile and δ is the Kronecker delta
function. The eye model distribution q can be built in a similar fashion.
2.3.2.3 Algorithm
After locating the eyes in the previous frame, we construct an eye model q
using Equation 2.14 based on the detected eyes in the previous frame. We then
predict the locations y0 of eyes in the current frame using the Kalman filter. Then
we treat y0 as the initial position and use the mean shift iterations to find the
most similar eye candidate to the eye target model in the current frame, using the
following algorithm:
1. Initialize the location of the target in the current frame with y0, then compute
the distribution pu(y0)u=1...m using Equation 2.14 and evaluate similarity
measure (Bhattacharyya coefficient) between the model density q and target
candidate density p:
ρ[p(y0), q] =m∑
u=1
√pu(y0)qu (2.15)
31
2. Derive the weights wii=1...nhaccording to
wi =m∑
u=1
δ[b(xi)− u]
√qu
pu(y0)(2.16)
3. Based on the mean shift vector, derive the new location of the eye target
y1 =
∑nh
i=1 xiwig(‖y0−xi
h‖
2)
∑nh
i=1 wig(‖y0−xi
h‖
2)
(2.17)
where g(x) = −k′(x) and then update pu(y1)u=1...m, and evaluate
ρ[p(y1), q] =m∑
u=1
√pu(y1)qu (2.18)
4. While ρ[p(y1), q] < ρ[p(y0), q]
Do y1 ← 0.5(y0 + y1)
This is necessary to avoid the mean shift tracker moving to an incorrect loca-
tion.
5. If ‖y1 − y0‖ < ε, stop, where ε is the termination threshold
Otherwise, set y0 ← y1 and go to step 1.
The new eye locations in the current frame can be achieved in a few itera-
tions, as opposed to correlation-based approaches, which must perform an exhaustive
search around the previous eye location. Due to the simplicity of the calculations,
this method is much faster than correlation. Figure 2.14(b) plots the surface for
the Bhattacharyya coefficient of the large rectangle marked in Figure 2.14(a). The
mean shift algorithm exploits the gradient of the surface to climb from its initial
position to the closest peak that represents the maximum value of the similarity
measure.
2.3.2.4 Mean Shift Tracking Parameters
The mean shift algorithm is sensitive to the window size and the histogram
quantization value. In order to obtain the best performance of the mean shift tracker
32
(a) (b)
Figure 2.14: (a) The image frame 13; (b) Values of Bhattacharyya co-efficient corresponding to the marked region(40 × 40 pix-els) around the left eye in frame 13. Mean shift algorithmconverges from the initial location(∗) to the convergencepoint(), which is a mode of the Bhattacharyya surface
for a specific task, we have to find the appropriate histogram quantization value and
the proper window size. We choose several image sequences and manually locate
the left eye positions in these frames. Then we run the mean shift eye tracker under
different window sizes and different histogram quantization values; we evaluate the
performance of the mean shift eye tracker under those conditions using the following
criterion:
αerror =N∑
i=1
√(yi(tracked)− y
′
i(manual))2/N (2.19)
where N is the number of image frames and yi(tracked) is the left eye location
tracked by the mean shift tracker in the image frame i; y′
i(manual) is the left eye
location manually located by the person in the image frame i. We treat the manually
selected eye locations as the correct left eye locations.
The intensity histogram is scaled in the range of 0 to 255/(2q), where q is the
quantization value. The results are plotted in Fig. 2.15. From figure 2.15 (a) and
(b), we can determine the optimal quantization level to be 25 while the optimal
window size is 20*20 pixels. Figure 2.16 shows some tracking results with these
parameters.
The mean-shift tracker, however, is sensitive to its initial placement. It may
33
(a) (b)
Figure 2.15: The error distribution of tracking results: (a) error distribu-tion vs. intensity quantization values and different windowsizes; (b) error distribution vs. quantization levels only
not converge, or may converge to a local minimum if initially placed too far from
the optimal location. It usually converges to the mode, closest to its initial position.
If the initial location is in the valley between two modes, the mean shift may not
converge to any (local maxima) peaks as shown in Figure 2.17. This demonstrates
the sensitivity of the mean-shift tracker to initial placement of the detector.
2.3.2.5 Experiments On Mean Shift Eye Tracking
In order to study the performance of the mean-shift tracker, we apply it to
sequences that contain images with weak or partially occluded pupils or no bright
pupils. We have noticed that when bright pupils disappear due to either eye closure
or face rotations as shown in Figure 2.18, the Kalman filter fails because there
are no bright pupil blobs in the difference images. However, the mean shift tracker
compensates for the failure of bright pupil tracker because it is an appearance-based
tracker that tracks the eyes according to the intensity statistical distributions of the
eye regions and does not need bright pupils. The black rectangles in Figure 2.18
represent the eye locations tracked by the mean shift tracker.
34
(1) (13) (27) (46)
(63) (88) (100) (130)
Figure 2.16: Mean-shift tracking both eyes with initial search area of40*40 pixels, as represented by the large black rectangle.The eyes marked with white rectangles in frame 1 are usedas the eye model and the tracked eyes in the following framesare marked by the smaller black rectangles
2.4 Combining Kalman Filtering Tracking with Mean Shift
Tracking
Mean shift tracking is fast and handles noise well, but it is easily distracted
by nearby similar targets such as the nearby region that appears similar to the
eyes. This is partially because of the histogram representation of the eye’s appear-
ance, which does not contain any information about the relative spatial relationships
among pixels; the distraction manifests primarily as errors in the calculated center
of the eyes. The mean shift tracker does not have the capability of self-correction
and the errors therefore tend to accumulate and propagate to subsequent frames
as tracking progresses, and eventually the tracker drifts away. Another factor that
could lead to errors with eye tracking based on mean shift is that the mean shift
tracker cannot continuously update its eye model despite the fact that the eyes look
significantly different under different face orientations and lighting conditions, as
demonstrated in Figures 2.19 (a-e). We can see that the mean shift eye tracker
cannot identify the correct eye location when the eyes appear significantly different
from the eye model due to face orientation change.
35
(a) (b)
Figure 2.17: (a) Image of frame 135, with the initial eye position markedand initial search area outlined by the large black rectan-gle. (b) Values of Bhattacharyya coefficient correspondingto the marked region(40 × 40 pixels) around the left eye in(a). Mean shift algorithm cannot converge from the initiallocation()(which is in the valley of two modes) to the cor-rect mode of the surface. Instead, it is trapped in the valley
(a) (b) (c) (d)
Figure 2.18: Bright pupil based Kalman tracker fails to track eyes dueto absence of bright pupils caused by either eye closure oroblique face orientations. The mean shift eye tracker, how-ever, tracks eyes successfully as indicated by the black rect-angles
To overcome these limitations with the mean shift tracker, we propose to
combine Kalman filter tracking with mean shift tracking to overcome their respective
limitations and to take advantage of their strengths. The two trackers are activated
alternately. The Kalman tracker is first initiated, assuming the presence of the
bright pupils. When the bright pupils appear weak or disappear, the mean shift
tracker is activated to take over the tracking. Mean shift tracking continues until the
reappearance of the bright pupils, when the Kalman tracker takes over. To prevent
the mean shift tracker from drifting away, the target eye model is continuously
36
(a) (b) (c) (d) (e)
(A) (B) (C) (D) (E)
Figure 2.19: An image sequence to demonstrate the drift-away problemof the mean-shift tracker as well as the correction of theproblem by the integrated eye tracker. Frames (a-e) showthe drift away case of the mean-shift eye tracker; for the sameimage sequences, Frames (A-E) show the improved results ofthe combined eye tracker. White rectangles show the eyestracked by the Kalman tracker while the black rectanglesshow the tracked eyes by the mean shift tracker
updated by the eyes successfully detected by the Kalman tracker.
Figures 2.19 (A-E) show the results of tracking the same sequence with the
integrated eye tracker. It is apparent that the integrated tracker can correct the
drift problem of the mean shift tracker. Specifically, in Figure 2.19, white rectangles
represent the eyes tracked by the Kalman tracker while the black rectangles represent
the eyes tracked by the mean shift tracker, which works for all the following figures
in this chapter.
2.5 Experimental Results
In this section, we will present results from an extensive experiment we con-
ducted to validate the performance of our integrated eye tracker under different
conditions.
2.5.1 Eye Tracking Under Significant Head Pose Changes
Here, we show some qualitative and quantitative results to demonstrate the
performance of our tracker under different face orientations. Figure 2.20 shows the
37
tracking results on a typical face image sequence with a person undergoing significant
face pose changes. Additional results for different subjects under significant head
rotations are shown in Figure 2.21. We can see that under significant head pose
changes, the eyes will be either partially occluded or the appearance of eyes will
be significantly different from the eyes with frontal faces. But the two eye trackers
alternate reliably, detecting the eyes under different head orientations, with eyes
either open, closed or partially occluded.
Figure 2.20: Tracking results of the combined eye tracker for a personundergoing significant head movements
To confirm this finding quantitatively, we manually located the positions of the
eyes for two typical sequences and they serve as the ground-truth eye positions. The
tracked eye positions are then compared with the ground-truth data. The results
are summarized in Tables 2.2 and 2.3. From the tracking statistics in Tables 2.2 and
2.3, we can conclude that the integrated eye tracker is much more accurate than the
Kalman filter pupil tracker, especially for closed eyes and for eyes partially occluded
due to large face rotations. These results demonstrate that this combination of two
tracking techniques produces much better tracking results than using either of them
individually.
2.5.2 Eye Tracking Under Different Illuminations
In this experiment, we demonstrate the performance of our integrated tracker
under different illumination conditions by varying the light conditions during track-
38
(a)
(b)
(c)
(d)
Figure 2.21: Tracking results of the combined eye tracker for four imagesequences (a),(b),(c) and (d) under significant head move-ments.
ing. The experiment included first turning off the ambient lights, followed by using
a mobile light source and positioning it close to the people to produce strong ex-
ternal light interference. The external mobile light produces significant shadows as
well as intensity saturation on the subject’s faces. Figure 2.22 visually shows the
sample tracking results for two individuals. Despite these somewhat extreme condi-
tions, our eye tracker managed to track the eyes correctly. Because of the use of IR,
the faces are still visible and eyes are tracked even under darkness. It is apparent
that illumination change does not adversely affect the performance of our technique.
This may be attributed to the simultaneous use of active IR sensing, image intensity
normalization for eye detection using SVM, and the dynamic eye model updating
for the mean shift tracker.
39
Table 2.2: Tracking statistics comparison for both trackers under differ-ent eyes conditions (open, closed, occluded) on the first person
Image Bright pupil Combined600 frames tracker tracker
Left eye (open)452 frames 400/452 452/452
Left eye (closed)66 frames 0/66 66/66
Left eye (occluded)82 frames 0/82 82/82
Right eye (open)425 frames 389/425 425/425
Right eye (closed)66 frames 0/66 66/66
Right eye (occluded)109 frames 0/109 109/109
(a)
(b)
Figure 2.22: Tracking results of the combined eye tracker for two imagesequences (a) and (b) under significant illumination changes.
2.5.3 Eye Tracking With Glasses
Eye appearance changes significantly with glasses. Furthermore, the glare on
the glasses caused by light reflections presents significant challenges to eye tracking
with glasses. In Figure 2.23, we show the results of applying our eye tracker to
persons wearing glasses. We can see that our eye tracker can still detect and track
eyes robustly and accurately for people with glasses. However, our study shows that
when the head orientation is such that the glare completely occludes the pupils, our
40
Table 2.3: Tracking statistics comparison for both trackers under differ-ent eyes conditions (open, closed, occluded) on the second per-son
Image Sequence 1 Bright pupil Combined600 frames tracker tracker
Left eye (open)421 frames 300/421 410/421
Left eye (closed)78 frames 0/78 60/78
Left eye (occluded)101 frames 0/101 60/101
Right eye (open)463 frames 336/463 453/463
Right eye (closed)78 frames 0/78 78/78
Right eye (occluded)59 frames 0/59 59/59
tracker will fail. This is a problem that we will tackle in the future.
(a)
(b)
Figure 2.23: Tracking results of the combined eye tracker for two imagesequences (a), (b) with persons wearing glasses.
2.5.4 Eye Tracking With Multiple People
Our eye tracker not only can track the eyes of one person but also can track
multiple people’s eyes simultaneously. Here, we show the results of applying our eye
tracker to simultaneously track multiple people’s eyes with different distances and
41
face orientations with respect to the camera. The result is presented in Figure 2.24.
This experiment demonstrates the versatility of our eye tracker.
Figure 2.24: Tracking results of the combined eye tracker for multiplepersons.
2.5.5 Occlusion Handling
Eyes are often partially or completely occluded either by face due to oblique
face orientations or by hands or by other objects. A good eye tracker should be
able to track eyes under partial occlusion and be able to detect complete occlusion
and re-detect the eyes after the complete occlusion is removed. In Figure 2.25, two
persons are moving in front of the camera, and one person’s eyes are occluded by
another’s head when they are crossing. As shown in Figure 2.25, when the rear
person moves from right to left, the head of the front person starts to occlude his
eyes, beginning with one and then two eyes getting completely occluded. As shown,
our tracker can still correctly track an eye even though it is partially occluded. When
both eyes are completely occluded, our tracker detects this situation. As soon as
the eyes reappear in the image, our eye tracker will capture the eyes one by one
immediately as shown in Figure 2.25. This experiment shows the robustness of our
method to occlusions.
2.5.6 Tracking Accuracy Validation
Experiments were conducted to quantitatively characterize the tracking accu-
racy of our proposed eye tracker. Specifically, we randomly selected an image se-
quence that contains 13,620 frames, and manually identified the eyes in each frame.
42
Figure 2.25: Tracking results of the combined eye tracker for an imagesequence involving multiple persons occluding each other’seyes.
The manually labelled data serves as the ground truth and is compared with auto-
matically tracked results from our eye tracker. The study shows that our eye tracker
is quite accurate, with a false alarm rate of 0.05% and a mis-detection rate of 4.2%.
In addition, we studied the positional accuracy of the tracked eyes. The ground
truth is still obtained by manually locating the eyes in each frame. Figure 2.26
summarizes the comparison results. It shows that the automatically tracked eye
positions match very well with manually located eye positions, with RMS position
errors of 1.09 and 0.68 pixels along x and y coordinates, respectively.
(a) (b)
Figure 2.26: The comparison between the automatically tracked eye po-sitions and the manually located eye positions for 100 ran-domly selected consecutive frames: (a) x coordinate and (b)y coordinate.
43
2.5.7 Processing Speed
The proposed eye detection and tracking algorithm is implemented using C++
on a PC with a Xeon (TM) 2.80GHz CPU and a 1.00GB RAM. The resolution of the
captured images is 640×480 pixels, and the built eye tracker runs at approximately
26 fps.
2.6 Chapter Summary
In this chapter, we present an integrated eye tracker to track eyes robustly
under various illuminations and face orientations. Our method performs well re-
gardless of whether the pupils are directly visible or not. This has been achieved
by combining an appearance-based pattern recognition method (SVM) and object
tracking (Mean Shift) with a bright-pupil eye tracker based on Kalman filtering.
Specifically, we take the following measures. First, the use of SVM for pupil
detection complements eye detection based on bright pupils from IR illumination,
allowing detection of eyes in the presence of other bright objects; second, two chan-
nels (dark-pupil and bright-pupil eye images) are used to characterize the statistical
distributions of the eye, based on which a mean shift eye tracker is developed. Third,
the eye model is continuously updated by having the eye successfully detected from
the last Kalman tracker, to avoid error propagation with the mean shift tracker.
Finally, the experimental determination of the optimal window size and quantiza-
tion level for mean shift tracking further enhances the performance of our technique.
Experiments show that these enhancements have led to a significant improvement
in eye tracking robustness and accuracy over existing eye trackers, especially under
various conditions identified in Section 1. Furthermore, our integrated eye tracker
is demonstrated to be able to handle occlusion and people with glasses, and to
simultaneously track multiple people of different poses and scales.
The two important lessons we learn from this research are: 1) perform active
vision (e.g., active IR illumination) to produce quality input images and to simplify
subsequent image processing; and 2) combine different complementary techniques
to utilize their respective strengths and to overcome their limitations, leading to a
much more robust technique than using each technique individually.
CHAPTER 3
Eye Gaze Tracking
3.1 Introduction
Eye gaze is defined as the line of sight of a person. It represents a person’s focus
of attention. Eye gaze tracking has been an active research topic for many decades
because of its potential usages in various applications such as Human Computer
Interaction (HCI), Virtual Reality, Eye Disease Diagnosis, Human Behavior Studies,
etc. For example, when a user is looking at a computer screen, the user’s gaze
point at the screen can be estimated via the eye gaze tracker. Hence, the eye
gaze can serve as an advanced computer input [47], which is proven to be more
efficient than the traditional input devices such as a mouse pointer [126]. Also, a
gaze-contingent interactive graphic display application can be built [133], in which
the graphic display on the screen can be controlled interactively by the eye gaze.
Recently, eye gaze has also been widely used by cognitive scientists to study human
beings’ cognition [67], memory [70], etc.
Numerous techniques [133, 74, 45, 104, 131, 79, 6, 101, 113, 84, 76, 71] have
been proposed to estimate the eye gaze. Earlier eye gaze trackers are fairly intrusive
in that they require physical contacts with the user, such as placing a reflective
white dot directly onto the eye [74] or attaching a number of electrodes around the
eye [45]. In addition, most of these technologies also require the user’s head to be
motionless during eye tracking.
With the rapid technological advancements in both video cameras and mi-
crocomputers, gaze-tracking technology based on the digital video analysis of eye
movements has been widely explored. Since it does not require anything attached
to the user, video technology opens the most promising direction for building a non-
intrusive eye gaze tracker. Various techniques [20, 77, 18, 49, 123, 2, 133] have been
proposed to perform the eye gaze estimation based on eye images captured by video
cameras. However, most available remote eye gaze trackers have two characteristics
that prevent them from being widely used. First, they must often be calibrated
44
45
repeatedly for each individual; second, they have low tolerance for head movements
and require the user to hold the head uncomfortably still.
In this chapter, two different techniques are introduced to improve the existing
gaze-tracking techniques. First, a simple 3D gaze-tracking technique is proposed to
estimate the 3D direction of the gaze. Different from existing 3D techniques, the
proposed 3D gaze-tracking technique does not need to know any user-dependent
parameters about the eyeball. Hence, the 3D direction of the gaze can be estimated
in a way allowing more easy implementation. Second, a novel 2D mapping-based
gaze estimation technique is introduced to allow free head movements and simplify
the calibration procedure. A dynamic head compensation model is proposed to
compensate for the head movements so that whenever the head moves to a new
3D position, the gaze mapping function at the new 3D position can be updated
automatically. Hence, accurate gaze information can still be estimated as the head
moves. Therefore, by using our proposed gaze-tracking techniques, a more robust,
accurate, comfortable and useful eye gaze-tracking system can be built.
3.2 Related Works
In general, most of the non-intrusive vision-based gaze-tracking techniques can
be classified into two groups: 2D mapping-based gaze estimation method [133, 104,
131, 79] and direct 3D gaze estimation method [6, 101, 113, 84, 76, 71]. In the
following section, each group will be discussed briefly.
3.2.1 2D Mapping-Based Gaze Estimation Technique
For the 2D mapping-based gaze estimation method, the eye gaze is estimated
from a calibrated gaze mapping function by inputting a set of 2D eye movement
features extracted from eye images, without knowing the 3D direction of the gaze.
Usually, the extracted 2D eye movement features vary with the eye gaze so that the
relationship between them can be encoded by a gaze mapping function. In order
to obtain the gaze mapping function, an online calibration needs be performed for
each person. Unfortunately, the extracted 2D eye movement features also vary
significantly with head position; thus, the calibrated gaze mapping function is very
46
sensitive to head motion [79]. Hence, the user has to keep his head unnaturally still
in order to achieve good performance.
The Pupil Center Cornea Reflection (PCCR) technique is the most commonly
used 2D mapping-based approach for eye gaze tracking. The angle of the visual axis
(or the location of the fixation point on the display surface) is calculated by tracking
the relative position of the pupil center and a speck of light reflected from the cornea,
technically known as the “glint” as shown in Figure 3.1 (a) and (b). The generation
of the glint will be discussed in more detail at Section 3.3.2.1. The accuracy of the
system can be further enhanced by illuminating the eyes with near-InfraRed (IR)
light, which produces the “bright pupil” effect as shown Figure 3.1 (b) and makes
the video image easier to process. Infrared light is harmless and invisible to the
user.
glint
(a) (b)
Figure 3.1: Eye images with corneal reflection (glint): (a) dark pupil im-age (b) bright pupil image. Glint is a small bright spot asindicated in (a) and (b)
Several systems [44, 48, 20, 78] have been built based on the PCCR technique.
Most of these systems show that if the user has the ability to keep his head fixed, or
to restrict head motion via the help of chin-rest or bite-bar, very high accuracy can
be achieved in eye gaze tracking results. Specifically, the average error can be less
than 1 visual angle, which corresponds to less than 10 mm in the computer screen
when the subject is sitting approximately 550 mm from the computer screen. But
as the head moves away from the original position where the user performed the eye
gaze calibration, the accuracy of these eye gaze-tracking systems drops dramatically;
for example, [79] reports detailed data showing how the calibration mapping function
decays as the head moves away from its original position. Jacob reports a similar
47
fact in [48]. Jacob attempted to solve the problem by giving the user the ability
to make local manual re-calibrations, which brings numerous troubles for the user.
As these studies indicate, calibration is a significant problem in current remote eye
tracking systems.
Most of the commercially available eye gaze-tracking systems [1, 2, 3] are also
built on the PCCR technique, and most of them claim that they can tolerate small
head motion. For example, less than 2 square inches of head motion tolerance is
claimed for the eye gaze tracker from LC technologies [2], which is still working
to improve it. The ASL eye tracker [1] has the best claimed tolerance of head
movement, allowing approximately one square foot of head movement. It eliminates
the need for head restraint by combining a magnetic head tracker with a pan-tilt
camera. However, details about how it handles head motion are not publicly known.
Further, combining a magnetic head tracker with a pan-tilt camera is not only
complicated but also expensive for the regular user.
In summary, most of existing eye gaze systems based on the PCCR technique
share two common drawbacks: first, the user has to perform certain experiments in
calibrating the user-dependent parameters before using the gaze -tracking system;
second, the user has to keep his head unnaturally still, with no significant head
movements allowed.
3.2.2 Direct 3D Gaze Estimation Technique
For the direct 3D gaze estimation technique, the 3D direction of the gaze is
estimated directly so that the gaze point can be obtained by simply intersecting it
with the scene. Therefore, how to estimate the 3D gaze direction of the eye precisely
is the key issue for most of these techniques. Several techniques [76, 84, 6, 101] have
been proposed to estimate the 3D direction of gaze directly from the eye images.
This method is not constrained by the head position, and it can be used to obtain
the gaze point on any object in the scene by simply intersecting it with the estimated
3D gaze line. Therefore, with the use of this method, the issues of gaze mapping
function calibration and head movement that plague the 2D mapping methods can
be solved nicely.
48
Morimoto et al. [76] proposed a technique to estimate the 3D gaze direction
of the eye with the use of a single calibrated camera and at least two light sources.
First, the radius of the eye cornea is measured in advance for each person, using
at least three light sources. A set of high order polynomial equations are derived
to compute the radius and center of the cornea, but their solutions are not unique.
Therefore, how to choose the correct one from the set of possible solutions is still an
issue. Furthermore, no working system has been built using the proposed technique.
Ohno et al. [84] proposed an approximation method to estimate the 3D eye
gaze. There are several limitations for this proposed method. First, the cornea
radius and the distance between the pupil and cornea center are fixed for all users
although they actually vary significantly from person to person. Second, the formu-
lation to obtain the cornea center is based on the assumption that the virtual image
of IR LED appears on the surface of the cornea. In fact, however – as shown in
Section 3.3.2.1 of this chapter – the virtual image of IR LED will not appear on the
surface of the cornea; instead, it will appear behind the cornea surface or inside the
cornea. Therefore, the calculated cornea center will be a very rough approximation.
Beymer et al. [6] proposed another system that can estimate the 3D gaze
direction based on a complicated 3D eyeball model with at least seven parameters.
First, the 3D eyeball model will be automatically individualized to a new user, which
is achieved by fitting the 3D eye model with a set of image features via a nonlinear
estimation technique. The image features used for fitting include only the glints of
the IR LEDs and the pupil edges. But as shown in Section 3.3.2.1 of this chapter,
the glints are the image projections of the virtual images of the IR LEDs created
by the cornea, and they are not on the surface of the eye cornea, but inside the
cornea. Also, the pupil edges are not on the surface of the 3D eye model, either.
Therefore, the radius of the cornea cannot be estimated based on the proposed
method. Further, fitting such a complicated 3D model with only few feature points,
the solution will be unstable and very sensitive to noise.
Shih et al. [101] proposed a novel method to estimate 3D gaze direction by
using multiple cameras and multiple light sources. In their method, although there is
no need to know the user-dependent parameters of the eye, there are several obvious
49
limitations for the current system. First, the light sources and the cameras cannot
be collinear, and a careful arrangement of them is required in order to achieve a
good performance; second, when the user is looking at points on the line connecting
the optical centers of the two cameras, the 3D gaze direction cannot be determined
uniquely.
Therefore, most of the existing 3D gaze-tracking techniques either require
knowledge of several user-dependent parameters about the eye [76, 84, 6], or cannot
work under certain circumstances [101]. But in reality, these user-dependent param-
eters of the eyeball, such as the cornea radius and the distance between the pupil
and the cornea center, etc., are very difficult to measure accurately due to its small
size (normally less than 10mm). Therefore, the accuracy of these proposed eye gaze
techniques will decline dramatically if they cannot be measured accurately.
3.3 Direct 3D Gaze Estimation Technique
3.3.1 The Structure of Human Eyeball
As shown in Figure 3.2, the eyeball is made up of the segments of two spheres
with different sizes placed in front of the other [86]. The anterior, the smaller
segment, is transparent and forms about one-sixth of the eyeball, and has a radius
of curvature of about 8 mm. The posterior, the larger segment, is opaque and forms
about five-sixths of the eyeball, and has a radius of about 12 mm.
F
Cornea
cornea O
Optic Axis
eyeball O
Fovea
P
Visual Axis
Posterior Pole Anterior
Pole
Eyeball
Target
Kappa
Figure 3.2: The structure of the eyeball (top view of the right eye)
The anterior pole of the eye is the center of curvature of the transparent
50
segment or cornea. The posterior pole is the center of the posterior curvature of
the eyeball, and is located slightly temporal to the optical nerve. The optic axis is
defined as a line connecting these two poles, as shown in Figure 3.2. The fovea defines
the center of the retina, and is a small region with highest visual acuity and color
sensitivity. Since the fovea provides the sharpest and most detailed information, the
eyeball is continuously moving so that the light from the object of primary interest
will fall on this region. Thus, another major axis, the visual axis, is defined as
the projection of the foveal center into object space through the eye’s nodal point
Ocornea as shown in Figure 3.2. Therefore, it is the visual axis that determines a
person’s visual attention or direction of gaze, not the optic axis. Since the fovea is a
few degrees temporal to the posterior pole, the visual axis will deviate a few degrees
nasally from the optic axis. The angle formed by the intersection of the visual axis
and the optic axis at the nodal point is named as angle kappa. The angle kappa in
the two eyes should have the same magnitude [86], approximately around 5.
Pupillary axis of the eye is defined as the 3D line connecting the center of the
pupil P and the center of the cornea Ocornea. The pupillary axis is the best estimate
of the location of the eye’s optic axis; if extended through the eye, it should exit
very near the anatomical posterior pole. In Figure 3.2, the pupillary axis is shown
as the optic axis of the eyeball. Therefore, if we can obtain the 3D locations of
the pupil center and cornea center, then the optic axis of the eye can be estimated
easily.
3.3.2 Derivation of 3D Cornea Center
3.3.2.1 The Structure of Cornea
The anterior of the eyeball is composed of several layers, and each layer is made
of tissue with a slightly different refraction index. When light passes through the
eye, the boundary surface of each layer will act like a reflective surface. Therefore,
if a light source is placed in front of the eye, several reflections will occur on the
boundaries of the lens and cornea as shown in Figure 3.3. If these reflections are
captured by a camera, the generated images are called Purkinje images. The first
Purkinje image corresponds to the reflection from the external surface of the cornea
51
as shown in Figure 3.3, which will be captured as a very bright spot in the eye
image as shown in Figure 3.1 (a) and (b). This special bright dot is called glint and
it is the brightest and easiest reflection to detect and track. Detecting the other
Purkinje images requires special hardware; therefore, from now on, only the first
Purkinje image will be considered here.
Light rays Cornea
Lens
1st Purkinje Image
2nd Purkinje Image 3rd
Purkinje Image
4th Purkinje Image
Figure 3.3: The reflection diagram of Purkinje images
Since the external surface of the cornea functions like a convex mirror, in order
to understand the formation of the glint, the external surface of the cornea is further
modelled as a convex mirror with a radius R.
3.3.3 Image Formation in the Convex Mirror
First, a few key concepts will be introduced in order to study the image for-
mation by a spherical convex mirror [33].
O
Principal Axis T
' T
S ' S
1
2
V F
Figure 3.4: A ray diagram to locate the image of an object in a convexmirror
52
As illustrated in Figure 3.4, the point V is the surface center of the mirror and
the normal of the mirror is called the principal axis. The mirror is assumed to be
rotationally symmetrical about its principal axis. This allows us to represent a three-
dimensional mirror in a two-dimensional diagram without loss of generality. The
point O, on the principal axis, which is equidistant from all points on the reflecting
surface of the mirror, is called the center of curvature. It is found experimentally
that rays striking a convex mirror parallel to its principal axis, and not too far away
from this axis, are reflected by the mirror such that they all pass through the same
point F on the principal axis. This point, which lies between the center of curvature
and the vertex, is called the focal point, or virtual focus, of the mirror.
The ray diagram of the image produced by a convex mirror always conforms
to the following two simple rules:
1. An incident ray which is parallel to the principal axis is reflected as if it came
from the virtual focus of the mirror.
2. An incident ray which is directed towards the center of curvature of the mirror
is reflected back along its own path (since it is normally incident on the mirror).
As shown in Figure 3.4, two rays are used to locate the image S ′T ′ of an object
ST placed in front of the mirror. It can be seen that the image is virtual, upright,
and diminished.
3.3.3.1 Glint Formation In Cornea Reflection
The eye cornea serves as a convex mirror during the process of glint formation.
Specifically, the focus point F , the center of the curvature Ocornea and the principal
axis are shown in Figure 3.5. In our research, the IR LEDs are utilized as the light
sources. Therefore, when an IR LED is placed in front of the eye, the cornea will
produce a virtual image of the IR LED, which is located somewhere behind the
cornea surface, as shown in Figure 3.5.
According to the properties of the convex mirror, the IR light ray diagram of
the cornea is shown in Figure 3.5. In the diagram, an image is the location in space
where it appears that light diverges from. Any observer from any position who is
53
IR LED
F
Image
Cornea
cornea O
Principal Axis
Camera
Camera
Camera
Camera
Figure 3.5: The image formation of a point light source in the corneawhen the cornea serves as a convex mirror
sighting along a line at the image location will view the IR light source as a result
of the reflected light; each observer sees the image in the same location regardless of
the observer’s location. Thus, the task of determining the image location of the IR
light source is to determine the location where reflected light intersects. In Figure
3.5, several rays of light emanating from the IR light source are shown approaching
the cornea and subsequently reflecting. Each ray is extended backwards to a point
of intersection - this point of intersection of all extended reflected rays indicates the
location of the virtual image of IR light source.
In our research, the cameras are the observers. Therefore, the virtual image
of the IR light source created by the cornea will be shown as a glint in the image
captured by the camera. If we place two cameras at different locations, each camera
will capture a glint corresponding to the same virtual image of the IR light source
in space as shown in Figure 3.6. Therefore, in theory, with the use of two cameras,
the 3D location of the virtual image of the IR light source in space can be recovered.
3.3.3.2 Curvature Center of the Cornea
According to the properties of the convex mirror, an incident ray that is di-
rected towards the center of curvature of a mirror is reflected back along its own
path (since it is normally incident on the mirror). Therefore, as shown in Figure
3.7, if the light ray L1P1 is shone directly towards the center of the curvature of the
cornea Ocornea, it will be reflected back along its own path. Also, the virtual image
54
IR LED
F
Image
Cornea
cornea O
Principal Axis
Camera
Camera
Image Plane
Image Plane
Glint
Glint
Figure 3.6: The ray diagram of the virtual image of the IR light sourcein front of the cameras
of the IR light source P1 will lie in this path. Therefore, as shown in Figure 3.7, the
IR light source L1, its the virtual image P1 and the curvature center of the cornea
Ocornea will be co-linear.
IR LED 1
F
Image 1
Cornea
cornea O
Principal Axis
IR LED 2 Image 2
1 L
2 L
1 P
2 P
Figure 3.7: The ray diagram of two IR light sources in front of the cornea
Further, if we place another IR light source at a different place L2 as shown in
Figure 3.7, then the IR light source L2, its the virtual image P2 and the curvature
center of the cornea Ocornea will lie in another line L2P2Ocornea. Line L1P1Ocornea
and line L2P2Ocornea will intersect at the point Ocornea.
As discussed in Section 3.3.3.1, if two cameras are used, the 3D locations of
the virtual images P1 and P2 of the IR light sources can be obtained through 3D
reconstruction. Furthermore, the 3D location of the IR light sources L1 and L2 can
55
be obtained through the system calibration procedure discussed in Section 3.5.1.
Therefore, the 3D location of the curvature center of cornea Ocornea can be obtained
by intersecting the line L1P1 and L2P2 as follows:
Ocornea = L1 + k1(L1 − P1)
Ocornea = L2 + k2(L2 − P2)(3.1)
Note that when more than two IR light sources are available, a set of equations
can be obtained, which can lead to a more robust estimation of the 3D location of
cornea center.
3.3.4 Computation of 3D Gaze Direction
3.3.4.1 Estimation of Optic Axis
As discussed earlier, the pupillary axis is the best approximation of the optic
axis of the eye. Therefore, after the 3D location of the pupil center P is extracted,
the optic axis Vp of the eye can be estimated by connecting the 3D pupil center P
with cornea center Ocornea as follows:
Vp = Ocornea + k(P −Ocornea) (3.2)
Since the fovea is invisible from the captured eye images, the visual axis of the
eye cannot be estimated directly. Without knowing the visual axis of the eye, the
user’s fixation point in the 3D space still cannot be determined.
However, the deviation angle kappa between the visual axis and the optic axis
of the eye is constant for each person. Therefore, if the deviation angle kappa is
known, then the visual axis can be computed from the estimated optic axis easily.
In the following, a technique is proposed to estimate the deviation angle kappa
accurately.
56
3.3.4.2 Compensation of the Angle Deviation between Visual Axis and
Optic Axis
When a user is looking at a known point Ps in the screen, the 3D location
of the screen point Ps can be known in that the screen is calibrated. At the same
time, the 3D location of the cornea center Ocornea and the 3D location of the pupil
center P can be computed from the eye images via the proposed technique discussed
above. Therefore, the direction of visual axis−→Vv and the direction of optic axis
−→Vp
can be computed as follows:
−→Vv = (Ps −Ocornea)/‖Ps −Ocornea‖−→Vp = (P −Ocornea)/‖P −Ocornea‖
(3.3)
In addition, let’s represent the relationship between the visual axis and the
optic axis as follows:
−→Vv = R
−→Vp (3.4)
where R is a 3 × 3 rotation matrix and it is constructed from the deviation angles
between the vectors−→Vv and
−→Vv, or deviation angle kappa. Once the rotation matrix
R is estimated, then the 3D visual axis can be estimated from the extracted 3D
optic axis. Therefore, instead of estimating the deviation angle kappa directly to
know the relationship between the visual axis and the optic axis, it can be encoded
through the rotation matrix R implicitly. In addition, the rotation matrix R can be
estimated by a simple calibration as follows.
During the calibration, the user is asked to look at a set of k pre-defined point
Psi (i = 1, · · · , k) in the screen. After the calibration is done, a set of k pairs of
vectors−→Vv and
−→Vp are obtained via equation 3.3. In addition, since the rotation
matrix R is an orthonormal matrix, equation 3.4 can be represented as
−→Vp = RT−→Vv (3.5)
Therefore, according to equations 3.4 and 3.5, one pair of vectors−→Vv and
−→Vp can
57
give 6 linear equations so that two screen points are enough to estimate the 3 × 3
rotation matrix R.
Once the rotation matrix R is estimated, the visual axis of the eye can be esti-
mated from the computed optic axis−→Vp through equation 3.4. Finally, an accurate
point of regard of the user can be computed by intersecting the estimated 3D visual
axis with any object in the scene.
3.4 2D Mapping-Based Gaze Estimation Technique
Most available remote eye gaze trackers are built from the PCCR technique.
If the users have the ability to keep their heads fixed or has the use of a chin-rest to
restrict the head motion, very high accuracy can be achieved in most eye gaze esti-
mation results. But as the head moves away from the original head position where
the user performed the eye gaze calibration, the accuracy of these gaze-tracking
systems drops significantly.
In the following sections, the head motion effect on the accuracy of the PCCR-
based gaze-tracking techniques is first analyzed. Subsequently, a solution is proposed
to compensate for the head movement effect so that the user can move his head freely
in front of the camera while the gaze information can still be accurately estimated.
3.4.1 Classical PCCR Technique
The PCCR-based technique consists of two major components: pupil-glint
vector extraction and gaze mapping function acquisition.
1. Pupil-glint Vector Extraction
Gaze estimation starts with pupil-glint vector extraction. After grabbing the
eye image from the camera, computer vision techniques [133, 134] are proposed to
extract the pupil center and the glint center robustly and accurately. The pupil
center and the glint center are connected to form a 2D pupil-glint vector v as shown
in Figure 3.9.
2. Specific Gaze Mapping Function Acquisition
After obtaining the pupil-glint vectors, a calibration procedure is proposed to
58
acquire a specific gaze mapping function that will map the extracted pupil-glint
vector to the user’s fixation point in the screen at the current head position. The
extracted pupil-glint vector v is represented as (vx, vy) and the screen gaze point Ss
is represented by (xgaze, ygaze) in the screen coordinate system. The specific gaze
mapping function Ss = f(v) can be modelled by the following nonlinear equations
[2]:
xgaze = a0 + a1 ∗ vx + a2 ∗ vy + a3 ∗ vx ∗ vy
ygaze = b0 + b1 ∗ vx + b2 ∗ vy + b3 ∗ vy2
(3.6)
The coefficients a0, a1, a2, a3 and b0, b1, b2, b3 are estimated from a set of pairs
of pupil-glint vectors and the corresponding screen gaze points. These pairs are
collected in a calibration procedure. During the calibration, the user is required
to visually follow a shining dot as it displays at several predefined locations on the
computer screen. In addition, he must keep his head as still as possible.
If the user does not move his head significantly after the gaze calibration, the
calibrated gaze mapping function can be used to accurately estimate the user’s gaze
point on the screen, based on the extracted pupil-glint vector. But when the user
moves his head away from the position where the gaze calibration is performed,
the calibrated gaze mapping function will fail to estimate the gaze point accurately
because of the pupil-glint vector changes caused by the head movements. In the
following section, head movement effects on the pupil-glint vector will be illustrated.
3.4.2 Head Motion Effects on Pupil-glint Vector
Figure 3.8 shows the ray diagram of the pupil-glint vector generation in the
image when an eye is located at two different 3D positions O1 and O2 in front of the
camera due to head movement. For simplicity, the eye is represented by a cornea,
the cornea is modelled as a convex mirror, and the IR light source used to generate
the glint is located at O, all of which are applicable to subsequent figures in this
chapter. Assume that the origin of the camera is located at O, p1 and p2 are the
pupil centers and g1 and g2 are the glint centers generated in the image. Further,
at both positions, the user is looking at the same point of the computer screen S.
59
According to the light ray diagram shown in Figure 3.8, the generated pupil-glint
vectors −−→g1p1 and −−→g2p2 will be significantly different in the images, as shown in Figure
3.9. Two factors are responsible for this pupil-glint vector difference: first, the eyes
are at different positions in front of the camera; second, in order to look at the same
screen point, eyes at different positions rotate themselves differently.
1 O
2 O
1 P
2 P
1 G
2 G
S
O
1 p
2 p
1 g
2 g
Cornea
Image Plane
Screen Plane
Camera Z Axis
Figure 3.8: Pupil and glint image formations when eyes are located atdifferent positions while gazing at the same screen point (sideview)
1 g 1 p
2 g 2 p
(a) eye image at location O1 (b) eye image at location O2
Figure 3.9: The pupil-glint vectors generated in the eye images when theeye is located at O1 and O2 in Figure 3.8
The eye will move as the head moves. Therefore, when the user is gazing
at a fixed point on the screen while moving his head in front of the camera, a set
of pupil-glint vectors in the image will be generated. These pupil-glint vectors are
significantly different from each other. If uncorrected, inaccurate gaze points will
be estimated after inputting them into a calibrated specific gaze mapping function
obtained at a fixed head position.
60
Therefore, the head movement effects on these pupil-glint vectors must be
eliminated in order to utilize the specific gaze mapping function to estimate the
screen gaze points accurately. In the following section, a technique is proposed
to eliminate the head movement effects from these pupil-glint vectors. With this
technique, accurate gaze screen points can be estimated accurately under natural
head movements.
3.4.3 Dynamic Head Compensation Model
3.4.3.1 Approach Overview
The first step of our technique is to find a specific gaze mapping function fO1
between the pupil-glint vector v1 and the screen coordinate S at a reference 3D
eye position O1. This is usually achieved via a gaze calibration procedure using
equations 3.6. The function fO1can be expressed as follows:
S = fO1(v1) (3.7)
Assume that when the eye moves to a new position O2 as the head moves, a
pupil-glint vector v2 will be created in the image while the user is looking at the
same screen point S. When O2 is significantly different from O1, v2 cannot be used
as the input of the gaze mapping function fO1to estimate the screen gaze point due
to the changes of the pupil-glint vector caused by the head movement. If the changes
of the pupil-glint vector v2 caused by the head movement can be eliminated, then
a corrected pupil-glint vector v2′
will be obtained. Ideally, this corrected pupil-glint
vector v2′
is the generated pupil-glint vector v1 of the eye at the reference position
O1 when gazing at the same screen point S. Therefore, this is equivalent to finding
a head mapping function g between two different pupil-glint vectors at two different
head positions when still gazing at the same screen point. This mapping function g
can be written as follows:
v2′
= g(v2, O2, O1) (3.8)
where v2′
is the equivalent measurement of v1 with respect to the initial reference
61
head position O1. Therefore, the screen gaze point can be estimated accurately from
the pupil-glint vector v2′
via the specific gaze mapping function fO1as follows:
S = fO1(g(v2, O2, O1)) = F (v2, O2) (3.9)
where the function F can be called as a generalized gaze mapping function that
explicitly accounts for the head movement. It provides the gaze mapping function
dynamically for a new eye position O2.
With the use of the proposed technique, whenever the head moves, a gaze map-
ping function at each new 3D eye position can be updated automatically; therefore,
the issue of the head movement can be solved nicely.
3.4.3.2 Image Projection of Pupil-glint Vector
In this section, we show how to find the head mapping function g. Figure 3.10
shows the process of the pupil-glint vector formation in the image for an eye in front
of the camera. When the eye is located at two different positions O1 and O2 while
still gazing at the same screen point S, two different pupil-glint vectors −−→g1p1 and
−−→g2p2 are generated in the image. Further, as shown in Figure 3.10, a plane A parallel
to the image plane that goes through the point P1 will intersect the line O1O at
G11. Another plane B parallel to the image plane that goes through the point P2
will intersect the line O2O at G22. Therefore, −−→g1p1 is the projection of the vector
−−−→G1P1 and −−→g2p2 is the projection of the vector
−−−→G2P2 in the image plane. Because
plane A, plane B and the image plane are parallel, the vectors −−→g1p1, −−→g2p2,−−−→G1P1 and
−−−→G2P2 can be represented as 2D vectors in the X−Y plane of the camera coordinate
system.
Assume that in the camera coordinate system, the 3D pupil centers P1 and
P2 are represented as (x1, y1, z1) and (x2, y2, z2), the glint centers g1 and g2 are
represented as (xg1, yg1
,−f) and (xg2, yg2
,−f), where f is focus length of the camera,
and the screen gaze point S is represented by (xs, ys, zs). Via the pinhole camera
1G1 is not the actual virtual image of the IR light source2G2 is not the actual virtual image of the IR light source
62
Z Axis of Camera
Screen Plane
Image Plane
s O
c o
1 O
2 O
1 P
2 P
1 G
2 G
O
1 p 2 p
S Screen Gaze
Point
f
X
1 g 2 g
1 G
1 P
2 G
2 P
Plane A
Plane B
Y
Figure 3.10: Pupil and glint image formation when the eye is located atdifferent positions in front of the camera
model, the image projection of the pupil-glint vectors can be expressed as follows:
−−→g1p1 = −f
z1
∗−−−→G1P1 (3.10)
−−→g2p2 = −f
z2
∗−−−→G2P2 (3.11)
Assume that the pupil-glint vectors −−→g1p1 and −−→g2p2 are represented as (vx1, vy1)
and (vx2, vy2) respectively, and the vectors−−−→G1P1 and
−−−→G2P2 are represented as (Vx1, Vy1)
and (Vx2, Vy2) respectively. Therefore, the following equation can be derived by com-
bining the equations 3.10 and 3.11:
vx1 =Vx1
Vx2
∗z2
z1
∗ vx2 (3.12)
vy1 =Vy1
Vy2
∗z2
z1
∗ vy2 (3.13)
The above two equations describe how the pupil-glint vector changes as the
63
head moves in front of the camera. Also, based on the above equations, it is obvious
that each component of the pupil-glint vector can be mapped individually. There-
fore, equation 3.12 for the X component of the pupil-glint vector will be derived
first as follows.
3.4.3.3 First Case: The cornea center and the pupil center lie on the
camera’s X − Z plane:
Figure 3.11 shows the ray diagram of the pupil-glint vector formation when
the cornea center and pupil center of an eye happen to lie on the X − Z plane of
the camera coordinate system. Therefore, either the generated pupil-glint vectors
−−→p1g1 and −−→p2g2 or the vectors−−−→P1G1 and
−−−→P2G2 can be represented as one dimensional
vectors, specifically, −−→p1g1 = vx1, −−→p2g2 = vx2,−−−→P1G1 = Vx1 and
−−−→P2G2 = Vx2.
Z Axis of Camera
Screen Plane
Image Plane
s O
c o
1 O
2 O
1 P
2 P
1 G
2 G
O
1 p 2 p
S Screen
Gaze Point
f
X 1 g 2 g
' 1 S
' 2 S
' 1 O
' 2 O
1 G 1 P
1 O
Figure 3.11: Pupil and glint image formation when the eye is located atdifferent positions in front of the camera (top-down view)
According to Figure 3.11, the vectors−−−→G1P1 and
−−−→G2P2 can be represented as
follows:
−−−→G1P1 =
−−−→G1O1
′ +−−−→O1
′P1 (3.14)
−−−→G2P2 =
−−−→G2O2
′ +−−−→O2
′P2 (3.15)
64
For simplicity, r1 is used to represent the length of−−−→O1P1, r2 is used to represent
the length of−−−→O2P2, ∠G1P1O1 is represented as α1, ∠G2P2O2 is represented as α2,
∠P1G1O1 is represented as β1 and ∠P2G2O2 is represented as β2. According to the
geometries shown in Figure 3.11, the vectors−−−→G1P1 and
−−−→G2P2 can be further achieved
as follows:
−−−→G1P1 = −
r1 ∗ sin(α1)
tan(β1)− r1 ∗ cos(α1) (3.16)
−−−→G2P2 = −
r2 ∗ sin(α2)
tan(β2)− r2 ∗ cos(α2) (3.17)
As shown in Figure 3.11, line G1P1 and line G2P2 are parallel to the X axis
of the camera. Therefore, tan(β1) and tan(β2) can be obtained from the rectangles
g1ocO and g2ocO individually as follows:
tan(β1) =f−−→ocg1
(3.18)
tan(β2) =f−−→ocg2
(3.19)
In the above equation, g1 and g2 are the glints in the image, and oc is the
principal point of the camera. For simplicity, we choose xg1to represent −−→ocg1 and
xg2to represent −−→ocg2. Therefore, after detecting the glints in the image, tan(β1) and
tan(β2) can be obtained accurately.
Further, sin(α1), cos(α1), sin(α2) and cos(α2) can be obtained from the
geometries of the rectangles P1SS ′
1 and P2SS ′
2 directly. Therefore, equations 3.16
and 3.17 can be derived as follows:
Vx1 = r1 ∗(zs − z1) ∗ xg1
P1S ∗ f+ r1 ∗
(xs − x1)
P1S(3.20)
Vx2 = r2 ∗(zs − z2) ∗ xg2
P2S ∗ f+ r2 ∗
(xs − x2)
P2S(3.21)
3.4.3.4 Second Case: The cornea center and the pupil center do not lie
on the camera’s X − Z plane:
In fact, the cornea center and the pupil center do not always lie on the camera’s
X − Z plane. However, we can obtain the ray diagram shown in Figure 3.11 by
65
projecting the ray diagram in Figure 3.10 into X −Z plane along the Y axis of the
camera’s coordinate system. Therefore, as shown in Figure 3.12, point P1 is the
projection of the pupil center Pc1, point O1 is the projection of the cornea center
Oc1, and point S is also the projection of the screen gaze point S′
in the X−Z plane.
Starting from Oc1, a parallel line Oc1P1′
of line O1P1 intersects with line Pc1P1 at
P1′
. Also starting from Pc1, a parallel line Pc1O1′′
of line P1S intersects with line
SS′
at O1′′
.
S
' S
s O
X
Z
Y
' '
1 O
1 O
1 c O
1 P
1 c P
1 ' P
Figure 3.12: Projection into camera’s X − Z plane
Because Oc1Pc1 represents the distance r between the pupil center to the cornea
center, which will not change as the eyeball rotates, O1P1 can be derived as follows:
r1 = O1P1 = r ∗P1S√
P1S2 + (y1 − ys)
2(3.22)
Therefore, when the eye moves to a new location O2 as shown in Figure 3.11,
O2P2 can be represented as follows:
r2 = O2P2 = r ∗P2S√
P2S2 + (y2 − ys)
2(3.23)
66
After substituting the formulations of r1 and r2 into equations 3.20 and 3.21,
we can obtain Vx1
Vx2
as follows:
Vx1
Vx2
= d ∗[(zs − z1) ∗ xg1
+ (xs − x1) ∗ f ]
[(zs − z2) ∗ xg2+ (xs − x2) ∗ f ]
(3.24)
where d is set as follows:
d =
√(z2 − zs)
2 + (x2 − xs)2 + (y2 − ys)
2
√(z1 − zs)
2 + (x1 − xs)2 + (y1 − ys)
2
As a result, equations 3.12 and 3.13 can be finally obtained as follows:
vx1 = d ∗[(zs − z1) ∗ xg1
+ (xs − x1) ∗ f ]
[(zs − z2) ∗ xg2+ (xs − x2) ∗ f ]
∗z2
z1
∗ vx2 (3.25)
vy1 = d ∗[(zs − z1) ∗ yg1
+ (ys − y1) ∗ f ]
[(zs − z2) ∗ yg2+ (ys − y2) ∗ f ]
∗z2
z1
∗ vy2 (3.26)
The above equations constitute the head mapping function g between the pupil-glint
vectors of the eyes at different positions in front of the camera, while gazing at the
same screen point.
3.4.3.5 Iterative Algorithm for Gaze Estimation
Equations 3.25 and 3.26 require the knowledge of gaze point S = (xs, ys, zs)
on the screen. However, the gaze point S is the one that needs to be estimated. As
a result, the gaze point S is also a variable of the head mapping function g, which
can be further expressed as follows:
v2′
= g(v2, P2, P1, S) (3.27)
Assume that a specific gaze mapping function fP1is known via the calibration
procedure described in Section 3.4.1. Therefore, after integrating the head map-
ping function g into the specific gaze mapping function fP1via equation 3.9, the
67
generalized gaze mapping function F can be rewritten as follows:
S = F (v2, P2, S) (3.28)
Given the extracted pupil-glint vector v2 from the eye image and the new
location P2 that the eye has moved to, equation 3.28 becomes a recursive function.
An iterative solution is proposed to solve it.
First, the screen center S0 is chosen as an initial gaze point, then a corrected
pupil-glint vector v2′ can be obtained from the detected pupil-glint vector v2 via the
head mapping function g. By inputting the corrected pupil-glint vector v2′ into the
specific gaze mapping function fP1, a new screen gaze point S ′ can be estimated. S ′
is further used to compute a new corrected pupil-glint vector v2′. The loop continues
until the estimated screen gaze point S ′ does not change any more. Usually, the
whole iteration process will converge in less than 5 iterations, which makes the real
time implementation possible.
3.5 Experiment Results
3.5.1 System Setup
Our system consists of two cameras mounted under the computer monitor, as
shown in Figure 3.13. An IR light illuminator is mounted at the center of the lens
of each camera, which will produce the corneal glint in the eye image.
Figure 3.13: The configuration of the gaze-tracking system
68
Before the usage of the system, two steps are performed to calibrate the sys-
tem. The first step is to obtain the parameters of the stereo camera system, which
is obtained through camera calibration [130]. Once the stereo camera system is
calibrated, given any point Pi in front of it, the 3D position (xi yi zi)T of Pi can
be reconstructed from the image points of Pi in both cameras. The second step is
to obtain the 3D positions of the IR LEDs and the computer screen in the stereo
camera system. Since the IR LEDs and the computer screen are located behind
the view-field of the stereo camera system, they cannot be observed directly by the
stereo camera system. Therefore, similar to [6, 101], a planar mirror with a set of
fiducial markers attached to the mirror surface is utilized. With the help of the
planar mirror, the virtual images of the IR LEDs and the computer screen reflected
by the mirror can be observed by the stereo camera system. Thus, the 3D locations
of the IR LEDs and the computer screen can be calibrated after obtaining the 3D
locations of the virtual images of them.
In the following, some experiment results of both gaze-tracking techniques are
reported.
3.5.2 Performance of 3D Gaze Tracking Technique
3.5.2.1 Gaze Estimation Accuracy
Once the system is calibrated and the angle deviation between the visual axis
and the optic axis for a new user is obtained, his screen gaze point can be determined
by intersecting the estimated 3D visual axis of the eye with the computer screen.
In order to test the accuracy of the gaze-tracking system, seven users were involved
in the experiments and none of them wears glasses.
Personal calibration is needed for each user before using our gaze-tracking sys-
tem in order to obtain the angle deviation of the visual axis and the optic axis. The
calibration is very fast and only lasts for less than 5 seconds. Once the calibration
is done, the user does not need to do the calibration any more if he wants to use
the system later.
During the experiments, a marker will display at nine fixed locations in the
screen randomly, and the user is asked to gaze at the marker when it appears at
69
each location. The experiment contains five 1-minute sessions. At each session,
the user is required to position his head at a different position purposely. Table
3.1 summarizes the computed gaze estimation accuracy for the first subject, where
the last column represents the average distance from the user to the camera during
each session. As shown in Table 3.1, the accuracy of the gaze tracker significantly
depends on the user’s distance to the camera. Normally, as the user moves closer
to the camera, the gaze accuracy will increase dramatically. This is because the
resolution of the eye image increases as the user moves closer to the camera.
Table 3.1: The gaze estimation accuracy for the first subject.
Session Horizontal accuracy Vertical accuracy Distance to the camera
1 5.02 mm (0.72) 6.40 mm (0.92) 280 mm
2 7.20 mm (0.92) 9.63 mm (1.22) 320 mm
3 9.74 mm (1.24) 13.24 mm (1.68) 370 mm
4 12.47 mm (1.37) 17.30 mm (1.90) 390 mm
5 19.60 mm (1.97) 24.32 mm (2.45) 440 mm
Table 3.2 summarizes the computed average gaze estimation accuracy for all
the seven subjects during the experiments. Specifically, in the experiment, the aver-
age covered head movement volume is around 200mm in the X, Y and Z directions
respectively. In addition, the average angular gaze accuracy in the horizontal direc-
tion is 1.47 and the average angular gaze accuracy in the vertical direction is 1.87
for all these seven users, which is acceptable for many Human Computer Interaction
(HCI) applications, allowing natural head movements.
Table 3.2: The gaze estimation accuracy for seven subjects
Subject Horizontal accuracy Vertical accuracy
1 1.24 1.63
2 1.28 1.70
3 1.33 1.74
4 1.39 1.79
5 1.43 1.87
6 1.66 2.05
7 1.97 2.32
70
3.5.2.2 Comparison with Other Methods
Table 3.3 shows the comparison of accuracy and allowable head movements
among several practically working gaze-tracking systems that allow natural head
movements. In addition, all of these systems were built recently and require only
a very simple personal calibration instead of a tedious gaze mapping function cal-
ibration. For simplicity, only the depth or Z direction movement is illustrated, as
shown in the second column of Table 3.3. We can see that our proposed technique
can provide a competitive gaze accuracy as well as a large head movement volume
without the help of a face tracking system. Therefore, it represents the state of the
art in 3D gaze-tracking research.
Table 3.3: Comparison with other systems
Methods Head movement Best Featuresvolume (Z) accuracy
[101] < 70 mm 0.4 1 stereo cameraseye tracking
[6] N/A, but > 70 mm 0.6 2 stereo camerasface trackingeye tracking
Ours around 200 mm 0.72 1 stereo cameraseye tracking
[133] around 500 mm 5 single cameraeye tracking
3.5.3 Performance of 2D Mapping Based Gaze Tracking Technique
3.5.3.1 Head Compensation Model Validation
For the proposed 2D mapping-based gaze-tracking technique, the Equations
3.25 and 3.26 of the head mapping function g are validated first by the following
experiments.
A screen point Sc = (132.75,−226.00,−135.00) is chosen as the gaze point.
The user gazes at this point from twenty different locations in front of the camera;
at each location, the pupil-glint vector and the 3D pupil center are collected. The
3D pupil centers and the pupil-glint vectors of the first two samples P1,P2 are shown
in Table 3.4, where P1 serves as the reference position. The second column indicates
71
the original pupil-glint vectors, while the third column indicates the transformed
pupil-glint vectors by the head compensation model. The difference between the
transformed pupil-glint vector of P2 and the reference pupil-glint vector at P1 is
defined as the transformation error. Figure 3.14 illustrates the transformation errors
for all these twenty samples. It is observed that the average transformation error is
only around 1 pixel, which validates our proposed head compensation model.
Table 3.4: Pupil-glint vector comparison at different eye locations
3D pupil 2D Pupil-glint Transformed Pupil-glintposition (mm) Vector (pixel) Vector (pixel)
P1(5.25, 15.56, 331.55) (9.65, -16.62) (9.65, -16.62)P2(-8.13, 32.29, 361.63) (7.17, -13.33) (8.75, -16.01)
0 5 10 15 20−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Locations
Pix
els
Transformation Error (X Component)
0 5 10 15 20−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Locations
Pix
els
Transformation Error (Y Component)
(a) (b)
Figure 3.14: The pupil-glint vector transformation errors: (a) transfor-mation error on the X component of the pupil-glint vector,(b)transformation error on the Y component of the pupil-glint vector
3.5.3.2 Gaze Estimation Accuracy
In order to test the accuracy of the gaze-tracking system, several users were
asked to participate in the tests.
For the first user, the gaze mapping function calibration was performed when
the user was sitting approximately 330 mm from the camera. After the calibration,
the user was asked to stand up for a while. Then, the user was asked to sit approx-
imately 360 mm to the camera and follow a shining object that would display at 12
72
different pre-specified positions across the screen. The user was asked to reposition
his head to a different position before the shining object moved to the next position.
Figure 3.15 displays the error between the estimated gaze points and the actual
gaze points. The average horizontal error is around 4.41 mm in the screen, which
corresponds to around 0.51 angular accuracy. The average vertical error is around
6.62 mm in the screen, which corresponds to around 0.77 angular accuracy. Also,
it shows that our proposed technique can handle head movements very well.
Figure 3.15: The plot of the estimated gaze points and the true gazepoints, where, “+” represents the estimated gaze point and“*” represents the actual gaze point
When the user moves his head away from the camera, the eye in the image
will become smaller. Due to the increased pixel measurement error caused by the
lower image resolution, the gaze accuracy of the eye gaze tracker will decrease as
the user moves away from the camera.
In this experiment, the effect of the distance to the camera on the gaze accuracy
of our system is analyzed. A new user was asked to perform the gaze calibration
when he was sitting around 360 mm to the camera. After the calibration, the
user was positioned at five different locations, which have different distances to the
camera as listed in Table 3.5. At each location, the user was asked to follow the
73
moving objects that will display 12 predefined positions across the screen. Table
3.5 lists the gaze estimation accuracy at these five different locations, which shows
that as the user moves away from the camera, the gaze resolution will decrease. But
within this space allowed for the head movement, approximately 200 × 200 × 300
mm (width×height×depth) at 450 mm from the camera, the average horizontal
angular accuracy is around 1.16 and the average vertical angular accuracy is around
1.42, which is acceptable for most Human Computer Interaction applications. Also,
this space volume allowed for the head movement is large enough for a user to sit
comfortably in front of the camera and communicate with the computer naturally.
Table 3.5: Gaze estimation accuracy under different eye image resolutions
Distance to Horizontal Accuracy Vertical Accuracythe camera (mm) (degrees) (degrees)
300.26 0.52 0.61340.26 0.68 0.83400.05 1.31 1.41462.23 1.54 1.90552.51 1.73 2.34
3.6 Comparison of Both Techniques
Two different gaze tracking techniques are discussed in this chapter. In this
section, we briefly summarize their differences. The 2D mapping-based gaze esti-
mation method does not require knowledge of the 3D direction of the eye gaze to
determine the gaze point; instead, it directly estimates the gaze point on the object
from a gaze-mapping function directly by inputting a set of features extracted from
the eye image. The gaze-mapping function is usually obtained through a calibration
procedure repeated for each person.
The calibrated gaze mapping function is very sensitive to head motion; con-
sequently, a complicated head-motion compensation model is proposed to eliminate
the effect of head motion on the gaze-mapping function. Thus the 2D mapping-
based method can work under natural head movement. Since the 2D mapping-based
method is proposed mainly to estimate the gaze points on a specific object, a new
74
gaze-mapping function calibration must be performed each time when a new object
is presented.
In contrast, the 3D gaze estimation technique estimates the 3D direction of
the visual axis directly, and determines the gaze by intersecting the visual axis with
the object in the scene. Thus, it can be used to estimate the gaze point on any
object in the scene without the use of tedious gaze-mapping function calibration.
Furthermore, since this method is not constrained by head position, the complicated
head-motion compensation model can be avoided. But the 3D technique needs a
stereo camera system, and the accuracy of the 3D gaze estimation technique is
affected by the accuracy of the stereo camera system.
In terms of accuracy, the experiments indicate that the 2D mapping-based
gaze estimation technique is more accurate than the 3D gaze-tracking technique.
For example, for a user who is sitting approximately 340 mm from the camera,
the 2D mapping-based gaze estimation technique can achieve 0.68 accuracy in the
horizontal direction and 0.83 accuracy in the vertical direction; nevertheless, the
direct 3D gaze estimation technique only achieves 1.14 accuracy in the horizontal
direction and 1.58 accuracy in the vertical direction. Therefore, we can see that
the accuracy of the direct 3D gaze estimation technique can still be improved.
3.6.1 Processing Speed
Both gaze tracking techniques proposed in the chapter are implemented using
C++ on a PC with a Xeon (TM) 2.80GHz CPU and a 1.00GB RAM. The image
resolution of the cameras is 640 × 480 pixels, and the built gaze-tracking systems
can run at approximately 25 fps comfortably.
3.7 Chapter Summary
In this chapter, two different techniques are proposed to improve the existing
gaze-tracking techniques. First, a novel 2D mapping-based gaze estimation tech-
nique is proposed to allow free head movement and simplify the personal calibration
procedure. Therefore, the eye gaze can be estimated with high accuracy under nat-
ural head movement, with the personal calibration being minimized simultaneously.
75
Second, a simple method is proposed to estimate the 3D gaze direction of the user
without using any user-dependent parameters of the eyeball. Therefore, it is more
feasible to work on different individuals without tedious calibration. By the novel
techniques proposed in this chapter, the two common drawbacks of the existing eye
gaze trackers can be minimized or eliminated nicely so that the eye gaze of the user
can be estimated accurately under natural head movements, with minimum personal
calibration.
CHAPTER 4
Robust Face Tracking Using Case-Based Reasoning with
Confidence
4.1 Introduction
In reality, there are significant variations in the captured face images over time.
Image variations are usually caused by a number of factors such as external lighting
changes, occlusions, facial expressions, head movements, camera view changes, etc.
Due to the lack of an effective tracking framework to cope with the appearance
changes of the face being tracked, how to track the faces robustly and accurately in
the video sequence still remains a very challenging problem.
Numerous approaches have been proposed to track rigid or non-rigid objects
throughout the video sequences. However, most of them still suffer from the well-
known drifting issue, incapable of assessing the tracking failures or recovering from
any possible tracking error. For example, the proposed tracking techniques [7, 60]
maintain the object model or template fixed during tracking. Clearly, the template
cannot adapt to significant appearance changes under conditions such as occlusions,
or lighting changes, or more importantly the internal geometry deformations or
appearance changes of the non-rigid objects. Hence, they can only work well when
the appearance of the object does not vary significantly during tracking. Otherwise,
it will drift away and start to track the wrong object instead. An intuitive strategy is
to update the template of the object to account for appearance changes whenever the
object’s appearance varies. However, it is always a challenging task to automatically
provide an accurate object template for updating.
In this chapter, a robust visual object tracking framework based on Case-
Based Reasoning paradigm is proposed to provide an accurate 2D tracking model
for the object being tracked dynamically at each image frame. As a result, the drift-
ing issue that plagues most of the tracking techniques can be solved successfully.
Furthermore, under the CBR paradigm, since the tracked view is always adapted
from its most similar case or image view of the object in a case base, an accurate
76
77
similarity measurement can be always obtained to characterize the confidence level
of the tracked region. Therefore, when it starts to track a wrong object, the con-
fidence level associated with the tracked region will become low so that the failure
situations can be detected in time. Based on the proposed CBR visual object track-
ing framework, a real-time face tracking system was built so that the face can be
tracked robustly under significant changes in lighting, scale, facial expression and
head movement.
4.2 Related Works
Numerous techniques have been proposed to solve the drifting issue during
tracking. In this section, only several representative techniques are discussed. In
[72], a template update algorithm is proposed to avoid drifting during tracking by
storing the first template throughout the tracking. However, it only works when
the appearance of the object being tracked will not change significantly from the
stored object template. As a result, when new parts of the object come into view
or the appearance of the object varies significantly during tracking, the proposed
updating strategy fails. In [118], the template at each frame is dynamically updated
by a re-registration technique during tracking. The registration technique matches
the tracking template with a set of collected key-frames and then eliminates any un-
registered pixels from the tracking template. However, the un-registered pixels may
represent the object appearance changes instead of the non-object pixels. Therefore,
once they are eliminated, the object appearance changes cannot be adapted, and it
will be more difficult to track the object in the subsequent frames.
In [39], an adaptive appearance model is proposed to track complex natural
objects based on a subspace learning technique. In this technique, updating the
model is equivalent to finding a subspace representation that can best approximate
a given set of observations from previous frames. However, during subspace updat-
ing, the tracked object image at a new frame is directly integrated into the subspace
learning without any error correction. Therefore, once a tracked object view con-
tains error, which usually consists of the background pixels or non-object pixels, it
may be integrated into the learnt subspace and errors accumulate throughout the
78
tracking. Finally, it may become a severe issue that leads the tracker to drift away
from the object being tracked once the learnt subspace cannot represent the ob-
ject accurately any more. Therefore, the key issue is how to eliminate or minimize
the error accumulated in each frame before learning. Once it can be eliminated
or minimized, the drifting issue can be solved. Unfortunately, in their subsequent
works [66, 63], efforts were still focused on the online learning strategy, ignoring the
potential errors associated with each tracked image. Therefore, it is believed that it
will still suffer from the drifting issue as demonstrated by experiments discussed in
Section 4.5 of this chapter. Very similar to these methods, the technique proposed in
[50] also could not eliminate the background pixels contained in the tracked object
views before updating the object appearance model. Therefore, drifting is still a
significant issue.
In [92, 75], a modified two-frame tracker is proposed, where multiple past
frames that are closest to the previous frame in pose space are used to refine the
pose information. Hence, each image must be annotated with the pose information.
An important assumption of this method is that pose information governs every-
thing about the image appearance changes of the object being tracked. The pose
information for each image frame must therefore be estimated accurately and the
face region must be correctly segmented. As a result, the tracking performance is
limited by the accuracy of the estimated pose information. Furthermore, it only
works with the rigid objects without significant illumination changes or occlusions.
Very similar techniques [110] are proposed to reduce drifting by learning the online
and offline view information of the object. However, a 3D model of the object being
tracked must be built and the object is represented by a 3D mesh. In addition, the
internal camera parameters must be calibrated and fixed during tracking. In our
method, we propose to overcome these limitations and make it work on any object,
with either rigid or non-rigid objects, and under illumination changes.
In addition, most of the above methods fail to identify failures and cannot
provide confidence levels that measure the goodness of each tracked region (the
probability of the tracked region contains the object). However, via the proposed
CBR visual tracking with confidence framework, each tracked region can be as-
79
sessed with a score to indicate the confidence level. Furthermore, in our constructed
case base for the proposed CBR framework, since each individual exemplar serves
as a unique case, the complicated image distribution modelling from the collected
training exemplars [107] can be avoided.
In summary, compared to the existing techniques, our proposed technique has
the following advantages: (1) avoiding the drifting issue during tracking; (2) no need
of a 3D model; (3) not restricted to rigid objects, capable of tracking any object;
(4) no need of a calibrated camera; (5) more accurate than the synthesized image
patches used for matching; and (6) being able to assess the tracking failures with a
confidence level.
4.3 The Mathematical Framework
4.3.1 2D Visual Tracking
Assume that an object O is moving in front of a video camera. At time t,
the object is captured as an image view I(Xt) at position Xt in the image frame
It. Then the task of 2D visual tracking is to search for the image view I(Xt) of the
object O in each image frame It.
The most straightforward way is to place the object appearance model I tM to
all the possible positions in the image frame It and find the position that produces
the best matching results. Assuming the located image view is I(X ′
t), the tracking
error ∆I t0 can be represented as the difference between the true image view I(Xt)
and the located image view I(X ′
t) of the object:
∆I t0 = I(Xt)− I(X ′
t) (4.1)
Given the object model I tM , if we assume that its most similar image view can be
successfully located in the image frame It, then apparently the tracking error ∆I t0
is mostly caused by the inaccuracy of the utilized object model I tM .
If the tracking error ∆I t0 exists consistently at each image frame during track-
ing, it will accumulate and eventually force the tracker to drift away from the object.
As a result, this well-known drifting issue happens because of the lack of an accu-
80
rate tracking model of the object for each image frame during tracking. Therefore,
a key issue for a successful visual tracking technique is how to obtain an accurate
2D tracking model I tM of the object at each image frame It during tracking.
4.3.2 The Proposed Solution
In this section, an algorithm is proposed to maintain an accurate tracking
model for each image frame so that the tracking error ∆I0 can always be minimized
during tracking. Figure 4.1 illustrates the major steps of the proposed algorithm.
) ( t
X I
1 '
t M
I
) ( ' 1 t
X I
Case Base
t+1 t
) ( 1 t
X I
t+2
CBR
+
+ +
Figure 4.1: The diagram of the proposed algorithm to improve the accu-racy of the 2D object tracking model
Assume that the image appearance varies gradually between two consecutive
image frames It and It+1, and the object image I(Xt) at the image frame It is
already located. As shown in Figure 4.1, the first step is to locate the object in
the image frame It+1 using the tracked 2D view I(Xt). Let I(X ′
t+1) be the located
object view in image frame It+1. In practice, due to the time passing between two
consecutive image frames, the generated image views of the object usually appear
quite different because of camera view changes, illumination changes, or object
deformations, etc. As a result, the located image I(X ′
t+1) is usually not accurate
because the image I(Xt) is not an accurate model any more for the image frame It+1
due to the appearance changes. However, the located view I(X ′
t+1) of the object
usually contains information about the object appearance at the current time t + 1
partially or completely. Therefore, the located image view I(X ′
t+1) represents an
important information source that can be utilized to infer the true view of the
object in the image frame It+1.
The second step is to infer the object appearance model I t+1M from the located
81
image view I(X ′
t+1). Intuitively, the object model I t+1M will best match the located
image view I(X ′
t+1). In theory, if all the possible 2D image views of the object are
available in advance, then the object model I t+1M can be found by matching with the
located image view I(X ′
t+1). Therefore, the goal is to search the object view space
IM for one specific view of the object that minimizes the error between the located
view I(X ′
t+1) as follows:
I t+1M = IM∗
k= arg min
k[IMk
− I(X ′
t+1)]2 (4.2)
The obtained object model I t+1M will adapt to the appearance changes so that it
can represent the appearance of the object in the image frame It+1 more closely.
Therefore, the object model I t+1M can be utilized to locate the object in the image
frame It+1.
In order to do this, a set of 2D image views of the object must be available
in advance. However, in practice, it is infeasible to enumerate all the possible 2D
views of the object. In order to solve this issue, a feasible solution is to collect only
a set of representative 2D views of the object and then any unseen image view can
be adapted from them with some image adaption strategy.
Fortunately, the above proposed solution can be well-interpreted and imple-
mented in a Case-Based Reasoning paradigm. As shown in Figure 4.1, based on the
located image view I(X ′
t+1), an object model I t+1M ′ is selected from a constructed case
base. Then it will be used to locate the object in the image frame with a proposed
adaptation mechanism. Via the proposed CBR visual tracking framework, a more
accurate 2D tracking model I tM can be obtained at each image frame It. There-
fore, the drifting issue accompanied with most of visual tracking technique can be
alleviated.
4.4 The CBR Visual Tracking Algorithm
4.4.1 Case-Based Reasoning
Case-Based Reasoning (CBR) is a problem solving and learning approach that
has grown into a field of widespread interests both in academics and industry since
82
the 1990’s [4]. It is based on the view that a significant portion of human cognition
and problem solving involves recalling entire prior experiences or cases, rather than
just pieces of knowledge at the granularity of a production rule [96]. Therefore, the
basic assumption of CBR is that individuals have numerous experiences that have
been indexed in their memory to be used in new situations. Therefore, when asked
to solve a problem, humans typically search their memory for past experiences that
can be reapplied in this new situation. In short, CBR is a model of reasoning that
incorporates problem solving, understanding, and learning with memory processes.
The processes that make up CBR can be seen as a reflection of a particular type of
human reasoning. In many situations, the problems that human beings encounter
are solved with a human equivalent of CBR. Detailed information about CBR can
be found in [62].
From a CBR perspective, the problem of visual object tracking in a video
sequence can be solved similarly by retrieving and adapting the previously seen
views of the object to a new view of the object at each image frame. Specifically,
how to use Case-Based Reasoning paradigm to build a face tracking system will be
demonstrated as follows.
Figure 4.2 illustrates a general CBR cycle for the built face tracking system.
As shown in figure 4.2, four processes are made up of the whole CBR face tracking
system:
1. RETRIEVE the most similar face images from a case base composed of the
collected representative face images.
2. REUSE the information and knowledge in the retrieved face image to refine
the previously located face image by adapting to the face appearance changes.
3. REVISE the proposed solution, which is to evaluate the located face image
and give a confidence measure.
4. RETAIN a new solution, which is to add the new tracked face image into the
case base as a new case if necessary.
In the following sections, each process of the CBR face tracking framework
will be discussed in detail.
83
RETRIEVE
R E U S E
CASE BASE
REVISE
Problem
Probe Case Similar Cases
New Solution Solution with Confidence
New Case ?
RETAIN
Figure 4.2: The Case-Based Cycle of the face tracking system.
4.4.2 Case Base Construction
In a CBR system, the case base is the set of all the cases that are used by
the reasoner. It provides suggestions of solutions to problems for understanding or
assessing a situation. In the proposed CBR visual tracking framework, an appropri-
ate case base should consist of a set of views of the object as the prior knowledge or
historical cases for efficient tracking. Especially, each case needs to be representative
and evenly distributed in the 2D view space of the object.
To construct such a case base, initially, a set of representative 2D objects
are put into the case base. Then, the case base is enriched incrementally with a
training process. That is, during the training period, every time there is a tracking
failure due to the lack of similar cases in the case base, the corresponding new image
view is added into the case base. The training process is to find a balance between
tracking failure due to the lack of similar cases and the size of the case base. After
the training process, the case base is updated in the retaining step, which will be
discussed later.
One significant advantage of building such a case base consisting of 2D image
views is to avoid constructing an explicit model for an object directly, which usu-
ally needs complicated image distribution modelling or tedious hand-construction of
84
object-model. Instead, CBR provides a feasible way to represent such a complicated
object model by collecting all the representative 2D views of the object simply in a
case base. As a result, the object model IM can be simply selected from a set of m
discrete 2D views I12 , · · · , I
m2 .
Once the face case base is constructed, the face in a video sequence can be
tracked efficiently as described below.
4.4.3 Case Retrieving
The proposed framework starts with searching for the most similar 2D view
I(X ′
t+1) of the object in the current frame It+1 to its tracked 2D view I(Xt) in the
previous frame It. Once the view I(X ′
t+1) is located, then the next step is to search
the case base for its most similar case I t+1M ′ with the located view I(X ′
t+1). This
is done by comparing all the collected cases in the case base one by one and then
choosing the one that produces the maximum similarity with the view I(X ′
t+1) in
the Gabor space.
Object Appearance Representation. Besides its raw pixel intensities, there are
many other representations of an image appearance that one could learn for robust
object matching, such as color statistics [12], multi-scale filter responses [50], etc.
Since Gabor Wavelets can represent local features effectively, we applied a set of
multi-scale and multi-orientation Gabor wavelet filters on the image appearance I
of an object. Given a pixel p ∈ I, its Gabor response f(p) is represented by a
vector composed of the computed Gabor filter coefficients. Since f(p) governs the
information of an entire neighborhood of p, there is no need to compute the Gabor
responses for all the pixels in I. Therefore, the image appearance of an object
is represented by an ordered collection ω of the Gabor responses at n uniformly
sampled locations p1, ..., pn with a spacing:
ω = f(p1), · · · , f(pn) (4.3)
Object Searching. Assume that a pixel pt at frame t corresponds to the image
location pt+1 = ϕ(pt, π) at frame t+1, where ϕ(pt, π) is a coordinate transformation
85
function with parameter vector π. Then, object searching is conducted to find the
parameter vector π that give the optimal ω with respect to the similarity measured
in the Gabor space. This is equivalent to maximizing the sum of a set of local
similarity measures in ω as follows:
π∗ = arg maxπ
n∑
i=1
Si(f(pi), f(ϕ(pi, π))) (4.4)
where Si(f(pi), f(ϕ(pi, π))) is the similarity measure between Gabor response vec-
tors f(pi) and f(ϕ(pi, π))). In addition, the final similarity S ′
t+1 with the tracked
image view I(X ′
t+1) is characterized by the average of its local similarity measures
S ′
t+1 = 1n
∑n
i=1 Si.
4.4.4 Case Adaption (Reusing)
After the view I t+1M ′ most similar to the tracked view I(X ′
t+1) is derived from
the case base, a two-step procedure is proposed to adapt the view I t+1M ′ most similar
view to the image frame It+1 for the final face view.
In the first step, the selected image view I t+1M ′ is utilized as a tracking model to
perform a search starting at X ′
t+1 in the image frame It+1 for a most similar image
view. It is assumed that an image view I(X ′′
t+1) is located at X ′′
t+1 in the image with
a similarity measurement S ′′
t+1.
Subsequently, the second step is to combine the tracked image views I(X ′
t+1)
and I(X ′′
t+1) to obtain the final image view I(Xt+1) as follows:
Xt+1 =S ′
t+1
S ′
t+1 + S ′′
t+1
X ′
t+1 +S ′′
t+1
S ′
t+1 + S ′′
t+1
X ′′
t+1 (4.5)
The above formula shows that the tracked target view is the combined result
of the tracked view in the previous frame and the selected case view in the case
base. Intuitively, it can be seen as a minimization of the sum of two errors, the error
between the target view in the current frame and the tracked view in the previous
frame, as well as the error between the target view in the current frame and the
selected similar case view. In other words, the tracked target view possesses one
important property: it must be similar to the tracked view in the previous frame as
86
well as the selected case view in the case base. Only when the tracked view satisfies
this property, the tracking can succeed.
4.4.5 Case Revising and Retaining
Case revising is a phase that has been proposed to evaluate the tracking result
for a confidence measure. Once the final image view I(Xt+1) is obtained, another
search will be conducted in the case base to find the most similar case. Simulta-
neously, a similarity score St+1 is derived after the searching is done. In practice,
if the similarity score is high, tracking is usually successful; otherwise, it may fail.
Therefore, simply, the derived similarity score St+1 is utilized as a confidence mea-
sure to characterize the final image view I(Xt+1). The system automatically reports
the confidence measure and stores those image views with low confidence level into
a temporary case database. Such a temporary case database is retained for further
use.
The retaining process enriches the case base by adding new representative
image views. To retain a new case, the image views in the temporary database have
to be reviewed periodically, so that only the useful image views are selected and
added into the case base. Using this process, it is believed that the case base can
include more and more representative 2D views of the objects.
4.5 Experiment Results
Based on the proposed CBR visual tracking framework, a real-time face track-
ing system was built. When a person appears in the view of the camera, the per-
son’s face is automatically localized via the proposed frontal face detector [111] and
tracked subsequently by our proposed face tracker. Specifically, in the built face
tracking system, there are totally 120 face images in the current constructed case
base, which will grow with the use of the tracker. These face images were collected
from different subjects under various face orientations, facial expressions, and illu-
mination changes. Although the current case base is small, experiments indicated
that it is good enough to track most people’s faces robustly. To test the perfor-
mance of the built face tracking system, a set of face video sequences with several
87
new subjects were collected.
Out experiments first focused on demonstrating two significant advantages of
the proposed CBR tracking framework: drifting-elimination and confidence-assessment
capabilities. Subsequently, experiments were conducted to demonstrate that the face
tracker works well with different individuals, under different external illuminations,
different facial expressions, and various face orientations.
4.5.1 Drifting-Elimination Capability
In this experiment, the self-correction or drifting-elimination capability of the
proposed tracking scheme is demonstrated. A face sequence was collected under
significant head movements and facial expressions as shown in Figure 4.3. It is 20
seconds long and was recorded at 30 fps with the 320*240 pixel gray-scale image
resolution.
Figure 4.3: Comparison of the face tracking results with different tech-niques. The first row shows the tracking results by the in-cremental subspace learning technique and the tracked faceis marked by a red rectangle; the second row represents thetracking results by our proposed technique and the trackedface is marked by a dark square. The images are frame 26,126, 352, 415 and 508 from left to right.
Three other popular tracking techniques were applied to track the face in
the video sequence as well. The first technique is the traditional two-frame-based
tracking method [72], which utilizes the previously tracked face image to update the
tracking model dynamically. The second technique is the offline tracking method,
which utilizes the selected face image most similar to the previously tracked face
88
view from the case base to update the tracking model dynamically. This method is
very similar to our proposed CBR tracking method, except that it does not adapt
the selected face image to the current image. The last one is the tracking technique
with the incremental subspace learning [66], which dynamically updates the object
model by incremental learning its eigen-basis from a set of previously tracked faces.
In order to measure tracking performance, the face center at each frame in
this video sequence was manually identified, and served as the ground truth. From
Figure 4.4 (a), it is obvious that the tracking error of the two-frame tracker accumu-
lates as the tracking continues. Eventually it drifts away from the face and tracks
the wrong object. In addition, as shown in Figure 4.4 (b), only using the offline face
information without any adaption to the current view of the face, the offline tracking
technique is very unstable and inaccurate. Furthermore, Figure 4.4 (c) shows that
the tracking technique with the incremental subspace learning also drifts and tracks
the wrong object eventually. One possible reason is that because the tracked face
view contains errors or non-face image pixels, they are learned and accumulated in
the model throughout the video sequence and eventually lead to drifting. However,
our proposed tracking method can eliminate the tracking error in each frame grad-
ually during tracking, while still tracking the face robustly. Figure 4.3 clearly shows
that the proposed tracker outperforms the incremental subspace learning tracker.
(a) (b) (c)
Figure 4.4: Comparisons of the tracked face position error: (a) betweenthe proposed tracker and the two-frame tracker; (b) betweenthe proposed tracker and the offline tracker; (c) betweenthe proposed tracker and the incremental subspace learningtracker;
89
4.5.2 Confidence-Assessment Capability
When both the two-frame tracker and the subspace learning tracker fail to
track the face correctly in the above sequence, they cannot automatically detect
failures and still continue tracking. Figure 4.5 (a) shows the computed similarities
by the two-frame tracker. It illustrates that when the accumulated tracking error is
severe enough to start losing the face, the computed similarity is still high enough
to indicate that the face was still being tracked successfully.
The incremental subspace learning tracking technique has the same problem.
As shown in Figure 4.5 (b), when the drifting becomes severe, the tracker continues
to show that the RMSE error associated with the tracked face view is decreasing
instead. On the other hand, for our proposed method, since it still tracks the
face successfully, the computed confidence level maintains high scores to tell this
as shown Figure 4.5 (c). In practice, we found that when the confidence scores are
higher than 0.5, the tracking is usually successful.
Drifting starts to become severe
Drifting starts to become severe
(a) (b) (c)
Figure 4.5: (a)The similarities computed by the two-frame tracker; (b)The RMSE errors computed by the incremental subspacelearning tracker; (c) The confidence scores computed by ourproposed tracker
In addition, when the tracker fails to track the face, our proposed technique
can indicate the failures through confidence measurements. In order to demonstrate
this capability, another face image sequence was collected under significant face
rotations, facial expressions, and occlusions. It contains around 600 image frames
and corresponds to about 20 seconds of video.
Some selected image frames with tracked faces are illustrated in Figure 4.6. An
occlusion by hand happens approximately from frame 205 to frame 238, starting from
90
partial occlusion to full occlusion and then back to partial occlusion. Figure 4.7 plots
the computed confidence measurements for the tracked faces in this sequence. It
clearly shows that the computed confidence measurements can reflect the occlusion
exactly, with the lowest confidence score being estimated when the face is fully
occluded by the hand. On the other hand, as shown in Figure 4.7, the proposed
tracker can provide high confidence level when the face is under either significant
facial expressions or large face orientations without occlusion.
Figure 4.6: The face tracking results with significant facial expressionchanges, large head movements and occlusion. For eachframe, the tracked face is marked by a dark square. Theupper row displays the image frames 29, 193, 211, 236 fromleft to right; while the lower row displays the image frames237, 238, 412 and 444 from left to right
4.5.3 Performance under Illumination Changes
Our proposed tracking technique can perform well not only under large head
movements and significant facial expressions, as illustrated above, but also under
significant external illumination changes. To demonstrate, a face video sequence was
recorded under an environment with drastic lighting variations. As shown in Figure
4.8, the appearance of the person’s face changes drastically due to the external
lighting changes, which makes the tracking task extremely challenging. However,
our proposed tracking technique can still track the face very successfully.
91
Occlusions
29
193
211
236
237
238
412
444
Figure 4.7: The estimated confidence measures
Figure 4.8: Face tracking results under significant changes in illuminationand head movement. The tracked face is marked by a darksquare in each image. From left to right, the selected imageframes are 75, 246, 295, 444 in the first row, while the secondrow displays the image frames 662, 706, 745 and 898
4.5.4 Processing Speed
The proposed CBR face tracking technique is implemented using C++ on a
PC with a Xeon (TM) 2.80GHz CPU and a 1.00GB RAM. The resolution of the
captured images is 320 × 240 pixels, and the built face tracking system runs at
approximately 26 fps comfortably.
92
4.6 Chapter Summary
In this chapter, a visual tracking framework based on CBR paradigm with
confidence level is introduced to track the faces in a video sequence. Under the
proposed framework, the face can be tracked robustly under significant appearance
changes without drifting. In addition, a confidence measurement can be derived
accurately for each tracked face view so that the failures can be assessed success-
fully. Since the proposed algorithm does not involve any complicated probabilistic
appearance modelling or non-linear optimization, it is very simple and computation-
ally inexpensive. Most importantly, it provides a very nice framework to solve the
drifting issue that has plagued the face tracking community for a very long time.
Finally, such a tracking framework can be easily generalized to track other objects
by only updating its case base.
CHAPTER 5
Real-Time Facial Feature Tracking
5.1 Introduction
Facial features, such as eyes, eyebrows, nose and mouth, as well as their spatial
arrangement, are important for such facial interpretation tasks as face recognition
[117], facial expression analysis [128] and face animation [116]. Therefore, locating
these facial features in a face image accurately is crucial for these tasks to perform
well. Various techniques [10, 108, 94, 38, 117, 14] have been proposed to detect
and track facial features in face images. Generally, two types of information are
commonly utilized by these techniques. One is the image appearance of the fa-
cial features, which is referred as texture information; and the other is the spatial
relationship among different facial features, which is referred as shape information.
In [94], a neural network is trained individually as a feature detector for each
facial feature, e.g. as an eye detector. Facial features are then located by searching
the face image via the trained facial feature detectors. Similarly, instead of using
neural networks, Gabor wavelet networks are trained to locate the facial features in
[108]. Since the shape information of the facial features is not modelled explicitly
in both techniques, they are prone to image noise. Therefore, in [10], a statistical
shape model is built to capture the spatial relationships among facial features, and
multi-scale and multi-orientation Gaussian derivative filters are employed to model
the texture of facial features. However, only the shape information is used when
comparing two possible feature point configurations, ignoring the local measure-
ment for each facial feature. Such a method may not be robust in the presence of
image noise. In [117], facial features are represented with the Gabor Jets and the
spatial distributions of facial features are captured with a graph structure implic-
itly. Via the graph structure, only the simple spatial information among the facial
features is imposed, whose variation is not modelled directly. Since most of these
proposed techniques either assume frontal facial views, or without significant facial
expressions, or under constant illuminations, good performance has been reported.
93
94
However, in reality, the image appearance of the facial features varies signifi-
cantly among different individuals. Even for a specific person, the appearance of the
facial features is easily affected by the lighting conditions, face orientations and facial
expressions. Therefore, robust facial feature tracking still remains a very challenging
task, especially under variable illumination, face orientation and facial expression.
In this chapter, techniques are proposed to improve the robustness and accuracy
of the existing facial feature trackers such that facial features can be detected and
tracked under the above challenging situations.
In chapter 4, a CBR-based visual tracking framework is introduced to stably
detect and track the face under significant changes in lighting, facial expression and
face orientation. Therefore, the position and motion of the tracked face can provide
strong and reliable geometric constraints over the locations of other facial features.
In addition, Kalman filtering [13] is a useful means for object tracking. It can impose
a smooth constraint on the motion of each facial feature. For the trajectory of each
facial feature, this constraint removes random jumping due to the uncertainty in
the image. Therefore, in this chapter, the Kalman filtering and the face motion are
combined together to predict the position for each facial feature in a new image.
By doing so, we not only obtain a smooth trajectory for each feature, but also
catch rapid head motion. Given the predicted feature positions, the multi-scale and
multi-orientation Gabor wavelet matching method [64, 117, 129] is used to detect
each facial feature in the vicinity of the predicted locations. The robust matching in
the Gabor space provides an accurate and fast solution for tracking multiple facial
features simultaneously.
It is important for us not only to track each single facial feature, but also
to capture their spatial relationships. Gabor wavelet matching is used to identify
each facial feature in the tracking initialization; the Gabor wavelet coefficients are
updated at each frame to adaptively compensate for the facial feature appearance
changes. These updated coefficients are used as the template to match the facial
feature in the coming image frame. This updating approach works very well when
no occlusion or self-occlusion happens. Considering the free head motion in which
the head can turn around from the frontal view to the side view or vice versa, the
95
self-occlusion often fails the tracker because a random or arbitrary profile is assigned
to the occluded feature. In this chapter, a shape-constrained correction mechanism
is developed to tackle this problem and to refine the tracking results.
Figure 5.1 shows the flowchart of the proposed facial feature detection and
tracking algorithm. After the face image is captured from the camera, the frontal
face is first located via the technique proposed in chapter 4. Based on the detected
frontal face region, a trained face mesh is employed to estimate a rough position for
each facial feature. Subsequently, a refinement technique based on Gabor wavelet
matching is proposed to search for an accurate position around the roughly estimated
position for each facial feature. Once the facial features are located successfully, a
correction-based tracking mechanism is activated to track them in the subsequent
image frames. In the following sections, each component will be described briefly.
Initialization?
Frontal Face Detection
Frontal Face Found?
Facial Feature Detection
Facial Features Found?
Facial Feature Tracking
Facial Features Tracked?
Face Images
Yes
No
Yes
No
No
Yes
Yes No
Figure 5.1: The flowchart of the proposed facial feature detection andtracking algorithm
5.2 Facial Feature Representation
Gabor wavelets are biologically motivated convolution kernels in the shape of
plane waves restricted by a Gaussian envelope function [16]. So far, many research
groups have reported that with the use of Gabor wavelets, good performance can be
96
achieved in face recognition [117], facial expression recognition [129], facial feature
detection [117], etc.
In this chapter, a set of multi-scale and multi-orientation Gabor wavelets are
applied to each facial feature so that each facial feature can be represented by a
set of filter responses. Specifically, for an image patch I(−→x ) around a given pixel
−→x = (x, y)t, the utilized 2D Gabor wavelet kernels are expressed as follows:
Ψj(−→x ) =
‖−→kj‖
2
σ2exp(−
‖−→kj‖
2‖−→x ‖2
2σ2)
[exp(i
−→kj ·−→x )− exp(−
σ2
2)
](5.1)
where σ = 2π, and the wave vector−→kj is represented as
−→kj =
kjx
kjy
=
kν cos φµ
kν sin φµ
(5.2)
The wave vector−→kj controls the orientation and the scale of the wavelets, with
index j ranging from 0 to 6ν + µ. Specifically, the set of Gabor wavelets consist
of 10 spatial frequencies and 6 distinct orientations given by ν = 0, · · · , 9 and
µ = 0, · · · , 5, with kν = 2−ν2 π and φµ = µπ
6.
Therefore, for each pixel −→x , a Gabor coefficient vector J(−→x ) is derived as a
set Jj(−→x ) of 60 convolution results with the above 60 Gabor kernels, with each
convolution defined as:
Jj(−→x ) =
∫I(−→x
′
)Ψj(−→x −−→x
′
)d−→x′
= mjeiφj , j = 0, 1, ..., 59
where I(−→x′
) is the image grey level distribution, and mj and φj are the magnitude
and phase of the computed complex Gabor coefficient. This Gabor coefficient vector
J(−→x ) can be used to represent the image pixel −→x and its vicinity [64] efficiently.
In our proposed algorithm, the Gabor coefficient vector J(−→x ) is not only used to
detect each facial feature at the initial frame, but also used during tracking.
However, it is too expensive to compute all the kernel convolutions in real
time. In the following, a pyramidal representation is employed to speed up the
convolution computations dramatically.
97
5.2.1 Pyramidal Gabor Wavelets
Given a face image, an image pyramid can be generated to depict it with
decreasing resolutions. To create an image pyramid, each level IL of the image
pyramid is generated from its next lower level I(L−1) as follows:
IL(x, y) = 14I(L−1)(2x, 2y) + 1
4I(L−1)(2x + 1, 2y) + 1
4I(L−1)(2x, 2y + 1)
+14I(L−1)(2x + 1, 2y + 1)
(5.3)
where L represents the pyramid level and I(x, y) represents the intensity of the
image pixel at the coordinates (x, y). It tells that an image I(L+1) at level L + 1 is
created by shrinking the image IL at level L by half. Specifically, the base level I0 is
the image at the original resolution, which has the best resolution. In our proposed
algorithm, an image pyramid with three levels is generated to represent each face
image. Figure 5.2 shows a generated three-level image pyramid, whose base level
image is shown as Figure 5.2 (a).
(a) (b) (c) (d)
Figure 5.2: An image pyramid with three levels: (a) base level contains a320×240 pixels image; (b) first level contains a 160×120 pixelsimage; (c) second level contains a 80 × 60 pixels image; (d)third level contains a 40× 30 pixels image
At the same time, the Gabor kernels with low frequency need to be shrunk
correspondingly. This is done easily by re-formulating the wave vectors of the Gabor
kernels as follows:
−→kj =
kν cos φµ
kν sin φµ
(5.4)
98
where kν = 2(− ν2+L)π with L = quotient(ν, 3).
Therefore, the kernels with the frequency index ν = 0, 1, 2 will perform
the convolutions on the base level of the image pyramid. The kernels with the
frequency index ν = 3, 4, 5 will perform the convolutions on the first level of
the image pyramid, shrinking its kernel size by half. While the kernels with the
frequency index ν = 6, 7, 8 will perform the convolutions on the second level of
the image pyramid, shrinking its kernel size by half twice. Finally, the kernel with
the lowest frequency will perform the convolutions at the third level of the image
pyramid, shrinking the kernel size by half three times. Via the proposed pyramidal
Gabor wavelet representation, the computations of the kernel convolutions will be
much faster.
5.2.2 Fast Phase-Based Displacement Estimation
Given a facial feature at two consecutive image frames, a Gabor coefficient
vector J(~x) is extracted at the position ~x in the first image, while another Gabor
coefficient vector J(~y) is extracted at a different position ~y = ~x+~d in the subsequent
image frame, with the displacement ~d = (dx, dy)T . Since the position ~x is in a small
vicinity of the position ~y, the phase shift between Gabor coefficient vectors J(~x) and
J(~y) can approximately be compensated for by the term ~d· ~kj. So the phase-sensitive
similarity function between these two Gabor coefficient vectors can be expressed as:
S(J(~x), J(~y)) =
∑j mjm
′
j cos(φj − φ′
j −~d · ~kj)√∑
j m2j
∑j m
′
j
2(5.5)
where each component in the Gabor coefficient vector J(~x) is represented as Ji(~x) =
mjeiφj , and each component in the Gabor coefficient vector J(~y) is represented as
Ji(~y) = m′
jeiφ
′
j . To compute it, the displacement ~d must be estimated. This can be
done by maximizing the similarity in its second Taylor expansion:
S(J(~x), J(~y)) ≈
∑j mjm
′
j[1− 0.5(φj − φ′
j −~d · ~kj)
2]√∑
j m2j
∑j m
′2j
(5.6)
99
Therefore, by setting ∂∂dx
S = 0 and ∂∂dy
S = 0, the optimal displacement vector ~d
can be estimated as follows:
~d =1
ΓxxΓyy − ΓxyΓyx
×
Γyy −Γyx
−Γxy Γxx
θx
θy
(5.7)
if ΓxxΓyy − ΓxyΓyx 6= 0, with
θx = Σjmjm′
jkjx(φj − φ′
j)
θy = Σjmjm′
jkjy(φj − φ′
j)
Γxx = Σjmjm′
jkjxkjx
Γxy = Σjmjm′
jkjxkjy
Γyx = Σjmjm′
jkjxkjy
Γyy = Σjmjm′
jkjykjy
This similarity function can only determine the displacements up to half wave-
length of the highest frequency kernel, which would be ±1 pixel area centered at
the predicted position for k0 = π/2. But this range can be increased by using low
frequency kernel. Specifically, a four level coarse-to-fine approach is used, which can
determine up to ±22 pixel displacement with the use of the lowest frequency kernel.
Therefore, for each facial feature, it only needs four displacement calculations to de-
termine the optimal position, which dramatically speeds up the displacement estima-
tion processing and makes the real-time implementation for multi-feature tracking
possible.
5.3 Facial Feature Detection
Twenty-eight prominent facial features around eyes, eyebrows, nose and mouth
are selected for detection and tracking in the face images as shown in Figure 5.3.
Among them, some facial features will move significantly under facial expressions so
that the facial expressions of the face images can be revealed from the movements
of them.
Given a face image sequence, our proposed algorithm starts with detecting
100
1.1 1.2
1.3 1.4
1.5 1.6
2.11
2.8
2.9
2.10
2.7 2.6 2.5
2.2
2.3
2.4 2.1
3.1
3.2 3.3
4.1
4.2 4.4 4.6
4.8
4.7 4.5
4.3
Figure 5.3: A face mesh with facial features
these twenty-eight facial features automatically in the initial image frames. In
essence, the proposed facial feature detection technique consists of two steps: facial
feature approximation and facial feature refinement. The first step provides an ap-
proximated location for each facial feature based on a detected frontal face region,
which are then fine-tuned in the second step.
5.3.1 Facial Feature Approximation
As shown in Figure 5.4, given a frontal face image, the selected twenty-eight
facial features are symmetrically located within the face region. In addition, certain
anthropometric proportions [17] must be satisfied among these facial features so that
the positions of the facial features can be roughly estimated once the face region is
estimated.
Therefore, a simple strategy is proposed to estimate a rough position for each
facial feature based on the located frontal face region in an image. First, a face mesh
F composed of these twenty-eight facial features is formed as shown in Figure 5.5
(a). Specifically, the face mesh F is learned from a set of collected frontal faces by
taking the mean face mesh of them. Since the set of collected faces cover a variety
of people from different races, the learnt face mesh F will work on most people.
Once the face mesh F is obtained, based on the size and position of the detected
frontal face region, it can be resized and imposed on the face image to obtain a rough
position for each facial feature as shown in Figure 5.5 (b). Since the deviation from
the actual position is usually small for each facial feature, its position can be further
101
Figure 5.4: Spatial geometry of the facial features in a frontal face region.Eyes are marked by the small white rectangles, the face regionis marked with large white rectangles, and the facial featuresare marked with white circles
refined by searching around its estimated position subsequently in the refinement
step. In the following, the searching procedure will be described briefly.
(a) (b) (c)
Figure 5.5: (a) face mesh, (b) face image with approximated facial fea-tures, (c) face image with refined facial features
5.3.2 Facial Feature Refinement
Given an approximated position ~xe for a facial feature in the face image, a
Gabor coefficient vector J(~xe) is first calculated. Then, the nearest neighbor ap-
proach is utilized to seek a most similar Gabor coefficient vector J′
from a training
set for each facial feature by matching with J(~xe) in the Gabor space. In order to
obtain an effective training set for each facial feature, a large number of face images
with different local properties were collected and the correct facial features were
marked in each image. These local properties include different individuals, different
102
lighting conditions, different face orientations and different facial expressions. Then
a set of Gabor coefficient vectors, each derived from a different face image at the
same facial feature, are stored for each facial feature. Since these face images are
collected under various conditions, a wide range of appearance variations for each
facial feature can be covered.
Once the most similar J′
is obtained, it is utilized as a model to estimate a new
position ~x′
for each facial feature starting from the approximated position ~xe via
the fast phase-based displacement estimation technique proposed in Section 5.2.2.
The above procedure is repeated until the final estimated position ~x′
converges or
the pre-defined number of iterations exceeds. As shown in Figure 5.5 (c), we can
see that the facial features are located successfully via the proposed facial feature
detection technique. More face images with detected facial features under significant
facial expressions are illustrated in Figure 5.6.
(a) (b)
(c) (d)
Figure 5.6: The face images with detected facial features under differentfacial expressions: (a) disgust, (b) anger, (c) surprise and (d)happy
103
5.4 Facial Feature Tracking
Once the facial features are located in the first image frames, they are tracked
subsequently in the following image frames. The facial feature tracking, especially
tracking a set of facial features simultaneously is a crucial and difficult task. The
face is a typical nonrigid object. The appearance of the facial features and the
spatial relationship among them can vary significantly under changes in facial ex-
pression and face orientation. In addition, they are very difficult to track under
rapid head motion, and some of the them may even disappear when the head turns
to profile views. Therefore, tracking these facial features robustly is a tough issue.
However, a novel facial feature tracking scheme is proposed in this chapter to track
them robustly under large head movements and significant facial expressions. The
flowchart of the proposed tracking scheme is illustrated in Figure 5.7. Specifically,
it is composed of three stages, namely facial feature prediction, facial feature mea-
surement, and facial feature correction. In the subsequent sections, each stage will
be described briefly.
Facial Feature Prediction
Facial Feature Measurement
Facial Feature Correction
Figure 5.7: The flowchart of the proposed tracking algorithm
5.4.1 Facial Feature Prediction
Kalman filtering is a well-known tracking method, and it has been successfully
used on many applications. In this chapter, it is used to predict the position of each
facial feature in a new image frame from its previous locations so that a smooth
constraint can be imposed on the motion of each facial feature. In addition, the
motion of the face tracked by our proposed face tracker can provide strong and
reliable information about the location and movement of each facial feature between
two consecutive frames. This information is especially useful to track the facial
features under significant head movements. Therefore, by combining the face motion
obtained by our proposed face tracker with the Kalman filtering, we can obtain an
104
accurate and robust prediction of the position for each facial feature in a new image
frame, even under rapid head movement.
Specifically, for each feature, its motion state at each time instance (frame) can
be characterized by its position and velocity. Let (xt, yt) represent its pixel position
and (ut, vt) be its velocity at time t in x and y directions. The state vector at time
t can therefore be represented as St = (xt yt ut vt)t. The system can therefore be
modelled as
St+1 = ΦSt + Wt (5.8)
where Φ is the transition matrix and Wt represents system perturbation. Given the
system model, S−
t+1, the state vector at t + 1 can be predicted by
S−
t+1 = ΦSt + Wt (5.9)
along with its covariance matrix Σ−
t+1 to characterize its uncertainty.
The prediction based on Kalman filtering assumes smooth motion for each fa-
cial feature. The prediction will be off significantly if the head undergoes a sudden
rapid movement. In dealing with this issue, we propose to approximate the move-
ment for each facial feature with the face movement since the face can be reliably
tracked in each frame. Let the predicted position for each facial feature at t+1 based
on face motion be Spt+1. It provides a different motion constraint to complement
the smooth constraint from Kalman filtering when the motion is rapid. Combining
these two constraints yields the final predicted position for each facial feature
S∗
t+1 = S−
t+1 + Σ−
t+1(S−
t+1 − Spt+1) (5.10)
The simultaneous use of Kalman filtering and face motion allows us to perform
accurate motion prediction for each facial feature under significant and rapid head
movements. We can then derive a new covariance matrix Σ∗
t+1 for S∗
t+1 using the
above equation to characterize its uncertainty.
105
5.4.2 Facial Feature Measurement
Given the predicted state vector S∗
t+1 at time t+1 and the predicted uncertainty
Σ∗
t+1 for each facial feature, a searching area centered at each predicted position in
the image frame is provided. Usually the size of the searching area is determined
by the covariance matrix Σ∗
t+1, and a searching process within the searching area
is often used to detect the optimal position. But when tracking a number of facial
features simultaneously, searching the area exhaustively for each facial feature is very
time-consuming. Instead, it is done automatically via the proposed fast phase-based
disparity estimation technique described in Section 5.2.2. Hence, the displacement−→d for each facial feature can be computed directly so that its position can be
estimated efficiently. Since the proposed technique is very efficient, it is suitable for
the real-time application.
Once the facial features are located, the Gabor wavelet coefficient vector is
computed at the located position for each facial feature, which will be utilized as
the tracking model in the subsequent image frame. It is equivalent to updating the
tracking model dynamically in each image frame; hence, the appearance changes
of each facial feature in the image frames can be adapted. However, since it does
not have the ability to correct the possible errors during tracking, errors may be
accumulated over the image frames such that the tracker will drift away eventually.
Therefore, in the next stage, a correction step is proposed to eliminate the accumu-
lated errors for each facial feature in the image frames so that the drifting can be
avoided.
5.4.3 Facial Feature Correction
The proposed facial feature correction strategy consists of two components: re-
fining facial features and imposing shape constraint. In the facial feature refinement
component, the tracked position of each facial feature will be refined by matching
with a training set in the Gabor space so that the accumulated error due to the
appearance changes can be eliminated. However, the above procedure only works
when the tracked position does not deviate far away from the actual position for
each facial feature, otherwise it may fail. Hence, a second component is subsequently
106
activated to impose the shape constraints among the facial features to correct those
obvious geometry-violated ones that deviate far away from their actual positions.
In the following, each component will be discussed briefly.
5.4.3.1 Facial Feature Refinement
First, for each facial feature, a set of Gabor wavelet coefficient vectors are
extracted from a large number of collected face images offline. The collected face
images try to cover all the possible appearance variations of each facial feature under
various face orientations, facial expressions and illuminations. Therefore, the set of
Gabor wavelet coefficient vectors serves as a generic appearance model for each facial
feature under various situations.
Given a face image, for a specific facial feature such as the left eye corner, we
can always find a very similar one from the set of collected Gabor wavelet coefficient
vectors. Once the most similar Gabor wavelet coefficient vector is found from the
training set, it will be used as a model to search for a new position for each facial
feature around the positions obtained in the facial feature measurement stage. The
searching is done via the fast phase-based disparity estimation technique. The above
procedure is repeated until the final estimated position converges or a pre-defined
number of iterations exceeds. By this way, the appearance changes of each facial
feature can be adapted successfully so that the accumulated tracking error can be
eliminated in time during tracking.
5.4.3.2 Imposing Geometry Constraints
So far, during tracking, the geometrical relationship among the facial features
has not been considered. In order to correct those geometrically violated facial
features that deviate far away from their actual positions, the geometry constraint
among the detected facial features is imposed. However, the geometry variations
among all the twenty-eight facial features under changes in individuals, facial ex-
pressions and face orientations are too complicated to be modelled successfully.
Therefore, in the following, a simple but effective technique is proposed to handle
this issue. Basically, the proposed technique can be divided into two steps: face pose
estimation and geometry constraint imposing. The first step provides a rough face
107
pose information so that the face pose effects can be eliminated from the tracked
2D facial features. Subsequently, the geometry constraints is imposed on the pose-
eliminated 2D facial features to correct the geometrically violated ones. By this way,
only the geometry variation under the frontal facial view needs to be learned, which
is feasible to be modelled so that the geometry constraint can be imposed easily. In
the followings, each step will be discussed briefly.
1. Rough Face Pose Estimation
Based on the detected facial features, a technique is proposed to estimate
the face pose efficiently. Under facial expressions, some of the facial features,
such as the ones around the mouth and the eyebrows, will move significantly.
Therefore, these nonrigid facial features are not suitable for the face pose
estimation. In order to minimize the effect of the facial expressions, only a set
of rigid facial features that do not move significantly with facial expression is
selected to estimate the face pose. Specifically, six facial features are selected
as shown in Figure 5.8 (a), which include four eye corners and two points on
the nose.
(a) (b)
Figure 5.8: (a) A frontal face image, and (b) its 3D face geometry, withthe selected facial features marked as the white dots
In order to estimate the face pose, the 3D shape model composed of these
six facial features has to be initialized. Currently, the 3D coordinates Xi =
(xi, yi, zi)T of facial features in the 3D face shape model are first initialized from
108
a generic 3D face model as shown in Figure 5.8 (b). Because of the individual
difference from the generic face model for a new user, before using the system,
the user is asked to face the camera directly to obtain a frontal view face
image for the 3D face model adaptation. Based on the detected facial features
in the frontal face image, the xi and yi coordinates of each facial feature in
the 3D face shape model are adjusted automatically to this new user. In this
situation, the depth values of the facial features in the 3D face shape model
of the new user are not available. Therefore, the depth pattern of the generic
face model is used to approximate the zi value. Our experiment shows that
this method is effective and feasible in our real time application.
Based on the personalized 3D face shape model and these six detected facial
features in a given face image, the face pose parameters α = (σpan, φtilt, κswing, λ)
can be estimated. In which, (σpan, φtilt, κswing) are the three euler face angles
and λ is the scale. Since the traditional least-square method [85] cannot han-
dle the outliers successfully, a robust RANSAC (Random Sample Consensus)
algorithm [28] is proposed to estimate the face pose instead. In the following,
the major steps involved in the RANSAC process of the face pose estimation
are discussed briefly.
(a) Form N triangles from the facial features
We can randomly select three non-colinear facial features to form a tri-
angle Ti. For each triangle Ti, if one of the vertices is chosen as the origin
and one edge is served as the X coordinate, we can fix a local coordinate
system Ct to it. In this local coordinate system Ct, three vertices of the
triangle are coplanar and the z coordinate of them is 0.
(b) Obtain the projection matrix P from each triangle
Under weak-perspective projection model, for each triangle Ti , the rela-
tionship between the row-column coordinate system and the local coor-
109
dinate system Ct can be expressed as follows:
c− c0
r − r0
= P
xt − xt0
yt − yt0
zt − zt0
(5.11)
where, (c, r) represents the projection image of the 3D point (xt, yt, zt) in
the local coordinate system Ct, and (c0, r0) is the projection image of the
reference point (xt0, yt0, zt0). In addition, P is a 2× 3 projection matrix
composed of the scalar λ and the first two rows of the rotation matrix R
as follows:
P =1
λ
r11 r12 r13
r21 r22 r23
(5.12)
Since the vertices of the triangle Ct are coplanar, the projection model
5.11 can be expressed as
c− c0
r − r0
= M
xt − xt0
yt − yt0
(5.13)
where M is a 2× 2 projection matrix represented as
M =1
λ
r11 r12
r21 r22
(5.14)
From the projection matrix M , two sets of face pose parameters can be
recovered after constraining the range of the euler face pose angles from
−π2
to π2. Further, for each set of the recovered face pose parameters, a
2× 3 projection matrices Pi can be generated, but only one is correct.
(c) Calculate the projection deviation for each projection matrix
Based on a recovered projection matrix Pi, the facial features of the 3D
face shape model are projected into the image to obtain a set of projected
110
facial features. Then the deviation derror between the projected facial
features and the detected ones in the image is calculated. If derror is
bigger than a threshold value d, then the projection matrix Pi will be
discarded; Otherwise, the projection matrix Pi is kept and its weight is
obtained as ωi = (d− derror)2.
(d) Average the final results
After checking all of the N triangles, we will get a list of K 2 × 3 pro-
jection matrix Pi, i = 1...K, and its corresponding weight ωi, i = 1...K.
From each full projection matrix Pi, a set of face pose parameters αi =
(σpan, φtilt, κswing, λ) is obtained uniquely. Then the final face pose pa-
rameters α can be obtained as follows:
α =
∑K
i=1 αi ∗ ωi∑K
i=1 ωi
(5.15)
Via the proposed face pose estimation technique based on RANSAC, the face
pose can be estimated robustly under various facial expressions. Therefore,
given the estimated face pose information, the face pose effect can be elimi-
nated from the tracked twenty-eight 2D facial features. Since the set of pose-
eliminated 2D facial features can be treated as being captured under the frontal
face view, only a frontal face shape model needs to be learned to impose the
geometry constraint among the facial features.
2. Shape Constraints with Active Shape Model
First, the shape of a face is defined as a vector composed of these twenty-eight
facial features. Then, a frontal face shape model is built as follows. Initially,
a set of face shape samples are extracted from a large number of frontal faces
with various facial expressions. The set of collected face shape samples serves
as a training set, consisting of N face shape samples Qi, where Qi is a vector
composed of the coordinates of the facial features. Then, a mean face shape
111
vector Qmean can be computed by
Qmean =1
N
N∑
i=1
Qi (5.16)
Once the mean face shape vector Qmean is obtained, then the shape variations
∆Qi can be obtained by subtracting the mean face shape Qmean from each face
shape samples Qi in the training set. From the set of shape variations ∆Qi,
a set of k basis face shape vectors Qj, 1 ≤ j ≤ k is extracted subsequently
using the PCA analysis technique. Usually, the selected number k is much
smaller than the dimension of the face shape vector Q.
As a result, given a face shape vector Qi extracted from a face image, the
global geometry constraint among the facial features can be imposed by
Qi −Qmean ≈ Φb (5.17)
with Φ = (Q1, ..., Qk) and b is a coefficient vector given by
b = Φ+(Qi −Qmean) (5.18)
where Φ+ is the pseudo-inverse of the matrix Φ, and it is computed from
Φ+ = (ΦT Φ)−1
ΦT .
After the coefficient vector b is obtained, the geometry-constrained face shape
vector Q′
i is represented as
Q′
i = Qmean + Φb (5.19)
Via the proposed method, normally, the obvious geometrically violated facial
features that deviate far away from their actual positions can be corrected
efficiently.
112
5.5 Experiment Results
A real-time facial feature tracking system is built based on the proposed al-
gorithm. When a person is sitting in front of the camera, it can detect and track
these twenty-eight facial features automatically.
In order to test the performance of the proposed facial feature tracking algo-
rithm, a number of face image sequences were collected from different people. Figure
5.9 shows the facial feature tracking results on some typical face image sequences
with different facial expressions under various face orientations. Although the fa-
cial expression changes significantly or the face orientation is significantly large, the
facial features can be still tracked successfully as shown in Figure 5.9.
0 24 70 80 96
0 7 10 14 18
0 27 36 70 79
Figure 5.9: The randomly selected face images from different image se-quences.
5.5.1 Facial Feature Tracking Accuracy
In order to evaluate the accuracy of the proposed facial feature tracking al-
gorithm, ten face image sequences with significant changes in facial expression and
face pose were collected. In each sequence, twenty-eight facial features were detected
and tracked automatically by the proposed facial feature tracker. In addition, the
113
positions of these facial features at each frame were also manually located for each
sequence. These manually located facial features serve as the ground truth during
the error computation.
Figure 5.10 illustrates the computed absolute position errors for the facial
features tracked by the proposed facial feature tracker. Specifically, Figure 5.10
(a) and (b) summarize the computed mean and standard deviation of the absolute
position errors for each facial feature in the X-direction and Y-direction respectively,
where the mean is represented by the middle value of each error bar and the standard
deviation is indicated by the half length of the error bar. In the X-direction, the
average mean of the computed position errors for all the twenty-eight facial features
is 1.66 pixels, and its average standard deviation is 0.89 pixel. While in the Y-
direction, the average mean of the computed position errors for all the twenty-eight
facial features is 1.85 pixels, and its average standard deviation is 0.71 pixel. It
appears that the computed mean and standard deviation of the position errors of
all the tracked facial features are less than 2 pixels. Since the original face image
resolution is 320 × 240 pixels, the position errors of the tracked facial features are
small enough for most of the applications, such as facial expression recognition or
facial animation.
(a) (b)
Figure 5.10: The computed position errors of the automatically extractedfacial features by the proposed facial feature tracker: (a)position errors in the X-direction for each facial feature; (b)position errors in the Y-direction for each facial feature
114
5.5.2 Processing Speed
The proposed facial feature detection and tracking technique is implemented
using C++ on a PC with a Xeon (TM) 2.80GHz CPU and a 1.00GB RAM. The
resolution of the captured images is 320 × 240 pixels, and the built facial feature
tracking system runs at approximately 26 fps comfortably.
5.6 Comparison with IR-Based Eye Tracker
The eyes (or eye centers), along with other twenty-six prominent facial features,
can be detected and tracked via the proposed technique, as described in the chapter.
Unlike the IR-based eye detection and tracking technique described in Chapter 2,
the technique proposed in this chapter works well under ambient lighting conditions,
without a special IR illuminator. Its hardware setup, therefore, is simple, with
sufficient accuracy for tasks like facial feature detection, face detection, and face
recognition.
In contrast, the IR-based eye tracker requires special IR illumination hardware
and an associated video decoder. However, it can work during the day and at night.
In addition, it produces sub-pixel accuracy, which is a prerequisite for a stable and
accurate gaze-tracking system.
5.7 Chapter Summary
In this chapter, we present a real-time approach to detect and track twenty-
eight facial features from the face images under significant changes in both facial
expression and face orientation. The improvements over the existing facial feature
detection and tracking algorithms result from: (1) combination of the Kalman fil-
tering with the face motion to constrain the facial feature locations; (2) the use of
pyramidal Gabor wavelets for efficient facial feature representation; (3) dynamic and
accurate model updating for each facial feature to eliminate any error accumulation;
and (4) imposing the global geometry constraints to eliminate any geometrical vio-
lations. With these combinations, the accuracy and robustness of the facial feature
tracker reaches a practical acceptable level. Subsequently, the extracted spatio-
temporal relationships among the facial features can be used to perform the facial
115
motion estimation or facial expression classification in the following chapters.
CHAPTER 6
Nonrigid and Rigid Facial Motion Estimation
6.1 Introduction
The motion of the face consists of two independent motions: rigid motion
and nonrigid motion. The rigid motion results from the global motion of the face
describing the rotation and translation of the head or face pose, while the nonrigid
motion results from the local motion of the face describing the contraction of facial
muscles or facial expression. When captured by the camera, both motions are mixed
together to form a 2D face motion in the image plane.
Successful recovery of face pose and facial expression from face images is very
important for many applications including face animation, facial expression analysis,
HCI and model-based video conferencing. For example, if the face pose and facial
expression can be recovered successfully from the face images, an MPEG-4 player
can be driven to animate a synthetic 3D face model directly by the video images
of a live performer [95]. Also, in the area of automatic facial expression analysis,
due to the inability to separate face pose from facial expression accurately, most
of the facial expression recognition systems developed so far require the subject to
face the camera directly without significant head movements [25, 105]. But once the
face pose is precisely estimated, its effect can be eliminated so that the estimated
nonrigid motion of facial expression will be independent of the face pose. Therefore,
the user can move his head freely in front of the camera while the facial expression
can still be recognized.
In this chapter, a novel technique is proposed to recover 3D face pose and
facial expression simultaneously from a set of twenty-eight facial features tracked by
the proposed technique in chapter 5. Specifically, after explicitly modelling the non-
linear coupling between face pose and facial expression in the image, a normalized
SVD (N-SVD) decomposition technique is proposed to analytically recover the pose
and expression parameters simultaneously. Subsequently, the solution obtained from
the N-SVD technique is further refined via a nonlinear technique by imposing the
116
117
orthonormality constraints on the pose parameters. Compared to the original SVD
technique proposed in [5], which is very sensitive to image noise and is numerically
unstable in practice, our proposed method can recover the face pose and facial
expression robustly and accurately.
6.2 Related Works
Numerous methods [95, 25, 30, 11, 121, 120] have been proposed to estimate
rigid and nonrigid facial motions from face images. Conventionally, the rigid and
nonrigid motions of the face are estimated separately [95, 122], and usually, the
nonrigid motion is subsequently estimated after the rigid motion is recovered. For
example, a two-stage method [95] is proposed. First, the rigid 3D head motion
is estimated from two successive frames. Then, the nonrigid motion of the face
is recovered by eliminating the estimated 3D head motion. However, in the stage
of estimating the head motion, the face is assumed to be a rigid object, ignoring
the facial expression changes between these two views. Therefore, the head motion
cannot be accurately estimated under the facial expression changes. In fact, most
face pose estimation algorithms [30, 11, 121, 120] are proposed to deal with the
rigid face only, ignoring the facial expressions. Hence, with facial expressions, the
estimated face pose by these techniques will only be an approximation of the true
pose. In fact, the estimated face pose will not be accurate at all when the facial
expression is significant. On the other hand, without accurate face pose information,
the recovered nonrigid motions of the face will not be accurate, either.
Since the 3D face motion is the sum of rigid and nonrigid motions, when pro-
jected into the image plane of the camera, both motions will be nonlinearly coupled
in the projected face motion. Therefore, any approach that tries to recover one
motion independently by ignoring the other will not solve this problem accurately.
But if the image projection of both motions can be explicitly modelled, then the
face pose and facial expression can be recovered simultaneously and accurately from
the motion projection model.
A group of different techniques [65, 23, 5, 35] have been proposed to estimate
the rigid and nonrigid facial motions simultaneously. For each technique, a different
118
2D facial motion model is developed to integrate the face pose and facial expression
parameters together. Then, based on the built 2D facial motion model, the face
pose and facial expression can be extracted simultaneously.
Li et al. [65] proposed a method to simultaneously extract the face pose and
facial expression from two successive views with the use of the image brightness
constancy equation. Their method employs a 3D face model that can describe facial
expressions by a set of action units (AU). For this method, the 3D motion between
successive frames must be small, since the motion of the 3D model is modelled as a
linear approximation of the face pose motion and facial expression motion. Similar to
[65], another 3D face model that can represent the facial expression by a set of facial
animation parameters (FAPs) is utilized to extract the pose and expression from
two successive views based on the image brightness constancy equation [23]. With
the use of the image brightness constancy equation, the component of the motion
field in the direction orthogonal to the spatial image gradient is not constrained.
Therefore, for both methods, the recovered facial motion is only an approximation
of the true motion, whose error will be large if the true motion deviates far from the
direction of the spatial image gradient. In addition, because both methods process
the whole image in order to estimate the motion, they may not be suitable for
real-time implementation due to their computational complexity.
In [34], a novel method is proposed to extract both pose and shape of the
face simultaneously from images based on a 3D model that can represent the facial
expression by a linear combination of rigid shape basis vectors. Different from
techniques in [65, 23], the 3D facial expression is learnt from a set of 3D real facial
expression data collected from a stereo camera system. However, there are two
significant issues that prevent this technique from working robustly in practice.
First, since the facial features are tracked in a simply modified optical flow-based
framework, it cannot adapt to the appearance changes of each facial feature under
significant facial expression changes. It suffers from the well-known drifting issue
[72] during tracking. Second, since the recovery of pose and expression parameters
is integrated into the facial feature tracking in the image, the parameter searching
space becomes very complicated. In such a complicated parameter space, achieving
119
correct convergence and real-time implementation is very difficult.
In [5], the face pose and facial expression are recovered from a set of tracked
facial features. The facial expression is modelled as a linear combination of key-
expressions, while the facial motion is approximated by affine projection with par-
allax. Then, the coupling of both motions in the image plane is described by a
bilinear model. Subsequently, by decomposing the bilinear model via the Singular
Value Decomposition (SVD) method, the face pose and facial expression parameters
can be extracted directly. This method seems very nice in theory; however, we will
show both in theory and experiments that this method is so numerically unstable in
the presence of image noise that it does not work at all in practice. Furthermore, for
this method, the selected facial features are tracked with the use of black make-ups,
which also is unpractical.
In order to overcome the shortcomings of the similar research done by [34, 5], a
robust technique is proposed to recover the rigid and nonrigid facial motions from the
face images accurately in real time without any make-ups. First, a normalized SVD
(N-SVD) technique is proposed to improve the original SVD technique proposed
in [5] so that it can work stably in the presence of image noise. As shown in
Section 6.5 of this chapter, however, the proposed N-SVD method is unable to
impose the orthonormality constraints on the face pose parameters. Therefore, we
further introduce a nonlinear technique to refine the solution of the N-SVD method
based on a criterion that has a favorable interpretation in terms of distance and
appropriate constraints. Experiments show that the proposed motion estimation
technique provides significant improvements in accuracy for estimated rigid and
nonrigid motions.
6.3 Pose and Expression Modelling
6.3.1 3D Face Representation
A neutral face is defined as a relaxed face without any contraction of the facial
muscles. With facial expression changes, the facial features will be moved and the
facial appearance will change subsequently as the facial muscles contract. Hence,
facial expression can be treated as deformations of a neutral face. If a set of facial
120
features that move significantly with facial expressions are selected for tracking
as shown in Figure 6.1 (a), then the facial expression can be characterized by the
movements (or displacements) of these selected facial features relative to the neutral
face.
1.1 1.2
1.3 1.4
1.5 1.6
2.11
2.8
2.9
2.10
2.7 2.6 2.5
2.2
2.3
2.4 2.1
3.1
3.2 3.3
4.1
4.2 4.4 4.6
4.8
4.7 4.5
4.3
(a) (b)
Figure 6.1: (a) The spatial geometry of the selected facial features markedby the dark dots; (b) The 3D face mesh with the selected facialfeatures marked by the white dots
A 3D face model is represented by a set of l facial features as shown in Figure
6.1 (b). A 3D object coordinate system is attached to the face, whose origin is at the
tip of the nose and whose Z-axis is perpendicular to the face plane. In the defined
face coordinate system, the 3D coordinates of each facial feature Xi are represented
as (xi, yi, zi)T . In addition, the tip of the nose is treated as the face center, whose
3D coordinates are (0, 0, 0)T .
6.3.2 3D Deformable Face Model
Given a neutral face model XN composed of l facial features, the 3D face with
deformation can be expressed as follows:
X = XN + ∆X (6.1)
where X is a vector composed of l facial feature coordinates Xi (i = 1, ..., l) on the
face model, XN is a vector composed of l facial feature coordinates XNi (i = 1, ..., l)
on the neutral face model, and ∆X is the facial deformation vector under the facial
121
expressions, which consists of the relative movements ∆Xi (i = 1, ..., l) of the facial
features to the neutral face.
The task of the nonrigid facial motion estimation is to recover ∆X of a face
from its 2D image. Without reduction, ∆X contains 3l variables, which are difficult
to estimate due to such a large dimension. In order to minimize the dimensionality of
the facial deformation vector ∆X, a compact representation with a reduced number
of parameters is statistically built from a set of collected facial deformation vectors
via the Principal Component Analysis (PCA) technique similar to [34]. Specifically,
a set of p (p ≪ 3l) facial deformation basis vectors ∆Qk, k = 1, ..., p, is obtained
offline from a collected training set so that any facial deformation vector ∆X of the
3D face model X can be approximated by a linear combination of the basis vectors
as follows:
∆X ≈
p∑
k=1
αk∆Qk (6.2)
where αk (k = 1, ..., p) are the coefficients of the facial deformation basis vectors,
and ∆Qk is represented as
∆Qk =(
∆Qk1 . . . ∆Qk
l
)T
(6.3)
Therefore, a 3D deformable face model X with facial expression changes can be
represented by a set of reduced parameters as follows:
X ≈ XN +
p∑
k=1
αk∆Qk (6.4)
In the rest of the chapter, the coefficients αk will be called facial expression param-
eters.
122
6.3.3 3D Motion Projection Model
Under the weak-perspective projection assumption [109], given a facial feature
point in the 3D face model, the following equation can be obtained:
Ui = MXi (6.5)
where Ui = (ui vi)T is the relative coordinate of the 2D image point and Xi =
(xi yi zi)T is its corresponding relative 3D coordinate point. They are produced
by subtracting the center of the face in their respective coordinate systems. The
projection matrix M is a 2 × 3 matrix composed of the scalar λ and the first two
rows of the rotation matrix R, as follows:
M =
m11 m12 m13
m21 m22 m23
=
1
λ
r11 r12 r13
r21 r22 r23
(6.6)
Therefore, once M is known, the rotation matrix R and the scalar λ of the 3D
face model can be recovered. In the rest of the chapter, the coefficients mij will be
referred to as face pose parameters.
For each facial feature point, after integrating equation 6.4 into equation 6.5,
a motion projection model that analytically combines the effects of face pose and
facial expression in the 2D face image is derived as follows:
Ui = M
[XN
i +
p∑
k=1
αk∆Qki
]
= MXNi +
p∑
k=1
Mαk∆Qki (6.7)
From equation 6.7, it is clear that the 3D motion projection model is a nonlinear
function of the pose parameters mij and expression parameters αi. Furthermore, the
coupling term Mαi indicates that the pose parameters mij and expression parame-
ters αi directly interact to produce the facial motion in the image. For convenience,
the parameters involved in the motion projection equation 6.7 are represented as
ξ = (m11,m12,m13,m21,m22,m23, α1, ..., αp)T .
123
6.4 Normalized SVD for Pose and Expression Decomposi-
tion
6.4.1 SVD Decomposition Method
A decomposition technique based on Singular Value Decomposition (SVD)
is introduced by Bascle et al. [5] to estimate the parameters from equation 6.7.
Specifically, it can be summarized by the following two steps:
1. Solution of a system of linear equations
After some appropriate arrangements, equation 6.7 can be written in matrix
format as:
Ui = ΩiW (6.8)
with
Ωi =
XN
i 0 ∆Q1i 0 · · · ∆Qp
i 0
0 XNi 0 ∆Q1
i · · · 0 ∆Qpi
and W = (A α1A · · · αpA)T , where A = (m11 m12 m13 m21 m22 m23). The
parameter vector W contains 6+6p unknowns. Given a set of l detected facial
features in a face image, where l ≥ 3 + 3p, a system with 2l linear equations
can be derived from equation 6.8 as follows:
P = ΩW (6.9)
where P = (U1 · · · Ul)T and Ω = (Ω1 · · · Ωl)
T . The linear system 6.9 can
be easily solved by the least squares technique as W = Ω+P , where Ω+ is the
pseudo-inverse of the matrix Ω, and it is computed from Ω+ = (Ω′Ω)−1Ω′.
2. The singularity constraint
Once the parameter vector W is estimated, a matrix F can be constructed by
124
rearranging it:
F =(
AT α1AT · · · αpA
T
)
=
m11
m12
m13
m21
m22
m23
(1 α1 · · · αp
)(6.10)
The above equation clearly shows that the matrix F is a multiplication result of
two vectors. Hence, the matrix F is a singular matrix with rank 1. In practice,
however, the constructed matrix F has a rank 5 because of the inaccuracies
of the measurement or image noise. Thus, steps are needed to enforce this
singularity constraint on the constructed matrix F .
A corrected matrix F ′ can be derived by minimizing the Frobenius norm ‖F −
F ′‖ subject to the constraint det(F ′) = 0. A convenient method of doing this
is to use the SVD decomposition technique. In particular, let F = UDV T be
the SVD of F , where U and V are two unitary matrices, and D is a diagonal
matrix D = diag(d0 d1 · · · dp) satisfying d0 ≥ d1 ≥ · · · ≥ dp. Then F ′ is
chosen as F ′ = Udiag(d0 0 · · · 0)V T . This way, the corrected matrix F ′
will be a singular matrix with rank 1. Subsequently, the parameters ξ can be
recovered directly from the SVD of the corrected matrix F ′.
6.4.2 Condition of the Linear System
Unfortunately, it turns out that the matrix Ω of the linear system P = ΩW
is ill-conditioned. As indicated in [32], the condition 1 of a general nonzero matrix
Ω is a quantitative indication of the sensitivity to perturbation of a linear system
involving Ω. Hence, the ill-conditioned matrix Ω makes the solution of the linear
system P = ΩW very sensitive to image noise. In practice, the estimated pose and
1Formally, the condition of a general nonzero matrix Ω is defined as κ(Ω) = ‖Ω+‖‖Ω‖, whereΩ+ is the pseudoinverse of the matrix Ω.
125
expression parameters by the proposed SVD method [5] are extremely susceptible
to image noise, so the SVD method cannot work effectively for real images.
Hence, we must analyze the condition of the linear system P = ΩW . For
simplicity, equation 6.8 can be split into the following two equations, after some
proper rearrangements:
ui = Ω′
iWu (6.11)
vi = Ω′
iWv (6.12)
with Ω′
i = (XNi ∆Q1
i · · · ∆Qpi ) and
Wu = (A1 α1A1 · · · αpA1)T
Wv = (A2 α1A2 · · · αpA2)T
where A1 = (m11 m12 m13) and A2 = (m21 m22 m23).
Then, given a set of l facial features in a face image, assuming l ≥ 3 + 3p, two
systems of linear equations can be derived:
Pu = ΩfWu (6.13)
Pv = ΩfWv (6.14)
with Pu = (u1 · · · ul)T , Pv = (v1 · · · vl)
T , and
Ωf =
XN1 ∆Q1
1 · · · ∆Qp1
......
......
XNl ∆Q1
l · · · ∆Qpl
Apparently, the matrix Ωf is very unbalanced because the magnitudes of the
first column XN are significantly larger than the other columns. In practice, the
magnitude of the difference between the columns XN and the remaining ones is
around 103. As a result, the condition number of the matrix Ωf is very large,
usually larger than 4× 103. This large condition number means that the matrix Ωf
126
is close to singular or ill-conditioned. Therefore, the ill-conditioned matrix Ωf will
cause the solutions of the linear equation systems 6.13 and 6.14 to be very unstable
and extremely sensitive to image noise.
In the following section, a simple but effective solution is proposed to improve
the condition of the matrix Ωf . As a result, the linear systems 6.13 and 6.14 can be
solved more stably.
6.4.3 Normalization SVD Technique
In this section, a new stable algorithm, named as normalized SVD method (N-
SVD), is proposed to effectively estimate the pose and expression parameters. Since
the singularity of the matrix Ωf is caused by the imbalance of the matrix Ωf , we
propose a simple matrix transformation technique to balance the matrix Ωf . Given
a transformation matrix C, a transformed matrix Ω′
f is obtained by multiplying the
inverse of the matrix C:
Ω′
f = ΩfC−1 (6.15)
In particular, the transformation matrix C is a diagonal matrix and its struc-
ture is illustrated as follows:
C = diag(
c0 c0 c0 c1 c1 c1 · · · cp cp cp
)
where ck (k = 0, ..., p) equals the average of the sums of each column in Ωf . Using
the matrix transformation equation 6.15, the coordinates of each point are equally
scaled.
After the transformation, the transformed matrix Ω′
f is well-balanced. The
condition number of the transformed matrix Ω′
f is very small, around 101, indicating
that the transformed matrix Ω′
f is well-conditioned. Therefore, for the linear systems
Pu = Ω′
fW′
u and Pv = Ω′
fW′
v, where W ′
u = CWu and W ′
v = CWv, the solutions W ′
u
and W ′
v are less sensitive to image noise.
Similar to the construction of the matrix F , from the estimated parameter
127
vectors W ′
u and W ′
v, a new matrix Fn can be constructed:
Fn =(
c0AT c1α1A
T · · · ckαkAT
)
=
m11
m12
m13
m21
m22
m23
(c0 c1α1 · · · ckαk
)(6.16)
The preceding equations clearly show that the constructed matrix Fn is also a sin-
gular matrix with rank 1. Hence, this singularity constraint can be imposed on
the constructed matrix Fn via the SVD technique to obtain a corrected matrix F ′
n.
Subsequently, the parameters can be recovered from F ′
n.
Finally, the N-SVD method is summarized as the following five steps:
1. Normalization: Normalizing the matrix Ωf of the linear systems Pu = ΩfWu
and Pv = ΩfWv by a computed transformation matrix C from the matrix Ωf .
2. Linear solution: Similar to the construction of the matrix F , constructing
a matrix Fn from W ′
u and W ′
v obtained from the transformed linear systems
Pu = Ω′
fW′
u and Pv = Ω′
fW′
v.
3. Constraint enforcement: Replacing Fn by its closest singular matrix F ′
n
via the SVD technique.
4. De-normalization: Decomposing the matrix F ′
n back into parameter vectors
W ′
u and W ′
v, and then replacing W ′
u and W ′
v by C−1W ′
u and C−1W ′
v, respec-
tively.
5. Parameter recovery: Recovering the face pose and expression parameters
from the vectors W ′
u and W ′
v.
Via the proposed N-SVD technique, the condition of the linear systems is
improved significantly so that the recovered pose and expression parameters are no
128
longer sensitive to the image noise.
6.4.4 Stability Analysis
For a system of linear equations P = ΩfW , each entry of the vector W will
contribute the same amount of perturbation in the vector P . Thus, under a per-
turbation in P , the entries of W corresponding to the columns with larger entries
in matrix Ωf will undergo a smaller perturbation. In other words, the entries of W
corresponding to the columns with smaller entries in matrix Ωf are more subject to
the perturbation in P . Therefore, for an unbalanced matrix Ωf , as shown in Section
6.4.2, the entries of W corresponding to the columns of ∆Xk in matrix Ωf are very
sensitive to image noise.
After achieving normalization by multiplying the matrix C−1, the entries in
each column of the matrix Ω′
f will have approximately the same average magnitude.
Thus, the matrix Ω′
f will be well-conditioned; solving the system of linear equations
P = Ω′
fW′ is equivalent to treating each column of the matrix Ω′
f equally so that
each entry in W ′ will have the same sensitivity to the image noise.
For example, via the motion modelling equation 6.7, a specific synthetic facial
deformation vector ∆X is generated from 2 basis facial deformation vectors and
a 3D neural face mesh XN by choosing a set of known face pose and expression
parameters ξ0. In addition, in order to test its sensitivity under the image noise,
some Gaussian noise is added into the generated synthetic facial deformation vector
∆X to obtain a new facial deformation vector ∆X. Then, the proposed technique is
subsequently applied to recover the face pose and expression parameters ξ from the
synthetic facial deformation vector ∆X. From the parameter vector ξ0, two vectors
Wu and Wv can be obtained and serve as the ground truth for the perturbation
calculation.
According to the proposed technique above, the matrix Ωf is first generated
as follows:
Ωf =
129
−31.96 −35.63 34.51 −0.0120 0.0071 −0.0350 0.2135 −0.1036 −0.0062
38.64 −34.31 34.51 −0.0007 0.0026 0.0158 0.2223 0.0496 0.0007...
......
......
......
......
−48.70 −53.32 33.33 −0.0034 0.0847 0.0129 0.2888 −0.1246 0.072
52.95 −54.02 33.33 0.0085 0.0766 −0.0606 0.2792 0.0997 0.0553
which is very unbalanced since the entries of the first three columns are significantly
larger than the rest of columns. The computed condition number of the matrix
Ωf is 4484.4, which means that it is ill-conditioned. Similar to the original SVD
method, if Ωf is utilized directly to form two systems of linear equations (equations
6.13 and 6.14), then the computed perturbations of the estimated vectors Wu and
Wv are obtained as follows:
δWu =(
0.001 −0.027 −0.020 1.590 −3.713 5.991 −0.331 0.625 −3.541
)T
δWv =(−0.009 −0.012 −0.018 15.681 −1.585 −1.110 0.107 −0.352 −1.456
)T
The preceding calculations show that the perturbations of the entries in Wu
and Wv corresponding to the columns of XN in matrix Ωf are much smaller than
other entries. Hence, in the constructed matrix F , each entry will contain a signifi-
cantly different number of perturbations. However, when taking its closest singular
matrix F ′, all entries are treated equally by ignoring the large inequalities of the per-
turbations associated with them. Therefore, this ignorance of the large inequalities
among the perturbations can make the solution of the original SVD decomposi-
tion technique inaccurate and unstable in the presence of the image noise. The
perturbation δξ of the estimated parameter is computed as follows:
δξ =(−0.5604 −0.2216 0.0407 −0.5767 −0.2449 0.0155 24.6269 7.4026
)T
According to the normalization technique proposed in Section 6.4.3, a well-
balanced matrix Ω′
f is obtained. The condition number of the matrix Ω′
f is only 23.2,
which is significantly smaller than the original value of 4484.4. With the use of the
matrix Ω′
f , two new systems of linear equations can be formed and the computed
130
perturbations of the estimated parameter vectors W ′
u and W ′
v are illustrated as
follows:
δW ′
u =(
1.19 −19.74 −8.90 −12.94 2.36 −5.53 14.57 −0.80 1.57)T
δW ′
v =(−6.69 −9.18 −3.66 −11.75 23.36 −2.36 −2.702 0.26 −0.88
)T
It is obvious that the perturbations of the entries of W ′
u and W ′
v are of almost
the same magnitude. Hence, in the constructed matrix Fn, all the entries contain
approximately the same number of perturbations. Similarly, when taking its closest
singular matrix F ′
n, all entries are also treated approximately equally. Therefore,
the computed perturbation δξ of the estimated parameter ξ is much smaller:
δξ =(−0.0296 −0.0319 0.0407 −0.0171 0.0357 0.0155 −2.7909 −5.6009
)T
Based on the preceding calculations, we see that the parameters can be estimated
more stably in the presence of image noise via the proposed N-SVD method.
6.5 Nonlinear Decomposition Method
The proposed N-SVD decomposition method will obtain a unique solution
for each face image. But if the solution is correct, then the recovered face pose
parameters mij must automatically satisfy the following constraints:
f1(ξ) = m211 + m2
12 + m213 − (m2
21 + m222 + m2
23) = 0 (6.17)
f2(ξ) = m11m21 + m12m22 + m13m23 = 0 (6.18)
The above constraints are derived from equation 6.6 by considering the face pose
rotation matrix R as an orthonormal matrix.
Apparently, however, both constraints are ignored by the proposed N-SVD
method, which cannot guarantee that the recovered matrix R is an orthonormal
matrix. Therefore, the above two constraints must be considered in order to guar-
antee that the recovered face pose rotation matrix R is orthonormal.
Due to the non-linearity of these constraints, a nonlinear optimization method
131
is utilized to recover the face pose and facial expression parameters simultaneously,
subject to the orthonormal constraints. From equation 6.7, the image projection
function for a facial feature can be re-written as follows:
ui = A1XNi +
p∑
k=1
A1αk∆Qki (6.19)
vi = A2XNi +
p∑
k=1
A2αk∆Qki (6.20)
Therefore, from l facial features, we can build a positive error function fe:
fe(ξ) =l∑
i=1
[∆ui
2 + ∆vi2]
(6.21)
=l∑
i=1
[(ui − upi)
2 + (vi − vpi)2)
](6.22)
where the image pixel errors ∆ui and ∆vi are the Euclidean distances between the
tracked facial features (ui, vi)T and the predicted facial features (upi, vpi)
T computed
by projecting their corresponding 3D facial feature points (xi, yi, zi)T via the derived
equations 6.19 and 6.20.
In the error function fe, there are 6+p unknowns, including 6 pose parameters
and p expression parameters. Therefore, the unknown parameter vector ξ∗ can be
estimated in terms of the following constrained minimization problem:
ξ∗ = arg minξ
fe,
subject to the constraints f1(ξ) and f2(ξ). (6.23)
The above constrained minimization is solved by the sequential quadratic pro-
gramming (SQP) method [97], which is one of the most effective methods for solving
the optimization problem with nonlinear constraints. At each step, a quadratic pro-
gramming (QP) subproblem is solved using an active set strategy [31].
The algorithm needs an initial value, which is provided by the proposed N-
SVD method. Therefore, given a face image, the first step for our algorithm is to
132
extract an initial value of ξ by the proposed N-SVD decomposition method; next,
the estimated value of ξ is further refined by the proposed nonlinear optimization
method. In this way, the nonlinear algorithm can efficiently converge to the optimal
value in less than 10 iterations. Experiments show that via the proposed estima-
tion algorithm, the face pose and facial expression parameters can be recovered
accurately and efficiently.
6.6 Experiment Results
Once the pose and expression parameters are recovered from the face images,
the facial expression will be independent of the face pose. In essence, if the decom-
position is successful, then both parameters will be estimated accurately; otherwise,
neither of them will be accurate. Therefore, the performance of the proposed mo-
tion decomposition technique can be evaluated on either the facial expression or face
pose parameters individually.
In the following, several experiments are conducted to test the validity of our
proposed motion decomposition algorithm. First, the performance is analyzed on a
set of synthetic data. Next, real face image sequences with various facial expressions
under natural head movements were collected and the performance of the proposed
techniques on them is subsequently analyzed. In these experiments, a Root-Mean-
Square error (RMSE) measure is used to estimate the accuracy of the recovered
results. The RMSE of an estimated vector is defined as the square root of the
average Euclidean norm of the difference between the estimated vector and the true
vector.
6.6.1 Performance on Synthetic Data
Synthetic data is generated from a set of basis facial deformation vectors and
a 3D neutral face mesh. The 3D neutral face mesh is obtained from a 3D generic
face model. Specifically, in this experiment, the number of basis facial deformation
vectors utilized is equal to 4. In addition, a set of face pose parameters M and
facial expression parameters αi are randomly sampled. Based on them, a sequence
that contains 300 frames of image points is generated. Finally, different levels of
133
Gaussian noise with a zero mean are added into the generated coordinates of image
points at each frame. For each noise level, the errors associated with the parameters
(pose and expression) recovered by the original SVD method, N-SVD method and
the nonlinear method are computed from the 300-frame sequence.
When computing the errors, the pose parameters M and expression parameters
αi originally employed to generate the synthetic data serve as the ground truth. In
addition, for the nonlinear decomposition method, the solution obtained from the
N-SVD method is employed as an initial estimate.
In order to express the parameter errors in a meaningful and measurable quan-
tity, the following error measurement technique is performed. First, from the face
pose parameters M , three face pose angles and the face scale factor are recovered.
Therefore, the pose parameter error is characterized by the face pose angle error
in degrees and the face scale factor error in percentage. Second, from the facial
expression parameters αk, the facial deformation vector ∆X is computed as follows:
∆X =
p∑
k=1
αk∆Qk (6.24)
Hence, the facial expression parameter error can be characterized by the RMSE of
the facial deformations in pixels.
As discussed in Section 6.4, the original SVD decomposition method is very
sensitive to image noise; Figure 6.2 (a-c) illustrates its behaviors on the synthesized
image frames in the presence of noise. Apparently, as shown in Figure 6.2 (a-c), small
image noise produces large errors on the estimated pose and expression parameters.
On the other hand, for the proposed N-SVD method, the stability of the estimated
parameters under the image noise improves dramatically as shown in Figure 6.2
(a-c), and its accuracy degrades gracefully with the image noise level.
In addition, Figure 6.3 illustrates the improvements of the estimated parame-
ters via the proposed nonlinear method over the synthesized 300 frames as a function
of the standard deviation of the Gaussian noise. It clearly shows that the nonlinear
method reduces the errors of the parameters estimated from the N-SVD method
significantly as the image noise level increases.
134
(a) (b) (c)
Figure 6.2: Average errors of the estimated parameters by the SVDmethod and the proposed N-SVD method respectively as afunction of Gaussian noise: (a) face pose error; (b) face scalefactor error; (c) facial deformation error
(a) (b) (c)
Figure 6.3: Average errors of the estimated parameters by the proposedN-SVD method and nonlinear method respectively as a func-tion of Gaussian noise: (a) face pose error; (b) face scalefactor error; (c) facial deformation error
6.6.2 Performance on Real Image Sequences
6.6.2.1 Neutral Face Under Various Face Orientations
In this experiment, subjects will move their faces freely in front of the camera,
while keeping the facial expression as neutral. Significant out-of-plane face rotations
are included, along with significant distance changes from the face to the camera.
Randomly selected face images with automatically tracked facial features from dif-
ferent image sequences are shown in Figure 6.4. Based on the tracked facial features,
the proposed motion decomposition method is subsequently applied to recover the
face pose and the facial expression for each face image.
Since these face image sequences contain only a neutral face, ideally, the ex-
135
0 15 41 82 104 112
0 12 24 70 80 96
Figure 6.4: The randomly selected face images from a set of differentneutral face image sequences
tracted facial deformation vector V is close to zero. Thus, the RMSE of an estimated
facial deformation vector V can be simplified as follows:
RMS(V ) =
√‖V ‖
2
n(6.25)
where n is the dimension of the vector. Therefore, the RMSE of the estimated
facial deformation vector can be used as a metric to evaluate the performance of
the proposed technique: the smaller the RMSE, the better the performance of the
proposed technique.
Figure 6.5 shows the calculated RMSEs of the estimated facial deformation
vectors with the proposed N-SVD method as well as the nonlinear method respec-
tively for the two different image sequences, shown in Figure 6.4, correspondingly.
In these experiments, the output of the N-SVD method is always refined by the non-
linear method. As shown in Figure 6.5, the proposed nonlinear method works very
well and improves the N-SVD solution dramatically so that the facial deformations
can be accurately estimated. On the other hand, if the pose effects on the facial
features are ignored (all the faces are treated as frontal faces), then the RMSEs of
the calculated facial deformation vectors are significantly large, as shown in Figure
6.5.
Figure 6.6 (a) also displays the three estimated face pose angles by the pro-
136
Figure 6.5: The calculated RMSEs of the estimated facial deformations
posed method, and the estimated face scale factor is displayed in Figure 6.6 (b).
The face scale factor characterizes the distance between the face and the camera.
They follow the movements of the face in the images very well visually.
6.6.2.2 Frontal Face with Different Facial Expressions
We also collected a set of face image sequences that contain significant facial
expressions from the frontal view. In each sequence, a person is changing his facial
expressions while facing the camera directly. Randomly selected face images with
automatically tracked facial features from these image sequences are shown in Figure
6.7.
Since the face in the image is frontal, the three pose angles of face in the image
must be approximately zero. In addition, the movements of the facial features in a
frontal face image can be extracted directly by subtracting them from their neutral
face image. Hence, for each face image, a facial deformation vector V ′
k can be
obtained. Via the proposed motion decomposition method, a facial deformation
vector Vk can be estimated. Therefore, by choosing V ′
k as ground truth, the RMSE
of an estimated facial deformation vector Vk is defined as follows:
RMS(Vk) =
√‖Vk − V ′
k‖2
n(6.26)
137
(a) (b)
Figure 6.6: (a) The three estimated face pose angles; (b) The estimatedface scale factor
The calculated RMSE of an estimated facial deformation vector can be used
as a metric to evaluate the performance of the proposed motion decomposition tech-
nique: the smaller the RMSE, the better the performance of the proposed technique.
Figure 6.8 (a) illustrates the RMSEs of the estimated facial deformation vectors for
different face image sequences. From them, it shows that the proposed nonlinear
method can improve the solution of the proposed N-SVD method significantly so
that the average value of the computed RMSEs of the facial deformations is only
around 1 pixel, which is precise enough to discriminate most subtle facial expres-
sions. Table 6.1 quantitively summarizes the calculated RMSEs of the estimated
facial deformation vectors for the image sequences used in these experiments.
138
0 7 10 14 18
0 5 7 9 11
0 9 13 17 22
Figure 6.7: The randomly selected images from a frontal face image se-quence
Table 6.1: The average RMSEs of the extracted facial deformation vectorsfor different image sequences
Sequence N-SVD NonlinearNumber Method (pixels) Method (pixels)
1 1.40 0.752 1.27 0.923 1.20 0.89
Figure 6.8 (b) further illustrates the average error of the estimated face pose
angles for different face image sequences. The calculated average error of the es-
timated face pose angles is summarized quantitively in Table 6.2. The nonlinear
method improves the solution of the N-SVD method so that the average error of the
estimated face pose angles can be equal to approximately one degree.
6.6.2.3 Non-neutral Face Under Various Face orientations
We also collected a set of image sequences containing simultaneous face pose
and facial expression changes. In each sequence, a person rotates his head freely
in front of the camera (but always starting from frontal view), while keeping the
139
(a)
(b)
Figure 6.8: (a) The calculated RMSE of the estimated facial deforma-tion vectors; (b) The average error of the estimated face poseangles
Table 6.2: The average error of the extracted face pose angles for differentimage sequences
Sequence N-SVD NonlinearNumber Method (degrees) Method (degrees)
1 1.51 0.522 1.71 0.823 3.00 1.29
facial expression unchanged. Randomly selected face images for these sequences are
shown in Figure 6.9.
For example, in the “happy” face image sequence, each image contains a face
with the same facial expression, while rotating in front of the camera. Therefore, the
facial deformation vectors in all face images are equivalent. Since frame 0 contains
a frontal face, its extracted facial deformation vector V0 can serve as the ground
truth. Specifically, V0 is obtained by directly subtracting the face in frame 0 from
its neutral face. Therefore, the RMSE of an estimated facial deformation vector Vk
140
0 17 26 69 80
0 13 27 70 79
0 12 42 59 66
Figure 6.9: The randomly selected images from three face image se-quences with different facial expressions. (Top: happy; Mid-dle: surprise; Bottom: disgust)
is defined as follows:
RMS(Vk) =
√‖Vk − V0‖
2
n(6.27)
Figure 6.10 (a) shows the calculated RMSEs of the estimated facial deforma-
tion vectors for a “happy” face image sequence with the proposed technique. It
shows that the nonlinear technique performs much better than the N-SVD method
alone on the face images with simultaneous face pose and facial expression changes.
In addition, Figure 6.10 (b) and (c) display the three estimated face pose angles and
the estimated face scale factor, respectively. They visually follow the movements of
the face in the images very well.
Similar to the “happy” image sequence, the RMSEs of the estimated facial
deformation vectors from two other image sequences are illustrated in Figure 6.10.
Table 6.3 summarizes the calculated RMSEs of the estimated facial deformation
vectors for the image sequences used in the experiments. It tells clearly that the
proposed two-stage motion estimation technique achieves very good results so that
141
(a) (b) (c)
Figure 6.10: (a) The calculated RMSE of the estimated facial deformationvectors; (b) The estimated three face pose angles; (c) Theestimated face scale factor
most of the subtle facial expressions can be still discriminated via the recovered
facial deformations.
6.6.3 Processing Speed
The proposed facial motion decomposition technique is implemented using
C++ on a PC with a Xeon (TM) 2.80GHz CPU and a 1.00GB RAM. The resolution
of the captured images is 320×240 pixels, and the built facial motion decomposition
system, integrated with the facial feature tracker, runs at approximately 20 fps
comfortably.
142
Table 6.3: The average RMSEs of the extracted facial deformation vectorsfor different image sequences
Sequence N-SVD Nonlinear Without PoseType Method (pixels) Method (pixels) Elimination (pixels)Happy 2.25 0.88 6.09
Surprise 2.01 1.17 5.31Disgust 1.84 1.01 5.87
6.7 Chapter Summary
In this chapter, a novel technique is presented to simultaneously recover the
rigid and nonrigid facial motions from the face images accurately. The coupling
effects of both motions in the 2D face image are first analytically modelled into an
nonlinear motion projection function, and then decomposed by the proposed two-
stage motion decomposition technique in a very efficient way. Experiments show
that the proposed method can simultaneously recover the rigid and nonrigid motion
of the face very accurately.
The main contributions of this chapter are summarized as follows. First, the
face pose and the facial expression parameters are analytically integrated into a
unified formulation, which can be solved efficiently by a nonlinear optimization
method combined with the N-SVD technique. Finally, a real time system is built
so that the face pose and facial expression parameters can be extracted accurately
as soon as the user is sitting in front of the camera, without black make-ups or
markers.
CHAPTER 7
Facial Expression Recognition
7.1 Introduction
Via the technique proposed in chapter 6, the rigid facial motion related to face
pose and the non-rigid facial motion related to facial expression can be separated
successfully from a face image. In addition, the recovered non-rigid facial motion
is composed of the movements of a set of prominent facial features relative to the
neutral face. Since these non-rigid facial feature movements are closely related to
the generation of facial expressions, the facial expressions can be recognized from
them intuitively. In this chapter, based on the non-rigid facial motions recovered
from the face images, a computational model is constructed to model and under-
stand the facial expressions with the use of Dynamic Bayesian Networks (DBN)
[128]. Although a facial expression recognition framework with the use of DBN has
been introduced [128], no working system has ever been built to recognize facial
expressions in practice. Therefore, in this chapter, efforts are focused on developing
a real-time working system based on DBN. Finally, a facial expression recognition
system is built so that the six basic facial expressions can be recognized successfully
under natural head movements in real time.
7.2 Facial Expressions with AUs
The Facial Action Coding System (FACS) developed by Ekman et al. [24] is
the most comprehensive method of coding facial expressions. With the use of Action
Units (AUs) defined in FACS, all possible visually distinguishable facial movements
can be coded successfully into either a single AU or a combination of different AUs.
In essence, a facial expression can be expressed as a combination of different AUs
uniquely. Specifically, for the six basic facial expressions, each facial expression is
characterized by a set of different AUs as shown in Table 7.1. For example, AU12
(lip corner puller) can be directly associated with an expression of “happy,” and
AU9 can be directly associated with an expression of “disgust.”
143
144
Table 7.1: The association of six basic facial expressions with AUs
Expressions Associated AUsHappy 6,12,25,26Anger 2,4,7,17,23,24,25,26
Sadness 1,4,7,15,17Disgust 9,10,17,25,26
Fear 1,2,4,20,25,26Surprise 1,2,5,26,27
7.3 Coding AUs with Feature Movement Parameters
A single AU describes a type of specific facial appearance changes that occur
with muscular contractions in certain facial regions. Usually, facial appearance
changes can be revealed directly from the movements of the facial features involved
with changes in appearance. Therefore, the movements of facial features can be
utilized to quantitatively measure and code the AUs. The recovered non-rigid facial
motion is composed of the movements of a set of twenty-eight selected facial features
as shown in Figure 7.1; hence, it can be used to code the AU directly.
1.1 1.2
1.3 1.4
1.5 1.6
2.11
2.8
2.9
2.10
2.7 2.6 2.5
2.2
2.3
2.4 2.1
3.1
3.2 3.3
4.1
4.2 4.4 4.6
4.8
4.7 4.5
4.3
Figure 7.1: The spatial geometry of the selected facial features markedby the dark dots
Since an AU is related to several different facial features, Table 7.3 groups
different facial features into AUs relevant to six basic facial expressions. In addition,
a set of parameters named as “Feature Movement Parameters” or “FMPs,” defined
and derived from the movements of the facial features, are shown in Table 7.3. In
total, there are 33 FMPs defined. Each face image is normalized to the same scale
145
as the one in the neutral facial expression before the FMPs are extracted.
7.4 Modelling Spatial Dependency
Tables 7.1 and 7.3 deterministically characterize the relationships between
facial expressions and AUs as well as the relationships between FMPs and AUs. To
account for the uncertainties associated with facial feature movement measurements
and facial expressions, the otherwise deterministic relations are cast in a probabilistic
framework by a static BN model, as shown in Figure 7.2. The static BN model of
the facial expression consists of three layers: expression layer, facial AU layer, and
FMP layer.
Expressions
HAP SUR FER DIS SAD ANG
AU26 AU2 AU1 AU20 AU10 AU9 AU17 AU15 AU23 AU4 AU12 AU6 AU27 AU24 AU25 AU5 AU7
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 F23 F24 F25 F26 F27 F28 F32 F33 F29 F30 F31
Exp
ress
ion
L
ayer
A
U
Lay
er
FM
P
Lay
er
Figure 7.2: The BN model of six basic facial expressions. In thismodel, “HAP” represents “Happy,” “ANG” represents“Anger,” “SAD” represents “Sad,” “DIS” represents “Dis-gust,” “FEA” represents “Fear,” and “SUR” represents “Sur-prise”
The expression layer consists of hypothesis variable C, including six states
c1, c2, · · · , c6, which represent the six basic expressions, and a set of attribute vari-
ables denoted as HAP , ANG, SAD, DIS, FEA and SUR corresponding to the six
basic facial expressions, as shown in Figure 7.2. The goal of this level of abstraction
is to find the probability distribution of six facial expressions. The class state ci,
which represents the chance of class state ci given FMPs.
The AU layer is analogous to a linguistic description of the relationship be-
146
Table 7.2: The association between facial action units and facial featuremovement parameters (FMPs)
AUs Description Name of FMP Value FMPs
AU1 Inner brow raise l i eyebrow Dy(1.3) F1raiser raise r i eyebrow Dy(1.4) F2
AU2 Outer brow raise l o eyebrow Dy(1.1) F3raiser raise r o eyebrow Dy(1.6) F4
AU4 Brow lower lower l i eyebrow Dy(1.3) F5lower r i eyebrow Dy(1.4) F6lower l m eyebrow Dy(1.2) F7lower r m eyebrow Dy(1.5) F8squeeze l eyebrow Dx(1.3) F9squeeze r eyebrow Dx(1.4) F10
AU5 Upper lid raise l t eyelid Dy(2.2) F11raiser raise r t eyelid Dy(2.8) F12
AU6 Cheek raiser raise l eyecorner Dy(2.1) F13raise r eyecorner Dy(2.11) F14
AU7 Lid tightener close l eye Dy(2.2)−Dy(2.3) F15close r eye Dy(2.8)−Dy(2.9) F16
AU9 Nose wrinkler stretch l nose Dy(3.2) F17stretch r nose Dy(3.3) F18
AU10 Upper lip raise t m lip Dy(4.4) F19raiser raise t l lip Dy(4.2) F20
raise t r lip Dy(4.6) F21
AU12 Lip corner raise l c lip Dy(4.1) F22puller raise r c lip Dy(4.8) F23
stretch l c lip Dx(4.1) F24stretch r c lip Dx(4.8) F25
AU15 Lip corner lower l c lip Dy(4.1) F26depressor lower r c lip Dy(4.8) F27
AU17 Chin raiser raise b m lip Dy(4.5) F28
AU20 Lip stretcher stretch l c lip Dx(4.1) F24stretch r c lip Dx(4.8) F25raise b m lip Dy(4.5) F28
AU23 Lip tighener tight l c lip Dx(4.1) F29tight r c lip Dx(4.8) F30
AU24 Lip pressor lower t m lip Dy(4.4) F31raise b m lip Dy(4.5) F28
AU25 Lips part open jaw (slight) Dy(4.4)−Dy(4.5) F32lower b midlip (slight) Dy(4.5) F33
AU26 Jaw drop open jaw (middle) Dy(4.4)−Dy(4.5) F32lower b midlip (middle) Dy(4.5) F33
AU27 Mouth open jaw (large) Dy(4.4)−Dy(4.5) F32stretch lower b midlip (large) Dy(4.5) F33
Note: functions Dx and Dy extract the x and y components of a facial feature movement res-pectively (see Figure 7.1 for the facial feature definition). In the FMP names, “l” represents “left,” “r” represents “right,” “i” represents “inner,” “o” represents “outer,” “m” represents “middle,” “t” represents “top,” “b” represents “bottom,” and “c” represents “corner.”
147
tween AUs and facial expressions in Table 7.1. Each expression category, which is
actually an attribute node in the classification layer, consists of a set of AUs. These
AUs contribute visual cues to the understanding of the facial expression. The lowest
level of layer in the model is the sensory data layer containing FMPs, as given in
Tables 7.3. All the FMPs are observable, and they are connected to the correspond-
ing AUs. The value of an FMP is segmented into three ranges to differentiate the
intensity of an individual muscular action (e.g., low, middle, high). The range of
variation is determined by statistically analyzing the Cohn-Kanade facial expression
database [55].
Since the relationship between facial motion behaviors and facial expressions
is determined uniquely by human psychology, theoretically this alleviates the influ-
ence of inter-personal variations on facial expression model. Hence, the topology
of the BN facial expression model is invariant over time. Nevertheless, the model
needs be parameterized by the conditional probabilities for the intermediate nodes.
The conditional probabilities of AUs for a given facial expression are based on the
statistical results produced by a group of AU coders through visual inspection. The
parameters of conditional probabilities in FMP layer are estimated by Maximum
Likelihood Estimation from the given FMPs from 300 image sequences covering the
six basic expressions in Cohn-Kanade facial expression database [55]. Each condi-
tional probability in the AU layer is known, which guarantees that the rest of the
parameters will converge to a local maximum on the likelihood surface.
7.5 Modelling Temporal Dynamics
Facial expression often reveals not only the nature of the deformation of the
face, but also the relative timing of facial actions and their temporal evolution.
A facial action occurs when muscular contraction begins and increases in in-
tensity. The apex of the process is indicated by the maximum excursion of the
muscle, while the offset is observed in the relaxation of muscular action. Modelling
such a temporal course of facial expressions allows us to better understand the facial
representation of human emotion at each stage of its development.
The static BN model of facial expression works with visual evidences and
148
beliefs at a single time instant, and it lacks the ability to capture the temporal
dependencies between the consecutive occurrences of expression in image sequences.
In contrast, a Dynamic Bayesian Networks (DBN) can be utilized to achieve spatio-
temporal analysis and interpretation of facial expressions.
Our DBN model is made up of interconnected time slices from a static BN, and
the relationships between two neighboring time slices are linked by the first-order
HMM. The relative timing of facial actions during the emotional evolution process
is expressed by moving a time frame in accordance with the frame motion of a video
sequence, so that the visual information at the previous time provides diagnostic
support for current expression hypothesis.
Eventually, the values for the current hypothesis are inferred from the com-
bined information of current visual cues through causal dependencies in the current
time slice, as well as from the preceding evidence obtained from temporal dependen-
cies. Figure 7.3 shows the temporal dependencies derived by linking the top nodes
of the BN model given in Figure 7.2.
t−1
t
t+1
HAP SAD ANG SUP DIS
FEADISSUPANGSADHAP
FEA
Expressions
Expressions
Figure 7.3: The temporal links of DBN for modelling facial expression(two time slices are shown since the structure repeats by ”un-rolling” the two-slice BN). Node notations are given in Figure7.2
The expression hypothesis obtained from the preceding time slice serves as a
priori information for current hypothesis, and it is integrated with current data to
produce a posteriori estimate of current facial expression. More details can be found
149
in [128].
Therefore, after the DBN facial expression model is parameterized, given a
set of FMPs measured from a face image, the facial expression can be inferred
successfully from the DBN model via belief propagation over time.
7.6 Experiment Results
The output of the DBN model is the probability distributions over the six
basic facial expressions as functions of image frames. If the distributions of the six
basic facial expressions are equally likely, the face is in an absolutely neutral state.
Hence, the facial expression is identified as the one with the highest probability.
However, if there are multiple facial expressions showing outstanding distributions,
this indicates that either the subject performs a blended emotion or the system
confuses the facial actions.
21 70 161 294 392 483 602 672
Figure 7.4: Upper: a video sequence with 700 frames containing six ba-sic facial expressions. It only shows 8 snapshots for illustra-tion. Bottom: the output result shows probability distribu-tions (emotional intensities) over six basic facial expressionresulting from sampling the sequence every 7 frames
Figure 7.4 illustrates system output which shows the temporal course of facial
expressions resulting from sampling a video sequence every 7 frames. The sequence
150
has 700 frames, containing six basic facial expressions plus the neutral states among
them. Though there is a certain amount of confusion (e.g., the confusion between
surprise and fear) due to ambiguity in its appearance and errors involved in track-
ing, the overall performance of modelling dynamics of emotional expressions is good.
The inability of current facial expression recognition systems to correlate and reason
about facial temporal information over time is an impediment to providing a coher-
ent overview of the dynamic behavior of facial expressions in an image sequence.
The proposed approach enables an automated facial expression recognition system
to recognize the facial expressions, yet model their temporal behavior so that var-
ious stages of the development of a human emotion can be visually analyzed and
dynamically interpreted by machine.
To further quantitatively characterize the performance of our technique, we
conduct the facial expression recognition with respect to manual recognition using
another 700-frame sequence. These manually labelled frames are then compared
with the results from our automated system. Table 7.3 summarizes the confusion
statistics compared against the ground truth by visual inspection. The results show
that the performance is apparently good for this sequence. However, our essential
purpose is to perceive the temporal course and intensity of facial expressions from
an image sequence rather than the recognition accuracy for individual images.
Table 7.3: Confusion statistics from the 700-frame sequence
GROUND RECOGNIZED FACIAL EXPRESSIONSTRUTH HAP SUP SAD ANG DIS FEA NEU Tot.
HAP 96 0 0 0 0 0 10 106SUP 0 78 0 4 0 0 8 90SAD 0 0 92 0 0 8 5 105ANG 0 0 0 80 0 0 6 86DIS 3 7 0 0 64 0 3 77FEA 0 9 13 0 0 73 8 103NEU 0 0 0 0 0 0 133 133
Tot. 99 94 105 84 64 81 173 700
Note: NEU denotes neutral.
151
7.6.1 Processing Speed
The proposed facial expression modelling is implemented using C++ on a PC
with a Xeon (TM) 2.80GHz CPU and a 1.00GB RAM. The resolution of the cap-
tured images is 320× 240 pixels, and the built facial expression recognition system,
integrated with the facial feature tracker as well as the facial motion decomposition,
runs at approximately 20 fps comfortably.
7.7 Chapter Summary
In this chapter, based on the recovered non-rigid facial motion from face im-
ages, a probabilistic model is constructed to model and understand the six basic
facial expressions with the use of Dynamic Bayesian Networks (DBN). A real-time
facial expression recognition system has been successfully built so that the six basic
facial expressions can be recognized under natural head movements all in real time.
Compared to most of the existing facial expression recognition systems, our system
allows natural head movements, which is beyond the state of the art in the real time
facial expression recognition research.
CHAPTER 8
Conclusion
This thesis addresses the problems of real-time and non-intrusive human facial be-
havior understanding for Human Computer Interaction. Computer vision techniques
characterizing three typical facial behaviors, namely eye gaze, head gesture, and fa-
cial expression, are developed in this thesis. Several fundamental issues associated
with each computer vision technique are addressed. In addition, we also make the-
oretical contributions in several areas of computer vision including object detection
and tracking, motion analysis and estimation, and pose estimation.
Specifically, the main contributions of the thesis are summarized as follows:
1. We present a new real time eye detection and tracking methodology that
works under variable and realistic lighting conditions as well as various face
orientations.
2. From the detected eye images, we propose an improved eye gaze tracking algo-
rithm that allows natural head movement, with minimum personal calibration.
3. Via the proposed Case-Based Reasoning with a confidence paradigm, a ro-
bust visual tracking framework is proposed to track the faces under significant
changes in lighting, scale, facial expression and head movement.
4. With the use of robust feature representation via Gabor wavelets as well as
the global shape constraint among the facial features, twenty-eight prominent
facial features are detected and tracked simultaneously.
5. Based on the set of tracked facial features, a robust decomposition method is
presented to separate the rigid head motion from the nonrigid facial motion
from the 2D face images.
6. Based on the recovered non-rigid facial motions, a DBN model is utilized to
recognize six basic facial expressions under natural head movement.
152
153
Experiments were conducted to test these proposed algorithms with numerous sub-
jects under different lighting conditions. The algorithms were found to be robust,
reliable and accurate.
Based on our findings and experimental results, future research could focus on
improving each technique individually, as follows. For gaze tracking, high-resolution
cameras can be utilized to improve gaze estimation accuracy, and new techniques
are needed to further expand the volume of head movement for more natural inter-
actions between humans and computers. Regarding the facial feature tracking, it
fails when some facial features being tracked are occluded under large head orien-
tations. Therefore, an important future objective is to improve robustness of facial
feature tracking for near-profile face images. Finally, for the proposed CBR based
face tracking system, automatic maintenance of the face case base requires further
study.
Besides further improving each proposed technique, an important future re-
search goal is to build a prototype system that integrates all the component tech-
niques proposed in this research. Combined with a probabilistic user model, the
prototype system may be used to infer the user’s needs, intentions and affective
states for an effective human computer interaction.
BIBLIOGRAPHY
[1] Applied Science Laboratories. http://www.a-s-l.com.
[2] LC Technologies, Inc. http://www.eyegaze.com.
[3] SensoMotoric. http://www.smi.de.
[4] A. Aamodt and E. Plaza. Case-based reasoning: foundational issues,methodological variations, and system approaches. AI Communications,7(1):39–59, 1994.
[5] B. Bascle and A. Blake. Separability of pose and expression in facial trackingand animation. In International Conference on Computer Vision, pages323–328, 1998.
[6] D. Beymer and M. Flickner. Eye gaze tracking using an active stereo head.In Proceedings of the International Conference on Computer vision andPattern Recognition, 2003.
[7] M. Black and Y. Yacoob. Recognizing facial expression in image sequencesusing local parameterized models of image motion. IJCV, 25(1):23–48, 1997.
[8] A. Blake, R. Curwen, and A. Zisserman. A framework for spatio-temporalcontrol in the tracking of visual contours. International Journal of ComputerVision, 11(2):127–145, 1993.
[9] A. Bobick, Intille S., J. Davis, F. Baird, C. Pinhanez, L. Campbell,Y. Ivanov, A. Schutte, and A. Wilson. The KidsRoom: APerceptually-Based Interactive and Immersive Story Environment. TechnicalReport 398, E15, 20 Ames Street, Cambridge, MA 02139, December 1996.
[10] M. Burl, T. Leung, and P. Perona. Face localization via shape statistics. InInternational Conference on Automatic Face and Gesture Recognition,Zurich, Switzerland, 1995.
[11] M. La Cascia, S. Sclaroff, and V. Athitsos. Fast, reliable head tracking undervarying illumination: An approach based on robust registration oftexture-mapped 3D models. IEEE Transactions on Pattern Analysis andMachine Intelligence (PAMI), 22, 2000.
[12] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigidobjects using mean shift. In IEEE Conference on CVPR00, 2000.
154
155
[13] W. S. Cooper. Use of optimal estimation theory, in particular the kalmanfilter, in data analysis and signal processing. Rev. Sci. Instrument,57(11):2862–2869, 1986.
[14] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models.In European Conference on Computer Vision, Berlin, 1998.
[15] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,20:273–297, 1995.
[16] J. D. Daugman. Complete discrete 2-D gabor transforms by neural networksfor image analysis and compression. IEEE Transactions on ASSP,(36):1169–1197, 1988.
[17] D. Decarlo, D. Metaxas, and M. Stone. An anthropometric face model usingvariational techniques. In 25th annual conference on computer graphics andinteractive techniques, pages 67–74, 1998.
[18] A. T. Duchowski. Eye tracking methodology: Theory and practice. In SpringVerlag, 2002.
[19] Y. Ebisawa. Improved video-based eye-gaze detection method. IEEETransactions on Instrumentation and Measurement, 47(2):948–955, 1998.
[20] Y. Ebisawa, M. Ohtani, and A. Sugioka. Proposal of a zoom and focuscontrol method using an ultrasonic distance-meter for video-based eye-gazedetection under free-hand condition. In Proceedings of the 18th AnnualInternational conference of the IEEE Eng. in Medicine and Biology Society,1996.
[21] Y. Ebisawa and S. Satoh. Effectiveness of pupil area detection techniqueusing two light sources and image difference method. In Proceedings of the15th Annual Int. Conf. of the IEEE Eng. in Medicine and Biology Society,1993.
[22] G. Edwards. New software makes eyetracking viable: you can controlcomputers with your eyes, 1998.
[23] P. Eisert and B. Girod. Model-based estimation of facial expressionparameters from image sequences. In International Conference on ImageProcessing, pages 418–421, 1997.
[24] P. Ekman and W. Friesen. The Facial Action Coding System: A Techniquefor the Measurement of Facial Movement. Consulting Psychologists Press,Inc., San Francisco, CA, 1978.
156
[25] I. A. Essa and A. P. Pentland. Coding, analysis, interpretation, andrecognition of facial expressions. IEEE Transactions on Pattern Analysisand Machine Intelligence, 19(7):757 – 763, 1997.
[26] G. C. Feng and P. C. Yuen. Variance projection function and its applicationto eye detection for human face recognition. International Journal ofComputer Vision, 19:899–906, 1998.
[27] G. C. Feng and P. C. Yuen. Multi-cues eye detection on gray intensityimage. Pattern recognition, 34:1033–1046, 2001.
[28] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm formodel fitting with applications to image analysis and automatedcartography. Communications of the ACM, 24(6):381–395, 1981.
[29] A. W. Fitzgibbon and R. B. Fisher. A buyers guide to conic fitting. In 5thBritish Machine Vision Conference, pages 513–522, 1995.
[30] A. Gee and R. Cipolla. Determining the gaze of faces in images. Image andVision Computing, 12:639–948, 1994.
[31] P. E. Gill, W. Murray, and M. H. Wright. Practical Optimization. London,Academic Press, 1981.
[32] P. E. Gill, W. Murray, and M. H. Wright. Numerical Linear Algebra andOptimization. Addison-wesley Publishing Company, 1991.
[33] I. D. Gluck. Optics. New York, Holt, Rinehart and Winston, 1964.
[34] S. B. Gokturk, J. Y. Bouguet, and R. Grzeszczuk. A data-driven model formonocular face tracking. In IEEE International Conference on ComputerVision, Vancouver,B.C., Canada, 2001.
[35] S. B. Gokturk, C. Tomasi, B. Girod, and J. Y. Bouguet. Model-based facetracking for view-independent facial expression recognition. In Fifth IEEEInternational Conference on Automatic Face and Gesture Recognition,Washington DC, 2002.
[36] P. W. Hallinan. Recognizing human eyes. In Proceedings of SPIE, Vol. 1570:Geometric Methods in Computer Vision, pages 212–226, 1991.
[37] A. Haro, M. Flickner, and I. Essa. Detecting and tracking eyes by using theirphysiological properties, dynamics, and appearance. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2000.
[38] R. Herpers, M. Michaelis, K. H. Lichtenauer, and G. Sommer. Edge andkeypoint detection in facial regions. In Proceedings of the 2th IEEEInternational Conference on Automatic Face and Gesture Recognition, 1996.
157
[39] J. Ho, K. Lee, M. Yang, and D. Kriegman. Visual tracking using learnedlinear subspaces. In IEEE Conference on CVPR04, 2004.
[40] J. Huang, D. Li, X. Shao, and H. Wechsler. Pose discrimination and eyedetection using support vector machines (SVMs). In Proceeding ofNATO-ASI on Face Recognition: From Theory to Applications, pages528–536, 1998.
[41] J. Huang and H. Wechsler. Eye detection using optimal wavelet packets andradial basis functions (RBFs). International Journal of Pattern recognitionand Artificial Intelligence, 13(7):1009–1025, 1999.
[42] W. Huang and R. Mariani. Face detection and precise eyes location. InProceedings of the International Conference on Pattern Recognition, 2000.
[43] T. E. Hutchinson. Eye movement detection with improved calibration andspeed. U.S. patent 4950069, 1990.
[44] T. E. Hutchinson, K. P. White Jr., K. C. Reichert, and L. A. Frey.Human-computer interaction using eye-gaze input. In IEEE Transactions onSystems, Man, and Cybernetics, volume 19, pages 1527–1533, 1989.
[45] K. Hyoki, M. Shigeta, N. Tsuno, Y. Kawamuro, and T. Kinoshita.Quantitative electro-oculography and electroencephalography as indices ofalertness. Electroencephalography and Clinical Neurophysiology, 106:213–219,1998.
[46] A. Hyrshykari, P. Majaranta, A. Aaltonen, and K. Raiha. design issues ofidict: A gaze-assisted translation aid, 2000.
[47] R. J. K. Jacob. The use of eye movements in human computer interactiontechniques: What you look at is what you get. ACM Transactions onInformation Systems, 9(3):152–169, 1991.
[48] R. J. K. Jacob. Eye-movement-based human-computer interactiontechniques: Towards non-command interfaces. volume 4, pages 151–190.Ablex Publishing corporation, Norwood, NJ, 1993.
[49] R. J. K. Jacob and K. S. Karn. Eye tracking in human-computer interactionand usability research: Ready to deliver the promises. In The Mind’s Eyes:Cognitive and Applied Aspects of Eye Movements, J. Hyona, R. Radach, H.deubel (Eds.). Oxford, Elsevier Science, 2003.
[50] A.D. Jepson, D.J. Fleet, and T.F. El-Maraghi. Robust online appearancemodels for visual tracking. IEEE Transactions on PAMI, 25(10):1296–1311,2003.
158
[51] Q. Ji and X. Yang. Real time visual cues extraction for monitoring drivervigilance. In Proceedings of the International Workshop on Computer VisionSystems, Vancouver, Canada, 2001.
[52] Q. Ji and X. Yang. Real-time eye, gaze, and face pose tracking formonitoring driver vigilance. In Real Time Imaging, pages 357-377, 2002.
[53] Q. Ji and Z. Zhu. Eye and gaze tracking for interactive graphic display. In2nd International Symposium on Smart Graphics, 2002.
[54] M. Kampmann and L. Zhang. Estimation of eye, eyebrow and nose featuresin videophone sequences. In International Workshop on Very Low BitrateVideo Coding, 1998.
[55] T. Kanade, J. F. Cohn, and Y. Tian. Comprehensive database for facialexpression analysis. In Proceedings of the International Conference on Faceand Gesture Recognition, 2000.
[56] S. Kawato and J. Ohya. Real-time detection of nodding and head-shaking bydirectly detecting and tracking the “between-eyes”. In Proceedings of theIEEE 4th International Conference on Automatic Face and GestureRecognition, 2000.
[57] S. Kawato and J. Ohya. Two-step approach for real-time eye tracking with anew filtering technique. In Proceedings of the International Conference onSystem, Man, and Cybernetics, pp.1366-1371, 2000.
[58] S. Kawato and N. Tetsutani. Detection and tracking of eyes for gaze-cameracontrol. In Proceedings of 15th International Conference on Vision Interface,2002.
[59] S. Kawato and N. Tetsutani. Real-time detection of between-the-eyes with acircle frequency filter. In Proceedings of the Asian Conference on ComputerVision, 2002.
[60] M. LaCascia, S. Sclaroff, and V. Athitsos. Fast, reliable head tracking undervarying illumination: An approach based on registration of textured-mapped3d models. IEEE Transactions on PAMI, 22(4):322–336, 2000.
[61] K. M. Lam and H. Yan. Locating and extracting the eye in human faceimages. Pattern Recognition, 29:771–779, 1996.
[62] D. B. Leake. Case-based reasoning. AAAI Press/The MIT Press, 1996.
[63] K. Lee and D. Kriegman. Online learning of probabilistic appearancemanifolds for video-based recognition and tracking. In IEEE Conference onCVPR05, 2005.
159
[64] T. S. Lee. Image representation using 2D gabor wavelets. IEEE Transactionson Pattern Analysis and Machine Intelligence (PAMI), 18(10):959–971, 1996.
[65] H. Li, P. Roivainen, and R. Forchheimer. 3D motion estimation inmodel-based facial image coding. IEEE Transactions on Pattern Analysisand Machine Intelligence (PAMI), 15(6):545–555, 1993.
[66] J. Lim, D. Ross, R. Lin, and M. Yang. Incremental learning for visualtracking. In NIPS04, 2004.
[67] S. P. Liversedge and J. M. Findlay. Saccadic eye movements and cognition.Trends in Cognitive Science, 4(1):6–14, 2000.
[68] B. D. Lucas and T. Kanade. An iterative image registration technique withan application to stereo vision. In International Joint Conference onArtificial Intelligence, 1981.
[69] P. Maes, T. Darrell, B. Blumberg, and A. Pentland. The ALIVE system:Wireless, full-body interaction with autonomous agents. ACM MultimediaSystems, pages 105–112, 1997.
[70] M. F. Mason, B. M. Hood, and C. N. Macrae. Look into my eyes: Gazedirection and person memory. Memory, 12:637–643, 2004.
[71] Y. Matsumoto, T. Ogasawara, and A. Zelinsky. Behavior recognition basedon head pose and gaze direction measurement. In Proceedings of 2000IEEE/RSJ International Conference on Intelligent Robots and Systems,2000.
[72] I. Matthews, T. Ishikawa, and S. Baker. The template update problem.IEEE Transactions on PAMI, 26(6):810–815, 2004.
[73] Peter S. Maybeck. Stochastic Models, Estimation and Control, volume 1.Academic Press, Inc, 1979.
[74] S. Milekic. The more you look the more you get: intention-based interfaceusing gaze-tracking. In Bearman, D., Trant, J.(des.) Museums and the Web2002: Selected papers from an international conference, Archives andMuseum Informatics, 2002.
[75] L.P. Morency, A. Rahimi, and T. Darrell. Adaptive view-based appearancemodel. In IEEE Conference on CVPR03, 2003.
[76] C. H. Morimoto, A. Amir, and M. Flickner. Detecting eye position and gazefrom a single camera and 2 light sources. In Proceedings of the InternationalConference on Pattern Recognition, 2002.
160
[77] C. H. Morimoto, D. Koons, A. Amir, and M. Flickner. Frame-rate pupildetector and gaze tracker. In IEEE ICCV’99 Frame-rate Workshop, 1999.
[78] C. H. Morimoto, D. Koons, A. Amir, and M. Flickner. Pupil detection andtracking using multiple light sources. Image and Vision Computing,18:331–336, 2000.
[79] C. H. Morimoto and M. Mimica. Eye gaze tracking techniques for interactiveapplications. Computer Vision and Image Understanding, Special Issue onEye Detection and Tracking, 98(1):4–24, 2005.
[80] C.H. Morimoto and M. Flickner. Real-time multiple face detection usingactive illumination. In Proceedings of the 4th IEEE International Conferenceon Automatic Face and Gesture Recognition 2000, Grenoble, France, 2000.
[81] C.H. Morimoto, D. Koons, A. Amir, and M. Flickner. Pupil detection andtracking using multiple light sources. Technical Report RJ-10117, IBMAlmaden Research Center, 1998.
[82] M. Motwani and Q. Ji. 3D face pose discrimination using wavelets. In InProceedings of IEEE International Conference on Image Processing,Thessaloniki, Greece, 2001.
[83] M. Nixon. Eye spacing measurement for facial recognition. In Proceedings ofthe Society of Photo-Optical Instrument Engineers, 1985.
[84] T. Ohno, N. Mukawa, and A. Yoshikawa. Freegaze: A gaze tracking systemfor everyday gaze interaction. In Proceedings of the symposium on ETRA2002, 2002.
[85] S. Or, W. Luk, K. Wong, and I. King. An efficient iterative pose estimationalgorithm. Image and Vision Computing, 16:353–362, 1998.
[86] C. W. Oyster. The human eye: Structure and function. In SinauerAssociates, Inc., 1999.
[87] A. Pentland. Looking at people. IEEE Transactions on Pattern Analysisand Machine Intelligence, 22(1):107–119, 2000.
[88] A. Pentland. Perceptual intelligence. Communications of the ACM,43(3):35–44, 2000.
[89] A. Pentland and T. choudhury. Face recognition for smart environments.IEEE computer, pages 50–55, 2000.
[90] A. Pentland, B. Moghaddam, and T. Starner. View-based and modulareigenspaces for face recognition. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 1994.
161
[91] R. W. Picard, E. Vyzas, and J. Healey. Toward machine emotionalintelligence: Analysis of affective physiological state. IEEE TransactionsPattern Analysis and Machine Intelligence, 23(10), 2001.
[92] A. Rahimi, L.P. Morency, and T. Darrell. Reducing drift in parametricmotion tracking. In IEEE Conference on ICCV01, 2001.
[93] M. Reinders, R. Koch, and J. Gerbrands. Locating facial features in imagesequences using neural networks. In Proceedings of the Second InternationalConference on Automatic Face and Gesture Recognition, Killington, USA,1997.
[94] M. J. Reinders, R. W. Koch, and J. Gerbrands. Locating facial features inimage sequences using neural networks. In International Conference onAutomatic Face and Gesture Recognition, 1996.
[95] N. Sarris, N. Grammalidis, and M. G. Strintzis. FAP extraction using threedimensional motion estimation. IEEE Transactions on Circuits and Systemsfor Video Technology, 12(10):865 – 876, 2002.
[96] R. C. Schank. Dynamic memory: a theory of reminding and learning incomputers and people. Cambridge University Press, 1983.
[97] K. Schittkowski. NLQPL: A FORTRAN-subroutine solving constrainednonlinear programming problems. Annals of Operations Research, 5:485 –500, 1985.
[98] H. Schneiderman and T. Kanade. A statistical method for 3D objectdetection applied to faces and cars. In In IEEE Conference of ComputerVision and Pattern Recognition, 2000.
[99] J. Sherrah and S. Gong. Exploiting context in gesture recognition, 1999.
[100] J. Sherrah and S. Gong. VIGOUR: A system for tracking and recognition ofmultiple people and their activities, 2000.
[101] S. W. Shih and J. Liu. A novel approach to 3-D gaze tracking using stereocameras. In IEEE Transactions on Syst. Man and Cybern., part B,volume 34, pages 234–245, 2004.
[102] S. A. Sirohey and A. Rosenfeld. Eye detection. In Technical ReportCAR-TR-896, Center for Automation Research, University of Maryland,College Park,MD, 1998.
[103] S. A. Sirohey and A. Rosenfeld. Eye detection in a face image using linearand nonlinear filters. Pattern recognition, 34:1367–1391, 2001.
162
[104] K. H. Tan, D. Kriegman, and H. Ahuja. Appearance based eye gazeestimation. In Proceedings of the IEEE Workshop on Applications ofComputer Vision, pages 191–195, 2002.
[105] Y. Tian, T. Kanade, and J. Cohn. Recognizing action units for facialexpression analysis. IEEE Transactions on Pattern Analysis and MachineIntelligence, 23(2):97 – 115, 2001.
[106] Y. Tian, T. Kanade, and J. F. Cohn. Dual-state parametric eye tracking. InProceedings of the 4th IEEE International Conference on Automatic Faceand Gesture Recognition, 2000.
[107] K. Toyama and A. Blake. Probabilistic tracking with exemplars in a metricspace. IJCV, 48(1):9–19, 2002.
[108] K. Toyama, R. S. Feris, J. Gemmell, and V. Kruger. Hierarchial waveletnetworks for facial feature localization. In International Conference onAutomatic Face and Gesture Recognition, Washington D.C., USA, 2002.
[109] E. Trucco and A. Verri. Introductory techniques for 3D computer vision.1998.
[110] L. Vacchetti, V. Lepetit, and P. Fua. Stable real-time 3d tracking usingonline and offline information. IEEE Transactions on PAMI,26(10):1391–1391, 2004.
[111] P. Viola and M. Jones. Robust real-time object detection. IJCV,57(2):137–154, 2004.
[112] J. Waite and J.M. Vincent. A probabilistic framework for neural networkfacial feature location. British Telecom Technology Journal, 10(3):20–29,1992.
[113] J. Wang, E. Sung, and R. Venkateswarlu. Eye gaze estimation from a singleimage of one eye. In Proceedings of International Conference on ComputerVision, 2003.
[114] C. Ware and H. H. Mikaelian. An evaluation of an eye tracker as a device forcomputer input. In ACM conference on human factors in computing systemsand graphics interface, Toronto, 1987.
[115] K. Waters, J. Rehg, M. Loughlin, S. Kang, and D. Terzopoulos. Visualsensing of humans for active public interfaces, 1998.
[116] X. Wei, Z. Zhu, L. Yin, and Q. Ji. A real-time face tracking and animationsystem. In IEEE Workshop on Face Processing in Video, Washington DC,USA, 2004.
163
[117] L. Wiskott, J. M. Fellous, N. Kruger, and C. V. Malsburg. Face recognitionby elastic graph matching. IEEE Transactions on Pattern Analysis andMachine Intelligence (PAMI), 19(7), 1997.
[118] J. Xiao, T. Kanade, and J. F. Cohn. Robust full-motion recovery of head bydynamic templates and re-registration techniques. In IEEE Conference onAFGR02, 2002.
[119] X. Xie, R. Sudhakar, and H. Zhuang. On improving eye feature extractionusing deformable templates. Pattern Recognition, 27:791–799, 1994.
[120] J. Yao and W. Cham. Efficient model-based linear head motion recoveryfrom movies. In IEEE Computer Society Conference on Computer Visionand Pattern Recognition, Washington DC, 2004.
[121] P. Yao, G. Evans, and A. Calway. Using affine correspondence to estimate3-D facial pose. In Proceedings of the IEEE International Conference onImage Processing, pages 919–922, 2001.
[122] A. Yilmaz, K. Shafique, and M. Shah. Estimation of rigid and non-rigidfacial motion using anatomical face model. In Proceedings of theInternational Conference on Pattern Recognition, Quebec City, QC, Canada,2002.
[123] D. H. Yoo and M. J. Chung. A novel non-intrusive eye gaze estimation usingcross-ratio under large head motion. Computer Vision and ImageUnderstanding, Special Issue on Eye Detection and Tracking, 98(1):25–51,2005.
[124] A. Yuille, P. Hallinan, and D. Cohen. Feature extraction from faces usingdeformable templates. International Journal of Computer Vision,8(2):99–111, 1992.
[125] S. Zhai, C. Morimoto, and S. Ihde. Manual and gaze input cascade (magic)pointing. In ACM CHI’99, Pittsburgh, PA, USA, 1999.
[126] S. Zhai, C. H. Morimoto, and S. Ihde. Manual and gaze input cascaded(magic) pointing. In ACM SIGHCI-Human Factors Comput. Syst.Conference, 1999.
[127] L. Zhang. Estimation of eye and mouth corner point positions in aknowledge-based coding system. In Proceedings of SPIE, Vol 2952, pp.21-18, 1996.
[128] Y. Zhang and Q. Ji. Active and dynamic information fusion for facialexpression understanding from image sequences. IEEE Transactions onPattern Analysis and Machine Intelligence (PAMI), 27(5), 2005.
164
[129] Z. Zhang. Feature-based facial expression recognition: Experiments with amulti-layer perceptron. In Technical Report INRIA 3354, 1998.
[130] Z. Zhang. A flexible new technique for camera calibration. IEEETransactions on Pattern Analysis and Machine Intelligence,22(11):1330–1334, 2000.
[131] J. Zhu and J. Yang. Subpixel eye gaze tracking. In Proceedings of the IEEEInternational Conference on Automatic Face and Gesture Recognition,Washington D.C., pages 131–136, 2002.
[132] Z. Zhu, K. Fujimura, and Q. Ji. Real-time eye detection and tracking undervarious light conditions. In Symposium on Eye Tracking Research andApplications, 2002.
[133] Z. Zhu and Q. Ji. Eye and gaze tracking for interactive graphic display.Machine Vision and Applications, 15(3):139–148, 2004.
[134] Z. Zhu and Q. Ji. Robust real-time eye detection and tracking under variablelighting conditions and various face orientations. Computer Vision andImage Understanding, Special Issue on Eye Detection and Tracking,38(1):124–154, 2005.
[135] Z. Zhu, Q. Ji, K. Fujimura, and K. Lee. Combining kalman filtering andmean shift for real time eye tracking under active IR illumination. InInternational Conference on Pattern Recognition, 2002.