zhiwei zhu phd thesis

181
REAL-TIME HUMAN FACIAL BEHAVIOR UNDERSTANDING FOR HUMAN COMPUTER INTERACTION By Zhiwei Zhu A Thesis Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Major Subject: Department of Electrical, Computer and Systems Engineering Approved by the Examining Committee: Qiang Ji, Thesis Adviser Badrinath Roysam, Member John Wen, Member Wayne Gray, Member Rensselaer Polytechnic Institute Troy, New York December 2005 (For Graduation December 2005)

Upload: others

Post on 02-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Zhiwei Zhu PhD Thesis

REAL-TIME HUMAN FACIAL BEHAVIORUNDERSTANDING FOR HUMAN COMPUTER

INTERACTION

By

Zhiwei Zhu

A Thesis Submitted to the Graduate

Faculty of Rensselaer Polytechnic Institute

in Partial Fulfillment of the

Requirements for the Degree of

DOCTOR OF PHILOSOPHY

Major Subject: Department of Electrical, Computer and Systems Engineering

Approved by theExamining Committee:

Qiang Ji, Thesis Adviser

Badrinath Roysam, Member

John Wen, Member

Wayne Gray, Member

Rensselaer Polytechnic InstituteTroy, New York

December 2005(For Graduation December 2005)

Page 2: Zhiwei Zhu PhD Thesis

REAL-TIME HUMAN FACIAL BEHAVIORUNDERSTANDING FOR HUMAN COMPUTER

INTERACTION

By

Zhiwei Zhu

An Abstract of a Thesis Submitted to the Graduate

Faculty of Rensselaer Polytechnic Institute

in Partial Fulfillment of the

Requirements for the Degree of

DOCTOR OF PHILOSOPHY

Major Subject: Department of Electrical, Computer and Systems Engineering

The original of the complete thesis is on filein the Rensselaer Polytechnic Institute Library

Examining Committee:

Qiang Ji, Thesis Adviser

Badrinath Roysam, Member

John Wen, Member

Wayne Gray, Member

Rensselaer Polytechnic InstituteTroy, New York

December 2005(For Graduation December 2005)

Page 3: Zhiwei Zhu PhD Thesis

c© Copyright 2005

by

Zhiwei Zhu

All Rights Reserved

ii

Page 4: Zhiwei Zhu PhD Thesis

CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Vision-Based Human Sensing . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Fundamental Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2. Real-Time Eye Detection and Tracking . . . . . . . . . . . . . . . . . . . . 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Eye Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1 Initial Eye Position Detection . . . . . . . . . . . . . . . . . . 16

2.2.2 Eye Verification Using Support Vector Machines . . . . . . . . 18

2.2.2.1 Support Vector Machines . . . . . . . . . . . . . . . 19

2.2.2.2 SVM Training . . . . . . . . . . . . . . . . . . . . . 20

2.2.2.3 Retraining Using Mis-labeled Data . . . . . . . . . . 22

2.2.2.4 Eye Detection with SVM . . . . . . . . . . . . . . . 23

2.3 Eye Tracking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.1 Eye (Pupil) Tracking with Kalman Filtering . . . . . . . . . . 24

2.3.2 Mean Shift Eye Tracking . . . . . . . . . . . . . . . . . . . . . 28

2.3.2.1 Similarity Measure . . . . . . . . . . . . . . . . . . . 29

2.3.2.2 Eye Appearance Model . . . . . . . . . . . . . . . . . 29

2.3.2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3.2.4 Mean Shift Tracking Parameters . . . . . . . . . . . 31

2.3.2.5 Experiments On Mean Shift Eye Tracking . . . . . . 33

2.4 Combining Kalman Filtering Tracking with Mean Shift Tracking . . . 34

2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5.1 Eye Tracking Under Significant Head Pose Changes . . . . . . 36

2.5.2 Eye Tracking Under Different Illuminations . . . . . . . . . . 37

iii

Page 5: Zhiwei Zhu PhD Thesis

2.5.3 Eye Tracking With Glasses . . . . . . . . . . . . . . . . . . . . 39

2.5.4 Eye Tracking With Multiple People . . . . . . . . . . . . . . . 40

2.5.5 Occlusion Handling . . . . . . . . . . . . . . . . . . . . . . . . 41

2.5.6 Tracking Accuracy Validation . . . . . . . . . . . . . . . . . . 41

2.5.7 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3. Eye Gaze Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2.1 2D Mapping-Based Gaze Estimation Technique . . . . . . . . 45

3.2.2 Direct 3D Gaze Estimation Technique . . . . . . . . . . . . . 47

3.3 Direct 3D Gaze Estimation Technique . . . . . . . . . . . . . . . . . . 49

3.3.1 The Structure of Human Eyeball . . . . . . . . . . . . . . . . 49

3.3.2 Derivation of 3D Cornea Center . . . . . . . . . . . . . . . . . 50

3.3.2.1 The Structure of Cornea . . . . . . . . . . . . . . . . 50

3.3.3 Image Formation in the Convex Mirror . . . . . . . . . . . . . 51

3.3.3.1 Glint Formation In Cornea Reflection . . . . . . . . . 52

3.3.3.2 Curvature Center of the Cornea . . . . . . . . . . . . 53

3.3.4 Computation of 3D Gaze Direction . . . . . . . . . . . . . . . 55

3.3.4.1 Estimation of Optic Axis . . . . . . . . . . . . . . . 55

3.3.4.2 Compensation of the Angle Deviation between Vi-sual Axis and Optic Axis . . . . . . . . . . . . . . . 56

3.4 2D Mapping-Based Gaze Estimation Technique . . . . . . . . . . . . 57

3.4.1 Classical PCCR Technique . . . . . . . . . . . . . . . . . . . . 57

3.4.2 Head Motion Effects on Pupil-glint Vector . . . . . . . . . . . 58

3.4.3 Dynamic Head Compensation Model . . . . . . . . . . . . . . 60

3.4.3.1 Approach Overview . . . . . . . . . . . . . . . . . . 60

3.4.3.2 Image Projection of Pupil-glint Vector . . . . . . . . 61

3.4.3.3 First Case: The cornea center and the pupil centerlie on the camera’s X − Z plane: . . . . . . . . . . . 63

3.4.3.4 Second Case: The cornea center and the pupil centerdo not lie on the camera’s X − Z plane: . . . . . . . 64

3.4.3.5 Iterative Algorithm for Gaze Estimation . . . . . . . 66

3.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.5.1 System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

iv

Page 6: Zhiwei Zhu PhD Thesis

3.5.2 Performance of 3D Gaze Tracking Technique . . . . . . . . . . 68

3.5.2.1 Gaze Estimation Accuracy . . . . . . . . . . . . . . . 68

3.5.2.2 Comparison with Other Methods . . . . . . . . . . . 70

3.5.3 Performance of 2D Mapping Based Gaze Tracking Technique . 70

3.5.3.1 Head Compensation Model Validation . . . . . . . . 70

3.5.3.2 Gaze Estimation Accuracy . . . . . . . . . . . . . . . 71

3.6 Comparison of Both Techniques . . . . . . . . . . . . . . . . . . . . . 73

3.6.1 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4. Robust Face Tracking Using Case-Based Reasoning with Confidence . . . . 76

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.3 The Mathematical Framework . . . . . . . . . . . . . . . . . . . . . . 79

4.3.1 2D Visual Tracking . . . . . . . . . . . . . . . . . . . . . . . . 79

4.3.2 The Proposed Solution . . . . . . . . . . . . . . . . . . . . . . 80

4.4 The CBR Visual Tracking Algorithm . . . . . . . . . . . . . . . . . . 81

4.4.1 Case-Based Reasoning . . . . . . . . . . . . . . . . . . . . . . 81

4.4.2 Case Base Construction . . . . . . . . . . . . . . . . . . . . . 83

4.4.3 Case Retrieving . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.4.4 Case Adaption (Reusing) . . . . . . . . . . . . . . . . . . . . . 85

4.4.5 Case Revising and Retaining . . . . . . . . . . . . . . . . . . . 86

4.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.5.1 Drifting-Elimination Capability . . . . . . . . . . . . . . . . . 87

4.5.2 Confidence-Assessment Capability . . . . . . . . . . . . . . . . 89

4.5.3 Performance under Illumination Changes . . . . . . . . . . . . 90

4.5.4 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5. Real-Time Facial Feature Tracking . . . . . . . . . . . . . . . . . . . . . . 93

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2 Facial Feature Representation . . . . . . . . . . . . . . . . . . . . . . 95

5.2.1 Pyramidal Gabor Wavelets . . . . . . . . . . . . . . . . . . . . 97

5.2.2 Fast Phase-Based Displacement Estimation . . . . . . . . . . . 98

5.3 Facial Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . 99

v

Page 7: Zhiwei Zhu PhD Thesis

5.3.1 Facial Feature Approximation . . . . . . . . . . . . . . . . . . 100

5.3.2 Facial Feature Refinement . . . . . . . . . . . . . . . . . . . . 101

5.4 Facial Feature Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.4.1 Facial Feature Prediction . . . . . . . . . . . . . . . . . . . . . 103

5.4.2 Facial Feature Measurement . . . . . . . . . . . . . . . . . . . 105

5.4.3 Facial Feature Correction . . . . . . . . . . . . . . . . . . . . 105

5.4.3.1 Facial Feature Refinement . . . . . . . . . . . . . . . 106

5.4.3.2 Imposing Geometry Constraints . . . . . . . . . . . . 106

5.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.5.1 Facial Feature Tracking Accuracy . . . . . . . . . . . . . . . . 112

5.5.2 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.6 Comparison with IR-Based Eye Tracker . . . . . . . . . . . . . . . . . 114

5.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6. Nonrigid and Rigid Facial Motion Estimation . . . . . . . . . . . . . . . . 116

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.3 Pose and Expression Modelling . . . . . . . . . . . . . . . . . . . . . 119

6.3.1 3D Face Representation . . . . . . . . . . . . . . . . . . . . . 119

6.3.2 3D Deformable Face Model . . . . . . . . . . . . . . . . . . . 120

6.3.3 3D Motion Projection Model . . . . . . . . . . . . . . . . . . . 122

6.4 Normalized SVD for Pose and Expression Decomposition . . . . . . . 123

6.4.1 SVD Decomposition Method . . . . . . . . . . . . . . . . . . . 123

6.4.2 Condition of the Linear System . . . . . . . . . . . . . . . . . 124

6.4.3 Normalization SVD Technique . . . . . . . . . . . . . . . . . . 126

6.4.4 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.5 Nonlinear Decomposition Method . . . . . . . . . . . . . . . . . . . . 130

6.6 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.6.1 Performance on Synthetic Data . . . . . . . . . . . . . . . . . 132

6.6.2 Performance on Real Image Sequences . . . . . . . . . . . . . 134

6.6.2.1 Neutral Face Under Various Face Orientations . . . . 134

6.6.2.2 Frontal Face with Different Facial Expressions . . . . 136

6.6.2.3 Non-neutral Face Under Various Face orientations . . 138

6.6.3 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

vi

Page 8: Zhiwei Zhu PhD Thesis

7. Facial Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.2 Facial Expressions with AUs . . . . . . . . . . . . . . . . . . . . . . . 143

7.3 Coding AUs with Feature Movement Parameters . . . . . . . . . . . . 144

7.4 Modelling Spatial Dependency . . . . . . . . . . . . . . . . . . . . . . 145

7.5 Modelling Temporal Dynamics . . . . . . . . . . . . . . . . . . . . . . 147

7.6 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.6.1 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

vii

Page 9: Zhiwei Zhu PhD Thesis

LIST OF TABLES

2.1 Experiment results using 3 kernel types with different parameters . . . . 22

2.2 Tracking statistics comparison for both trackers under different eyesconditions (open, closed, occluded) on the first person . . . . . . . . . . 39

2.3 Tracking statistics comparison for both trackers under different eyesconditions (open, closed, occluded) on the second person . . . . . . . . 40

3.1 The gaze estimation accuracy for the first subject. . . . . . . . . . . . . 69

3.2 The gaze estimation accuracy for seven subjects . . . . . . . . . . . . . 69

3.3 Comparison with other systems . . . . . . . . . . . . . . . . . . . . . . 70

3.4 Pupil-glint vector comparison at different eye locations . . . . . . . . . 71

3.5 Gaze estimation accuracy under different eye image resolutions . . . . . 73

6.1 The average RMSEs of the extracted facial deformation vectors for dif-ferent image sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.2 The average error of the extracted face pose angles for different imagesequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.3 The average RMSEs of the extracted facial deformation vectors for dif-ferent image sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.1 The association of six basic facial expressions with AUs . . . . . . . . . 144

7.2 The association between facial action units and facial feature movementparameters (FMPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.3 Confusion statistics from the 700-frame sequence . . . . . . . . . . . . . 150

viii

Page 10: Zhiwei Zhu PhD Thesis

LIST OF FIGURES

2.1 The disappearance or weakness of the bright pupils due to (a) eye clo-sure, (b) oblique face orientation, (c) eye glasses glare and (d) strongexternal illumination interference . . . . . . . . . . . . . . . . . . . . . . 12

2.2 The combined eye tracking flowchart . . . . . . . . . . . . . . . . . . . . 15

2.3 The bright-pupil (a) and dark-pupil (b) images . . . . . . . . . . . . . . 15

2.4 Eye detection block diagram . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Background illumination interference removal: (a) the even-field imagesobtained under both ambient and IR light; (b) the odd-field imagesobtained under only ambient light; (c) the difference images resultedfrom subtracting (b) from (a) . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 The thresholded difference image marked with pupil candidates . . . . . 17

2.7 The thresholded difference image after removing some blobs based ontheir geometric properties (shape and size). The blobs marked withcircles are selected for further consideration . . . . . . . . . . . . . . . . 18

2.8 (a) The thresholded difference image superimposed with possible pupilcandidates. (b) The dark image marked with possible eye candidatesaccording to the positions of pupil candidates in (a) . . . . . . . . . . . 21

2.9 The eye images in the positive training set . . . . . . . . . . . . . . . . 21

2.10 The non-eye images in the negative training set . . . . . . . . . . . . . . 22

2.11 The result images (a) and (b) marked with identified eyes. Comparedwith images in Figure 2.8 (b), many false alarms have been removed . . 23

2.12 The combined eye tracking flowchart . . . . . . . . . . . . . . . . . . . . . 25

2.13 The eye images: (a)(b) left and right bright-pupil eyes; (c)(d) corre-sponding left and right dark-pupil eyes . . . . . . . . . . . . . . . . . . 29

2.14 (a) The image frame 13; (b) Values of Bhattacharyya coefficient corre-sponding to the marked region(40 × 40 pixels) around the left eye inframe 13. Mean shift algorithm converges from the initial location(∗) tothe convergence point(), which is a mode of the Bhattacharyya surface 32

2.15 The error distribution of tracking results: (a) error distribution vs. in-tensity quantization values and different window sizes; (b) error distri-bution vs. quantization levels only . . . . . . . . . . . . . . . . . . . . . 33

ix

Page 11: Zhiwei Zhu PhD Thesis

2.16 Mean-shift tracking both eyes with initial search area of 40*40 pixels, asrepresented by the large black rectangle. The eyes marked with whiterectangles in frame 1 are used as the eye model and the tracked eyes inthe following frames are marked by the smaller black rectangles . . . . . 34

2.17 (a) Image of frame 135, with the initial eye position marked and initialsearch area outlined by the large black rectangle. (b) Values of Bhat-tacharyya coefficient corresponding to the marked region(40×40 pixels)around the left eye in (a). Mean shift algorithm cannot converge fromthe initial location()(which is in the valley of two modes) to the correctmode of the surface. Instead, it is trapped in the valley . . . . . . . . . 35

2.18 Bright pupil based Kalman tracker fails to track eyes due to absenceof bright pupils caused by either eye closure or oblique face orienta-tions. The mean shift eye tracker, however, tracks eyes successfully asindicated by the black rectangles . . . . . . . . . . . . . . . . . . . . . . 35

2.19 An image sequence to demonstrate the drift-away problem of the mean-shift tracker as well as the correction of the problem by the integratedeye tracker. Frames (a-e) show the drift away case of the mean-shifteye tracker; for the same image sequences, Frames (A-E) show the im-proved results of the combined eye tracker. White rectangles show theeyes tracked by the Kalman tracker while the black rectangles show thetracked eyes by the mean shift tracker . . . . . . . . . . . . . . . . . . . 36

2.20 Tracking results of the combined eye tracker for a person undergoingsignificant head movements . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.21 Tracking results of the combined eye tracker for four image sequences(a),(b),(c) and (d) under significant head movements. . . . . . . . . . . 38

2.22 Tracking results of the combined eye tracker for two image sequences(a) and (b) under significant illumination changes. . . . . . . . . . . . . 39

2.23 Tracking results of the combined eye tracker for two image sequences(a), (b) with persons wearing glasses. . . . . . . . . . . . . . . . . . . . 40

2.24 Tracking results of the combined eye tracker for multiple persons. . . . . 41

2.25 Tracking results of the combined eye tracker for an image sequenceinvolving multiple persons occluding each other’s eyes. . . . . . . . . . . 42

2.26 The comparison between the automatically tracked eye positions andthe manually located eye positions for 100 randomly selected consecu-tive frames: (a) x coordinate and (b) y coordinate. . . . . . . . . . . . . 42

x

Page 12: Zhiwei Zhu PhD Thesis

3.1 Eye images with corneal reflection (glint): (a) dark pupil image (b)bright pupil image. Glint is a small bright spot as indicated in (a) and(b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 The structure of the eyeball (top view of the right eye) . . . . . . . . . 49

3.3 The reflection diagram of Purkinje images . . . . . . . . . . . . . . . . . 51

3.4 A ray diagram to locate the image of an object in a convex mirror . . . 51

3.5 The image formation of a point light source in the cornea when thecornea serves as a convex mirror . . . . . . . . . . . . . . . . . . . . . . 53

3.6 The ray diagram of the virtual image of the IR light source in front ofthe cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.7 The ray diagram of two IR light sources in front of the cornea . . . . . 54

3.8 Pupil and glint image formations when eyes are located at differentpositions while gazing at the same screen point (side view) . . . . . . . 59

3.9 The pupil-glint vectors generated in the eye images when the eye islocated at O1 and O2 in Figure 3.8 . . . . . . . . . . . . . . . . . . . . . 59

3.10 Pupil and glint image formation when the eye is located at differentpositions in front of the camera . . . . . . . . . . . . . . . . . . . . . . 62

3.11 Pupil and glint image formation when the eye is located at differentpositions in front of the camera (top-down view) . . . . . . . . . . . . . 63

3.12 Projection into camera’s X − Z plane . . . . . . . . . . . . . . . . . . . 65

3.13 The configuration of the gaze-tracking system . . . . . . . . . . . . . . 67

3.14 The pupil-glint vector transformation errors: (a) transformation erroron the X component of the pupil-glint vector, (b)transformation erroron the Y component of the pupil-glint vector . . . . . . . . . . . . . . . 71

3.15 The plot of the estimated gaze points and the true gaze points, where,“+” represents the estimated gaze point and “*” represents the actualgaze point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.1 The diagram of the proposed algorithm to improve the accuracy of the2D object tracking model . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.2 The Case-Based Cycle of the face tracking system. . . . . . . . . . . . 83

xi

Page 13: Zhiwei Zhu PhD Thesis

4.3 Comparison of the face tracking results with different techniques. Thefirst row shows the tracking results by the incremental subspace learningtechnique and the tracked face is marked by a red rectangle; the secondrow represents the tracking results by our proposed technique and thetracked face is marked by a dark square. The images are frame 26, 126,352, 415 and 508 from left to right. . . . . . . . . . . . . . . . . . . . . 87

4.4 Comparisons of the tracked face position error: (a) between the pro-posed tracker and the two-frame tracker; (b) between the proposedtracker and the offline tracker; (c) between the proposed tracker andthe incremental subspace learning tracker; . . . . . . . . . . . . . . . . 88

4.5 (a)The similarities computed by the two-frame tracker; (b) The RMSEerrors computed by the incremental subspace learning tracker; (c) Theconfidence scores computed by our proposed tracker . . . . . . . . . . . 89

4.6 The face tracking results with significant facial expression changes, largehead movements and occlusion. For each frame, the tracked face ismarked by a dark square. The upper row displays the image frames 29,193, 211, 236 from left to right; while the lower row displays the imageframes 237, 238, 412 and 444 from left to right . . . . . . . . . . . . . . 90

4.7 The estimated confidence measures . . . . . . . . . . . . . . . . . . . . 91

4.8 Face tracking results under significant changes in illumination and headmovement. The tracked face is marked by a dark square in each image.From left to right, the selected image frames are 75, 246, 295, 444 inthe first row, while the second row displays the image frames 662, 706,745 and 898 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.1 The flowchart of the proposed facial feature detection and tracking al-gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2 An image pyramid with three levels: (a) base level contains a 320×240pixels image; (b) first level contains a 160×120 pixels image; (c) secondlevel contains a 80× 60 pixels image; (d) third level contains a 40× 30pixels image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.3 A face mesh with facial features . . . . . . . . . . . . . . . . . . . . . . 100

5.4 Spatial geometry of the facial features in a frontal face region. Eyes aremarked by the small white rectangles, the face region is marked withlarge white rectangles, and the facial features are marked with whitecircles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.5 (a) face mesh, (b) face image with approximated facial features, (c) faceimage with refined facial features . . . . . . . . . . . . . . . . . . . . . 101

xii

Page 14: Zhiwei Zhu PhD Thesis

5.6 The face images with detected facial features under different facial ex-pressions: (a) disgust, (b) anger, (c) surprise and (d) happy . . . . . . 102

5.7 The flowchart of the proposed tracking algorithm . . . . . . . . . . . . . 103

5.8 (a) A frontal face image, and (b) its 3D face geometry, with the selectedfacial features marked as the white dots . . . . . . . . . . . . . . . . . . 107

5.9 The randomly selected face images from different image sequences. . . . 112

5.10 The computed position errors of the automatically extracted facial fea-tures by the proposed facial feature tracker: (a) position errors in theX-direction for each facial feature; (b) position errors in the Y-directionfor each facial feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.1 (a) The spatial geometry of the selected facial features marked by thedark dots; (b) The 3D face mesh with the selected facial features markedby the white dots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.2 Average errors of the estimated parameters by the SVD method and theproposed N-SVD method respectively as a function of Gaussian noise:(a) face pose error; (b) face scale factor error; (c) facial deformation error134

6.3 Average errors of the estimated parameters by the proposed N-SVDmethod and nonlinear method respectively as a function of Gaussiannoise: (a) face pose error; (b) face scale factor error; (c) facial deforma-tion error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.4 The randomly selected face images from a set of different neutral faceimage sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.5 The calculated RMSEs of the estimated facial deformations . . . . . . . 136

6.6 (a) The three estimated face pose angles; (b) The estimated face scalefactor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.7 The randomly selected images from a frontal face image sequence . . . . 138

6.8 (a) The calculated RMSE of the estimated facial deformation vectors;(b) The average error of the estimated face pose angles . . . . . . . . . 139

6.9 The randomly selected images from three face image sequences withdifferent facial expressions. (Top: happy; Middle: surprise; Bottom:disgust) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.10 (a) The calculated RMSE of the estimated facial deformation vectors;(b) The estimated three face pose angles; (c) The estimated face scalefactor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

xiii

Page 15: Zhiwei Zhu PhD Thesis

7.1 The spatial geometry of the selected facial features marked by the darkdots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.2 The BN model of six basic facial expressions. In this model, “HAP” rep-resents “Happy,” “ANG” represents “Anger,” “SAD” represents “Sad,”“DIS” represents “Disgust,” “FEA” represents “Fear,” and “SUR” rep-resents “Surprise” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.3 The temporal links of DBN for modelling facial expression (two timeslices are shown since the structure repeats by ”unrolling” the two-sliceBN). Node notations are given in Figure 7.2 . . . . . . . . . . . . . . . 148

7.4 Upper: a video sequence with 700 frames containing six basic facial ex-pressions. It only shows 8 snapshots for illustration. Bottom: the out-put result shows probability distributions (emotional intensities) oversix basic facial expression resulting from sampling the sequence every 7frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

xiv

Page 16: Zhiwei Zhu PhD Thesis

ACKNOWLEDGEMENT

I would like to express my deepest thanks to my thesis advisor, Professor Qiang Ji,

for his expert guidance and valuable suggestions contributed to this dissertation.

Without his help, this dissertation would not have been possible. For the past five

and a half years, I have learned a lot from him and his energetic working style has

influenced me greatly. I’m really appreciative that I have met Professor Qiang Ji.

My sincere thanks also go to Professor Wayne Gray, who helped me enrich

my knowledge significantly while working with his team for the past few years. I

also thank the rest of my committee members: Professor Badrinath Roysam and

Professor John Wen. Their valuable feedback helped me to improve the dissertation

significantly.

I thank all the members in my group for their wonderful support, and my

special thanks go to Wenhui Liao for her time and outstanding advice on the devel-

opment of some ideas in my thesis throughout these years.

This dissertation could not have been accomplished without Shuwen Xia, my

girlfriend who is always with me no matter how difficult the situations were. She

always gives me warm encouragement and love in every situation.

Last but not least, I thank my parents and my sister very much for supporting

me through all these years.

xv

Page 17: Zhiwei Zhu PhD Thesis

ABSTRACT

To enhance the interaction between human and computer, a major task for the

Human Computer Interaction (HCI) community is to equip the computer with the

ability to recognize the user’s affective states, intentions and needs in real time and in

a non-intrusive manner. Using video cameras together with a set of computer vision

techniques to interpret and understand the human’s behaviors, vision-based human

sensing technology has the advantages of non-intrusiveness and naturalness. Since

the human face contains rich and powerful information about human behaviors, it

has been extensively studied. Typical facial behaviors characterizing human states

include eye gaze, head gestures and facial expression. This research focuses on

developing real time and non-intrusive computer vision techniques to understand

and recognize various facial behaviors.

Specifically, we have developed a range of computer vision techniques. First,

based on systematically combining the appearance model with the bright-pupil effect

of the eye, we develop a new real-time technique to robustly detect and track the

eyes under variable lightings and face orientations. Second, we introduce a new

gaze estimation method for robustly tracking eye-gaze under natural head movement

and with minimum personal calibration. Third, a robust visual tracking framework

is proposed to track the faces under significant changes in lighting, scale, facial

expression and face movement. Fourth, given the detected face, we develop a new

technique for detecting and tracking twenty-eight facial features under significant

facial expressions and various face orientations. Fifth, based on the set of tracked

facial features, a framework is proposed to recover the rigid and non-rigid facial

motions successfully from a monocular image sequence. Subsequently, from the

recovered non-rigid facial motions, a Dynamic Bayesian Network is utilized to model

and recognize the six basic facial expressions under natural head movement.

All of these techniques are extensively tested with numerous subjects under

various situations such as different lighting conditions, significant head movements,

wearing glasses, etc. Experimental study shows significant improvement of our tech-

niques over the existing techniques.

xvi

Page 18: Zhiwei Zhu PhD Thesis

CHAPTER 1

Introduction

Today, the keyboard and mouse are the main devices in information exchanges

between human and computer. Interacting with keyboard-mouse-based computers,

however, can be a cumbersome experience because it requires the user to adapt to

the computer by learning how to use the keyboard and mouse. In our daily life, we

employ vision, hearing and touch as natural ways of interaction to communicate with

one another. Although we give off much information about ourselves, the computer’s

inability to recognize this complex information dramatically limits its ability to help

humans. If the computer could understand the visual and audio information from

the human, then it would be able to communicate with humans in natural ways. As a

result, rather than requiring the human to adapt to the computer, the computer can

adapt to the human intelligently as if it were a human by understanding the human,

such as what the mood of the human is, where the human is looking, what the human

is doing, and how the human performs. Therefore, by equipping the computer with

the ability to see and sense the human, it will make the interaction between human

and computer easier, more efficient, more intuitive, and more flexible.

Thus, advanced interface technologies that support the human-like interaction

between human and computer need be developed, which will help computers hear

(speech recognition), speak (speech synthesis), see (face tracking, eye tracking, hu-

man body tracking), and sense (gaze estimation, face recognition, affect recognition).

However, developing such interface technologies is challenging.

Recently, using a video camera together with a set of computer vision tech-

niques, researchers on vision-based human sensing technology have been trying to

provide computers with the capability of perception, seeing and sensing the human.

In order to achieve this, numerous research topics have been explored for HCI, such

as hand gesture analysis, head gesture analysis, lip movement analysis, eye gaze

estimation, facial expression analysis, and other body movements analysis. But

most work is still not fast, robust and efficient enough to integrate into functioning

1

Page 19: Zhiwei Zhu PhD Thesis

2

user-interfaces.

In the following section, we will discuss the current developments of those

vision-based human sensing techniques for HCI.

1.1 Vision-Based Human Sensing

By vision, we refer to the use of video cameras and a set of visual or graphical

techniques for representing and processing information. Non-vision-based human

sensing methods are fairly intrusive in that they require physical contacts with the

user. For example, Picard et al. [91] have tried to recognize human affective states

based on four different physiological signals that measure human’s facial muscle ac-

tivity, heart activity, skin conductance and respiration. These physiological signals

are collected from a set of physiological sensors attached to different parts of the

human body, such as face, fingers, chest, etc. Although these sensors are designed

with the minimal size, they are still very invasive to the human and deprive the hu-

man of the ease and naturalness when he/she interacts with the computer-controlled

environment. However, the vision-based human sensing technique does not require

any physical contact with the user; in fact it works when the user is physically

located at a distance from the sensors. For example, with the use of remotely lo-

cated video cameras together with computer vision techniques, the eye-gaze can

be estimated so that it can be used to control the computer remotely [114], which

even can free our hands to perform other things when we are using our gaze for

controlling the computer. Therefore, compared to non-vision-based human sensing

methods, vision-based human sensing has the advantage of unobtrusiveness and it

gives a sense of “naturalness” and being comfortable during the process of human-

computer interactions. Furthermore, the ever decreasing price/performance ratio

of computing coupled with recent decreases in video image acquisition cost imply

that computer vision systems can be deployed in desktop and embedded systems

[87, 88, 89].

There are numerous areas of research and application where vision-based hu-

man sensing techniques have been studied for HCI. For example, Maes et al. [69]

developed an “ALIVE” system that allows wireless full-body interaction between

Page 20: Zhiwei Zhu PhD Thesis

3

a user and a 3D graphical world in a computer. A vision system is developed to

extract information about the user, such as the 3D location of the user and the po-

sition of various body parts as well as simple hand gestures. Therefore, a user can

directly interact with the ALIVE space by gestures as if he were in the real, physical

space, such as playing with a virtual dog via head gestures. A fully automated and

interactive narrative play-space for children called the KidsRoom was demonstrated

by Bobick et al. [9]. Using a vision-based action recognition technique to recognize

what the children are doing, the KidsRoom will automatically react to the children’s

actions by providing entertaining feedback, and the children are aware that the room

is responsive.

A smart kiosk interface is presented by Waters et al. [115]. Equipping with

the vision-based techniques, the kiosk can detect a user in front of it and com-

municate with the user automatically as the user approaches it. However, some

important information about the user, such as the facial expression and gaze, has

not been extracted, which limits the kiosk’s functions significantly. Also, there are

various other potential applications that were demonstrated, such as a platform for

simultaneously tracking multiple people and recognizing their behaviors for high-

level interpretation [100]. Other systems include a gaze-assisted translator [46], an

eye interpretation engine [22], an intelligent mediator [99] and a driver’s fatigue

monitoring system [51].

In this thesis, we focus on facial behavior analysis and recognition with the

use of remote video cameras rather than having extra measurement devices such as

helmet and special sensors. The research is important because despite the recent

research and technological efforts, advanced human-computer interaction devices

still suffer from a number of problems. Two major problems with the current devices

are intrusiveness (wearing a helmet or attaching a set of sensors) and expensiveness

related to the need of special hardware. These problems, if not solved, are likely

to hamper the widespread use of the next generation computer interfaces. We

wish to develop a set of vision-based techniques to overcome these problems. We

have developed a set of close-range visual sensing technologies that can identify the

users’ eyelid movement, eye gaze, head movement and facial expressions accurately.

Page 21: Zhiwei Zhu PhD Thesis

4

Equipping with these vision-based human sensing technologies along with a user

cognitive model, the computer will become aware of user’s intentions and mental

states. Hence, the computer will respond or react to the user’s actions intelligently,

which can enhance the interaction between human and computer significantly. This

thesis describes the necessary computer vision algorithms needed for such vision-

based human sensing technologies.

1.2 Fundamental Issues

As one of the most salient features of human face, eyes play an important role

in interpreting and understanding a person’s desires, needs, and emotional states.

In particular, the eye-gazes, indicating where a person is looking, can reveal the

person’s focus of attention. Recently, Zhai et al. [125] proposed an approach named

as MAGIC pointing, which utilizes eye gaze to place the cursor in the vicinity of

every new object the user looks at. Therefore, rather than controlling the movements

of the cursor by hand all the time, the user only needs to refine the cursor’s position

near the object. Hence, with the aid of the eye gaze, the amount of stress to the

hand can be reduced significantly. Also, such an interface can allow handicapped

people to control the systems via eye-gaze input. This will give them a way to

aid themselves independently. However, the design and implementation of such an

interface has been faced with several challenges. The most difficult challenge is the

eye tracking technology itself, which is not robust and accurate enough.

A good eye tracker is therefore a prerequisite of eye gaze monitoring. Robust

techniques for eye detection are of particular importance to eye-gaze tracking sys-

tems. Information about the eyes can also be used to detect human faces, which

will be further analyzed to obtain the face pose information. Many eye tracking

methods rely on intrusive techniques such as measuring the electric potential of the

skin around the eyes or applying special contact lenses that facilitate eye tracking.

This causes serious problems of user acceptance. To alleviate these problems, we

have developed a non-intrusive eye tracker that can detect and track a user’s eyes

in real-time as soon as the face appears in the view of the camera. The eye tracker

is aided by the active IR lighting and leaves no markers on the user’s face. By

Page 22: Zhiwei Zhu PhD Thesis

5

combining the conventional appearance-based object recognition method (Support

Vector Machines) and object tracking method (mean shift) with Kalman filtering

based on active IR illumination, our technique is able to benefit from the strengths of

different techniques and overcome their respective limitations. Experimental study

shows significant improvement of our technique over the existing techniques.

After the eyes are tracked successfully, the eye gaze is subsequently extracted

from the tracked eyes. Unlike most of the existing gaze tracking techniques, which

often require assuming a static head to work well and require a cumbersome cali-

bration process for each person, our gaze tracker can perform robust and accurate

gaze estimation with one-time calibration and under rather significant head move-

ment. When the head moves to a new position, the gaze mapping function at this

new position will be automatically updated by the proposed dynamic computational

head compensation model to accommodate the eye position changes. Our proposed

method will dramatically increase the usability of the eye gaze tracking technology,

and we believe that it is a significant step for the eye tracker to be accepted as a

natural computer input device.

Furthermore, head gestures – from the simplest actions of nodding or head-

shaking to the most complex head movements that can express our feelings and

reveal our cognitive states – are a kind of non-verbal interaction among people.

Besides the head gestures, the facial expression is another important cue that can

reveal our emotions and intentions directly. Vision-based head gesture and facial

expression recognition research has also been studied extensively. However, accurate

and fast human face detection and tracking is a crucial first step for them, which still

remains a very challenging task under changes in lighting, scale, facial expressions

and head movements. Therefore, in this thesis, we propose a robust visual tracking

framework based on Case-Based Reasoning with a confidence paradigm to track the

face, so that the face can be tracked robustly under significant changes on lighting,

scale, facial expression and face orientations.

Based on the tracked face images, a set of twenty-eight prominent facial fea-

tures are detected and tracked automatically. However, in reality, the image appear-

ance of the facial features varies significantly among different individuals. Even for

Page 23: Zhiwei Zhu PhD Thesis

6

a specific person, the appearance of the facial features is easily affected by lighting

conditions, face orientations and facial expressions. Therefore, in order to com-

pensate for the image appearance changes during tracking, we developed a robust

tracking algorithm based on a shape-constrained correction mechanism so that the

facial features can be detected and tracked successfully under the above challenging

situations. Subsequently, the spatio-temporal relationships among the tracked facial

features can be utilized to recover the facial motions.

The face motion is the sum of rigid motion related with face pose and non-rigid

motion related with facial expression. Both motions are nonlinearly coupled in the

captured face image such that they cannot be easily recovered. In this thesis, a novel

technique is proposed to recover 3D rigid and non-rigid facial motions simultaneously

with the use of a set of tracked facial features from a monocular video sequence in

real time. First, the coupling between rigid and non-rigid motions in the image is

expressed analytically by a nonlinear model. Subsequently, techniques are proposed

to decompose the non-linear coupling between them so that the pose and expression

parameters can be recovered simultaneously. Experiments show that the proposed

method can recover the rigid and non-rigid facial motions very accurately.

Once the rigid and non-rigid facial motions are separated successfully from the

face images, the facial expressions are subsequently recognized from the recovered

non-rigid facial motions. A Dynamic Bayesian Network (DBN) is constructed to

model the facial expression based on the recovered non-rigid facial motions. In

the DBN model, the non-rigid facial motions are probabilistically combined with

Ekman’s Facial Action Coding System (FACS) to understand the facial expressions.

With the use of DBN model, the spatial dependencies, uncertainties and temporal

behaviors of facial expressions can be addressed in a coherent and unified hierarchical

probabilistic framework. Hence, the facial expressions can be recognized robustly

over time. Furthermore, since the recovered non-rigid motions are independent of the

face pose, the facial expression can be recognized under arbitrary face orientations.

Through this research, we have developed an integrated prototype system

that tracks a person’s eyelid movement, eye gaze, head movement, face pose and

facial expression all in real time. The specific contributions of this research address

Page 24: Zhiwei Zhu PhD Thesis

7

several fundamental issues associated with the development of real-time computer

vision algorithms for

1. real-time human eye detection and tracking under various lighting conditions

and face orientations.

2. real-time eye gaze estimation under natural head movements, with minimum

personal calibration.

3. real-time face tracking under changes in lighting, facial expression and face

orientation.

4. real-time facial feature detection and tracking under various face orientations

and significant facial expression changes.

5. real-time 3D rigid and nonrigid facial motions recovering from an uncalibrated

monocular camera.

6. real-time facial expression analysis under natural head movements.

In addition, we also make theoretical contributions in several areas of computer

vision including object detection and tracking, motion analysis and estimation, and

pose estimation.

1.3 Structure of the Thesis

This thesis is arranged as follows. In chapter 2, it presents a new real time

eye tracking methodology that works under variable and realistic lighting conditions

and various face orientations. Chapter 3 describes techniques for real time eye gaze

tracking under natural head movement and with minimum personal calibration.

Chapter 4 proposes a robust visual tracking framework to track faces with significant

facial expressions under various face orientations and lighting conditions in real time.

Chapter 5 proposes a novel technique to detect and track twenty-eight prominent

facial features under different face orientations and various facial expressions in

real time. Chapter 6 proposes a framework for recovering 3D rigid and nonrigid

facial motions from a monocular image sequence obtained from an uncalibrated

Page 25: Zhiwei Zhu PhD Thesis

8

camera. Subsequently, facial expression is recognized from the recovered nonrigid

facial motions successfully under natural head movements in chapter 7. Finally, a

summary of this research and the possible future research direction is discussed in

chapter 8.

Page 26: Zhiwei Zhu PhD Thesis

CHAPTER 2

Real-Time Eye Detection and Tracking

2.1 Introduction

As one of the salient features of the human face, human eyes play an impor-

tant role in face detection, face recognition and facial expression analysis. Robust

non-intrusive eye detection and tracking is a crucial step for vision-based man-

machine interaction technology to be widely accepted in common environments such

as homes and offices. Eye tracking has also found applications in other areas in-

cluding monitoring human vigilance [51], gaze-contingent smart graphics [53], and

assisting people with disabilities. The existing work in eye detection and tracking

can be classified into two categories: traditional image-based passive approaches

and active IR-based approaches. The former approaches detect eyes based on the

unique intensity distribution or shape of the eyes. The underlying assumption is

that the eyes appear different from the rest of the face both in intensity and shape;

eyes can be detected and tracked based on exploiting these differences. The active

IR-based approach, on the other hand, exploits the spectral (reflective) properties

of pupils under near IR illumination to produce the bright/dark pupil effect, and

accomplishes eye detection and tracking by detecting and tracking pupils.

Traditional methods can be broadly classified into three categories: template-

based methods [124, 119, 61, 127, 54, 26, 27, 26, 83, 36], appearance-based methods

[90, 42, 41] and feature-based methods [57, 56, 59, 58, 106, 112, 93, 102, 103]. In the

template-based methods, a generic eye model, based on the eye shape, is designed

first. Template matching is then used to search the image for the eyes. Nixon

[83] proposed an approach for accurate measurement of eye spacing using Hough

transform. The eye is modelled by a circle for the iris and a “tailored” ellipse for the

sclera boundary. Their method, however, is time-consuming, needs a high contrast

eye image, and only works with frontal faces. Deformable templates are commonly

used [124, 119, 61]. First, an eye model is designed, which is allowed to translate,

rotate and deform to fit the best representation of the eye shape in the image. Next,

9

Page 27: Zhiwei Zhu PhD Thesis

10

the eye position can be obtained through a recursive energy-minimization process.

While this method can detect eyes accurately, it requires that the eye model be

properly initialized near the eyes. Furthermore, it is computationally expensive,

and requires good image contrast for the method to converge correctly.

The appearance-based methods [90],[42], [41] detect eyes based on their photo-

metric appearance. These methods usually need to collect a large amount of training

data, representing the eyes of different subjects, under different face orientations,

and under different illumination conditions. These data are used to train a classifier,

such as a neural network or the SVM, and detection is achieved via classification.

In [90], Pentland et al. extended the eigenface technique to the description and

coding of facial features, yielding eigeneyes, eigennoses and eigenmouths. For eye

detection, they extracted appropriate eye templates for training and constructed a

principal component projective space called “Eigeneyes,” accomplishing eye detec-

tion by comparing a query image with an eye image in the eigeneyes space. Huang

et al. [42] also employed the eigeneyes to perform initial eye-position detection.

Huang et al. [41] presented a method to represent eye image using wavelets and to

perform eye detection using an RBF NN classifier. Reinders et al. [93] proposed

several improvements on the neural network-based eye detector. The trained neu-

ral network eye detector can detect rotated or scaled eyes under different lighting

conditions, but it is trained only for the frontal face image.

A number of feature-based methods explore the characteristics (such as edge

and intensity of the iris, the color distributions of the sclera and the flesh) of the eyes

to identify some distinctive features around the eyes. Kawato et al [56] proposed a

feature-based method for eye detection and tracking. Instead of detecting eyes, they

proposed to detect the point between two eyes. The authors believe the point is

more stable and easier to detect than are the individual eyes. Eyes are subsequently

detected as two dark parts, symmetrically located on each side of the between-eye-

point. Feng et al. [26, 27] designed a new eye model consisting of six landmarks

(eye corner points). Their technique first locates the eye landmarks based on the

variance projection function (VPF) and the located landmarks are then employed

to guide the eye detection. Unfortunately, experiments show that their method

Page 28: Zhiwei Zhu PhD Thesis

11

will fail if the eye is closed or partially occluded by hair or face orientation, and in

addition, their technique may mistake eyebrows for eyes. Tian et al. [106] proposed

a new method to track the eye and extract the eye parameters. The method requires

manually initializing the eye model in the first frame. The eye’s inner corner and

eyelids are tracked using a modified version of the Lucas-Kanade tracking algorithm

[68]. The edge and intensity of the iris are used to extract the shape information of

the eye. Their method, however, requires a high contrast image to detect and track

eye corners and to obtain a good edge image.

In summary, the traditional image-based eye tracking approaches detect and

track the eyes by exploiting eyes’ differences in appearance and shape from the rest

of the face. The special characteristics of the eye such as dark pupil, white sclera,

circular iris, eye corners, eye shape, etc. are utilized to distinguish the human eye

from other objects. But due to eye closure, eye occlusion, variability in scale and

location, different lighting conditions, and face orientations, these differences will

often diminish or even disappear. Wavelet filtering [82, 98] has been commonly used

in computer vision to reduce the illumination effect by removing subbands sensitive

to illumination changes; however, it only works under slight illumination variation,

and illumination variation for eye tracking applications could be significant. Hence,

the eye image will not look much different in appearance or shape from the rest of the

face, and the traditional image-based approaches cannot work very well, especially

for the faces with non-frontal orientations, under different illuminations, and for

different subjects.

Eye detection and tracking based on the active remote IR illumination is a

simple yet effective approach. It exploits the spectral (reflective) properties of the

pupil under near IR illumination. Numerous techniques [21, 19, 81, 80, 37, 51] have

been developed based on this principle, including some commercial eye trackers [2, 1].

They all rely on an active IR light source to produce the dark or bright pupil effects.

Ebisawa et al. [21] generate the bright/dark pupil images based on a differential

lighting scheme using two IR light sources (on and off camera axis). The eye can be

tracked effectively by tracking the bright pupils in the difference image resulting from

subtracting the dark pupil image from the bright pupil image. Later in [19], they

Page 29: Zhiwei Zhu PhD Thesis

12

further improved their method by using pupil brightness stabilization to eliminate

glass reflection. Morimoto et al. [81] also utilize a differential lighting scheme to

generate the bright/dark pupil images, and pupil detection is done after thresholding

the difference image. A larger temporal support is used to reduce artifacts caused

mostly by head motion, and geometric constraints are used to group the pupils.

Most of these methods require distinctive a bright/dark pupil effect to work

well. The success of such a system strongly depends on the brightness and size

of the pupils, which are often affected by several factors including eye closure, eye

occlusion due to face rotation, external illumination interferences, and the distances

of the subjects from the camera. Figure 2.1 summarizes different conditions under

which the pupils may not appear very bright or even disappear. These conditions

include eye closure as shown in Figure 2.1 (a), oblique face orientations as shown in

Figure 2.1 (b), presence of other bright objects (due to either eye glasses glares or

motion) as shown in Figure 2.1 (c), and external illumination interference as shown

in Figure 2.1 (d).

(a) (b) (c) (d)

Figure 2.1: The disappearance or weakness of the bright pupils due to (a)eye closure, (b) oblique face orientation, (c) eye glasses glareand (d) strong external illumination interference

The absence of bright pupils, or weak pupil intensity, poses serious problems for

the existing eye tracking methods using IR in that they all require relatively stable

lighting conditions, users close to the camera, small out-of-plane face rotations, and

open and un-occluded eyes. These conditions impose serious restrictions on the part

of their systems as well as on the user, and therefore limit their application scope.

Realistically, however, lighting can be variable in many application domains; the

natural movement of the head often involves out-of-plane rotation, and eye closures

due to blinking and winking are physiological necessities for humans. Furthermore,

Page 30: Zhiwei Zhu PhD Thesis

13

thick eye glasses tend to disturb the infrared light so much that the pupils appear

very weak. It is therefore very important for the eye tracking system to be able to

robustly and accurately track eyes under these conditions as well.

To alleviate some of these problems, Ebisawa [19] proposed an image difference

method based on two light sources to perform pupil detection under various lighting

conditions. The background can be eliminated using the image difference method,

and the pupils can be easily detected by setting the threshold as low as possible in the

difference image. Ebisawa [19] also proposed an ad hoc algorithm for eliminating

glare on glasses, based on thresholding and morphological operations. However,

the automatic determination of the threshold and the structure element size for

morphological operations is difficult, and the threshold value cannot be set as low

as possible considering the efficiency of the algorithm. Also, eliminating the noise

blobs just according to their sizes is not enough.

Haro [37] proposed performing pupil tracking based on combining eye appear-

ance, the bright pupil effect, and motion characteristics so that pupils can be sepa-

rated from other equally bright objects in the scene. To do so, Haro [37] proposed

to verify the pupil blobs using a conventional appearance-based matching method

and the motion characteristics of the eyes. But their method cannot track closed or

occluded eyes, or eyes with weak pupil intensity due to interference from external

illuminations. Ji et al. [51] proposed a real time subtraction and a special filter

to eliminate the external light interferences, but their technique fails to track the

closed/occluded eyes. To handle the presence of other bright objects, their method

performs pupil verification based on the shape and size of pupil blobs to eliminate

spurious pupil blobs. Usually, however, spurious blobs have similar shape and size

to those of the pupil blobs and make it difficult to distinguish the real pupil blobs

from the noise blobs based on only shape and size.

In this chapter, we have proposed a real-time robust method for eye tracking

under variable lighting conditions and face orientations, based on combining the

appearance-based methods and the active IR illumination approach. Combining

the respective strengths of different complementary techniques and overcoming their

shortcomings, the proposed method uses an active infrared illumination to brighten

Page 31: Zhiwei Zhu PhD Thesis

14

subject’s faces to produce the bright pupil effect. The bright pupil effect and the

appearance of eyes are utilized simultaneously for eye detection and tracking. The

latest technologies in pattern classification recognition (the SVM) and in object

tracking (the mean-shift) are employed for pupil detection and tracking based on

eye appearance. Some of the ideas presented in this chapter have been briefly

reported in [135] and [132].

In this chapter, we report our algorithm in detail. Our method consists of

two parts: eye detection and eye tracking. Eye detection is accomplished by si-

multaneously utilizing the bright/dark pupil effect under active IR illumination and

the eye appearance pattern under ambient illumination via the SVM classifier. Eye

tracking is composed of two major modules. The first module is a conventional

Kalman filtering tracker based on the bright pupil. The Kalman filtering tracker is

augmented with the SVM classifier [15, 40] to perform verification of the detected

eyes. If the Kalman filtering eye tracker fails due to either weak pupil intensity or

the absence of bright pupils, eye tracking based the on mean shift is activated [12]

to continue tracking the eyes. Eye tracking returns to the Kalman filtering tracker

as soon as the bright pupils reappear, since eye tracking using bright pupils is much

more robust than the mean shift tracker which, we find, tends to drift away. The

two trackers alternate, complementing each other and overcoming their limitations.

Figure 2.2 summarizes our eye tracking algorithm.

2.2 Eye Detection

To facilitate subsequent image processing, the person’s face is illuminated using

a near-infrared illuminator. The use of an infrared illuminator serves three purposes:

first, it minimizes the impact of different ambient light conditions, therefore ensur-

ing image quality under varying real-world conditions including poor illumination,

day, and night; second, it allows production of the bright/dark pupil effect, which

constitutes the foundation for the proposed eye detection and tracking algorithm;

third, since near infrared is barely visible to the user, it minimizes interference with

the user’s work. According to the original patent (from Hutchinson [43]), a bright

pupil can be obtained if the eyes are illuminated with a near infrared illuminator

Page 32: Zhiwei Zhu PhD Thesis

15

Eye Detection Based on SVM

Success?

Kalman Filter Based Bright Pupil Eye Tracker

Yes

Success?

Update the Target Model for the Mean Shift Eye Tracker

Yes

Yes

No

Mean Shift Eye Tracker

Success?

No

No

Input IR Images

Figure 2.2: The combined eye tracking flowchart

beaming light along the camera’s optical axis at a certain wavelength. At near

infrared wavelengths, pupils reflect almost all infrared light they receive along the

path back to the camera, producing the bright pupil effect, very much similar to

the red eye effect in photography. If illuminated off the camera’s optical axis, the

pupils appear dark since the reflected light will not enter the camera lens. This

produces the so-called dark pupil effects. Examples of bright and dark pupils are

given in Figure 2.3. Details about the construction of the IR illuminator and its

configuration may be found in [52].

(a) (b)

Figure 2.3: The bright-pupil (a) and dark-pupil (b) images

Given the IR illuminated eye images, eye detection is accomplished via pupil

Page 33: Zhiwei Zhu PhD Thesis

16

detection. Pupil detection is accomplished based on both the intensity of the pupils

(the bright and dark pupils) and the appearance of the eyes using the SVM classifier.

Specifically, pupil detection starts with preprocessing to remove external illumina-

tion interference, followed by searching the whole image for pupils based on pupil

intensity and eye appearance. Multiple pupils can be detected if there is more than

one person present, and the use of SVM avoids falsely identifying a bright region as

a pupil. Figure 2.4 gives an overview of the eye detection module.

Image Subtraction

Adaptive Thresholding

Connected Component Analysis

SVM Eye Verification

Interlaced Image

Even Field Image

Odd Field Image Binary Image

Geometric Constraints

Blobs Eye Candidates Eyes

Video Decoder

Figure 2.4: Eye detection block diagram

2.2.1 Initial Eye Position Detection

The detection algorithm starts with preprocessing to minimize interference

from illumination sources other than the IR illuminator, including sunlight and

ambient light interference. A differential method is used to remove background

interference by subtracting the dark eye image (odd field) from the bright eye image

(even field), producing a difference image, with most of the background and external

illumination effects removed, as shown in Figure 2.5 (c). For real time eye tracking,

the image subtraction must be implemented efficiently in real time. To achieve this,

we developed circuitry to synchronize the outer ring of LEDs and inner ring of LEDs

with the even and odd fields of the interlaced image, respectively, so that they can

be turned on and off alternately. When the even field is being scanned, the inner

ring of LEDs is on and the outer ring of LEDs is off and vice versa when the even

field is scanned. The interlaced input image is subsequently de-interlaced via a video

decoder, producing the even and odd field images as shown in Figure 2.5 (a) and

Page 34: Zhiwei Zhu PhD Thesis

17

(b). More on our image subtraction circuitry may be found in [52].

(a) (b) (c)

Figure 2.5: Background illumination interference removal: (a) the even-field images obtained under both ambient and IR light; (b)the odd-field images obtained under only ambient light; (c)the difference images resulted from subtracting (b) from (a)

The difference image is subsequently thresholded automatically based on its

histogram, producing a binary image. Connected component analysis is then applied

to the binary image to identify the binary blobs. Our task is then to find out which of

the blobs actually is the real pupil blob. Initially, we mark all the blobs as potential

candidates for pupils as shown in Figure 2.6.

Figure 2.6: The thresholded difference image marked with pupil candi-dates

Page 35: Zhiwei Zhu PhD Thesis

18

2.2.2 Eye Verification Using Support Vector Machines

As shown in Figure 2.6, there are usually many potential candidates for pupils.

Typically, pupils are found among the binary blobs. However, it is usually not

possible to isolate the pupil blob only by picking the right threshold value, since

pupils are often small and not bright enough compared with other noise blobs.

Thus, we will have to make use of information other than intensity to correctly

identify them.

One initial way to distinguish the pupil blobs from other noise blobs is based

on their geometric shapes. Usually, the pupil is an ellipse-like blob and we can use

an ellipse fitting method [29] to extract the shape of each blob and use the shape and

size to remove some blobs from further consideration. It must be noted, however,

that due to scale change (distance from the camera) and to variability in individual

pupil size, size is not a reliable criterion. It is only used to remove very large or very

small blobs. Shape criterion, on the other hand, is scale-invariant. Nevertheless,

shape alone is not sufficient since there are often present other non-pupil blobs with

similar shape and size, as shown in Figure 2.7, where we can see that there are still

Figure 2.7: The thresholded difference image after removing some blobsbased on their geometric properties (shape and size). Theblobs marked with circles are selected for further considera-tion

several non-pupil blobs left. Because they are so similar in shape and size, we can’t

distinguish the real pupil blobs from them, so we have to use other features.

We have observed that the eye region surrounding pupils has a unique intensity

distribution; they appear different from other parts of the face in the dark pupil

Page 36: Zhiwei Zhu PhD Thesis

19

image as shown in Figure 2.3 (b). The appearance of an eye can therefore be utilized

to separate it from non-eyes. We map the locations of the remaining binary blobs

to the dark pupil images and then apply the SVM classifier [15, 40] to automatically

identify the binary blobs that correspond to eyes, as discussed below.

2.2.2.1 Support Vector Machines

SVM [15] is a two-class classification method that finds the optimal decision

hyper-plane based on the concept of structural risk minimization. Ever since its

introduction, SVM has become increasingly popular. The theory of SVM can be

briefly summarized as follows. For the case of two-class pattern recognition, the

task of predictive learning from examples can be formulated as follows. Given a set

of functions fα and an input domain RN of N dimensions:

fα : α ∈ Λ, fα : RN −→ −1, +1

(Λ is an index set) and a set of l examples:

(x1, y1), ...(xi, yi), ..., (xl, yl), xi ∈ RN , yi ∈ −1, +1

where xi is an input feature vector and yi represents the class, which has only

two values, -1 and +1. Each (xi, yi) is generated from an unknown probability

distribution p(x, y), and the goal is to find a particular function f ∗

α which provides

the smallest possible value for the risk:

R(α) =

∫|fα(x)− y|dp(x, y) (2.1)

Suppose that there is a separating hyper-plane that separates the positive

class from the negative class. The data characterizing the boundary between the

two classes are called the support vectors since they alone define the optimal hyper-

plane. First, a set (xi, yi) of labeled training data are collected as the input to the

SVM. Then, a trained SVM will be characterized by a set of Ns support vectors si,

coefficient weights αi for the support vectors, class labels yi of the support vectors,

Page 37: Zhiwei Zhu PhD Thesis

20

and a constant term w0.

For the linearly separable case, the linear decision surface (the hyperplane) is

defined as

w · x + w0 = 0 (2.2)

where x is a point the hyperplane, “·” denotes dot product, w is the normal of the

hyperplane, and w0 is the distance to the hyperplane from the origin. Through the

use of training data, w can be estimated by

w =Ns∑

i=1

αiyisi (2.3)

Given w and w0, an input vector xi can be classified into one of the two classes,

depending on whether w · x + w0 is larger or smaller than 0.

Classes are often not linearly separable. In this case, SVM can be extended by

using a kernel K(., .), which performs a nonlinear mapping of the feature space to

a higher dimension, where classes are linearly separable. The most common SVM

kernels include Gaussian kernels, Radial-based kernels, and polynomial kernels. The

decision rule with a kernel can be expressed as

Ns∑

i=1

αiyiK(si, x) + w0 = 0 (2.4)

2.2.2.2 SVM Training

To use SVM, training data are needed to obtain the optimal hyper-plane. An

eye image is represented as a vector I consisting of the original pixel values. For

this project, after obtaining the positions of pupil candidates using the methods

mentioned above, we obtain the sub-images from the dark image according to those

positions as shown in Figure 2.8.

Usually, the eyes are included in those cropped images of 20× 20 pixels. The

cropped image data are processed using histogram equalization and normalized to

a [0, 1] range before training. The eye training images are divided into two sets: a

positive set and a negative set. In the positive image set, we include eye images of

different gazes, different degrees of opening, different face poses, different subjects,

Page 38: Zhiwei Zhu PhD Thesis

21

(a) (b)

Figure 2.8: (a) The thresholded difference image superimposed with pos-sible pupil candidates. (b) The dark image marked with pos-sible eye candidates according to the positions of pupil can-didates in (a)

and with/without glasses. The non-eye images are placed in the negative image set.

Figures 2.9 and 2.10 contain examples of eye and non-eye images in the training

sets, respectively.

Figure 2.9: The eye images in the positive training set

After finishing the above step, we get a training set, which has 558 positive

images and 560 negative images. In order to obtain the best accuracy, we need

to identify the best parameters for the SVM. In Table 2.1, we list three different

SVM kernels with various parameter settings and each SVM was tested on 1757 eye

Page 39: Zhiwei Zhu PhD Thesis

22

Figure 2.10: The non-eye images in the negative training set

Table 2.1: Experiment results using 3 kernel types with different param-eters

Kernel Type Deg Sigma # Support Accuracyσ Vectors

Linear 376 0.914058Polynomial 2 334 0.912351Polynomial 3 358 0.936255Polynomial 4 336 0.895845Gaussian 1 1087 0.500285Gaussian 2 712 0.936255Gaussian 3 511 0.955037Gaussian 4 432 0.946500Gaussian 5 403 0.941377

candidate images obtained from different persons.

From the above table, we can see that the best accuracy we can achieve is

95.5037%, using a Gaussian kernel with a σ of 3.

2.2.2.3 Retraining Using Mis-labeled Data

Usually, supervised learning machines rely on only limited labeled training

examples, and cannot reach very high learning accuracy. So we have to test on

thousands of unlabeled data, pick up the mis-labeled data, then put them into the

correct training sets and retrain the classifier. After performing this procedure on

the unlabeled data obtained from different conditions several times, we can boost

the accuracy of the learning machine at the cost of extra time needed for retraining.

Specifically, we have eye data sets from ten people, which we obtained using

the same method. We choose the first person’s data set and label the eye images and

Page 40: Zhiwei Zhu PhD Thesis

23

non-eye images manually, then we train the Gaussian SVM on this training set and

test Gaussian SVM on the second person’s data set. We check the second person’s

data one by one, pick up all the mis-labeled data, label them correctly and add

them into the training set. After finishing the above step, we retrain the SVM on

this increased training set and repeat the above step on the next person’s data set.

The whole process then repeats until the classification errors stabilize. Through the

retraining process, we can significantly boost the accuracy of the Gaussian SVM.

2.2.2.4 Eye Detection with SVM

During eye detection, we crop the regions in the dark pupil image according to

the locations of pupil candidates in the difference image as shown in Figure 2.8 (b).

After some preprocessing on these eye candidate images, they will be provided to

the trained SVM for classification. The trained SVM will classify the input vector

I into eye class or non-eye class. Figure 2.11 shows that the SVM eye classifier

correctly identifies the real eye regions as marked.

(a) (b)

Figure 2.11: The result images (a) and (b) marked with identified eyes.Compared with images in Figure 2.8 (b), many false alarmshave been removed

Pupil verification with SVM works reasonably well and can generalize to people

of the same race. However, for people from a race that is significantly different from

those in training images, the SVM may fail and need to be retrained. SVM can

work under different illumination conditions due to the intensity normalization for

the training images via histogram equalization.

Page 41: Zhiwei Zhu PhD Thesis

24

2.3 Eye Tracking Algorithm

Given the detected eyes in the initial frames, the eyes in subsequent frames

can be tracked from frame to frame. Eye tracking can be done by performing pupil

detection in each frame. This brute force method, however, will significantly slow

down the speed of pupil tracking, making real time pupil tracking impossible since

it needs to search the entire image for each frame. This can be done more efficiently

by using the scheme of prediction and detection. Kalman filtering [8] provides

a mechanism to accomplish this. The Kalman pupil tracker, however, may fail if

pupils are not bright enough under the conditions mentioned previously. In addition,

rapid head movement may also cause the tracker to lose the eyes. This problem is

addressed by augmenting the Kalman tracker with the mean shift tracker.

Figure 2.12 summarizes our eye tracking scheme. Specifically, after locating

the eyes in the initial frames, Kalman filtering is activated to track bright pupils. If

it fails in a frame due to disappearance of bright pupils, eye tracking based on the

mean shift will take over. Our eye tracker will return to bright pupil tracking as

soon as bright pupil appears again since it is much more robust and reliable tracking.

Pupil detection will be activated if the mean shift tracking fails. These two stage

eye trackers work together and they complement each other. The robustness of the

eye tracker is improved significantly. The Kalman tracking, the mean shift tracking,

and their integration are briefly discussed below.

2.3.1 Eye (Pupil) Tracking with Kalman Filtering

A Kalman filter is a set of recursive algorithms that estimate the position and

uncertainty of moving targets in the next time frame, that is, where to look for

the targets, and how large a region should be searched in the next frame around

the predicted position in order to find the targets with a certain confidence. It

recursively conditions current estimate on all of the past measurements and the

process is repeated with the previous a posteriori estimates used to project the new

a priori estimates. This recursive nature is one of the very appealing features of the

Kalman filter since it makes practical implementation much more feasible.

Our pupil tracking method based on Kalman filtering can be formalized as

Page 42: Zhiwei Zhu PhD Thesis

25

Eye detection

Success?

Kalman filter based bright pupil eye tracker

Yes

Success?

No

Update the eye target model for Mean shift

eye tracker

Yes

Initialize estimated center (y0) with Kalman filter, then y1=y0

Calculate the combined weights (bright pupil image and dark pupil

image as two channels)

Calculate the new target center y0

dis(y1-y0)<threshold value?

y1=y0

No

No

Bhattacharyya coefficient < threshold value?

Yes

No

Yes

Mean Shift Tracker

Figure 2.12: The combined eye tracking flowchart

follows. The state of a pupil at each time instance (frame) t can be characterized by

its position and velocity. Let (ct, rt) represent the pupil pixel position (its centroid)

at time t and (ut, vt) be its velocity at time t in c and r directions respectively. The

state vector at time t can therefore be represented as Xt = (ct rt ut vt)t.

According to the theory of Kalman filtering [73], Xt+1, the state vector at the

next time frame t+1, linearly relates to current state Xt by the system model as

Page 43: Zhiwei Zhu PhD Thesis

26

follows:

Xt+1 = ΦXt + Wt (2.5)

where Φ is the state transition matrix and Wt represents system perturbation. Wt

is normally distributed as p(Wt) ∼ N(0, Q), and Q represents the process noise

covariance.

We further assume that a fast feature extractor estimates Zt = (ct, rt), the

detected pupil position at time t. Therefore, the measurement model in the form

needed by the Kalman filter is

Zt = HXt + Mt (2.6)

where matrix H relates current state to current measurement and Mt represents

measurement uncertainty. Mt is normally distributed as p(Mt) ∼ N(0, R), and R

is the measurement noise covariance. For simplicity, since Zt only involves position,

H can be represented as

H =

1 0 0 0

0 1 0 0

The feature detector (e.g., thresholding or correlation) searches the region as

determined by the projected pupil position and its uncertainty to find the feature

point at time t + 1. The detected point is then combined with the prediction

estimation to produce the final estimate.

Specifically, given the state model in equation 2.5 and measurement model in

equation 2.6, as well as some initial conditions, the state vector Xt+1, along with

its covariance matrix Σt+1, can be updated as follows. For subsequent discussion,

let us define a few more variables. Let X−

t+1 be the estimated state at time t+1,

resulting from using the system model only. It is often referred to as the a priori

state estimate. Xt+1 differs from X−

t+1 in that it is estimated using both the system

model (equation 2.5) and the measurement model (equation 2.6). Xt+1 is usually

referred as the a posteriori state estimate. Let Σ−

t+1 and Σt+1 be the covariance

matrices for the state estimates X−

t+1 and Xt+1 respectively. They characterize the

Page 44: Zhiwei Zhu PhD Thesis

27

uncertainties associated with the a priori and a posteriori state estimates. The

goal of Kalman filtering is therefore to estimate Xt+1 and Σt+1 given Xt, Σt, Zt,

and the system and measurement models. The Kalman filtering algorithm for state

prediction and updating is summarized below.

1. State prediction

Given current state Xt and its covariance matrix Σt, state prediction involves

two steps: state projection (X−

t+1) and error covariance estimation (Σ−

t+1) as

summarized in Eq. 2.7 and Eq. 2.8.

X−

t+1 = ΦXt (2.7)

Σ−

t+1 = ΦΣtΦt + Qt (2.8)

Given the estimate X−

t+1, and its covariance matrix Σ−

t+1, pupil detection is

performed to detect the pupil around X−

t+1, with the search area determined

by Σ−

t+1. In practice, to speed up the computation, the values of Σ−

t+1[0][0] and

Σ−

t+1[1][1] are used to compute the search area size. Specifically, the search area

size is chosen as 20+2*Σ−

t+1[0][0] pixels and 20+2*Σ−

t+1[1][1] pixels, where 20

by 20 pixels is the basic window size. This means the larger the Σ−

t+1[0][0] and

Σ−

t+1[1][1], the more uncertain is the estimation, and the larger is the search

area. The search area is therefore adaptively adjusted. Therefore, the pupil

can be located quickly.

2. State updating

The detected pupil position is represented by Zt+1. Then, state updating can

be performed to derive the final state and its covariance matrix. The first

task during state updating is to compute the Kalman gain Kt+1. It is done as

follows:

Kt+1 =Σ−

t+1HT

HΣ−

t+1HT + R

(2.9)

Page 45: Zhiwei Zhu PhD Thesis

28

The gain matrix K can be physically interpreted as a weighting factor to

determine the contribution of measurement Zt+1 and prediction HX−

t+1 to the

a posteriori state estimate Xt+1. The next step is to generate a posteriori state

estimate Xt+1 by incorporating the measurement into equation 2.5. Xt+1 is

computed as follows:

Xt+1 = X−

t+1 + Kt+1(Zt+1 −HX−

t+1) (2.10)

The final step is to obtain the a posteriori error covariance estimate. It is

computed as follows:

Σt+1 = (I −Kt+1H)Σ−

t+1 (2.11)

After each time and measurement update pair, the Kalman filtering recursively

conditions the current estimate on all of the past measurements and the process is

repeated with the previous a posteriori estimates used to project a new a priori

estimate.

The Kalman filtering tracker works reasonably well under frontal face rotation

with the eye open. However, it will fail if the pupils are not bright due to either face

orientation or external illumination interferences. The Kalman filter also fails when a

sudden head movement occurs due to incorrect prediction because the assumption of

smooth head motion has been violated. In each case, Kalman filtering fails because

the Kalman filter detector cannot detect pupils. We proposed to use the mean shift

tracking to augment Kalman filtering tracking to overcome this limitation.

2.3.2 Mean Shift Eye Tracking

Due to the IR illumination, the eye region in the dark and bright pupil images

exhibits strong and unique visual patterns such as the dark iris in the white part.

This unique pattern should be utilized to track eyes in case the bright pupils fail

to appear on the difference images. This is accomplished via the use of mean shift

tracking. Mean shift tracking is an appearance-based object tracking method. It

employs mean shift analysis to identify a target candidate region, which has the

Page 46: Zhiwei Zhu PhD Thesis

29

most similar appearance to the target model in terms of intensity distribution.

2.3.2.1 Similarity Measure

The similarity of two distributions can be expressed by a metric based on the

Bhattacharyya coefficient as described in [12]. The derivation of the Bhattacharyya

coefficient from sample data involves the estimation of the target density q and the

candidate density p, for which we employ the histogram formulation. Therefore, the

discrete density q = quu=1...m (with∑m

u=1 qu = 1 ) is estimated from the m-bin

histogram of the target model, while p(y) = pu(y)u=1...m (with∑m

u=1 pu = 1 ) is

estimated at a given location y from the m-bin histogram of the target candidate.

Then at location y, the sample estimate of the Bhattacharyya coefficient for target

density q and candidate density p(y) is given by

ρ(y) ≡ ρ [p(y), q] =m∑

u=1

√puqu (2.12)

The distance between two distributions can be defined as

d(y) =√

1− ρ [p(y), q] (2.13)

2.3.2.2 Eye Appearance Model

To reliably characterize the intensity distribution of eyes and non-eyes, the

intensity distribution is characterized by two images: even and odd field images,

resulting from de-interlacing the original input images. They are under different il-

luminations, with one producing bright pupils and the other producing dark pupils

as shown in Figure 2.13. The use of two channel images to characterize eye appear-

ance represents a new contribution and can therefore improve the accuracy of eye

detection.

(a) (b) (c) (d)

Figure 2.13: The eye images: (a)(b) left and right bright-pupil eyes;(c)(d) corresponding left and right dark-pupil eyes

Page 47: Zhiwei Zhu PhD Thesis

30

Thus, there are two different feature probability distributions of the eye target

corresponding to dark-pupil and bright-pupil images , respectively. We use a 2D

joint histogram, which is derived from the grey level dark-pupil and bright-pupil

image spaces with m = l × l bins, to represent the feature probability distribution

of the eyes. Before calculating the histogram, we employ a convex and monotonic

decreasing kernel profile k to assign a smaller weight to the locations that are farther

from the center of the target. Let us denote by xii=1...nhthe pixel locations of

a target candidate that has nh pixels, centered at y in the current frame. The

probability distribution of the intensity vector I = (Ib, Id), where Id and Ib represent

the intensities in the dark and bright images respectively, in the target candidate is

given by

pu(y) =

∑nh

i=1 k(‖y−xi

h‖

2)δ[b(xi)− u]

∑nh

i=1 k(‖y−xi

h‖

2)

where u=1,2,..,m (2.14)

in which the b(xi) is the index to a bin in the joint histogram of the intensity vector

I at location xi, h is the radius of the kernel profile and δ is the Kronecker delta

function. The eye model distribution q can be built in a similar fashion.

2.3.2.3 Algorithm

After locating the eyes in the previous frame, we construct an eye model q

using Equation 2.14 based on the detected eyes in the previous frame. We then

predict the locations y0 of eyes in the current frame using the Kalman filter. Then

we treat y0 as the initial position and use the mean shift iterations to find the

most similar eye candidate to the eye target model in the current frame, using the

following algorithm:

1. Initialize the location of the target in the current frame with y0, then compute

the distribution pu(y0)u=1...m using Equation 2.14 and evaluate similarity

measure (Bhattacharyya coefficient) between the model density q and target

candidate density p:

ρ[p(y0), q] =m∑

u=1

√pu(y0)qu (2.15)

Page 48: Zhiwei Zhu PhD Thesis

31

2. Derive the weights wii=1...nhaccording to

wi =m∑

u=1

δ[b(xi)− u]

√qu

pu(y0)(2.16)

3. Based on the mean shift vector, derive the new location of the eye target

y1 =

∑nh

i=1 xiwig(‖y0−xi

h‖

2)

∑nh

i=1 wig(‖y0−xi

h‖

2)

(2.17)

where g(x) = −k′(x) and then update pu(y1)u=1...m, and evaluate

ρ[p(y1), q] =m∑

u=1

√pu(y1)qu (2.18)

4. While ρ[p(y1), q] < ρ[p(y0), q]

Do y1 ← 0.5(y0 + y1)

This is necessary to avoid the mean shift tracker moving to an incorrect loca-

tion.

5. If ‖y1 − y0‖ < ε, stop, where ε is the termination threshold

Otherwise, set y0 ← y1 and go to step 1.

The new eye locations in the current frame can be achieved in a few itera-

tions, as opposed to correlation-based approaches, which must perform an exhaustive

search around the previous eye location. Due to the simplicity of the calculations,

this method is much faster than correlation. Figure 2.14(b) plots the surface for

the Bhattacharyya coefficient of the large rectangle marked in Figure 2.14(a). The

mean shift algorithm exploits the gradient of the surface to climb from its initial

position to the closest peak that represents the maximum value of the similarity

measure.

2.3.2.4 Mean Shift Tracking Parameters

The mean shift algorithm is sensitive to the window size and the histogram

quantization value. In order to obtain the best performance of the mean shift tracker

Page 49: Zhiwei Zhu PhD Thesis

32

(a) (b)

Figure 2.14: (a) The image frame 13; (b) Values of Bhattacharyya co-efficient corresponding to the marked region(40 × 40 pix-els) around the left eye in frame 13. Mean shift algorithmconverges from the initial location(∗) to the convergencepoint(), which is a mode of the Bhattacharyya surface

for a specific task, we have to find the appropriate histogram quantization value and

the proper window size. We choose several image sequences and manually locate

the left eye positions in these frames. Then we run the mean shift eye tracker under

different window sizes and different histogram quantization values; we evaluate the

performance of the mean shift eye tracker under those conditions using the following

criterion:

αerror =N∑

i=1

√(yi(tracked)− y

i(manual))2/N (2.19)

where N is the number of image frames and yi(tracked) is the left eye location

tracked by the mean shift tracker in the image frame i; y′

i(manual) is the left eye

location manually located by the person in the image frame i. We treat the manually

selected eye locations as the correct left eye locations.

The intensity histogram is scaled in the range of 0 to 255/(2q), where q is the

quantization value. The results are plotted in Fig. 2.15. From figure 2.15 (a) and

(b), we can determine the optimal quantization level to be 25 while the optimal

window size is 20*20 pixels. Figure 2.16 shows some tracking results with these

parameters.

The mean-shift tracker, however, is sensitive to its initial placement. It may

Page 50: Zhiwei Zhu PhD Thesis

33

(a) (b)

Figure 2.15: The error distribution of tracking results: (a) error distribu-tion vs. intensity quantization values and different windowsizes; (b) error distribution vs. quantization levels only

not converge, or may converge to a local minimum if initially placed too far from

the optimal location. It usually converges to the mode, closest to its initial position.

If the initial location is in the valley between two modes, the mean shift may not

converge to any (local maxima) peaks as shown in Figure 2.17. This demonstrates

the sensitivity of the mean-shift tracker to initial placement of the detector.

2.3.2.5 Experiments On Mean Shift Eye Tracking

In order to study the performance of the mean-shift tracker, we apply it to

sequences that contain images with weak or partially occluded pupils or no bright

pupils. We have noticed that when bright pupils disappear due to either eye closure

or face rotations as shown in Figure 2.18, the Kalman filter fails because there

are no bright pupil blobs in the difference images. However, the mean shift tracker

compensates for the failure of bright pupil tracker because it is an appearance-based

tracker that tracks the eyes according to the intensity statistical distributions of the

eye regions and does not need bright pupils. The black rectangles in Figure 2.18

represent the eye locations tracked by the mean shift tracker.

Page 51: Zhiwei Zhu PhD Thesis

34

(1) (13) (27) (46)

(63) (88) (100) (130)

Figure 2.16: Mean-shift tracking both eyes with initial search area of40*40 pixels, as represented by the large black rectangle.The eyes marked with white rectangles in frame 1 are usedas the eye model and the tracked eyes in the following framesare marked by the smaller black rectangles

2.4 Combining Kalman Filtering Tracking with Mean Shift

Tracking

Mean shift tracking is fast and handles noise well, but it is easily distracted

by nearby similar targets such as the nearby region that appears similar to the

eyes. This is partially because of the histogram representation of the eye’s appear-

ance, which does not contain any information about the relative spatial relationships

among pixels; the distraction manifests primarily as errors in the calculated center

of the eyes. The mean shift tracker does not have the capability of self-correction

and the errors therefore tend to accumulate and propagate to subsequent frames

as tracking progresses, and eventually the tracker drifts away. Another factor that

could lead to errors with eye tracking based on mean shift is that the mean shift

tracker cannot continuously update its eye model despite the fact that the eyes look

significantly different under different face orientations and lighting conditions, as

demonstrated in Figures 2.19 (a-e). We can see that the mean shift eye tracker

cannot identify the correct eye location when the eyes appear significantly different

from the eye model due to face orientation change.

Page 52: Zhiwei Zhu PhD Thesis

35

(a) (b)

Figure 2.17: (a) Image of frame 135, with the initial eye position markedand initial search area outlined by the large black rectan-gle. (b) Values of Bhattacharyya coefficient correspondingto the marked region(40 × 40 pixels) around the left eye in(a). Mean shift algorithm cannot converge from the initiallocation()(which is in the valley of two modes) to the cor-rect mode of the surface. Instead, it is trapped in the valley

(a) (b) (c) (d)

Figure 2.18: Bright pupil based Kalman tracker fails to track eyes dueto absence of bright pupils caused by either eye closure oroblique face orientations. The mean shift eye tracker, how-ever, tracks eyes successfully as indicated by the black rect-angles

To overcome these limitations with the mean shift tracker, we propose to

combine Kalman filter tracking with mean shift tracking to overcome their respective

limitations and to take advantage of their strengths. The two trackers are activated

alternately. The Kalman tracker is first initiated, assuming the presence of the

bright pupils. When the bright pupils appear weak or disappear, the mean shift

tracker is activated to take over the tracking. Mean shift tracking continues until the

reappearance of the bright pupils, when the Kalman tracker takes over. To prevent

the mean shift tracker from drifting away, the target eye model is continuously

Page 53: Zhiwei Zhu PhD Thesis

36

(a) (b) (c) (d) (e)

(A) (B) (C) (D) (E)

Figure 2.19: An image sequence to demonstrate the drift-away problemof the mean-shift tracker as well as the correction of theproblem by the integrated eye tracker. Frames (a-e) showthe drift away case of the mean-shift eye tracker; for the sameimage sequences, Frames (A-E) show the improved results ofthe combined eye tracker. White rectangles show the eyestracked by the Kalman tracker while the black rectanglesshow the tracked eyes by the mean shift tracker

updated by the eyes successfully detected by the Kalman tracker.

Figures 2.19 (A-E) show the results of tracking the same sequence with the

integrated eye tracker. It is apparent that the integrated tracker can correct the

drift problem of the mean shift tracker. Specifically, in Figure 2.19, white rectangles

represent the eyes tracked by the Kalman tracker while the black rectangles represent

the eyes tracked by the mean shift tracker, which works for all the following figures

in this chapter.

2.5 Experimental Results

In this section, we will present results from an extensive experiment we con-

ducted to validate the performance of our integrated eye tracker under different

conditions.

2.5.1 Eye Tracking Under Significant Head Pose Changes

Here, we show some qualitative and quantitative results to demonstrate the

performance of our tracker under different face orientations. Figure 2.20 shows the

Page 54: Zhiwei Zhu PhD Thesis

37

tracking results on a typical face image sequence with a person undergoing significant

face pose changes. Additional results for different subjects under significant head

rotations are shown in Figure 2.21. We can see that under significant head pose

changes, the eyes will be either partially occluded or the appearance of eyes will

be significantly different from the eyes with frontal faces. But the two eye trackers

alternate reliably, detecting the eyes under different head orientations, with eyes

either open, closed or partially occluded.

Figure 2.20: Tracking results of the combined eye tracker for a personundergoing significant head movements

To confirm this finding quantitatively, we manually located the positions of the

eyes for two typical sequences and they serve as the ground-truth eye positions. The

tracked eye positions are then compared with the ground-truth data. The results

are summarized in Tables 2.2 and 2.3. From the tracking statistics in Tables 2.2 and

2.3, we can conclude that the integrated eye tracker is much more accurate than the

Kalman filter pupil tracker, especially for closed eyes and for eyes partially occluded

due to large face rotations. These results demonstrate that this combination of two

tracking techniques produces much better tracking results than using either of them

individually.

2.5.2 Eye Tracking Under Different Illuminations

In this experiment, we demonstrate the performance of our integrated tracker

under different illumination conditions by varying the light conditions during track-

Page 55: Zhiwei Zhu PhD Thesis

38

(a)

(b)

(c)

(d)

Figure 2.21: Tracking results of the combined eye tracker for four imagesequences (a),(b),(c) and (d) under significant head move-ments.

ing. The experiment included first turning off the ambient lights, followed by using

a mobile light source and positioning it close to the people to produce strong ex-

ternal light interference. The external mobile light produces significant shadows as

well as intensity saturation on the subject’s faces. Figure 2.22 visually shows the

sample tracking results for two individuals. Despite these somewhat extreme condi-

tions, our eye tracker managed to track the eyes correctly. Because of the use of IR,

the faces are still visible and eyes are tracked even under darkness. It is apparent

that illumination change does not adversely affect the performance of our technique.

This may be attributed to the simultaneous use of active IR sensing, image intensity

normalization for eye detection using SVM, and the dynamic eye model updating

for the mean shift tracker.

Page 56: Zhiwei Zhu PhD Thesis

39

Table 2.2: Tracking statistics comparison for both trackers under differ-ent eyes conditions (open, closed, occluded) on the first person

Image Bright pupil Combined600 frames tracker tracker

Left eye (open)452 frames 400/452 452/452

Left eye (closed)66 frames 0/66 66/66

Left eye (occluded)82 frames 0/82 82/82

Right eye (open)425 frames 389/425 425/425

Right eye (closed)66 frames 0/66 66/66

Right eye (occluded)109 frames 0/109 109/109

(a)

(b)

Figure 2.22: Tracking results of the combined eye tracker for two imagesequences (a) and (b) under significant illumination changes.

2.5.3 Eye Tracking With Glasses

Eye appearance changes significantly with glasses. Furthermore, the glare on

the glasses caused by light reflections presents significant challenges to eye tracking

with glasses. In Figure 2.23, we show the results of applying our eye tracker to

persons wearing glasses. We can see that our eye tracker can still detect and track

eyes robustly and accurately for people with glasses. However, our study shows that

when the head orientation is such that the glare completely occludes the pupils, our

Page 57: Zhiwei Zhu PhD Thesis

40

Table 2.3: Tracking statistics comparison for both trackers under differ-ent eyes conditions (open, closed, occluded) on the second per-son

Image Sequence 1 Bright pupil Combined600 frames tracker tracker

Left eye (open)421 frames 300/421 410/421

Left eye (closed)78 frames 0/78 60/78

Left eye (occluded)101 frames 0/101 60/101

Right eye (open)463 frames 336/463 453/463

Right eye (closed)78 frames 0/78 78/78

Right eye (occluded)59 frames 0/59 59/59

tracker will fail. This is a problem that we will tackle in the future.

(a)

(b)

Figure 2.23: Tracking results of the combined eye tracker for two imagesequences (a), (b) with persons wearing glasses.

2.5.4 Eye Tracking With Multiple People

Our eye tracker not only can track the eyes of one person but also can track

multiple people’s eyes simultaneously. Here, we show the results of applying our eye

tracker to simultaneously track multiple people’s eyes with different distances and

Page 58: Zhiwei Zhu PhD Thesis

41

face orientations with respect to the camera. The result is presented in Figure 2.24.

This experiment demonstrates the versatility of our eye tracker.

Figure 2.24: Tracking results of the combined eye tracker for multiplepersons.

2.5.5 Occlusion Handling

Eyes are often partially or completely occluded either by face due to oblique

face orientations or by hands or by other objects. A good eye tracker should be

able to track eyes under partial occlusion and be able to detect complete occlusion

and re-detect the eyes after the complete occlusion is removed. In Figure 2.25, two

persons are moving in front of the camera, and one person’s eyes are occluded by

another’s head when they are crossing. As shown in Figure 2.25, when the rear

person moves from right to left, the head of the front person starts to occlude his

eyes, beginning with one and then two eyes getting completely occluded. As shown,

our tracker can still correctly track an eye even though it is partially occluded. When

both eyes are completely occluded, our tracker detects this situation. As soon as

the eyes reappear in the image, our eye tracker will capture the eyes one by one

immediately as shown in Figure 2.25. This experiment shows the robustness of our

method to occlusions.

2.5.6 Tracking Accuracy Validation

Experiments were conducted to quantitatively characterize the tracking accu-

racy of our proposed eye tracker. Specifically, we randomly selected an image se-

quence that contains 13,620 frames, and manually identified the eyes in each frame.

Page 59: Zhiwei Zhu PhD Thesis

42

Figure 2.25: Tracking results of the combined eye tracker for an imagesequence involving multiple persons occluding each other’seyes.

The manually labelled data serves as the ground truth and is compared with auto-

matically tracked results from our eye tracker. The study shows that our eye tracker

is quite accurate, with a false alarm rate of 0.05% and a mis-detection rate of 4.2%.

In addition, we studied the positional accuracy of the tracked eyes. The ground

truth is still obtained by manually locating the eyes in each frame. Figure 2.26

summarizes the comparison results. It shows that the automatically tracked eye

positions match very well with manually located eye positions, with RMS position

errors of 1.09 and 0.68 pixels along x and y coordinates, respectively.

(a) (b)

Figure 2.26: The comparison between the automatically tracked eye po-sitions and the manually located eye positions for 100 ran-domly selected consecutive frames: (a) x coordinate and (b)y coordinate.

Page 60: Zhiwei Zhu PhD Thesis

43

2.5.7 Processing Speed

The proposed eye detection and tracking algorithm is implemented using C++

on a PC with a Xeon (TM) 2.80GHz CPU and a 1.00GB RAM. The resolution of the

captured images is 640×480 pixels, and the built eye tracker runs at approximately

26 fps.

2.6 Chapter Summary

In this chapter, we present an integrated eye tracker to track eyes robustly

under various illuminations and face orientations. Our method performs well re-

gardless of whether the pupils are directly visible or not. This has been achieved

by combining an appearance-based pattern recognition method (SVM) and object

tracking (Mean Shift) with a bright-pupil eye tracker based on Kalman filtering.

Specifically, we take the following measures. First, the use of SVM for pupil

detection complements eye detection based on bright pupils from IR illumination,

allowing detection of eyes in the presence of other bright objects; second, two chan-

nels (dark-pupil and bright-pupil eye images) are used to characterize the statistical

distributions of the eye, based on which a mean shift eye tracker is developed. Third,

the eye model is continuously updated by having the eye successfully detected from

the last Kalman tracker, to avoid error propagation with the mean shift tracker.

Finally, the experimental determination of the optimal window size and quantiza-

tion level for mean shift tracking further enhances the performance of our technique.

Experiments show that these enhancements have led to a significant improvement

in eye tracking robustness and accuracy over existing eye trackers, especially under

various conditions identified in Section 1. Furthermore, our integrated eye tracker

is demonstrated to be able to handle occlusion and people with glasses, and to

simultaneously track multiple people of different poses and scales.

The two important lessons we learn from this research are: 1) perform active

vision (e.g., active IR illumination) to produce quality input images and to simplify

subsequent image processing; and 2) combine different complementary techniques

to utilize their respective strengths and to overcome their limitations, leading to a

much more robust technique than using each technique individually.

Page 61: Zhiwei Zhu PhD Thesis

CHAPTER 3

Eye Gaze Tracking

3.1 Introduction

Eye gaze is defined as the line of sight of a person. It represents a person’s focus

of attention. Eye gaze tracking has been an active research topic for many decades

because of its potential usages in various applications such as Human Computer

Interaction (HCI), Virtual Reality, Eye Disease Diagnosis, Human Behavior Studies,

etc. For example, when a user is looking at a computer screen, the user’s gaze

point at the screen can be estimated via the eye gaze tracker. Hence, the eye

gaze can serve as an advanced computer input [47], which is proven to be more

efficient than the traditional input devices such as a mouse pointer [126]. Also, a

gaze-contingent interactive graphic display application can be built [133], in which

the graphic display on the screen can be controlled interactively by the eye gaze.

Recently, eye gaze has also been widely used by cognitive scientists to study human

beings’ cognition [67], memory [70], etc.

Numerous techniques [133, 74, 45, 104, 131, 79, 6, 101, 113, 84, 76, 71] have

been proposed to estimate the eye gaze. Earlier eye gaze trackers are fairly intrusive

in that they require physical contacts with the user, such as placing a reflective

white dot directly onto the eye [74] or attaching a number of electrodes around the

eye [45]. In addition, most of these technologies also require the user’s head to be

motionless during eye tracking.

With the rapid technological advancements in both video cameras and mi-

crocomputers, gaze-tracking technology based on the digital video analysis of eye

movements has been widely explored. Since it does not require anything attached

to the user, video technology opens the most promising direction for building a non-

intrusive eye gaze tracker. Various techniques [20, 77, 18, 49, 123, 2, 133] have been

proposed to perform the eye gaze estimation based on eye images captured by video

cameras. However, most available remote eye gaze trackers have two characteristics

that prevent them from being widely used. First, they must often be calibrated

44

Page 62: Zhiwei Zhu PhD Thesis

45

repeatedly for each individual; second, they have low tolerance for head movements

and require the user to hold the head uncomfortably still.

In this chapter, two different techniques are introduced to improve the existing

gaze-tracking techniques. First, a simple 3D gaze-tracking technique is proposed to

estimate the 3D direction of the gaze. Different from existing 3D techniques, the

proposed 3D gaze-tracking technique does not need to know any user-dependent

parameters about the eyeball. Hence, the 3D direction of the gaze can be estimated

in a way allowing more easy implementation. Second, a novel 2D mapping-based

gaze estimation technique is introduced to allow free head movements and simplify

the calibration procedure. A dynamic head compensation model is proposed to

compensate for the head movements so that whenever the head moves to a new

3D position, the gaze mapping function at the new 3D position can be updated

automatically. Hence, accurate gaze information can still be estimated as the head

moves. Therefore, by using our proposed gaze-tracking techniques, a more robust,

accurate, comfortable and useful eye gaze-tracking system can be built.

3.2 Related Works

In general, most of the non-intrusive vision-based gaze-tracking techniques can

be classified into two groups: 2D mapping-based gaze estimation method [133, 104,

131, 79] and direct 3D gaze estimation method [6, 101, 113, 84, 76, 71]. In the

following section, each group will be discussed briefly.

3.2.1 2D Mapping-Based Gaze Estimation Technique

For the 2D mapping-based gaze estimation method, the eye gaze is estimated

from a calibrated gaze mapping function by inputting a set of 2D eye movement

features extracted from eye images, without knowing the 3D direction of the gaze.

Usually, the extracted 2D eye movement features vary with the eye gaze so that the

relationship between them can be encoded by a gaze mapping function. In order

to obtain the gaze mapping function, an online calibration needs be performed for

each person. Unfortunately, the extracted 2D eye movement features also vary

significantly with head position; thus, the calibrated gaze mapping function is very

Page 63: Zhiwei Zhu PhD Thesis

46

sensitive to head motion [79]. Hence, the user has to keep his head unnaturally still

in order to achieve good performance.

The Pupil Center Cornea Reflection (PCCR) technique is the most commonly

used 2D mapping-based approach for eye gaze tracking. The angle of the visual axis

(or the location of the fixation point on the display surface) is calculated by tracking

the relative position of the pupil center and a speck of light reflected from the cornea,

technically known as the “glint” as shown in Figure 3.1 (a) and (b). The generation

of the glint will be discussed in more detail at Section 3.3.2.1. The accuracy of the

system can be further enhanced by illuminating the eyes with near-InfraRed (IR)

light, which produces the “bright pupil” effect as shown Figure 3.1 (b) and makes

the video image easier to process. Infrared light is harmless and invisible to the

user.

glint

(a) (b)

Figure 3.1: Eye images with corneal reflection (glint): (a) dark pupil im-age (b) bright pupil image. Glint is a small bright spot asindicated in (a) and (b)

Several systems [44, 48, 20, 78] have been built based on the PCCR technique.

Most of these systems show that if the user has the ability to keep his head fixed, or

to restrict head motion via the help of chin-rest or bite-bar, very high accuracy can

be achieved in eye gaze tracking results. Specifically, the average error can be less

than 1 visual angle, which corresponds to less than 10 mm in the computer screen

when the subject is sitting approximately 550 mm from the computer screen. But

as the head moves away from the original position where the user performed the eye

gaze calibration, the accuracy of these eye gaze-tracking systems drops dramatically;

for example, [79] reports detailed data showing how the calibration mapping function

decays as the head moves away from its original position. Jacob reports a similar

Page 64: Zhiwei Zhu PhD Thesis

47

fact in [48]. Jacob attempted to solve the problem by giving the user the ability

to make local manual re-calibrations, which brings numerous troubles for the user.

As these studies indicate, calibration is a significant problem in current remote eye

tracking systems.

Most of the commercially available eye gaze-tracking systems [1, 2, 3] are also

built on the PCCR technique, and most of them claim that they can tolerate small

head motion. For example, less than 2 square inches of head motion tolerance is

claimed for the eye gaze tracker from LC technologies [2], which is still working

to improve it. The ASL eye tracker [1] has the best claimed tolerance of head

movement, allowing approximately one square foot of head movement. It eliminates

the need for head restraint by combining a magnetic head tracker with a pan-tilt

camera. However, details about how it handles head motion are not publicly known.

Further, combining a magnetic head tracker with a pan-tilt camera is not only

complicated but also expensive for the regular user.

In summary, most of existing eye gaze systems based on the PCCR technique

share two common drawbacks: first, the user has to perform certain experiments in

calibrating the user-dependent parameters before using the gaze -tracking system;

second, the user has to keep his head unnaturally still, with no significant head

movements allowed.

3.2.2 Direct 3D Gaze Estimation Technique

For the direct 3D gaze estimation technique, the 3D direction of the gaze is

estimated directly so that the gaze point can be obtained by simply intersecting it

with the scene. Therefore, how to estimate the 3D gaze direction of the eye precisely

is the key issue for most of these techniques. Several techniques [76, 84, 6, 101] have

been proposed to estimate the 3D direction of gaze directly from the eye images.

This method is not constrained by the head position, and it can be used to obtain

the gaze point on any object in the scene by simply intersecting it with the estimated

3D gaze line. Therefore, with the use of this method, the issues of gaze mapping

function calibration and head movement that plague the 2D mapping methods can

be solved nicely.

Page 65: Zhiwei Zhu PhD Thesis

48

Morimoto et al. [76] proposed a technique to estimate the 3D gaze direction

of the eye with the use of a single calibrated camera and at least two light sources.

First, the radius of the eye cornea is measured in advance for each person, using

at least three light sources. A set of high order polynomial equations are derived

to compute the radius and center of the cornea, but their solutions are not unique.

Therefore, how to choose the correct one from the set of possible solutions is still an

issue. Furthermore, no working system has been built using the proposed technique.

Ohno et al. [84] proposed an approximation method to estimate the 3D eye

gaze. There are several limitations for this proposed method. First, the cornea

radius and the distance between the pupil and cornea center are fixed for all users

although they actually vary significantly from person to person. Second, the formu-

lation to obtain the cornea center is based on the assumption that the virtual image

of IR LED appears on the surface of the cornea. In fact, however – as shown in

Section 3.3.2.1 of this chapter – the virtual image of IR LED will not appear on the

surface of the cornea; instead, it will appear behind the cornea surface or inside the

cornea. Therefore, the calculated cornea center will be a very rough approximation.

Beymer et al. [6] proposed another system that can estimate the 3D gaze

direction based on a complicated 3D eyeball model with at least seven parameters.

First, the 3D eyeball model will be automatically individualized to a new user, which

is achieved by fitting the 3D eye model with a set of image features via a nonlinear

estimation technique. The image features used for fitting include only the glints of

the IR LEDs and the pupil edges. But as shown in Section 3.3.2.1 of this chapter,

the glints are the image projections of the virtual images of the IR LEDs created

by the cornea, and they are not on the surface of the eye cornea, but inside the

cornea. Also, the pupil edges are not on the surface of the 3D eye model, either.

Therefore, the radius of the cornea cannot be estimated based on the proposed

method. Further, fitting such a complicated 3D model with only few feature points,

the solution will be unstable and very sensitive to noise.

Shih et al. [101] proposed a novel method to estimate 3D gaze direction by

using multiple cameras and multiple light sources. In their method, although there is

no need to know the user-dependent parameters of the eye, there are several obvious

Page 66: Zhiwei Zhu PhD Thesis

49

limitations for the current system. First, the light sources and the cameras cannot

be collinear, and a careful arrangement of them is required in order to achieve a

good performance; second, when the user is looking at points on the line connecting

the optical centers of the two cameras, the 3D gaze direction cannot be determined

uniquely.

Therefore, most of the existing 3D gaze-tracking techniques either require

knowledge of several user-dependent parameters about the eye [76, 84, 6], or cannot

work under certain circumstances [101]. But in reality, these user-dependent param-

eters of the eyeball, such as the cornea radius and the distance between the pupil

and the cornea center, etc., are very difficult to measure accurately due to its small

size (normally less than 10mm). Therefore, the accuracy of these proposed eye gaze

techniques will decline dramatically if they cannot be measured accurately.

3.3 Direct 3D Gaze Estimation Technique

3.3.1 The Structure of Human Eyeball

As shown in Figure 3.2, the eyeball is made up of the segments of two spheres

with different sizes placed in front of the other [86]. The anterior, the smaller

segment, is transparent and forms about one-sixth of the eyeball, and has a radius

of curvature of about 8 mm. The posterior, the larger segment, is opaque and forms

about five-sixths of the eyeball, and has a radius of about 12 mm.

F

Cornea

cornea O

Optic Axis

eyeball O

Fovea

P

Visual Axis

Posterior Pole Anterior

Pole

Eyeball

Target

Kappa

Figure 3.2: The structure of the eyeball (top view of the right eye)

The anterior pole of the eye is the center of curvature of the transparent

Page 67: Zhiwei Zhu PhD Thesis

50

segment or cornea. The posterior pole is the center of the posterior curvature of

the eyeball, and is located slightly temporal to the optical nerve. The optic axis is

defined as a line connecting these two poles, as shown in Figure 3.2. The fovea defines

the center of the retina, and is a small region with highest visual acuity and color

sensitivity. Since the fovea provides the sharpest and most detailed information, the

eyeball is continuously moving so that the light from the object of primary interest

will fall on this region. Thus, another major axis, the visual axis, is defined as

the projection of the foveal center into object space through the eye’s nodal point

Ocornea as shown in Figure 3.2. Therefore, it is the visual axis that determines a

person’s visual attention or direction of gaze, not the optic axis. Since the fovea is a

few degrees temporal to the posterior pole, the visual axis will deviate a few degrees

nasally from the optic axis. The angle formed by the intersection of the visual axis

and the optic axis at the nodal point is named as angle kappa. The angle kappa in

the two eyes should have the same magnitude [86], approximately around 5.

Pupillary axis of the eye is defined as the 3D line connecting the center of the

pupil P and the center of the cornea Ocornea. The pupillary axis is the best estimate

of the location of the eye’s optic axis; if extended through the eye, it should exit

very near the anatomical posterior pole. In Figure 3.2, the pupillary axis is shown

as the optic axis of the eyeball. Therefore, if we can obtain the 3D locations of

the pupil center and cornea center, then the optic axis of the eye can be estimated

easily.

3.3.2 Derivation of 3D Cornea Center

3.3.2.1 The Structure of Cornea

The anterior of the eyeball is composed of several layers, and each layer is made

of tissue with a slightly different refraction index. When light passes through the

eye, the boundary surface of each layer will act like a reflective surface. Therefore,

if a light source is placed in front of the eye, several reflections will occur on the

boundaries of the lens and cornea as shown in Figure 3.3. If these reflections are

captured by a camera, the generated images are called Purkinje images. The first

Purkinje image corresponds to the reflection from the external surface of the cornea

Page 68: Zhiwei Zhu PhD Thesis

51

as shown in Figure 3.3, which will be captured as a very bright spot in the eye

image as shown in Figure 3.1 (a) and (b). This special bright dot is called glint and

it is the brightest and easiest reflection to detect and track. Detecting the other

Purkinje images requires special hardware; therefore, from now on, only the first

Purkinje image will be considered here.

Light rays Cornea

Lens

1st Purkinje Image

2nd Purkinje Image 3rd

Purkinje Image

4th Purkinje Image

Figure 3.3: The reflection diagram of Purkinje images

Since the external surface of the cornea functions like a convex mirror, in order

to understand the formation of the glint, the external surface of the cornea is further

modelled as a convex mirror with a radius R.

3.3.3 Image Formation in the Convex Mirror

First, a few key concepts will be introduced in order to study the image for-

mation by a spherical convex mirror [33].

O

Principal Axis T

' T

S ' S

1

2

V F

Figure 3.4: A ray diagram to locate the image of an object in a convexmirror

Page 69: Zhiwei Zhu PhD Thesis

52

As illustrated in Figure 3.4, the point V is the surface center of the mirror and

the normal of the mirror is called the principal axis. The mirror is assumed to be

rotationally symmetrical about its principal axis. This allows us to represent a three-

dimensional mirror in a two-dimensional diagram without loss of generality. The

point O, on the principal axis, which is equidistant from all points on the reflecting

surface of the mirror, is called the center of curvature. It is found experimentally

that rays striking a convex mirror parallel to its principal axis, and not too far away

from this axis, are reflected by the mirror such that they all pass through the same

point F on the principal axis. This point, which lies between the center of curvature

and the vertex, is called the focal point, or virtual focus, of the mirror.

The ray diagram of the image produced by a convex mirror always conforms

to the following two simple rules:

1. An incident ray which is parallel to the principal axis is reflected as if it came

from the virtual focus of the mirror.

2. An incident ray which is directed towards the center of curvature of the mirror

is reflected back along its own path (since it is normally incident on the mirror).

As shown in Figure 3.4, two rays are used to locate the image S ′T ′ of an object

ST placed in front of the mirror. It can be seen that the image is virtual, upright,

and diminished.

3.3.3.1 Glint Formation In Cornea Reflection

The eye cornea serves as a convex mirror during the process of glint formation.

Specifically, the focus point F , the center of the curvature Ocornea and the principal

axis are shown in Figure 3.5. In our research, the IR LEDs are utilized as the light

sources. Therefore, when an IR LED is placed in front of the eye, the cornea will

produce a virtual image of the IR LED, which is located somewhere behind the

cornea surface, as shown in Figure 3.5.

According to the properties of the convex mirror, the IR light ray diagram of

the cornea is shown in Figure 3.5. In the diagram, an image is the location in space

where it appears that light diverges from. Any observer from any position who is

Page 70: Zhiwei Zhu PhD Thesis

53

IR LED

F

Image

Cornea

cornea O

Principal Axis

Camera

Camera

Camera

Camera

Figure 3.5: The image formation of a point light source in the corneawhen the cornea serves as a convex mirror

sighting along a line at the image location will view the IR light source as a result

of the reflected light; each observer sees the image in the same location regardless of

the observer’s location. Thus, the task of determining the image location of the IR

light source is to determine the location where reflected light intersects. In Figure

3.5, several rays of light emanating from the IR light source are shown approaching

the cornea and subsequently reflecting. Each ray is extended backwards to a point

of intersection - this point of intersection of all extended reflected rays indicates the

location of the virtual image of IR light source.

In our research, the cameras are the observers. Therefore, the virtual image

of the IR light source created by the cornea will be shown as a glint in the image

captured by the camera. If we place two cameras at different locations, each camera

will capture a glint corresponding to the same virtual image of the IR light source

in space as shown in Figure 3.6. Therefore, in theory, with the use of two cameras,

the 3D location of the virtual image of the IR light source in space can be recovered.

3.3.3.2 Curvature Center of the Cornea

According to the properties of the convex mirror, an incident ray that is di-

rected towards the center of curvature of a mirror is reflected back along its own

path (since it is normally incident on the mirror). Therefore, as shown in Figure

3.7, if the light ray L1P1 is shone directly towards the center of the curvature of the

cornea Ocornea, it will be reflected back along its own path. Also, the virtual image

Page 71: Zhiwei Zhu PhD Thesis

54

IR LED

F

Image

Cornea

cornea O

Principal Axis

Camera

Camera

Image Plane

Image Plane

Glint

Glint

Figure 3.6: The ray diagram of the virtual image of the IR light sourcein front of the cameras

of the IR light source P1 will lie in this path. Therefore, as shown in Figure 3.7, the

IR light source L1, its the virtual image P1 and the curvature center of the cornea

Ocornea will be co-linear.

IR LED 1

F

Image 1

Cornea

cornea O

Principal Axis

IR LED 2 Image 2

1 L

2 L

1 P

2 P

Figure 3.7: The ray diagram of two IR light sources in front of the cornea

Further, if we place another IR light source at a different place L2 as shown in

Figure 3.7, then the IR light source L2, its the virtual image P2 and the curvature

center of the cornea Ocornea will lie in another line L2P2Ocornea. Line L1P1Ocornea

and line L2P2Ocornea will intersect at the point Ocornea.

As discussed in Section 3.3.3.1, if two cameras are used, the 3D locations of

the virtual images P1 and P2 of the IR light sources can be obtained through 3D

reconstruction. Furthermore, the 3D location of the IR light sources L1 and L2 can

Page 72: Zhiwei Zhu PhD Thesis

55

be obtained through the system calibration procedure discussed in Section 3.5.1.

Therefore, the 3D location of the curvature center of cornea Ocornea can be obtained

by intersecting the line L1P1 and L2P2 as follows:

Ocornea = L1 + k1(L1 − P1)

Ocornea = L2 + k2(L2 − P2)(3.1)

Note that when more than two IR light sources are available, a set of equations

can be obtained, which can lead to a more robust estimation of the 3D location of

cornea center.

3.3.4 Computation of 3D Gaze Direction

3.3.4.1 Estimation of Optic Axis

As discussed earlier, the pupillary axis is the best approximation of the optic

axis of the eye. Therefore, after the 3D location of the pupil center P is extracted,

the optic axis Vp of the eye can be estimated by connecting the 3D pupil center P

with cornea center Ocornea as follows:

Vp = Ocornea + k(P −Ocornea) (3.2)

Since the fovea is invisible from the captured eye images, the visual axis of the

eye cannot be estimated directly. Without knowing the visual axis of the eye, the

user’s fixation point in the 3D space still cannot be determined.

However, the deviation angle kappa between the visual axis and the optic axis

of the eye is constant for each person. Therefore, if the deviation angle kappa is

known, then the visual axis can be computed from the estimated optic axis easily.

In the following, a technique is proposed to estimate the deviation angle kappa

accurately.

Page 73: Zhiwei Zhu PhD Thesis

56

3.3.4.2 Compensation of the Angle Deviation between Visual Axis and

Optic Axis

When a user is looking at a known point Ps in the screen, the 3D location

of the screen point Ps can be known in that the screen is calibrated. At the same

time, the 3D location of the cornea center Ocornea and the 3D location of the pupil

center P can be computed from the eye images via the proposed technique discussed

above. Therefore, the direction of visual axis−→Vv and the direction of optic axis

−→Vp

can be computed as follows:

−→Vv = (Ps −Ocornea)/‖Ps −Ocornea‖−→Vp = (P −Ocornea)/‖P −Ocornea‖

(3.3)

In addition, let’s represent the relationship between the visual axis and the

optic axis as follows:

−→Vv = R

−→Vp (3.4)

where R is a 3 × 3 rotation matrix and it is constructed from the deviation angles

between the vectors−→Vv and

−→Vv, or deviation angle kappa. Once the rotation matrix

R is estimated, then the 3D visual axis can be estimated from the extracted 3D

optic axis. Therefore, instead of estimating the deviation angle kappa directly to

know the relationship between the visual axis and the optic axis, it can be encoded

through the rotation matrix R implicitly. In addition, the rotation matrix R can be

estimated by a simple calibration as follows.

During the calibration, the user is asked to look at a set of k pre-defined point

Psi (i = 1, · · · , k) in the screen. After the calibration is done, a set of k pairs of

vectors−→Vv and

−→Vp are obtained via equation 3.3. In addition, since the rotation

matrix R is an orthonormal matrix, equation 3.4 can be represented as

−→Vp = RT−→Vv (3.5)

Therefore, according to equations 3.4 and 3.5, one pair of vectors−→Vv and

−→Vp can

Page 74: Zhiwei Zhu PhD Thesis

57

give 6 linear equations so that two screen points are enough to estimate the 3 × 3

rotation matrix R.

Once the rotation matrix R is estimated, the visual axis of the eye can be esti-

mated from the computed optic axis−→Vp through equation 3.4. Finally, an accurate

point of regard of the user can be computed by intersecting the estimated 3D visual

axis with any object in the scene.

3.4 2D Mapping-Based Gaze Estimation Technique

Most available remote eye gaze trackers are built from the PCCR technique.

If the users have the ability to keep their heads fixed or has the use of a chin-rest to

restrict the head motion, very high accuracy can be achieved in most eye gaze esti-

mation results. But as the head moves away from the original head position where

the user performed the eye gaze calibration, the accuracy of these gaze-tracking

systems drops significantly.

In the following sections, the head motion effect on the accuracy of the PCCR-

based gaze-tracking techniques is first analyzed. Subsequently, a solution is proposed

to compensate for the head movement effect so that the user can move his head freely

in front of the camera while the gaze information can still be accurately estimated.

3.4.1 Classical PCCR Technique

The PCCR-based technique consists of two major components: pupil-glint

vector extraction and gaze mapping function acquisition.

1. Pupil-glint Vector Extraction

Gaze estimation starts with pupil-glint vector extraction. After grabbing the

eye image from the camera, computer vision techniques [133, 134] are proposed to

extract the pupil center and the glint center robustly and accurately. The pupil

center and the glint center are connected to form a 2D pupil-glint vector v as shown

in Figure 3.9.

2. Specific Gaze Mapping Function Acquisition

After obtaining the pupil-glint vectors, a calibration procedure is proposed to

Page 75: Zhiwei Zhu PhD Thesis

58

acquire a specific gaze mapping function that will map the extracted pupil-glint

vector to the user’s fixation point in the screen at the current head position. The

extracted pupil-glint vector v is represented as (vx, vy) and the screen gaze point Ss

is represented by (xgaze, ygaze) in the screen coordinate system. The specific gaze

mapping function Ss = f(v) can be modelled by the following nonlinear equations

[2]:

xgaze = a0 + a1 ∗ vx + a2 ∗ vy + a3 ∗ vx ∗ vy

ygaze = b0 + b1 ∗ vx + b2 ∗ vy + b3 ∗ vy2

(3.6)

The coefficients a0, a1, a2, a3 and b0, b1, b2, b3 are estimated from a set of pairs

of pupil-glint vectors and the corresponding screen gaze points. These pairs are

collected in a calibration procedure. During the calibration, the user is required

to visually follow a shining dot as it displays at several predefined locations on the

computer screen. In addition, he must keep his head as still as possible.

If the user does not move his head significantly after the gaze calibration, the

calibrated gaze mapping function can be used to accurately estimate the user’s gaze

point on the screen, based on the extracted pupil-glint vector. But when the user

moves his head away from the position where the gaze calibration is performed,

the calibrated gaze mapping function will fail to estimate the gaze point accurately

because of the pupil-glint vector changes caused by the head movements. In the

following section, head movement effects on the pupil-glint vector will be illustrated.

3.4.2 Head Motion Effects on Pupil-glint Vector

Figure 3.8 shows the ray diagram of the pupil-glint vector generation in the

image when an eye is located at two different 3D positions O1 and O2 in front of the

camera due to head movement. For simplicity, the eye is represented by a cornea,

the cornea is modelled as a convex mirror, and the IR light source used to generate

the glint is located at O, all of which are applicable to subsequent figures in this

chapter. Assume that the origin of the camera is located at O, p1 and p2 are the

pupil centers and g1 and g2 are the glint centers generated in the image. Further,

at both positions, the user is looking at the same point of the computer screen S.

Page 76: Zhiwei Zhu PhD Thesis

59

According to the light ray diagram shown in Figure 3.8, the generated pupil-glint

vectors −−→g1p1 and −−→g2p2 will be significantly different in the images, as shown in Figure

3.9. Two factors are responsible for this pupil-glint vector difference: first, the eyes

are at different positions in front of the camera; second, in order to look at the same

screen point, eyes at different positions rotate themselves differently.

1 O

2 O

1 P

2 P

1 G

2 G

S

O

1 p

2 p

1 g

2 g

Cornea

Image Plane

Screen Plane

Camera Z Axis

Figure 3.8: Pupil and glint image formations when eyes are located atdifferent positions while gazing at the same screen point (sideview)

1 g 1 p

2 g 2 p

(a) eye image at location O1 (b) eye image at location O2

Figure 3.9: The pupil-glint vectors generated in the eye images when theeye is located at O1 and O2 in Figure 3.8

The eye will move as the head moves. Therefore, when the user is gazing

at a fixed point on the screen while moving his head in front of the camera, a set

of pupil-glint vectors in the image will be generated. These pupil-glint vectors are

significantly different from each other. If uncorrected, inaccurate gaze points will

be estimated after inputting them into a calibrated specific gaze mapping function

obtained at a fixed head position.

Page 77: Zhiwei Zhu PhD Thesis

60

Therefore, the head movement effects on these pupil-glint vectors must be

eliminated in order to utilize the specific gaze mapping function to estimate the

screen gaze points accurately. In the following section, a technique is proposed

to eliminate the head movement effects from these pupil-glint vectors. With this

technique, accurate gaze screen points can be estimated accurately under natural

head movements.

3.4.3 Dynamic Head Compensation Model

3.4.3.1 Approach Overview

The first step of our technique is to find a specific gaze mapping function fO1

between the pupil-glint vector v1 and the screen coordinate S at a reference 3D

eye position O1. This is usually achieved via a gaze calibration procedure using

equations 3.6. The function fO1can be expressed as follows:

S = fO1(v1) (3.7)

Assume that when the eye moves to a new position O2 as the head moves, a

pupil-glint vector v2 will be created in the image while the user is looking at the

same screen point S. When O2 is significantly different from O1, v2 cannot be used

as the input of the gaze mapping function fO1to estimate the screen gaze point due

to the changes of the pupil-glint vector caused by the head movement. If the changes

of the pupil-glint vector v2 caused by the head movement can be eliminated, then

a corrected pupil-glint vector v2′

will be obtained. Ideally, this corrected pupil-glint

vector v2′

is the generated pupil-glint vector v1 of the eye at the reference position

O1 when gazing at the same screen point S. Therefore, this is equivalent to finding

a head mapping function g between two different pupil-glint vectors at two different

head positions when still gazing at the same screen point. This mapping function g

can be written as follows:

v2′

= g(v2, O2, O1) (3.8)

where v2′

is the equivalent measurement of v1 with respect to the initial reference

Page 78: Zhiwei Zhu PhD Thesis

61

head position O1. Therefore, the screen gaze point can be estimated accurately from

the pupil-glint vector v2′

via the specific gaze mapping function fO1as follows:

S = fO1(g(v2, O2, O1)) = F (v2, O2) (3.9)

where the function F can be called as a generalized gaze mapping function that

explicitly accounts for the head movement. It provides the gaze mapping function

dynamically for a new eye position O2.

With the use of the proposed technique, whenever the head moves, a gaze map-

ping function at each new 3D eye position can be updated automatically; therefore,

the issue of the head movement can be solved nicely.

3.4.3.2 Image Projection of Pupil-glint Vector

In this section, we show how to find the head mapping function g. Figure 3.10

shows the process of the pupil-glint vector formation in the image for an eye in front

of the camera. When the eye is located at two different positions O1 and O2 while

still gazing at the same screen point S, two different pupil-glint vectors −−→g1p1 and

−−→g2p2 are generated in the image. Further, as shown in Figure 3.10, a plane A parallel

to the image plane that goes through the point P1 will intersect the line O1O at

G11. Another plane B parallel to the image plane that goes through the point P2

will intersect the line O2O at G22. Therefore, −−→g1p1 is the projection of the vector

−−−→G1P1 and −−→g2p2 is the projection of the vector

−−−→G2P2 in the image plane. Because

plane A, plane B and the image plane are parallel, the vectors −−→g1p1, −−→g2p2,−−−→G1P1 and

−−−→G2P2 can be represented as 2D vectors in the X−Y plane of the camera coordinate

system.

Assume that in the camera coordinate system, the 3D pupil centers P1 and

P2 are represented as (x1, y1, z1) and (x2, y2, z2), the glint centers g1 and g2 are

represented as (xg1, yg1

,−f) and (xg2, yg2

,−f), where f is focus length of the camera,

and the screen gaze point S is represented by (xs, ys, zs). Via the pinhole camera

1G1 is not the actual virtual image of the IR light source2G2 is not the actual virtual image of the IR light source

Page 79: Zhiwei Zhu PhD Thesis

62

Z Axis of Camera

Screen Plane

Image Plane

s O

c o

1 O

2 O

1 P

2 P

1 G

2 G

O

1 p 2 p

S Screen Gaze

Point

f

X

1 g 2 g

1 G

1 P

2 G

2 P

Plane A

Plane B

Y

Figure 3.10: Pupil and glint image formation when the eye is located atdifferent positions in front of the camera

model, the image projection of the pupil-glint vectors can be expressed as follows:

−−→g1p1 = −f

z1

∗−−−→G1P1 (3.10)

−−→g2p2 = −f

z2

∗−−−→G2P2 (3.11)

Assume that the pupil-glint vectors −−→g1p1 and −−→g2p2 are represented as (vx1, vy1)

and (vx2, vy2) respectively, and the vectors−−−→G1P1 and

−−−→G2P2 are represented as (Vx1, Vy1)

and (Vx2, Vy2) respectively. Therefore, the following equation can be derived by com-

bining the equations 3.10 and 3.11:

vx1 =Vx1

Vx2

∗z2

z1

∗ vx2 (3.12)

vy1 =Vy1

Vy2

∗z2

z1

∗ vy2 (3.13)

The above two equations describe how the pupil-glint vector changes as the

Page 80: Zhiwei Zhu PhD Thesis

63

head moves in front of the camera. Also, based on the above equations, it is obvious

that each component of the pupil-glint vector can be mapped individually. There-

fore, equation 3.12 for the X component of the pupil-glint vector will be derived

first as follows.

3.4.3.3 First Case: The cornea center and the pupil center lie on the

camera’s X − Z plane:

Figure 3.11 shows the ray diagram of the pupil-glint vector formation when

the cornea center and pupil center of an eye happen to lie on the X − Z plane of

the camera coordinate system. Therefore, either the generated pupil-glint vectors

−−→p1g1 and −−→p2g2 or the vectors−−−→P1G1 and

−−−→P2G2 can be represented as one dimensional

vectors, specifically, −−→p1g1 = vx1, −−→p2g2 = vx2,−−−→P1G1 = Vx1 and

−−−→P2G2 = Vx2.

Z Axis of Camera

Screen Plane

Image Plane

s O

c o

1 O

2 O

1 P

2 P

1 G

2 G

O

1 p 2 p

S Screen

Gaze Point

f

X 1 g 2 g

' 1 S

' 2 S

' 1 O

' 2 O

1 G 1 P

1 O

Figure 3.11: Pupil and glint image formation when the eye is located atdifferent positions in front of the camera (top-down view)

According to Figure 3.11, the vectors−−−→G1P1 and

−−−→G2P2 can be represented as

follows:

−−−→G1P1 =

−−−→G1O1

′ +−−−→O1

′P1 (3.14)

−−−→G2P2 =

−−−→G2O2

′ +−−−→O2

′P2 (3.15)

Page 81: Zhiwei Zhu PhD Thesis

64

For simplicity, r1 is used to represent the length of−−−→O1P1, r2 is used to represent

the length of−−−→O2P2, ∠G1P1O1 is represented as α1, ∠G2P2O2 is represented as α2,

∠P1G1O1 is represented as β1 and ∠P2G2O2 is represented as β2. According to the

geometries shown in Figure 3.11, the vectors−−−→G1P1 and

−−−→G2P2 can be further achieved

as follows:

−−−→G1P1 = −

r1 ∗ sin(α1)

tan(β1)− r1 ∗ cos(α1) (3.16)

−−−→G2P2 = −

r2 ∗ sin(α2)

tan(β2)− r2 ∗ cos(α2) (3.17)

As shown in Figure 3.11, line G1P1 and line G2P2 are parallel to the X axis

of the camera. Therefore, tan(β1) and tan(β2) can be obtained from the rectangles

g1ocO and g2ocO individually as follows:

tan(β1) =f−−→ocg1

(3.18)

tan(β2) =f−−→ocg2

(3.19)

In the above equation, g1 and g2 are the glints in the image, and oc is the

principal point of the camera. For simplicity, we choose xg1to represent −−→ocg1 and

xg2to represent −−→ocg2. Therefore, after detecting the glints in the image, tan(β1) and

tan(β2) can be obtained accurately.

Further, sin(α1), cos(α1), sin(α2) and cos(α2) can be obtained from the

geometries of the rectangles P1SS ′

1 and P2SS ′

2 directly. Therefore, equations 3.16

and 3.17 can be derived as follows:

Vx1 = r1 ∗(zs − z1) ∗ xg1

P1S ∗ f+ r1 ∗

(xs − x1)

P1S(3.20)

Vx2 = r2 ∗(zs − z2) ∗ xg2

P2S ∗ f+ r2 ∗

(xs − x2)

P2S(3.21)

3.4.3.4 Second Case: The cornea center and the pupil center do not lie

on the camera’s X − Z plane:

In fact, the cornea center and the pupil center do not always lie on the camera’s

X − Z plane. However, we can obtain the ray diagram shown in Figure 3.11 by

Page 82: Zhiwei Zhu PhD Thesis

65

projecting the ray diagram in Figure 3.10 into X −Z plane along the Y axis of the

camera’s coordinate system. Therefore, as shown in Figure 3.12, point P1 is the

projection of the pupil center Pc1, point O1 is the projection of the cornea center

Oc1, and point S is also the projection of the screen gaze point S′

in the X−Z plane.

Starting from Oc1, a parallel line Oc1P1′

of line O1P1 intersects with line Pc1P1 at

P1′

. Also starting from Pc1, a parallel line Pc1O1′′

of line P1S intersects with line

SS′

at O1′′

.

S

' S

s O

X

Z

Y

' '

1 O

1 O

1 c O

1 P

1 c P

1 ' P

Figure 3.12: Projection into camera’s X − Z plane

Because Oc1Pc1 represents the distance r between the pupil center to the cornea

center, which will not change as the eyeball rotates, O1P1 can be derived as follows:

r1 = O1P1 = r ∗P1S√

P1S2 + (y1 − ys)

2(3.22)

Therefore, when the eye moves to a new location O2 as shown in Figure 3.11,

O2P2 can be represented as follows:

r2 = O2P2 = r ∗P2S√

P2S2 + (y2 − ys)

2(3.23)

Page 83: Zhiwei Zhu PhD Thesis

66

After substituting the formulations of r1 and r2 into equations 3.20 and 3.21,

we can obtain Vx1

Vx2

as follows:

Vx1

Vx2

= d ∗[(zs − z1) ∗ xg1

+ (xs − x1) ∗ f ]

[(zs − z2) ∗ xg2+ (xs − x2) ∗ f ]

(3.24)

where d is set as follows:

d =

√(z2 − zs)

2 + (x2 − xs)2 + (y2 − ys)

2

√(z1 − zs)

2 + (x1 − xs)2 + (y1 − ys)

2

As a result, equations 3.12 and 3.13 can be finally obtained as follows:

vx1 = d ∗[(zs − z1) ∗ xg1

+ (xs − x1) ∗ f ]

[(zs − z2) ∗ xg2+ (xs − x2) ∗ f ]

∗z2

z1

∗ vx2 (3.25)

vy1 = d ∗[(zs − z1) ∗ yg1

+ (ys − y1) ∗ f ]

[(zs − z2) ∗ yg2+ (ys − y2) ∗ f ]

∗z2

z1

∗ vy2 (3.26)

The above equations constitute the head mapping function g between the pupil-glint

vectors of the eyes at different positions in front of the camera, while gazing at the

same screen point.

3.4.3.5 Iterative Algorithm for Gaze Estimation

Equations 3.25 and 3.26 require the knowledge of gaze point S = (xs, ys, zs)

on the screen. However, the gaze point S is the one that needs to be estimated. As

a result, the gaze point S is also a variable of the head mapping function g, which

can be further expressed as follows:

v2′

= g(v2, P2, P1, S) (3.27)

Assume that a specific gaze mapping function fP1is known via the calibration

procedure described in Section 3.4.1. Therefore, after integrating the head map-

ping function g into the specific gaze mapping function fP1via equation 3.9, the

Page 84: Zhiwei Zhu PhD Thesis

67

generalized gaze mapping function F can be rewritten as follows:

S = F (v2, P2, S) (3.28)

Given the extracted pupil-glint vector v2 from the eye image and the new

location P2 that the eye has moved to, equation 3.28 becomes a recursive function.

An iterative solution is proposed to solve it.

First, the screen center S0 is chosen as an initial gaze point, then a corrected

pupil-glint vector v2′ can be obtained from the detected pupil-glint vector v2 via the

head mapping function g. By inputting the corrected pupil-glint vector v2′ into the

specific gaze mapping function fP1, a new screen gaze point S ′ can be estimated. S ′

is further used to compute a new corrected pupil-glint vector v2′. The loop continues

until the estimated screen gaze point S ′ does not change any more. Usually, the

whole iteration process will converge in less than 5 iterations, which makes the real

time implementation possible.

3.5 Experiment Results

3.5.1 System Setup

Our system consists of two cameras mounted under the computer monitor, as

shown in Figure 3.13. An IR light illuminator is mounted at the center of the lens

of each camera, which will produce the corneal glint in the eye image.

Figure 3.13: The configuration of the gaze-tracking system

Page 85: Zhiwei Zhu PhD Thesis

68

Before the usage of the system, two steps are performed to calibrate the sys-

tem. The first step is to obtain the parameters of the stereo camera system, which

is obtained through camera calibration [130]. Once the stereo camera system is

calibrated, given any point Pi in front of it, the 3D position (xi yi zi)T of Pi can

be reconstructed from the image points of Pi in both cameras. The second step is

to obtain the 3D positions of the IR LEDs and the computer screen in the stereo

camera system. Since the IR LEDs and the computer screen are located behind

the view-field of the stereo camera system, they cannot be observed directly by the

stereo camera system. Therefore, similar to [6, 101], a planar mirror with a set of

fiducial markers attached to the mirror surface is utilized. With the help of the

planar mirror, the virtual images of the IR LEDs and the computer screen reflected

by the mirror can be observed by the stereo camera system. Thus, the 3D locations

of the IR LEDs and the computer screen can be calibrated after obtaining the 3D

locations of the virtual images of them.

In the following, some experiment results of both gaze-tracking techniques are

reported.

3.5.2 Performance of 3D Gaze Tracking Technique

3.5.2.1 Gaze Estimation Accuracy

Once the system is calibrated and the angle deviation between the visual axis

and the optic axis for a new user is obtained, his screen gaze point can be determined

by intersecting the estimated 3D visual axis of the eye with the computer screen.

In order to test the accuracy of the gaze-tracking system, seven users were involved

in the experiments and none of them wears glasses.

Personal calibration is needed for each user before using our gaze-tracking sys-

tem in order to obtain the angle deviation of the visual axis and the optic axis. The

calibration is very fast and only lasts for less than 5 seconds. Once the calibration

is done, the user does not need to do the calibration any more if he wants to use

the system later.

During the experiments, a marker will display at nine fixed locations in the

screen randomly, and the user is asked to gaze at the marker when it appears at

Page 86: Zhiwei Zhu PhD Thesis

69

each location. The experiment contains five 1-minute sessions. At each session,

the user is required to position his head at a different position purposely. Table

3.1 summarizes the computed gaze estimation accuracy for the first subject, where

the last column represents the average distance from the user to the camera during

each session. As shown in Table 3.1, the accuracy of the gaze tracker significantly

depends on the user’s distance to the camera. Normally, as the user moves closer

to the camera, the gaze accuracy will increase dramatically. This is because the

resolution of the eye image increases as the user moves closer to the camera.

Table 3.1: The gaze estimation accuracy for the first subject.

Session Horizontal accuracy Vertical accuracy Distance to the camera

1 5.02 mm (0.72) 6.40 mm (0.92) 280 mm

2 7.20 mm (0.92) 9.63 mm (1.22) 320 mm

3 9.74 mm (1.24) 13.24 mm (1.68) 370 mm

4 12.47 mm (1.37) 17.30 mm (1.90) 390 mm

5 19.60 mm (1.97) 24.32 mm (2.45) 440 mm

Table 3.2 summarizes the computed average gaze estimation accuracy for all

the seven subjects during the experiments. Specifically, in the experiment, the aver-

age covered head movement volume is around 200mm in the X, Y and Z directions

respectively. In addition, the average angular gaze accuracy in the horizontal direc-

tion is 1.47 and the average angular gaze accuracy in the vertical direction is 1.87

for all these seven users, which is acceptable for many Human Computer Interaction

(HCI) applications, allowing natural head movements.

Table 3.2: The gaze estimation accuracy for seven subjects

Subject Horizontal accuracy Vertical accuracy

1 1.24 1.63

2 1.28 1.70

3 1.33 1.74

4 1.39 1.79

5 1.43 1.87

6 1.66 2.05

7 1.97 2.32

Page 87: Zhiwei Zhu PhD Thesis

70

3.5.2.2 Comparison with Other Methods

Table 3.3 shows the comparison of accuracy and allowable head movements

among several practically working gaze-tracking systems that allow natural head

movements. In addition, all of these systems were built recently and require only

a very simple personal calibration instead of a tedious gaze mapping function cal-

ibration. For simplicity, only the depth or Z direction movement is illustrated, as

shown in the second column of Table 3.3. We can see that our proposed technique

can provide a competitive gaze accuracy as well as a large head movement volume

without the help of a face tracking system. Therefore, it represents the state of the

art in 3D gaze-tracking research.

Table 3.3: Comparison with other systems

Methods Head movement Best Featuresvolume (Z) accuracy

[101] < 70 mm 0.4 1 stereo cameraseye tracking

[6] N/A, but > 70 mm 0.6 2 stereo camerasface trackingeye tracking

Ours around 200 mm 0.72 1 stereo cameraseye tracking

[133] around 500 mm 5 single cameraeye tracking

3.5.3 Performance of 2D Mapping Based Gaze Tracking Technique

3.5.3.1 Head Compensation Model Validation

For the proposed 2D mapping-based gaze-tracking technique, the Equations

3.25 and 3.26 of the head mapping function g are validated first by the following

experiments.

A screen point Sc = (132.75,−226.00,−135.00) is chosen as the gaze point.

The user gazes at this point from twenty different locations in front of the camera;

at each location, the pupil-glint vector and the 3D pupil center are collected. The

3D pupil centers and the pupil-glint vectors of the first two samples P1,P2 are shown

in Table 3.4, where P1 serves as the reference position. The second column indicates

Page 88: Zhiwei Zhu PhD Thesis

71

the original pupil-glint vectors, while the third column indicates the transformed

pupil-glint vectors by the head compensation model. The difference between the

transformed pupil-glint vector of P2 and the reference pupil-glint vector at P1 is

defined as the transformation error. Figure 3.14 illustrates the transformation errors

for all these twenty samples. It is observed that the average transformation error is

only around 1 pixel, which validates our proposed head compensation model.

Table 3.4: Pupil-glint vector comparison at different eye locations

3D pupil 2D Pupil-glint Transformed Pupil-glintposition (mm) Vector (pixel) Vector (pixel)

P1(5.25, 15.56, 331.55) (9.65, -16.62) (9.65, -16.62)P2(-8.13, 32.29, 361.63) (7.17, -13.33) (8.75, -16.01)

0 5 10 15 20−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Locations

Pix

els

Transformation Error (X Component)

0 5 10 15 20−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Locations

Pix

els

Transformation Error (Y Component)

(a) (b)

Figure 3.14: The pupil-glint vector transformation errors: (a) transfor-mation error on the X component of the pupil-glint vector,(b)transformation error on the Y component of the pupil-glint vector

3.5.3.2 Gaze Estimation Accuracy

In order to test the accuracy of the gaze-tracking system, several users were

asked to participate in the tests.

For the first user, the gaze mapping function calibration was performed when

the user was sitting approximately 330 mm from the camera. After the calibration,

the user was asked to stand up for a while. Then, the user was asked to sit approx-

imately 360 mm to the camera and follow a shining object that would display at 12

Page 89: Zhiwei Zhu PhD Thesis

72

different pre-specified positions across the screen. The user was asked to reposition

his head to a different position before the shining object moved to the next position.

Figure 3.15 displays the error between the estimated gaze points and the actual

gaze points. The average horizontal error is around 4.41 mm in the screen, which

corresponds to around 0.51 angular accuracy. The average vertical error is around

6.62 mm in the screen, which corresponds to around 0.77 angular accuracy. Also,

it shows that our proposed technique can handle head movements very well.

Figure 3.15: The plot of the estimated gaze points and the true gazepoints, where, “+” represents the estimated gaze point and“*” represents the actual gaze point

When the user moves his head away from the camera, the eye in the image

will become smaller. Due to the increased pixel measurement error caused by the

lower image resolution, the gaze accuracy of the eye gaze tracker will decrease as

the user moves away from the camera.

In this experiment, the effect of the distance to the camera on the gaze accuracy

of our system is analyzed. A new user was asked to perform the gaze calibration

when he was sitting around 360 mm to the camera. After the calibration, the

user was positioned at five different locations, which have different distances to the

camera as listed in Table 3.5. At each location, the user was asked to follow the

Page 90: Zhiwei Zhu PhD Thesis

73

moving objects that will display 12 predefined positions across the screen. Table

3.5 lists the gaze estimation accuracy at these five different locations, which shows

that as the user moves away from the camera, the gaze resolution will decrease. But

within this space allowed for the head movement, approximately 200 × 200 × 300

mm (width×height×depth) at 450 mm from the camera, the average horizontal

angular accuracy is around 1.16 and the average vertical angular accuracy is around

1.42, which is acceptable for most Human Computer Interaction applications. Also,

this space volume allowed for the head movement is large enough for a user to sit

comfortably in front of the camera and communicate with the computer naturally.

Table 3.5: Gaze estimation accuracy under different eye image resolutions

Distance to Horizontal Accuracy Vertical Accuracythe camera (mm) (degrees) (degrees)

300.26 0.52 0.61340.26 0.68 0.83400.05 1.31 1.41462.23 1.54 1.90552.51 1.73 2.34

3.6 Comparison of Both Techniques

Two different gaze tracking techniques are discussed in this chapter. In this

section, we briefly summarize their differences. The 2D mapping-based gaze esti-

mation method does not require knowledge of the 3D direction of the eye gaze to

determine the gaze point; instead, it directly estimates the gaze point on the object

from a gaze-mapping function directly by inputting a set of features extracted from

the eye image. The gaze-mapping function is usually obtained through a calibration

procedure repeated for each person.

The calibrated gaze mapping function is very sensitive to head motion; con-

sequently, a complicated head-motion compensation model is proposed to eliminate

the effect of head motion on the gaze-mapping function. Thus the 2D mapping-

based method can work under natural head movement. Since the 2D mapping-based

method is proposed mainly to estimate the gaze points on a specific object, a new

Page 91: Zhiwei Zhu PhD Thesis

74

gaze-mapping function calibration must be performed each time when a new object

is presented.

In contrast, the 3D gaze estimation technique estimates the 3D direction of

the visual axis directly, and determines the gaze by intersecting the visual axis with

the object in the scene. Thus, it can be used to estimate the gaze point on any

object in the scene without the use of tedious gaze-mapping function calibration.

Furthermore, since this method is not constrained by head position, the complicated

head-motion compensation model can be avoided. But the 3D technique needs a

stereo camera system, and the accuracy of the 3D gaze estimation technique is

affected by the accuracy of the stereo camera system.

In terms of accuracy, the experiments indicate that the 2D mapping-based

gaze estimation technique is more accurate than the 3D gaze-tracking technique.

For example, for a user who is sitting approximately 340 mm from the camera,

the 2D mapping-based gaze estimation technique can achieve 0.68 accuracy in the

horizontal direction and 0.83 accuracy in the vertical direction; nevertheless, the

direct 3D gaze estimation technique only achieves 1.14 accuracy in the horizontal

direction and 1.58 accuracy in the vertical direction. Therefore, we can see that

the accuracy of the direct 3D gaze estimation technique can still be improved.

3.6.1 Processing Speed

Both gaze tracking techniques proposed in the chapter are implemented using

C++ on a PC with a Xeon (TM) 2.80GHz CPU and a 1.00GB RAM. The image

resolution of the cameras is 640 × 480 pixels, and the built gaze-tracking systems

can run at approximately 25 fps comfortably.

3.7 Chapter Summary

In this chapter, two different techniques are proposed to improve the existing

gaze-tracking techniques. First, a novel 2D mapping-based gaze estimation tech-

nique is proposed to allow free head movement and simplify the personal calibration

procedure. Therefore, the eye gaze can be estimated with high accuracy under nat-

ural head movement, with the personal calibration being minimized simultaneously.

Page 92: Zhiwei Zhu PhD Thesis

75

Second, a simple method is proposed to estimate the 3D gaze direction of the user

without using any user-dependent parameters of the eyeball. Therefore, it is more

feasible to work on different individuals without tedious calibration. By the novel

techniques proposed in this chapter, the two common drawbacks of the existing eye

gaze trackers can be minimized or eliminated nicely so that the eye gaze of the user

can be estimated accurately under natural head movements, with minimum personal

calibration.

Page 93: Zhiwei Zhu PhD Thesis

CHAPTER 4

Robust Face Tracking Using Case-Based Reasoning with

Confidence

4.1 Introduction

In reality, there are significant variations in the captured face images over time.

Image variations are usually caused by a number of factors such as external lighting

changes, occlusions, facial expressions, head movements, camera view changes, etc.

Due to the lack of an effective tracking framework to cope with the appearance

changes of the face being tracked, how to track the faces robustly and accurately in

the video sequence still remains a very challenging problem.

Numerous approaches have been proposed to track rigid or non-rigid objects

throughout the video sequences. However, most of them still suffer from the well-

known drifting issue, incapable of assessing the tracking failures or recovering from

any possible tracking error. For example, the proposed tracking techniques [7, 60]

maintain the object model or template fixed during tracking. Clearly, the template

cannot adapt to significant appearance changes under conditions such as occlusions,

or lighting changes, or more importantly the internal geometry deformations or

appearance changes of the non-rigid objects. Hence, they can only work well when

the appearance of the object does not vary significantly during tracking. Otherwise,

it will drift away and start to track the wrong object instead. An intuitive strategy is

to update the template of the object to account for appearance changes whenever the

object’s appearance varies. However, it is always a challenging task to automatically

provide an accurate object template for updating.

In this chapter, a robust visual object tracking framework based on Case-

Based Reasoning paradigm is proposed to provide an accurate 2D tracking model

for the object being tracked dynamically at each image frame. As a result, the drift-

ing issue that plagues most of the tracking techniques can be solved successfully.

Furthermore, under the CBR paradigm, since the tracked view is always adapted

from its most similar case or image view of the object in a case base, an accurate

76

Page 94: Zhiwei Zhu PhD Thesis

77

similarity measurement can be always obtained to characterize the confidence level

of the tracked region. Therefore, when it starts to track a wrong object, the con-

fidence level associated with the tracked region will become low so that the failure

situations can be detected in time. Based on the proposed CBR visual object track-

ing framework, a real-time face tracking system was built so that the face can be

tracked robustly under significant changes in lighting, scale, facial expression and

head movement.

4.2 Related Works

Numerous techniques have been proposed to solve the drifting issue during

tracking. In this section, only several representative techniques are discussed. In

[72], a template update algorithm is proposed to avoid drifting during tracking by

storing the first template throughout the tracking. However, it only works when

the appearance of the object being tracked will not change significantly from the

stored object template. As a result, when new parts of the object come into view

or the appearance of the object varies significantly during tracking, the proposed

updating strategy fails. In [118], the template at each frame is dynamically updated

by a re-registration technique during tracking. The registration technique matches

the tracking template with a set of collected key-frames and then eliminates any un-

registered pixels from the tracking template. However, the un-registered pixels may

represent the object appearance changes instead of the non-object pixels. Therefore,

once they are eliminated, the object appearance changes cannot be adapted, and it

will be more difficult to track the object in the subsequent frames.

In [39], an adaptive appearance model is proposed to track complex natural

objects based on a subspace learning technique. In this technique, updating the

model is equivalent to finding a subspace representation that can best approximate

a given set of observations from previous frames. However, during subspace updat-

ing, the tracked object image at a new frame is directly integrated into the subspace

learning without any error correction. Therefore, once a tracked object view con-

tains error, which usually consists of the background pixels or non-object pixels, it

may be integrated into the learnt subspace and errors accumulate throughout the

Page 95: Zhiwei Zhu PhD Thesis

78

tracking. Finally, it may become a severe issue that leads the tracker to drift away

from the object being tracked once the learnt subspace cannot represent the ob-

ject accurately any more. Therefore, the key issue is how to eliminate or minimize

the error accumulated in each frame before learning. Once it can be eliminated

or minimized, the drifting issue can be solved. Unfortunately, in their subsequent

works [66, 63], efforts were still focused on the online learning strategy, ignoring the

potential errors associated with each tracked image. Therefore, it is believed that it

will still suffer from the drifting issue as demonstrated by experiments discussed in

Section 4.5 of this chapter. Very similar to these methods, the technique proposed in

[50] also could not eliminate the background pixels contained in the tracked object

views before updating the object appearance model. Therefore, drifting is still a

significant issue.

In [92, 75], a modified two-frame tracker is proposed, where multiple past

frames that are closest to the previous frame in pose space are used to refine the

pose information. Hence, each image must be annotated with the pose information.

An important assumption of this method is that pose information governs every-

thing about the image appearance changes of the object being tracked. The pose

information for each image frame must therefore be estimated accurately and the

face region must be correctly segmented. As a result, the tracking performance is

limited by the accuracy of the estimated pose information. Furthermore, it only

works with the rigid objects without significant illumination changes or occlusions.

Very similar techniques [110] are proposed to reduce drifting by learning the online

and offline view information of the object. However, a 3D model of the object being

tracked must be built and the object is represented by a 3D mesh. In addition, the

internal camera parameters must be calibrated and fixed during tracking. In our

method, we propose to overcome these limitations and make it work on any object,

with either rigid or non-rigid objects, and under illumination changes.

In addition, most of the above methods fail to identify failures and cannot

provide confidence levels that measure the goodness of each tracked region (the

probability of the tracked region contains the object). However, via the proposed

CBR visual tracking with confidence framework, each tracked region can be as-

Page 96: Zhiwei Zhu PhD Thesis

79

sessed with a score to indicate the confidence level. Furthermore, in our constructed

case base for the proposed CBR framework, since each individual exemplar serves

as a unique case, the complicated image distribution modelling from the collected

training exemplars [107] can be avoided.

In summary, compared to the existing techniques, our proposed technique has

the following advantages: (1) avoiding the drifting issue during tracking; (2) no need

of a 3D model; (3) not restricted to rigid objects, capable of tracking any object;

(4) no need of a calibrated camera; (5) more accurate than the synthesized image

patches used for matching; and (6) being able to assess the tracking failures with a

confidence level.

4.3 The Mathematical Framework

4.3.1 2D Visual Tracking

Assume that an object O is moving in front of a video camera. At time t,

the object is captured as an image view I(Xt) at position Xt in the image frame

It. Then the task of 2D visual tracking is to search for the image view I(Xt) of the

object O in each image frame It.

The most straightforward way is to place the object appearance model I tM to

all the possible positions in the image frame It and find the position that produces

the best matching results. Assuming the located image view is I(X ′

t), the tracking

error ∆I t0 can be represented as the difference between the true image view I(Xt)

and the located image view I(X ′

t) of the object:

∆I t0 = I(Xt)− I(X ′

t) (4.1)

Given the object model I tM , if we assume that its most similar image view can be

successfully located in the image frame It, then apparently the tracking error ∆I t0

is mostly caused by the inaccuracy of the utilized object model I tM .

If the tracking error ∆I t0 exists consistently at each image frame during track-

ing, it will accumulate and eventually force the tracker to drift away from the object.

As a result, this well-known drifting issue happens because of the lack of an accu-

Page 97: Zhiwei Zhu PhD Thesis

80

rate tracking model of the object for each image frame during tracking. Therefore,

a key issue for a successful visual tracking technique is how to obtain an accurate

2D tracking model I tM of the object at each image frame It during tracking.

4.3.2 The Proposed Solution

In this section, an algorithm is proposed to maintain an accurate tracking

model for each image frame so that the tracking error ∆I0 can always be minimized

during tracking. Figure 4.1 illustrates the major steps of the proposed algorithm.

) ( t

X I

1 '

t M

I

) ( ' 1 t

X I

Case Base

t+1 t

) ( 1 t

X I

t+2

CBR

+

+ +

Figure 4.1: The diagram of the proposed algorithm to improve the accu-racy of the 2D object tracking model

Assume that the image appearance varies gradually between two consecutive

image frames It and It+1, and the object image I(Xt) at the image frame It is

already located. As shown in Figure 4.1, the first step is to locate the object in

the image frame It+1 using the tracked 2D view I(Xt). Let I(X ′

t+1) be the located

object view in image frame It+1. In practice, due to the time passing between two

consecutive image frames, the generated image views of the object usually appear

quite different because of camera view changes, illumination changes, or object

deformations, etc. As a result, the located image I(X ′

t+1) is usually not accurate

because the image I(Xt) is not an accurate model any more for the image frame It+1

due to the appearance changes. However, the located view I(X ′

t+1) of the object

usually contains information about the object appearance at the current time t + 1

partially or completely. Therefore, the located image view I(X ′

t+1) represents an

important information source that can be utilized to infer the true view of the

object in the image frame It+1.

The second step is to infer the object appearance model I t+1M from the located

Page 98: Zhiwei Zhu PhD Thesis

81

image view I(X ′

t+1). Intuitively, the object model I t+1M will best match the located

image view I(X ′

t+1). In theory, if all the possible 2D image views of the object are

available in advance, then the object model I t+1M can be found by matching with the

located image view I(X ′

t+1). Therefore, the goal is to search the object view space

IM for one specific view of the object that minimizes the error between the located

view I(X ′

t+1) as follows:

I t+1M = IM∗

k= arg min

k[IMk

− I(X ′

t+1)]2 (4.2)

The obtained object model I t+1M will adapt to the appearance changes so that it

can represent the appearance of the object in the image frame It+1 more closely.

Therefore, the object model I t+1M can be utilized to locate the object in the image

frame It+1.

In order to do this, a set of 2D image views of the object must be available

in advance. However, in practice, it is infeasible to enumerate all the possible 2D

views of the object. In order to solve this issue, a feasible solution is to collect only

a set of representative 2D views of the object and then any unseen image view can

be adapted from them with some image adaption strategy.

Fortunately, the above proposed solution can be well-interpreted and imple-

mented in a Case-Based Reasoning paradigm. As shown in Figure 4.1, based on the

located image view I(X ′

t+1), an object model I t+1M ′ is selected from a constructed case

base. Then it will be used to locate the object in the image frame with a proposed

adaptation mechanism. Via the proposed CBR visual tracking framework, a more

accurate 2D tracking model I tM can be obtained at each image frame It. There-

fore, the drifting issue accompanied with most of visual tracking technique can be

alleviated.

4.4 The CBR Visual Tracking Algorithm

4.4.1 Case-Based Reasoning

Case-Based Reasoning (CBR) is a problem solving and learning approach that

has grown into a field of widespread interests both in academics and industry since

Page 99: Zhiwei Zhu PhD Thesis

82

the 1990’s [4]. It is based on the view that a significant portion of human cognition

and problem solving involves recalling entire prior experiences or cases, rather than

just pieces of knowledge at the granularity of a production rule [96]. Therefore, the

basic assumption of CBR is that individuals have numerous experiences that have

been indexed in their memory to be used in new situations. Therefore, when asked

to solve a problem, humans typically search their memory for past experiences that

can be reapplied in this new situation. In short, CBR is a model of reasoning that

incorporates problem solving, understanding, and learning with memory processes.

The processes that make up CBR can be seen as a reflection of a particular type of

human reasoning. In many situations, the problems that human beings encounter

are solved with a human equivalent of CBR. Detailed information about CBR can

be found in [62].

From a CBR perspective, the problem of visual object tracking in a video

sequence can be solved similarly by retrieving and adapting the previously seen

views of the object to a new view of the object at each image frame. Specifically,

how to use Case-Based Reasoning paradigm to build a face tracking system will be

demonstrated as follows.

Figure 4.2 illustrates a general CBR cycle for the built face tracking system.

As shown in figure 4.2, four processes are made up of the whole CBR face tracking

system:

1. RETRIEVE the most similar face images from a case base composed of the

collected representative face images.

2. REUSE the information and knowledge in the retrieved face image to refine

the previously located face image by adapting to the face appearance changes.

3. REVISE the proposed solution, which is to evaluate the located face image

and give a confidence measure.

4. RETAIN a new solution, which is to add the new tracked face image into the

case base as a new case if necessary.

In the following sections, each process of the CBR face tracking framework

will be discussed in detail.

Page 100: Zhiwei Zhu PhD Thesis

83

RETRIEVE

R E U S E

CASE BASE

REVISE

Problem

Probe Case Similar Cases

New Solution Solution with Confidence

New Case ?

RETAIN

Figure 4.2: The Case-Based Cycle of the face tracking system.

4.4.2 Case Base Construction

In a CBR system, the case base is the set of all the cases that are used by

the reasoner. It provides suggestions of solutions to problems for understanding or

assessing a situation. In the proposed CBR visual tracking framework, an appropri-

ate case base should consist of a set of views of the object as the prior knowledge or

historical cases for efficient tracking. Especially, each case needs to be representative

and evenly distributed in the 2D view space of the object.

To construct such a case base, initially, a set of representative 2D objects

are put into the case base. Then, the case base is enriched incrementally with a

training process. That is, during the training period, every time there is a tracking

failure due to the lack of similar cases in the case base, the corresponding new image

view is added into the case base. The training process is to find a balance between

tracking failure due to the lack of similar cases and the size of the case base. After

the training process, the case base is updated in the retaining step, which will be

discussed later.

One significant advantage of building such a case base consisting of 2D image

views is to avoid constructing an explicit model for an object directly, which usu-

ally needs complicated image distribution modelling or tedious hand-construction of

Page 101: Zhiwei Zhu PhD Thesis

84

object-model. Instead, CBR provides a feasible way to represent such a complicated

object model by collecting all the representative 2D views of the object simply in a

case base. As a result, the object model IM can be simply selected from a set of m

discrete 2D views I12 , · · · , I

m2 .

Once the face case base is constructed, the face in a video sequence can be

tracked efficiently as described below.

4.4.3 Case Retrieving

The proposed framework starts with searching for the most similar 2D view

I(X ′

t+1) of the object in the current frame It+1 to its tracked 2D view I(Xt) in the

previous frame It. Once the view I(X ′

t+1) is located, then the next step is to search

the case base for its most similar case I t+1M ′ with the located view I(X ′

t+1). This

is done by comparing all the collected cases in the case base one by one and then

choosing the one that produces the maximum similarity with the view I(X ′

t+1) in

the Gabor space.

Object Appearance Representation. Besides its raw pixel intensities, there are

many other representations of an image appearance that one could learn for robust

object matching, such as color statistics [12], multi-scale filter responses [50], etc.

Since Gabor Wavelets can represent local features effectively, we applied a set of

multi-scale and multi-orientation Gabor wavelet filters on the image appearance I

of an object. Given a pixel p ∈ I, its Gabor response f(p) is represented by a

vector composed of the computed Gabor filter coefficients. Since f(p) governs the

information of an entire neighborhood of p, there is no need to compute the Gabor

responses for all the pixels in I. Therefore, the image appearance of an object

is represented by an ordered collection ω of the Gabor responses at n uniformly

sampled locations p1, ..., pn with a spacing:

ω = f(p1), · · · , f(pn) (4.3)

Object Searching. Assume that a pixel pt at frame t corresponds to the image

location pt+1 = ϕ(pt, π) at frame t+1, where ϕ(pt, π) is a coordinate transformation

Page 102: Zhiwei Zhu PhD Thesis

85

function with parameter vector π. Then, object searching is conducted to find the

parameter vector π that give the optimal ω with respect to the similarity measured

in the Gabor space. This is equivalent to maximizing the sum of a set of local

similarity measures in ω as follows:

π∗ = arg maxπ

n∑

i=1

Si(f(pi), f(ϕ(pi, π))) (4.4)

where Si(f(pi), f(ϕ(pi, π))) is the similarity measure between Gabor response vec-

tors f(pi) and f(ϕ(pi, π))). In addition, the final similarity S ′

t+1 with the tracked

image view I(X ′

t+1) is characterized by the average of its local similarity measures

S ′

t+1 = 1n

∑n

i=1 Si.

4.4.4 Case Adaption (Reusing)

After the view I t+1M ′ most similar to the tracked view I(X ′

t+1) is derived from

the case base, a two-step procedure is proposed to adapt the view I t+1M ′ most similar

view to the image frame It+1 for the final face view.

In the first step, the selected image view I t+1M ′ is utilized as a tracking model to

perform a search starting at X ′

t+1 in the image frame It+1 for a most similar image

view. It is assumed that an image view I(X ′′

t+1) is located at X ′′

t+1 in the image with

a similarity measurement S ′′

t+1.

Subsequently, the second step is to combine the tracked image views I(X ′

t+1)

and I(X ′′

t+1) to obtain the final image view I(Xt+1) as follows:

Xt+1 =S ′

t+1

S ′

t+1 + S ′′

t+1

X ′

t+1 +S ′′

t+1

S ′

t+1 + S ′′

t+1

X ′′

t+1 (4.5)

The above formula shows that the tracked target view is the combined result

of the tracked view in the previous frame and the selected case view in the case

base. Intuitively, it can be seen as a minimization of the sum of two errors, the error

between the target view in the current frame and the tracked view in the previous

frame, as well as the error between the target view in the current frame and the

selected similar case view. In other words, the tracked target view possesses one

important property: it must be similar to the tracked view in the previous frame as

Page 103: Zhiwei Zhu PhD Thesis

86

well as the selected case view in the case base. Only when the tracked view satisfies

this property, the tracking can succeed.

4.4.5 Case Revising and Retaining

Case revising is a phase that has been proposed to evaluate the tracking result

for a confidence measure. Once the final image view I(Xt+1) is obtained, another

search will be conducted in the case base to find the most similar case. Simulta-

neously, a similarity score St+1 is derived after the searching is done. In practice,

if the similarity score is high, tracking is usually successful; otherwise, it may fail.

Therefore, simply, the derived similarity score St+1 is utilized as a confidence mea-

sure to characterize the final image view I(Xt+1). The system automatically reports

the confidence measure and stores those image views with low confidence level into

a temporary case database. Such a temporary case database is retained for further

use.

The retaining process enriches the case base by adding new representative

image views. To retain a new case, the image views in the temporary database have

to be reviewed periodically, so that only the useful image views are selected and

added into the case base. Using this process, it is believed that the case base can

include more and more representative 2D views of the objects.

4.5 Experiment Results

Based on the proposed CBR visual tracking framework, a real-time face track-

ing system was built. When a person appears in the view of the camera, the per-

son’s face is automatically localized via the proposed frontal face detector [111] and

tracked subsequently by our proposed face tracker. Specifically, in the built face

tracking system, there are totally 120 face images in the current constructed case

base, which will grow with the use of the tracker. These face images were collected

from different subjects under various face orientations, facial expressions, and illu-

mination changes. Although the current case base is small, experiments indicated

that it is good enough to track most people’s faces robustly. To test the perfor-

mance of the built face tracking system, a set of face video sequences with several

Page 104: Zhiwei Zhu PhD Thesis

87

new subjects were collected.

Out experiments first focused on demonstrating two significant advantages of

the proposed CBR tracking framework: drifting-elimination and confidence-assessment

capabilities. Subsequently, experiments were conducted to demonstrate that the face

tracker works well with different individuals, under different external illuminations,

different facial expressions, and various face orientations.

4.5.1 Drifting-Elimination Capability

In this experiment, the self-correction or drifting-elimination capability of the

proposed tracking scheme is demonstrated. A face sequence was collected under

significant head movements and facial expressions as shown in Figure 4.3. It is 20

seconds long and was recorded at 30 fps with the 320*240 pixel gray-scale image

resolution.

Figure 4.3: Comparison of the face tracking results with different tech-niques. The first row shows the tracking results by the in-cremental subspace learning technique and the tracked faceis marked by a red rectangle; the second row represents thetracking results by our proposed technique and the trackedface is marked by a dark square. The images are frame 26,126, 352, 415 and 508 from left to right.

Three other popular tracking techniques were applied to track the face in

the video sequence as well. The first technique is the traditional two-frame-based

tracking method [72], which utilizes the previously tracked face image to update the

tracking model dynamically. The second technique is the offline tracking method,

which utilizes the selected face image most similar to the previously tracked face

Page 105: Zhiwei Zhu PhD Thesis

88

view from the case base to update the tracking model dynamically. This method is

very similar to our proposed CBR tracking method, except that it does not adapt

the selected face image to the current image. The last one is the tracking technique

with the incremental subspace learning [66], which dynamically updates the object

model by incremental learning its eigen-basis from a set of previously tracked faces.

In order to measure tracking performance, the face center at each frame in

this video sequence was manually identified, and served as the ground truth. From

Figure 4.4 (a), it is obvious that the tracking error of the two-frame tracker accumu-

lates as the tracking continues. Eventually it drifts away from the face and tracks

the wrong object. In addition, as shown in Figure 4.4 (b), only using the offline face

information without any adaption to the current view of the face, the offline tracking

technique is very unstable and inaccurate. Furthermore, Figure 4.4 (c) shows that

the tracking technique with the incremental subspace learning also drifts and tracks

the wrong object eventually. One possible reason is that because the tracked face

view contains errors or non-face image pixels, they are learned and accumulated in

the model throughout the video sequence and eventually lead to drifting. However,

our proposed tracking method can eliminate the tracking error in each frame grad-

ually during tracking, while still tracking the face robustly. Figure 4.3 clearly shows

that the proposed tracker outperforms the incremental subspace learning tracker.

(a) (b) (c)

Figure 4.4: Comparisons of the tracked face position error: (a) betweenthe proposed tracker and the two-frame tracker; (b) betweenthe proposed tracker and the offline tracker; (c) betweenthe proposed tracker and the incremental subspace learningtracker;

Page 106: Zhiwei Zhu PhD Thesis

89

4.5.2 Confidence-Assessment Capability

When both the two-frame tracker and the subspace learning tracker fail to

track the face correctly in the above sequence, they cannot automatically detect

failures and still continue tracking. Figure 4.5 (a) shows the computed similarities

by the two-frame tracker. It illustrates that when the accumulated tracking error is

severe enough to start losing the face, the computed similarity is still high enough

to indicate that the face was still being tracked successfully.

The incremental subspace learning tracking technique has the same problem.

As shown in Figure 4.5 (b), when the drifting becomes severe, the tracker continues

to show that the RMSE error associated with the tracked face view is decreasing

instead. On the other hand, for our proposed method, since it still tracks the

face successfully, the computed confidence level maintains high scores to tell this

as shown Figure 4.5 (c). In practice, we found that when the confidence scores are

higher than 0.5, the tracking is usually successful.

Drifting starts to become severe

Drifting starts to become severe

(a) (b) (c)

Figure 4.5: (a)The similarities computed by the two-frame tracker; (b)The RMSE errors computed by the incremental subspacelearning tracker; (c) The confidence scores computed by ourproposed tracker

In addition, when the tracker fails to track the face, our proposed technique

can indicate the failures through confidence measurements. In order to demonstrate

this capability, another face image sequence was collected under significant face

rotations, facial expressions, and occlusions. It contains around 600 image frames

and corresponds to about 20 seconds of video.

Some selected image frames with tracked faces are illustrated in Figure 4.6. An

occlusion by hand happens approximately from frame 205 to frame 238, starting from

Page 107: Zhiwei Zhu PhD Thesis

90

partial occlusion to full occlusion and then back to partial occlusion. Figure 4.7 plots

the computed confidence measurements for the tracked faces in this sequence. It

clearly shows that the computed confidence measurements can reflect the occlusion

exactly, with the lowest confidence score being estimated when the face is fully

occluded by the hand. On the other hand, as shown in Figure 4.7, the proposed

tracker can provide high confidence level when the face is under either significant

facial expressions or large face orientations without occlusion.

Figure 4.6: The face tracking results with significant facial expressionchanges, large head movements and occlusion. For eachframe, the tracked face is marked by a dark square. Theupper row displays the image frames 29, 193, 211, 236 fromleft to right; while the lower row displays the image frames237, 238, 412 and 444 from left to right

4.5.3 Performance under Illumination Changes

Our proposed tracking technique can perform well not only under large head

movements and significant facial expressions, as illustrated above, but also under

significant external illumination changes. To demonstrate, a face video sequence was

recorded under an environment with drastic lighting variations. As shown in Figure

4.8, the appearance of the person’s face changes drastically due to the external

lighting changes, which makes the tracking task extremely challenging. However,

our proposed tracking technique can still track the face very successfully.

Page 108: Zhiwei Zhu PhD Thesis

91

Occlusions

29

193

211

236

237

238

412

444

Figure 4.7: The estimated confidence measures

Figure 4.8: Face tracking results under significant changes in illuminationand head movement. The tracked face is marked by a darksquare in each image. From left to right, the selected imageframes are 75, 246, 295, 444 in the first row, while the secondrow displays the image frames 662, 706, 745 and 898

4.5.4 Processing Speed

The proposed CBR face tracking technique is implemented using C++ on a

PC with a Xeon (TM) 2.80GHz CPU and a 1.00GB RAM. The resolution of the

captured images is 320 × 240 pixels, and the built face tracking system runs at

approximately 26 fps comfortably.

Page 109: Zhiwei Zhu PhD Thesis

92

4.6 Chapter Summary

In this chapter, a visual tracking framework based on CBR paradigm with

confidence level is introduced to track the faces in a video sequence. Under the

proposed framework, the face can be tracked robustly under significant appearance

changes without drifting. In addition, a confidence measurement can be derived

accurately for each tracked face view so that the failures can be assessed success-

fully. Since the proposed algorithm does not involve any complicated probabilistic

appearance modelling or non-linear optimization, it is very simple and computation-

ally inexpensive. Most importantly, it provides a very nice framework to solve the

drifting issue that has plagued the face tracking community for a very long time.

Finally, such a tracking framework can be easily generalized to track other objects

by only updating its case base.

Page 110: Zhiwei Zhu PhD Thesis

CHAPTER 5

Real-Time Facial Feature Tracking

5.1 Introduction

Facial features, such as eyes, eyebrows, nose and mouth, as well as their spatial

arrangement, are important for such facial interpretation tasks as face recognition

[117], facial expression analysis [128] and face animation [116]. Therefore, locating

these facial features in a face image accurately is crucial for these tasks to perform

well. Various techniques [10, 108, 94, 38, 117, 14] have been proposed to detect

and track facial features in face images. Generally, two types of information are

commonly utilized by these techniques. One is the image appearance of the fa-

cial features, which is referred as texture information; and the other is the spatial

relationship among different facial features, which is referred as shape information.

In [94], a neural network is trained individually as a feature detector for each

facial feature, e.g. as an eye detector. Facial features are then located by searching

the face image via the trained facial feature detectors. Similarly, instead of using

neural networks, Gabor wavelet networks are trained to locate the facial features in

[108]. Since the shape information of the facial features is not modelled explicitly

in both techniques, they are prone to image noise. Therefore, in [10], a statistical

shape model is built to capture the spatial relationships among facial features, and

multi-scale and multi-orientation Gaussian derivative filters are employed to model

the texture of facial features. However, only the shape information is used when

comparing two possible feature point configurations, ignoring the local measure-

ment for each facial feature. Such a method may not be robust in the presence of

image noise. In [117], facial features are represented with the Gabor Jets and the

spatial distributions of facial features are captured with a graph structure implic-

itly. Via the graph structure, only the simple spatial information among the facial

features is imposed, whose variation is not modelled directly. Since most of these

proposed techniques either assume frontal facial views, or without significant facial

expressions, or under constant illuminations, good performance has been reported.

93

Page 111: Zhiwei Zhu PhD Thesis

94

However, in reality, the image appearance of the facial features varies signifi-

cantly among different individuals. Even for a specific person, the appearance of the

facial features is easily affected by the lighting conditions, face orientations and facial

expressions. Therefore, robust facial feature tracking still remains a very challenging

task, especially under variable illumination, face orientation and facial expression.

In this chapter, techniques are proposed to improve the robustness and accuracy

of the existing facial feature trackers such that facial features can be detected and

tracked under the above challenging situations.

In chapter 4, a CBR-based visual tracking framework is introduced to stably

detect and track the face under significant changes in lighting, facial expression and

face orientation. Therefore, the position and motion of the tracked face can provide

strong and reliable geometric constraints over the locations of other facial features.

In addition, Kalman filtering [13] is a useful means for object tracking. It can impose

a smooth constraint on the motion of each facial feature. For the trajectory of each

facial feature, this constraint removes random jumping due to the uncertainty in

the image. Therefore, in this chapter, the Kalman filtering and the face motion are

combined together to predict the position for each facial feature in a new image.

By doing so, we not only obtain a smooth trajectory for each feature, but also

catch rapid head motion. Given the predicted feature positions, the multi-scale and

multi-orientation Gabor wavelet matching method [64, 117, 129] is used to detect

each facial feature in the vicinity of the predicted locations. The robust matching in

the Gabor space provides an accurate and fast solution for tracking multiple facial

features simultaneously.

It is important for us not only to track each single facial feature, but also

to capture their spatial relationships. Gabor wavelet matching is used to identify

each facial feature in the tracking initialization; the Gabor wavelet coefficients are

updated at each frame to adaptively compensate for the facial feature appearance

changes. These updated coefficients are used as the template to match the facial

feature in the coming image frame. This updating approach works very well when

no occlusion or self-occlusion happens. Considering the free head motion in which

the head can turn around from the frontal view to the side view or vice versa, the

Page 112: Zhiwei Zhu PhD Thesis

95

self-occlusion often fails the tracker because a random or arbitrary profile is assigned

to the occluded feature. In this chapter, a shape-constrained correction mechanism

is developed to tackle this problem and to refine the tracking results.

Figure 5.1 shows the flowchart of the proposed facial feature detection and

tracking algorithm. After the face image is captured from the camera, the frontal

face is first located via the technique proposed in chapter 4. Based on the detected

frontal face region, a trained face mesh is employed to estimate a rough position for

each facial feature. Subsequently, a refinement technique based on Gabor wavelet

matching is proposed to search for an accurate position around the roughly estimated

position for each facial feature. Once the facial features are located successfully, a

correction-based tracking mechanism is activated to track them in the subsequent

image frames. In the following sections, each component will be described briefly.

Initialization?

Frontal Face Detection

Frontal Face Found?

Facial Feature Detection

Facial Features Found?

Facial Feature Tracking

Facial Features Tracked?

Face Images

Yes

No

Yes

No

No

Yes

Yes No

Figure 5.1: The flowchart of the proposed facial feature detection andtracking algorithm

5.2 Facial Feature Representation

Gabor wavelets are biologically motivated convolution kernels in the shape of

plane waves restricted by a Gaussian envelope function [16]. So far, many research

groups have reported that with the use of Gabor wavelets, good performance can be

Page 113: Zhiwei Zhu PhD Thesis

96

achieved in face recognition [117], facial expression recognition [129], facial feature

detection [117], etc.

In this chapter, a set of multi-scale and multi-orientation Gabor wavelets are

applied to each facial feature so that each facial feature can be represented by a

set of filter responses. Specifically, for an image patch I(−→x ) around a given pixel

−→x = (x, y)t, the utilized 2D Gabor wavelet kernels are expressed as follows:

Ψj(−→x ) =

‖−→kj‖

2

σ2exp(−

‖−→kj‖

2‖−→x ‖2

2σ2)

[exp(i

−→kj ·−→x )− exp(−

σ2

2)

](5.1)

where σ = 2π, and the wave vector−→kj is represented as

−→kj =

kjx

kjy

=

kν cos φµ

kν sin φµ

(5.2)

The wave vector−→kj controls the orientation and the scale of the wavelets, with

index j ranging from 0 to 6ν + µ. Specifically, the set of Gabor wavelets consist

of 10 spatial frequencies and 6 distinct orientations given by ν = 0, · · · , 9 and

µ = 0, · · · , 5, with kν = 2−ν2 π and φµ = µπ

6.

Therefore, for each pixel −→x , a Gabor coefficient vector J(−→x ) is derived as a

set Jj(−→x ) of 60 convolution results with the above 60 Gabor kernels, with each

convolution defined as:

Jj(−→x ) =

∫I(−→x

)Ψj(−→x −−→x

)d−→x′

= mjeiφj , j = 0, 1, ..., 59

where I(−→x′

) is the image grey level distribution, and mj and φj are the magnitude

and phase of the computed complex Gabor coefficient. This Gabor coefficient vector

J(−→x ) can be used to represent the image pixel −→x and its vicinity [64] efficiently.

In our proposed algorithm, the Gabor coefficient vector J(−→x ) is not only used to

detect each facial feature at the initial frame, but also used during tracking.

However, it is too expensive to compute all the kernel convolutions in real

time. In the following, a pyramidal representation is employed to speed up the

convolution computations dramatically.

Page 114: Zhiwei Zhu PhD Thesis

97

5.2.1 Pyramidal Gabor Wavelets

Given a face image, an image pyramid can be generated to depict it with

decreasing resolutions. To create an image pyramid, each level IL of the image

pyramid is generated from its next lower level I(L−1) as follows:

IL(x, y) = 14I(L−1)(2x, 2y) + 1

4I(L−1)(2x + 1, 2y) + 1

4I(L−1)(2x, 2y + 1)

+14I(L−1)(2x + 1, 2y + 1)

(5.3)

where L represents the pyramid level and I(x, y) represents the intensity of the

image pixel at the coordinates (x, y). It tells that an image I(L+1) at level L + 1 is

created by shrinking the image IL at level L by half. Specifically, the base level I0 is

the image at the original resolution, which has the best resolution. In our proposed

algorithm, an image pyramid with three levels is generated to represent each face

image. Figure 5.2 shows a generated three-level image pyramid, whose base level

image is shown as Figure 5.2 (a).

(a) (b) (c) (d)

Figure 5.2: An image pyramid with three levels: (a) base level contains a320×240 pixels image; (b) first level contains a 160×120 pixelsimage; (c) second level contains a 80 × 60 pixels image; (d)third level contains a 40× 30 pixels image

At the same time, the Gabor kernels with low frequency need to be shrunk

correspondingly. This is done easily by re-formulating the wave vectors of the Gabor

kernels as follows:

−→kj =

kν cos φµ

kν sin φµ

(5.4)

Page 115: Zhiwei Zhu PhD Thesis

98

where kν = 2(− ν2+L)π with L = quotient(ν, 3).

Therefore, the kernels with the frequency index ν = 0, 1, 2 will perform

the convolutions on the base level of the image pyramid. The kernels with the

frequency index ν = 3, 4, 5 will perform the convolutions on the first level of

the image pyramid, shrinking its kernel size by half. While the kernels with the

frequency index ν = 6, 7, 8 will perform the convolutions on the second level of

the image pyramid, shrinking its kernel size by half twice. Finally, the kernel with

the lowest frequency will perform the convolutions at the third level of the image

pyramid, shrinking the kernel size by half three times. Via the proposed pyramidal

Gabor wavelet representation, the computations of the kernel convolutions will be

much faster.

5.2.2 Fast Phase-Based Displacement Estimation

Given a facial feature at two consecutive image frames, a Gabor coefficient

vector J(~x) is extracted at the position ~x in the first image, while another Gabor

coefficient vector J(~y) is extracted at a different position ~y = ~x+~d in the subsequent

image frame, with the displacement ~d = (dx, dy)T . Since the position ~x is in a small

vicinity of the position ~y, the phase shift between Gabor coefficient vectors J(~x) and

J(~y) can approximately be compensated for by the term ~d· ~kj. So the phase-sensitive

similarity function between these two Gabor coefficient vectors can be expressed as:

S(J(~x), J(~y)) =

∑j mjm

j cos(φj − φ′

j −~d · ~kj)√∑

j m2j

∑j m

j

2(5.5)

where each component in the Gabor coefficient vector J(~x) is represented as Ji(~x) =

mjeiφj , and each component in the Gabor coefficient vector J(~y) is represented as

Ji(~y) = m′

jeiφ

j . To compute it, the displacement ~d must be estimated. This can be

done by maximizing the similarity in its second Taylor expansion:

S(J(~x), J(~y)) ≈

∑j mjm

j[1− 0.5(φj − φ′

j −~d · ~kj)

2]√∑

j m2j

∑j m

′2j

(5.6)

Page 116: Zhiwei Zhu PhD Thesis

99

Therefore, by setting ∂∂dx

S = 0 and ∂∂dy

S = 0, the optimal displacement vector ~d

can be estimated as follows:

~d =1

ΓxxΓyy − ΓxyΓyx

×

Γyy −Γyx

−Γxy Γxx

θx

θy

(5.7)

if ΓxxΓyy − ΓxyΓyx 6= 0, with

θx = Σjmjm′

jkjx(φj − φ′

j)

θy = Σjmjm′

jkjy(φj − φ′

j)

Γxx = Σjmjm′

jkjxkjx

Γxy = Σjmjm′

jkjxkjy

Γyx = Σjmjm′

jkjxkjy

Γyy = Σjmjm′

jkjykjy

This similarity function can only determine the displacements up to half wave-

length of the highest frequency kernel, which would be ±1 pixel area centered at

the predicted position for k0 = π/2. But this range can be increased by using low

frequency kernel. Specifically, a four level coarse-to-fine approach is used, which can

determine up to ±22 pixel displacement with the use of the lowest frequency kernel.

Therefore, for each facial feature, it only needs four displacement calculations to de-

termine the optimal position, which dramatically speeds up the displacement estima-

tion processing and makes the real-time implementation for multi-feature tracking

possible.

5.3 Facial Feature Detection

Twenty-eight prominent facial features around eyes, eyebrows, nose and mouth

are selected for detection and tracking in the face images as shown in Figure 5.3.

Among them, some facial features will move significantly under facial expressions so

that the facial expressions of the face images can be revealed from the movements

of them.

Given a face image sequence, our proposed algorithm starts with detecting

Page 117: Zhiwei Zhu PhD Thesis

100

1.1 1.2

1.3 1.4

1.5 1.6

2.11

2.8

2.9

2.10

2.7 2.6 2.5

2.2

2.3

2.4 2.1

3.1

3.2 3.3

4.1

4.2 4.4 4.6

4.8

4.7 4.5

4.3

Figure 5.3: A face mesh with facial features

these twenty-eight facial features automatically in the initial image frames. In

essence, the proposed facial feature detection technique consists of two steps: facial

feature approximation and facial feature refinement. The first step provides an ap-

proximated location for each facial feature based on a detected frontal face region,

which are then fine-tuned in the second step.

5.3.1 Facial Feature Approximation

As shown in Figure 5.4, given a frontal face image, the selected twenty-eight

facial features are symmetrically located within the face region. In addition, certain

anthropometric proportions [17] must be satisfied among these facial features so that

the positions of the facial features can be roughly estimated once the face region is

estimated.

Therefore, a simple strategy is proposed to estimate a rough position for each

facial feature based on the located frontal face region in an image. First, a face mesh

F composed of these twenty-eight facial features is formed as shown in Figure 5.5

(a). Specifically, the face mesh F is learned from a set of collected frontal faces by

taking the mean face mesh of them. Since the set of collected faces cover a variety

of people from different races, the learnt face mesh F will work on most people.

Once the face mesh F is obtained, based on the size and position of the detected

frontal face region, it can be resized and imposed on the face image to obtain a rough

position for each facial feature as shown in Figure 5.5 (b). Since the deviation from

the actual position is usually small for each facial feature, its position can be further

Page 118: Zhiwei Zhu PhD Thesis

101

Figure 5.4: Spatial geometry of the facial features in a frontal face region.Eyes are marked by the small white rectangles, the face regionis marked with large white rectangles, and the facial featuresare marked with white circles

refined by searching around its estimated position subsequently in the refinement

step. In the following, the searching procedure will be described briefly.

(a) (b) (c)

Figure 5.5: (a) face mesh, (b) face image with approximated facial fea-tures, (c) face image with refined facial features

5.3.2 Facial Feature Refinement

Given an approximated position ~xe for a facial feature in the face image, a

Gabor coefficient vector J(~xe) is first calculated. Then, the nearest neighbor ap-

proach is utilized to seek a most similar Gabor coefficient vector J′

from a training

set for each facial feature by matching with J(~xe) in the Gabor space. In order to

obtain an effective training set for each facial feature, a large number of face images

with different local properties were collected and the correct facial features were

marked in each image. These local properties include different individuals, different

Page 119: Zhiwei Zhu PhD Thesis

102

lighting conditions, different face orientations and different facial expressions. Then

a set of Gabor coefficient vectors, each derived from a different face image at the

same facial feature, are stored for each facial feature. Since these face images are

collected under various conditions, a wide range of appearance variations for each

facial feature can be covered.

Once the most similar J′

is obtained, it is utilized as a model to estimate a new

position ~x′

for each facial feature starting from the approximated position ~xe via

the fast phase-based displacement estimation technique proposed in Section 5.2.2.

The above procedure is repeated until the final estimated position ~x′

converges or

the pre-defined number of iterations exceeds. As shown in Figure 5.5 (c), we can

see that the facial features are located successfully via the proposed facial feature

detection technique. More face images with detected facial features under significant

facial expressions are illustrated in Figure 5.6.

(a) (b)

(c) (d)

Figure 5.6: The face images with detected facial features under differentfacial expressions: (a) disgust, (b) anger, (c) surprise and (d)happy

Page 120: Zhiwei Zhu PhD Thesis

103

5.4 Facial Feature Tracking

Once the facial features are located in the first image frames, they are tracked

subsequently in the following image frames. The facial feature tracking, especially

tracking a set of facial features simultaneously is a crucial and difficult task. The

face is a typical nonrigid object. The appearance of the facial features and the

spatial relationship among them can vary significantly under changes in facial ex-

pression and face orientation. In addition, they are very difficult to track under

rapid head motion, and some of the them may even disappear when the head turns

to profile views. Therefore, tracking these facial features robustly is a tough issue.

However, a novel facial feature tracking scheme is proposed in this chapter to track

them robustly under large head movements and significant facial expressions. The

flowchart of the proposed tracking scheme is illustrated in Figure 5.7. Specifically,

it is composed of three stages, namely facial feature prediction, facial feature mea-

surement, and facial feature correction. In the subsequent sections, each stage will

be described briefly.

Facial Feature Prediction

Facial Feature Measurement

Facial Feature Correction

Figure 5.7: The flowchart of the proposed tracking algorithm

5.4.1 Facial Feature Prediction

Kalman filtering is a well-known tracking method, and it has been successfully

used on many applications. In this chapter, it is used to predict the position of each

facial feature in a new image frame from its previous locations so that a smooth

constraint can be imposed on the motion of each facial feature. In addition, the

motion of the face tracked by our proposed face tracker can provide strong and

reliable information about the location and movement of each facial feature between

two consecutive frames. This information is especially useful to track the facial

features under significant head movements. Therefore, by combining the face motion

obtained by our proposed face tracker with the Kalman filtering, we can obtain an

Page 121: Zhiwei Zhu PhD Thesis

104

accurate and robust prediction of the position for each facial feature in a new image

frame, even under rapid head movement.

Specifically, for each feature, its motion state at each time instance (frame) can

be characterized by its position and velocity. Let (xt, yt) represent its pixel position

and (ut, vt) be its velocity at time t in x and y directions. The state vector at time

t can therefore be represented as St = (xt yt ut vt)t. The system can therefore be

modelled as

St+1 = ΦSt + Wt (5.8)

where Φ is the transition matrix and Wt represents system perturbation. Given the

system model, S−

t+1, the state vector at t + 1 can be predicted by

S−

t+1 = ΦSt + Wt (5.9)

along with its covariance matrix Σ−

t+1 to characterize its uncertainty.

The prediction based on Kalman filtering assumes smooth motion for each fa-

cial feature. The prediction will be off significantly if the head undergoes a sudden

rapid movement. In dealing with this issue, we propose to approximate the move-

ment for each facial feature with the face movement since the face can be reliably

tracked in each frame. Let the predicted position for each facial feature at t+1 based

on face motion be Spt+1. It provides a different motion constraint to complement

the smooth constraint from Kalman filtering when the motion is rapid. Combining

these two constraints yields the final predicted position for each facial feature

S∗

t+1 = S−

t+1 + Σ−

t+1(S−

t+1 − Spt+1) (5.10)

The simultaneous use of Kalman filtering and face motion allows us to perform

accurate motion prediction for each facial feature under significant and rapid head

movements. We can then derive a new covariance matrix Σ∗

t+1 for S∗

t+1 using the

above equation to characterize its uncertainty.

Page 122: Zhiwei Zhu PhD Thesis

105

5.4.2 Facial Feature Measurement

Given the predicted state vector S∗

t+1 at time t+1 and the predicted uncertainty

Σ∗

t+1 for each facial feature, a searching area centered at each predicted position in

the image frame is provided. Usually the size of the searching area is determined

by the covariance matrix Σ∗

t+1, and a searching process within the searching area

is often used to detect the optimal position. But when tracking a number of facial

features simultaneously, searching the area exhaustively for each facial feature is very

time-consuming. Instead, it is done automatically via the proposed fast phase-based

disparity estimation technique described in Section 5.2.2. Hence, the displacement−→d for each facial feature can be computed directly so that its position can be

estimated efficiently. Since the proposed technique is very efficient, it is suitable for

the real-time application.

Once the facial features are located, the Gabor wavelet coefficient vector is

computed at the located position for each facial feature, which will be utilized as

the tracking model in the subsequent image frame. It is equivalent to updating the

tracking model dynamically in each image frame; hence, the appearance changes

of each facial feature in the image frames can be adapted. However, since it does

not have the ability to correct the possible errors during tracking, errors may be

accumulated over the image frames such that the tracker will drift away eventually.

Therefore, in the next stage, a correction step is proposed to eliminate the accumu-

lated errors for each facial feature in the image frames so that the drifting can be

avoided.

5.4.3 Facial Feature Correction

The proposed facial feature correction strategy consists of two components: re-

fining facial features and imposing shape constraint. In the facial feature refinement

component, the tracked position of each facial feature will be refined by matching

with a training set in the Gabor space so that the accumulated error due to the

appearance changes can be eliminated. However, the above procedure only works

when the tracked position does not deviate far away from the actual position for

each facial feature, otherwise it may fail. Hence, a second component is subsequently

Page 123: Zhiwei Zhu PhD Thesis

106

activated to impose the shape constraints among the facial features to correct those

obvious geometry-violated ones that deviate far away from their actual positions.

In the following, each component will be discussed briefly.

5.4.3.1 Facial Feature Refinement

First, for each facial feature, a set of Gabor wavelet coefficient vectors are

extracted from a large number of collected face images offline. The collected face

images try to cover all the possible appearance variations of each facial feature under

various face orientations, facial expressions and illuminations. Therefore, the set of

Gabor wavelet coefficient vectors serves as a generic appearance model for each facial

feature under various situations.

Given a face image, for a specific facial feature such as the left eye corner, we

can always find a very similar one from the set of collected Gabor wavelet coefficient

vectors. Once the most similar Gabor wavelet coefficient vector is found from the

training set, it will be used as a model to search for a new position for each facial

feature around the positions obtained in the facial feature measurement stage. The

searching is done via the fast phase-based disparity estimation technique. The above

procedure is repeated until the final estimated position converges or a pre-defined

number of iterations exceeds. By this way, the appearance changes of each facial

feature can be adapted successfully so that the accumulated tracking error can be

eliminated in time during tracking.

5.4.3.2 Imposing Geometry Constraints

So far, during tracking, the geometrical relationship among the facial features

has not been considered. In order to correct those geometrically violated facial

features that deviate far away from their actual positions, the geometry constraint

among the detected facial features is imposed. However, the geometry variations

among all the twenty-eight facial features under changes in individuals, facial ex-

pressions and face orientations are too complicated to be modelled successfully.

Therefore, in the following, a simple but effective technique is proposed to handle

this issue. Basically, the proposed technique can be divided into two steps: face pose

estimation and geometry constraint imposing. The first step provides a rough face

Page 124: Zhiwei Zhu PhD Thesis

107

pose information so that the face pose effects can be eliminated from the tracked

2D facial features. Subsequently, the geometry constraints is imposed on the pose-

eliminated 2D facial features to correct the geometrically violated ones. By this way,

only the geometry variation under the frontal facial view needs to be learned, which

is feasible to be modelled so that the geometry constraint can be imposed easily. In

the followings, each step will be discussed briefly.

1. Rough Face Pose Estimation

Based on the detected facial features, a technique is proposed to estimate

the face pose efficiently. Under facial expressions, some of the facial features,

such as the ones around the mouth and the eyebrows, will move significantly.

Therefore, these nonrigid facial features are not suitable for the face pose

estimation. In order to minimize the effect of the facial expressions, only a set

of rigid facial features that do not move significantly with facial expression is

selected to estimate the face pose. Specifically, six facial features are selected

as shown in Figure 5.8 (a), which include four eye corners and two points on

the nose.

(a) (b)

Figure 5.8: (a) A frontal face image, and (b) its 3D face geometry, withthe selected facial features marked as the white dots

In order to estimate the face pose, the 3D shape model composed of these

six facial features has to be initialized. Currently, the 3D coordinates Xi =

(xi, yi, zi)T of facial features in the 3D face shape model are first initialized from

Page 125: Zhiwei Zhu PhD Thesis

108

a generic 3D face model as shown in Figure 5.8 (b). Because of the individual

difference from the generic face model for a new user, before using the system,

the user is asked to face the camera directly to obtain a frontal view face

image for the 3D face model adaptation. Based on the detected facial features

in the frontal face image, the xi and yi coordinates of each facial feature in

the 3D face shape model are adjusted automatically to this new user. In this

situation, the depth values of the facial features in the 3D face shape model

of the new user are not available. Therefore, the depth pattern of the generic

face model is used to approximate the zi value. Our experiment shows that

this method is effective and feasible in our real time application.

Based on the personalized 3D face shape model and these six detected facial

features in a given face image, the face pose parameters α = (σpan, φtilt, κswing, λ)

can be estimated. In which, (σpan, φtilt, κswing) are the three euler face angles

and λ is the scale. Since the traditional least-square method [85] cannot han-

dle the outliers successfully, a robust RANSAC (Random Sample Consensus)

algorithm [28] is proposed to estimate the face pose instead. In the following,

the major steps involved in the RANSAC process of the face pose estimation

are discussed briefly.

(a) Form N triangles from the facial features

We can randomly select three non-colinear facial features to form a tri-

angle Ti. For each triangle Ti, if one of the vertices is chosen as the origin

and one edge is served as the X coordinate, we can fix a local coordinate

system Ct to it. In this local coordinate system Ct, three vertices of the

triangle are coplanar and the z coordinate of them is 0.

(b) Obtain the projection matrix P from each triangle

Under weak-perspective projection model, for each triangle Ti , the rela-

tionship between the row-column coordinate system and the local coor-

Page 126: Zhiwei Zhu PhD Thesis

109

dinate system Ct can be expressed as follows:

c− c0

r − r0

= P

xt − xt0

yt − yt0

zt − zt0

(5.11)

where, (c, r) represents the projection image of the 3D point (xt, yt, zt) in

the local coordinate system Ct, and (c0, r0) is the projection image of the

reference point (xt0, yt0, zt0). In addition, P is a 2× 3 projection matrix

composed of the scalar λ and the first two rows of the rotation matrix R

as follows:

P =1

λ

r11 r12 r13

r21 r22 r23

(5.12)

Since the vertices of the triangle Ct are coplanar, the projection model

5.11 can be expressed as

c− c0

r − r0

= M

xt − xt0

yt − yt0

(5.13)

where M is a 2× 2 projection matrix represented as

M =1

λ

r11 r12

r21 r22

(5.14)

From the projection matrix M , two sets of face pose parameters can be

recovered after constraining the range of the euler face pose angles from

−π2

to π2. Further, for each set of the recovered face pose parameters, a

2× 3 projection matrices Pi can be generated, but only one is correct.

(c) Calculate the projection deviation for each projection matrix

Based on a recovered projection matrix Pi, the facial features of the 3D

face shape model are projected into the image to obtain a set of projected

Page 127: Zhiwei Zhu PhD Thesis

110

facial features. Then the deviation derror between the projected facial

features and the detected ones in the image is calculated. If derror is

bigger than a threshold value d, then the projection matrix Pi will be

discarded; Otherwise, the projection matrix Pi is kept and its weight is

obtained as ωi = (d− derror)2.

(d) Average the final results

After checking all of the N triangles, we will get a list of K 2 × 3 pro-

jection matrix Pi, i = 1...K, and its corresponding weight ωi, i = 1...K.

From each full projection matrix Pi, a set of face pose parameters αi =

(σpan, φtilt, κswing, λ) is obtained uniquely. Then the final face pose pa-

rameters α can be obtained as follows:

α =

∑K

i=1 αi ∗ ωi∑K

i=1 ωi

(5.15)

Via the proposed face pose estimation technique based on RANSAC, the face

pose can be estimated robustly under various facial expressions. Therefore,

given the estimated face pose information, the face pose effect can be elimi-

nated from the tracked twenty-eight 2D facial features. Since the set of pose-

eliminated 2D facial features can be treated as being captured under the frontal

face view, only a frontal face shape model needs to be learned to impose the

geometry constraint among the facial features.

2. Shape Constraints with Active Shape Model

First, the shape of a face is defined as a vector composed of these twenty-eight

facial features. Then, a frontal face shape model is built as follows. Initially,

a set of face shape samples are extracted from a large number of frontal faces

with various facial expressions. The set of collected face shape samples serves

as a training set, consisting of N face shape samples Qi, where Qi is a vector

composed of the coordinates of the facial features. Then, a mean face shape

Page 128: Zhiwei Zhu PhD Thesis

111

vector Qmean can be computed by

Qmean =1

N

N∑

i=1

Qi (5.16)

Once the mean face shape vector Qmean is obtained, then the shape variations

∆Qi can be obtained by subtracting the mean face shape Qmean from each face

shape samples Qi in the training set. From the set of shape variations ∆Qi,

a set of k basis face shape vectors Qj, 1 ≤ j ≤ k is extracted subsequently

using the PCA analysis technique. Usually, the selected number k is much

smaller than the dimension of the face shape vector Q.

As a result, given a face shape vector Qi extracted from a face image, the

global geometry constraint among the facial features can be imposed by

Qi −Qmean ≈ Φb (5.17)

with Φ = (Q1, ..., Qk) and b is a coefficient vector given by

b = Φ+(Qi −Qmean) (5.18)

where Φ+ is the pseudo-inverse of the matrix Φ, and it is computed from

Φ+ = (ΦT Φ)−1

ΦT .

After the coefficient vector b is obtained, the geometry-constrained face shape

vector Q′

i is represented as

Q′

i = Qmean + Φb (5.19)

Via the proposed method, normally, the obvious geometrically violated facial

features that deviate far away from their actual positions can be corrected

efficiently.

Page 129: Zhiwei Zhu PhD Thesis

112

5.5 Experiment Results

A real-time facial feature tracking system is built based on the proposed al-

gorithm. When a person is sitting in front of the camera, it can detect and track

these twenty-eight facial features automatically.

In order to test the performance of the proposed facial feature tracking algo-

rithm, a number of face image sequences were collected from different people. Figure

5.9 shows the facial feature tracking results on some typical face image sequences

with different facial expressions under various face orientations. Although the fa-

cial expression changes significantly or the face orientation is significantly large, the

facial features can be still tracked successfully as shown in Figure 5.9.

0 24 70 80 96

0 7 10 14 18

0 27 36 70 79

Figure 5.9: The randomly selected face images from different image se-quences.

5.5.1 Facial Feature Tracking Accuracy

In order to evaluate the accuracy of the proposed facial feature tracking al-

gorithm, ten face image sequences with significant changes in facial expression and

face pose were collected. In each sequence, twenty-eight facial features were detected

and tracked automatically by the proposed facial feature tracker. In addition, the

Page 130: Zhiwei Zhu PhD Thesis

113

positions of these facial features at each frame were also manually located for each

sequence. These manually located facial features serve as the ground truth during

the error computation.

Figure 5.10 illustrates the computed absolute position errors for the facial

features tracked by the proposed facial feature tracker. Specifically, Figure 5.10

(a) and (b) summarize the computed mean and standard deviation of the absolute

position errors for each facial feature in the X-direction and Y-direction respectively,

where the mean is represented by the middle value of each error bar and the standard

deviation is indicated by the half length of the error bar. In the X-direction, the

average mean of the computed position errors for all the twenty-eight facial features

is 1.66 pixels, and its average standard deviation is 0.89 pixel. While in the Y-

direction, the average mean of the computed position errors for all the twenty-eight

facial features is 1.85 pixels, and its average standard deviation is 0.71 pixel. It

appears that the computed mean and standard deviation of the position errors of

all the tracked facial features are less than 2 pixels. Since the original face image

resolution is 320 × 240 pixels, the position errors of the tracked facial features are

small enough for most of the applications, such as facial expression recognition or

facial animation.

(a) (b)

Figure 5.10: The computed position errors of the automatically extractedfacial features by the proposed facial feature tracker: (a)position errors in the X-direction for each facial feature; (b)position errors in the Y-direction for each facial feature

Page 131: Zhiwei Zhu PhD Thesis

114

5.5.2 Processing Speed

The proposed facial feature detection and tracking technique is implemented

using C++ on a PC with a Xeon (TM) 2.80GHz CPU and a 1.00GB RAM. The

resolution of the captured images is 320 × 240 pixels, and the built facial feature

tracking system runs at approximately 26 fps comfortably.

5.6 Comparison with IR-Based Eye Tracker

The eyes (or eye centers), along with other twenty-six prominent facial features,

can be detected and tracked via the proposed technique, as described in the chapter.

Unlike the IR-based eye detection and tracking technique described in Chapter 2,

the technique proposed in this chapter works well under ambient lighting conditions,

without a special IR illuminator. Its hardware setup, therefore, is simple, with

sufficient accuracy for tasks like facial feature detection, face detection, and face

recognition.

In contrast, the IR-based eye tracker requires special IR illumination hardware

and an associated video decoder. However, it can work during the day and at night.

In addition, it produces sub-pixel accuracy, which is a prerequisite for a stable and

accurate gaze-tracking system.

5.7 Chapter Summary

In this chapter, we present a real-time approach to detect and track twenty-

eight facial features from the face images under significant changes in both facial

expression and face orientation. The improvements over the existing facial feature

detection and tracking algorithms result from: (1) combination of the Kalman fil-

tering with the face motion to constrain the facial feature locations; (2) the use of

pyramidal Gabor wavelets for efficient facial feature representation; (3) dynamic and

accurate model updating for each facial feature to eliminate any error accumulation;

and (4) imposing the global geometry constraints to eliminate any geometrical vio-

lations. With these combinations, the accuracy and robustness of the facial feature

tracker reaches a practical acceptable level. Subsequently, the extracted spatio-

temporal relationships among the facial features can be used to perform the facial

Page 132: Zhiwei Zhu PhD Thesis

115

motion estimation or facial expression classification in the following chapters.

Page 133: Zhiwei Zhu PhD Thesis

CHAPTER 6

Nonrigid and Rigid Facial Motion Estimation

6.1 Introduction

The motion of the face consists of two independent motions: rigid motion

and nonrigid motion. The rigid motion results from the global motion of the face

describing the rotation and translation of the head or face pose, while the nonrigid

motion results from the local motion of the face describing the contraction of facial

muscles or facial expression. When captured by the camera, both motions are mixed

together to form a 2D face motion in the image plane.

Successful recovery of face pose and facial expression from face images is very

important for many applications including face animation, facial expression analysis,

HCI and model-based video conferencing. For example, if the face pose and facial

expression can be recovered successfully from the face images, an MPEG-4 player

can be driven to animate a synthetic 3D face model directly by the video images

of a live performer [95]. Also, in the area of automatic facial expression analysis,

due to the inability to separate face pose from facial expression accurately, most

of the facial expression recognition systems developed so far require the subject to

face the camera directly without significant head movements [25, 105]. But once the

face pose is precisely estimated, its effect can be eliminated so that the estimated

nonrigid motion of facial expression will be independent of the face pose. Therefore,

the user can move his head freely in front of the camera while the facial expression

can still be recognized.

In this chapter, a novel technique is proposed to recover 3D face pose and

facial expression simultaneously from a set of twenty-eight facial features tracked by

the proposed technique in chapter 5. Specifically, after explicitly modelling the non-

linear coupling between face pose and facial expression in the image, a normalized

SVD (N-SVD) decomposition technique is proposed to analytically recover the pose

and expression parameters simultaneously. Subsequently, the solution obtained from

the N-SVD technique is further refined via a nonlinear technique by imposing the

116

Page 134: Zhiwei Zhu PhD Thesis

117

orthonormality constraints on the pose parameters. Compared to the original SVD

technique proposed in [5], which is very sensitive to image noise and is numerically

unstable in practice, our proposed method can recover the face pose and facial

expression robustly and accurately.

6.2 Related Works

Numerous methods [95, 25, 30, 11, 121, 120] have been proposed to estimate

rigid and nonrigid facial motions from face images. Conventionally, the rigid and

nonrigid motions of the face are estimated separately [95, 122], and usually, the

nonrigid motion is subsequently estimated after the rigid motion is recovered. For

example, a two-stage method [95] is proposed. First, the rigid 3D head motion

is estimated from two successive frames. Then, the nonrigid motion of the face

is recovered by eliminating the estimated 3D head motion. However, in the stage

of estimating the head motion, the face is assumed to be a rigid object, ignoring

the facial expression changes between these two views. Therefore, the head motion

cannot be accurately estimated under the facial expression changes. In fact, most

face pose estimation algorithms [30, 11, 121, 120] are proposed to deal with the

rigid face only, ignoring the facial expressions. Hence, with facial expressions, the

estimated face pose by these techniques will only be an approximation of the true

pose. In fact, the estimated face pose will not be accurate at all when the facial

expression is significant. On the other hand, without accurate face pose information,

the recovered nonrigid motions of the face will not be accurate, either.

Since the 3D face motion is the sum of rigid and nonrigid motions, when pro-

jected into the image plane of the camera, both motions will be nonlinearly coupled

in the projected face motion. Therefore, any approach that tries to recover one

motion independently by ignoring the other will not solve this problem accurately.

But if the image projection of both motions can be explicitly modelled, then the

face pose and facial expression can be recovered simultaneously and accurately from

the motion projection model.

A group of different techniques [65, 23, 5, 35] have been proposed to estimate

the rigid and nonrigid facial motions simultaneously. For each technique, a different

Page 135: Zhiwei Zhu PhD Thesis

118

2D facial motion model is developed to integrate the face pose and facial expression

parameters together. Then, based on the built 2D facial motion model, the face

pose and facial expression can be extracted simultaneously.

Li et al. [65] proposed a method to simultaneously extract the face pose and

facial expression from two successive views with the use of the image brightness

constancy equation. Their method employs a 3D face model that can describe facial

expressions by a set of action units (AU). For this method, the 3D motion between

successive frames must be small, since the motion of the 3D model is modelled as a

linear approximation of the face pose motion and facial expression motion. Similar to

[65], another 3D face model that can represent the facial expression by a set of facial

animation parameters (FAPs) is utilized to extract the pose and expression from

two successive views based on the image brightness constancy equation [23]. With

the use of the image brightness constancy equation, the component of the motion

field in the direction orthogonal to the spatial image gradient is not constrained.

Therefore, for both methods, the recovered facial motion is only an approximation

of the true motion, whose error will be large if the true motion deviates far from the

direction of the spatial image gradient. In addition, because both methods process

the whole image in order to estimate the motion, they may not be suitable for

real-time implementation due to their computational complexity.

In [34], a novel method is proposed to extract both pose and shape of the

face simultaneously from images based on a 3D model that can represent the facial

expression by a linear combination of rigid shape basis vectors. Different from

techniques in [65, 23], the 3D facial expression is learnt from a set of 3D real facial

expression data collected from a stereo camera system. However, there are two

significant issues that prevent this technique from working robustly in practice.

First, since the facial features are tracked in a simply modified optical flow-based

framework, it cannot adapt to the appearance changes of each facial feature under

significant facial expression changes. It suffers from the well-known drifting issue

[72] during tracking. Second, since the recovery of pose and expression parameters

is integrated into the facial feature tracking in the image, the parameter searching

space becomes very complicated. In such a complicated parameter space, achieving

Page 136: Zhiwei Zhu PhD Thesis

119

correct convergence and real-time implementation is very difficult.

In [5], the face pose and facial expression are recovered from a set of tracked

facial features. The facial expression is modelled as a linear combination of key-

expressions, while the facial motion is approximated by affine projection with par-

allax. Then, the coupling of both motions in the image plane is described by a

bilinear model. Subsequently, by decomposing the bilinear model via the Singular

Value Decomposition (SVD) method, the face pose and facial expression parameters

can be extracted directly. This method seems very nice in theory; however, we will

show both in theory and experiments that this method is so numerically unstable in

the presence of image noise that it does not work at all in practice. Furthermore, for

this method, the selected facial features are tracked with the use of black make-ups,

which also is unpractical.

In order to overcome the shortcomings of the similar research done by [34, 5], a

robust technique is proposed to recover the rigid and nonrigid facial motions from the

face images accurately in real time without any make-ups. First, a normalized SVD

(N-SVD) technique is proposed to improve the original SVD technique proposed

in [5] so that it can work stably in the presence of image noise. As shown in

Section 6.5 of this chapter, however, the proposed N-SVD method is unable to

impose the orthonormality constraints on the face pose parameters. Therefore, we

further introduce a nonlinear technique to refine the solution of the N-SVD method

based on a criterion that has a favorable interpretation in terms of distance and

appropriate constraints. Experiments show that the proposed motion estimation

technique provides significant improvements in accuracy for estimated rigid and

nonrigid motions.

6.3 Pose and Expression Modelling

6.3.1 3D Face Representation

A neutral face is defined as a relaxed face without any contraction of the facial

muscles. With facial expression changes, the facial features will be moved and the

facial appearance will change subsequently as the facial muscles contract. Hence,

facial expression can be treated as deformations of a neutral face. If a set of facial

Page 137: Zhiwei Zhu PhD Thesis

120

features that move significantly with facial expressions are selected for tracking

as shown in Figure 6.1 (a), then the facial expression can be characterized by the

movements (or displacements) of these selected facial features relative to the neutral

face.

1.1 1.2

1.3 1.4

1.5 1.6

2.11

2.8

2.9

2.10

2.7 2.6 2.5

2.2

2.3

2.4 2.1

3.1

3.2 3.3

4.1

4.2 4.4 4.6

4.8

4.7 4.5

4.3

(a) (b)

Figure 6.1: (a) The spatial geometry of the selected facial features markedby the dark dots; (b) The 3D face mesh with the selected facialfeatures marked by the white dots

A 3D face model is represented by a set of l facial features as shown in Figure

6.1 (b). A 3D object coordinate system is attached to the face, whose origin is at the

tip of the nose and whose Z-axis is perpendicular to the face plane. In the defined

face coordinate system, the 3D coordinates of each facial feature Xi are represented

as (xi, yi, zi)T . In addition, the tip of the nose is treated as the face center, whose

3D coordinates are (0, 0, 0)T .

6.3.2 3D Deformable Face Model

Given a neutral face model XN composed of l facial features, the 3D face with

deformation can be expressed as follows:

X = XN + ∆X (6.1)

where X is a vector composed of l facial feature coordinates Xi (i = 1, ..., l) on the

face model, XN is a vector composed of l facial feature coordinates XNi (i = 1, ..., l)

on the neutral face model, and ∆X is the facial deformation vector under the facial

Page 138: Zhiwei Zhu PhD Thesis

121

expressions, which consists of the relative movements ∆Xi (i = 1, ..., l) of the facial

features to the neutral face.

The task of the nonrigid facial motion estimation is to recover ∆X of a face

from its 2D image. Without reduction, ∆X contains 3l variables, which are difficult

to estimate due to such a large dimension. In order to minimize the dimensionality of

the facial deformation vector ∆X, a compact representation with a reduced number

of parameters is statistically built from a set of collected facial deformation vectors

via the Principal Component Analysis (PCA) technique similar to [34]. Specifically,

a set of p (p ≪ 3l) facial deformation basis vectors ∆Qk, k = 1, ..., p, is obtained

offline from a collected training set so that any facial deformation vector ∆X of the

3D face model X can be approximated by a linear combination of the basis vectors

as follows:

∆X ≈

p∑

k=1

αk∆Qk (6.2)

where αk (k = 1, ..., p) are the coefficients of the facial deformation basis vectors,

and ∆Qk is represented as

∆Qk =(

∆Qk1 . . . ∆Qk

l

)T

(6.3)

Therefore, a 3D deformable face model X with facial expression changes can be

represented by a set of reduced parameters as follows:

X ≈ XN +

p∑

k=1

αk∆Qk (6.4)

In the rest of the chapter, the coefficients αk will be called facial expression param-

eters.

Page 139: Zhiwei Zhu PhD Thesis

122

6.3.3 3D Motion Projection Model

Under the weak-perspective projection assumption [109], given a facial feature

point in the 3D face model, the following equation can be obtained:

Ui = MXi (6.5)

where Ui = (ui vi)T is the relative coordinate of the 2D image point and Xi =

(xi yi zi)T is its corresponding relative 3D coordinate point. They are produced

by subtracting the center of the face in their respective coordinate systems. The

projection matrix M is a 2 × 3 matrix composed of the scalar λ and the first two

rows of the rotation matrix R, as follows:

M =

m11 m12 m13

m21 m22 m23

=

1

λ

r11 r12 r13

r21 r22 r23

(6.6)

Therefore, once M is known, the rotation matrix R and the scalar λ of the 3D

face model can be recovered. In the rest of the chapter, the coefficients mij will be

referred to as face pose parameters.

For each facial feature point, after integrating equation 6.4 into equation 6.5,

a motion projection model that analytically combines the effects of face pose and

facial expression in the 2D face image is derived as follows:

Ui = M

[XN

i +

p∑

k=1

αk∆Qki

]

= MXNi +

p∑

k=1

Mαk∆Qki (6.7)

From equation 6.7, it is clear that the 3D motion projection model is a nonlinear

function of the pose parameters mij and expression parameters αi. Furthermore, the

coupling term Mαi indicates that the pose parameters mij and expression parame-

ters αi directly interact to produce the facial motion in the image. For convenience,

the parameters involved in the motion projection equation 6.7 are represented as

ξ = (m11,m12,m13,m21,m22,m23, α1, ..., αp)T .

Page 140: Zhiwei Zhu PhD Thesis

123

6.4 Normalized SVD for Pose and Expression Decomposi-

tion

6.4.1 SVD Decomposition Method

A decomposition technique based on Singular Value Decomposition (SVD)

is introduced by Bascle et al. [5] to estimate the parameters from equation 6.7.

Specifically, it can be summarized by the following two steps:

1. Solution of a system of linear equations

After some appropriate arrangements, equation 6.7 can be written in matrix

format as:

Ui = ΩiW (6.8)

with

Ωi =

XN

i 0 ∆Q1i 0 · · · ∆Qp

i 0

0 XNi 0 ∆Q1

i · · · 0 ∆Qpi

and W = (A α1A · · · αpA)T , where A = (m11 m12 m13 m21 m22 m23). The

parameter vector W contains 6+6p unknowns. Given a set of l detected facial

features in a face image, where l ≥ 3 + 3p, a system with 2l linear equations

can be derived from equation 6.8 as follows:

P = ΩW (6.9)

where P = (U1 · · · Ul)T and Ω = (Ω1 · · · Ωl)

T . The linear system 6.9 can

be easily solved by the least squares technique as W = Ω+P , where Ω+ is the

pseudo-inverse of the matrix Ω, and it is computed from Ω+ = (Ω′Ω)−1Ω′.

2. The singularity constraint

Once the parameter vector W is estimated, a matrix F can be constructed by

Page 141: Zhiwei Zhu PhD Thesis

124

rearranging it:

F =(

AT α1AT · · · αpA

T

)

=

m11

m12

m13

m21

m22

m23

(1 α1 · · · αp

)(6.10)

The above equation clearly shows that the matrix F is a multiplication result of

two vectors. Hence, the matrix F is a singular matrix with rank 1. In practice,

however, the constructed matrix F has a rank 5 because of the inaccuracies

of the measurement or image noise. Thus, steps are needed to enforce this

singularity constraint on the constructed matrix F .

A corrected matrix F ′ can be derived by minimizing the Frobenius norm ‖F −

F ′‖ subject to the constraint det(F ′) = 0. A convenient method of doing this

is to use the SVD decomposition technique. In particular, let F = UDV T be

the SVD of F , where U and V are two unitary matrices, and D is a diagonal

matrix D = diag(d0 d1 · · · dp) satisfying d0 ≥ d1 ≥ · · · ≥ dp. Then F ′ is

chosen as F ′ = Udiag(d0 0 · · · 0)V T . This way, the corrected matrix F ′

will be a singular matrix with rank 1. Subsequently, the parameters ξ can be

recovered directly from the SVD of the corrected matrix F ′.

6.4.2 Condition of the Linear System

Unfortunately, it turns out that the matrix Ω of the linear system P = ΩW

is ill-conditioned. As indicated in [32], the condition 1 of a general nonzero matrix

Ω is a quantitative indication of the sensitivity to perturbation of a linear system

involving Ω. Hence, the ill-conditioned matrix Ω makes the solution of the linear

system P = ΩW very sensitive to image noise. In practice, the estimated pose and

1Formally, the condition of a general nonzero matrix Ω is defined as κ(Ω) = ‖Ω+‖‖Ω‖, whereΩ+ is the pseudoinverse of the matrix Ω.

Page 142: Zhiwei Zhu PhD Thesis

125

expression parameters by the proposed SVD method [5] are extremely susceptible

to image noise, so the SVD method cannot work effectively for real images.

Hence, we must analyze the condition of the linear system P = ΩW . For

simplicity, equation 6.8 can be split into the following two equations, after some

proper rearrangements:

ui = Ω′

iWu (6.11)

vi = Ω′

iWv (6.12)

with Ω′

i = (XNi ∆Q1

i · · · ∆Qpi ) and

Wu = (A1 α1A1 · · · αpA1)T

Wv = (A2 α1A2 · · · αpA2)T

where A1 = (m11 m12 m13) and A2 = (m21 m22 m23).

Then, given a set of l facial features in a face image, assuming l ≥ 3 + 3p, two

systems of linear equations can be derived:

Pu = ΩfWu (6.13)

Pv = ΩfWv (6.14)

with Pu = (u1 · · · ul)T , Pv = (v1 · · · vl)

T , and

Ωf =

XN1 ∆Q1

1 · · · ∆Qp1

......

......

XNl ∆Q1

l · · · ∆Qpl

Apparently, the matrix Ωf is very unbalanced because the magnitudes of the

first column XN are significantly larger than the other columns. In practice, the

magnitude of the difference between the columns XN and the remaining ones is

around 103. As a result, the condition number of the matrix Ωf is very large,

usually larger than 4× 103. This large condition number means that the matrix Ωf

Page 143: Zhiwei Zhu PhD Thesis

126

is close to singular or ill-conditioned. Therefore, the ill-conditioned matrix Ωf will

cause the solutions of the linear equation systems 6.13 and 6.14 to be very unstable

and extremely sensitive to image noise.

In the following section, a simple but effective solution is proposed to improve

the condition of the matrix Ωf . As a result, the linear systems 6.13 and 6.14 can be

solved more stably.

6.4.3 Normalization SVD Technique

In this section, a new stable algorithm, named as normalized SVD method (N-

SVD), is proposed to effectively estimate the pose and expression parameters. Since

the singularity of the matrix Ωf is caused by the imbalance of the matrix Ωf , we

propose a simple matrix transformation technique to balance the matrix Ωf . Given

a transformation matrix C, a transformed matrix Ω′

f is obtained by multiplying the

inverse of the matrix C:

Ω′

f = ΩfC−1 (6.15)

In particular, the transformation matrix C is a diagonal matrix and its struc-

ture is illustrated as follows:

C = diag(

c0 c0 c0 c1 c1 c1 · · · cp cp cp

)

where ck (k = 0, ..., p) equals the average of the sums of each column in Ωf . Using

the matrix transformation equation 6.15, the coordinates of each point are equally

scaled.

After the transformation, the transformed matrix Ω′

f is well-balanced. The

condition number of the transformed matrix Ω′

f is very small, around 101, indicating

that the transformed matrix Ω′

f is well-conditioned. Therefore, for the linear systems

Pu = Ω′

fW′

u and Pv = Ω′

fW′

v, where W ′

u = CWu and W ′

v = CWv, the solutions W ′

u

and W ′

v are less sensitive to image noise.

Similar to the construction of the matrix F , from the estimated parameter

Page 144: Zhiwei Zhu PhD Thesis

127

vectors W ′

u and W ′

v, a new matrix Fn can be constructed:

Fn =(

c0AT c1α1A

T · · · ckαkAT

)

=

m11

m12

m13

m21

m22

m23

(c0 c1α1 · · · ckαk

)(6.16)

The preceding equations clearly show that the constructed matrix Fn is also a sin-

gular matrix with rank 1. Hence, this singularity constraint can be imposed on

the constructed matrix Fn via the SVD technique to obtain a corrected matrix F ′

n.

Subsequently, the parameters can be recovered from F ′

n.

Finally, the N-SVD method is summarized as the following five steps:

1. Normalization: Normalizing the matrix Ωf of the linear systems Pu = ΩfWu

and Pv = ΩfWv by a computed transformation matrix C from the matrix Ωf .

2. Linear solution: Similar to the construction of the matrix F , constructing

a matrix Fn from W ′

u and W ′

v obtained from the transformed linear systems

Pu = Ω′

fW′

u and Pv = Ω′

fW′

v.

3. Constraint enforcement: Replacing Fn by its closest singular matrix F ′

n

via the SVD technique.

4. De-normalization: Decomposing the matrix F ′

n back into parameter vectors

W ′

u and W ′

v, and then replacing W ′

u and W ′

v by C−1W ′

u and C−1W ′

v, respec-

tively.

5. Parameter recovery: Recovering the face pose and expression parameters

from the vectors W ′

u and W ′

v.

Via the proposed N-SVD technique, the condition of the linear systems is

improved significantly so that the recovered pose and expression parameters are no

Page 145: Zhiwei Zhu PhD Thesis

128

longer sensitive to the image noise.

6.4.4 Stability Analysis

For a system of linear equations P = ΩfW , each entry of the vector W will

contribute the same amount of perturbation in the vector P . Thus, under a per-

turbation in P , the entries of W corresponding to the columns with larger entries

in matrix Ωf will undergo a smaller perturbation. In other words, the entries of W

corresponding to the columns with smaller entries in matrix Ωf are more subject to

the perturbation in P . Therefore, for an unbalanced matrix Ωf , as shown in Section

6.4.2, the entries of W corresponding to the columns of ∆Xk in matrix Ωf are very

sensitive to image noise.

After achieving normalization by multiplying the matrix C−1, the entries in

each column of the matrix Ω′

f will have approximately the same average magnitude.

Thus, the matrix Ω′

f will be well-conditioned; solving the system of linear equations

P = Ω′

fW′ is equivalent to treating each column of the matrix Ω′

f equally so that

each entry in W ′ will have the same sensitivity to the image noise.

For example, via the motion modelling equation 6.7, a specific synthetic facial

deformation vector ∆X is generated from 2 basis facial deformation vectors and

a 3D neural face mesh XN by choosing a set of known face pose and expression

parameters ξ0. In addition, in order to test its sensitivity under the image noise,

some Gaussian noise is added into the generated synthetic facial deformation vector

∆X to obtain a new facial deformation vector ∆X. Then, the proposed technique is

subsequently applied to recover the face pose and expression parameters ξ from the

synthetic facial deformation vector ∆X. From the parameter vector ξ0, two vectors

Wu and Wv can be obtained and serve as the ground truth for the perturbation

calculation.

According to the proposed technique above, the matrix Ωf is first generated

as follows:

Ωf =

Page 146: Zhiwei Zhu PhD Thesis

129

−31.96 −35.63 34.51 −0.0120 0.0071 −0.0350 0.2135 −0.1036 −0.0062

38.64 −34.31 34.51 −0.0007 0.0026 0.0158 0.2223 0.0496 0.0007...

......

......

......

......

−48.70 −53.32 33.33 −0.0034 0.0847 0.0129 0.2888 −0.1246 0.072

52.95 −54.02 33.33 0.0085 0.0766 −0.0606 0.2792 0.0997 0.0553

which is very unbalanced since the entries of the first three columns are significantly

larger than the rest of columns. The computed condition number of the matrix

Ωf is 4484.4, which means that it is ill-conditioned. Similar to the original SVD

method, if Ωf is utilized directly to form two systems of linear equations (equations

6.13 and 6.14), then the computed perturbations of the estimated vectors Wu and

Wv are obtained as follows:

δWu =(

0.001 −0.027 −0.020 1.590 −3.713 5.991 −0.331 0.625 −3.541

)T

δWv =(−0.009 −0.012 −0.018 15.681 −1.585 −1.110 0.107 −0.352 −1.456

)T

The preceding calculations show that the perturbations of the entries in Wu

and Wv corresponding to the columns of XN in matrix Ωf are much smaller than

other entries. Hence, in the constructed matrix F , each entry will contain a signifi-

cantly different number of perturbations. However, when taking its closest singular

matrix F ′, all entries are treated equally by ignoring the large inequalities of the per-

turbations associated with them. Therefore, this ignorance of the large inequalities

among the perturbations can make the solution of the original SVD decomposi-

tion technique inaccurate and unstable in the presence of the image noise. The

perturbation δξ of the estimated parameter is computed as follows:

δξ =(−0.5604 −0.2216 0.0407 −0.5767 −0.2449 0.0155 24.6269 7.4026

)T

According to the normalization technique proposed in Section 6.4.3, a well-

balanced matrix Ω′

f is obtained. The condition number of the matrix Ω′

f is only 23.2,

which is significantly smaller than the original value of 4484.4. With the use of the

matrix Ω′

f , two new systems of linear equations can be formed and the computed

Page 147: Zhiwei Zhu PhD Thesis

130

perturbations of the estimated parameter vectors W ′

u and W ′

v are illustrated as

follows:

δW ′

u =(

1.19 −19.74 −8.90 −12.94 2.36 −5.53 14.57 −0.80 1.57)T

δW ′

v =(−6.69 −9.18 −3.66 −11.75 23.36 −2.36 −2.702 0.26 −0.88

)T

It is obvious that the perturbations of the entries of W ′

u and W ′

v are of almost

the same magnitude. Hence, in the constructed matrix Fn, all the entries contain

approximately the same number of perturbations. Similarly, when taking its closest

singular matrix F ′

n, all entries are also treated approximately equally. Therefore,

the computed perturbation δξ of the estimated parameter ξ is much smaller:

δξ =(−0.0296 −0.0319 0.0407 −0.0171 0.0357 0.0155 −2.7909 −5.6009

)T

Based on the preceding calculations, we see that the parameters can be estimated

more stably in the presence of image noise via the proposed N-SVD method.

6.5 Nonlinear Decomposition Method

The proposed N-SVD decomposition method will obtain a unique solution

for each face image. But if the solution is correct, then the recovered face pose

parameters mij must automatically satisfy the following constraints:

f1(ξ) = m211 + m2

12 + m213 − (m2

21 + m222 + m2

23) = 0 (6.17)

f2(ξ) = m11m21 + m12m22 + m13m23 = 0 (6.18)

The above constraints are derived from equation 6.6 by considering the face pose

rotation matrix R as an orthonormal matrix.

Apparently, however, both constraints are ignored by the proposed N-SVD

method, which cannot guarantee that the recovered matrix R is an orthonormal

matrix. Therefore, the above two constraints must be considered in order to guar-

antee that the recovered face pose rotation matrix R is orthonormal.

Due to the non-linearity of these constraints, a nonlinear optimization method

Page 148: Zhiwei Zhu PhD Thesis

131

is utilized to recover the face pose and facial expression parameters simultaneously,

subject to the orthonormal constraints. From equation 6.7, the image projection

function for a facial feature can be re-written as follows:

ui = A1XNi +

p∑

k=1

A1αk∆Qki (6.19)

vi = A2XNi +

p∑

k=1

A2αk∆Qki (6.20)

Therefore, from l facial features, we can build a positive error function fe:

fe(ξ) =l∑

i=1

[∆ui

2 + ∆vi2]

(6.21)

=l∑

i=1

[(ui − upi)

2 + (vi − vpi)2)

](6.22)

where the image pixel errors ∆ui and ∆vi are the Euclidean distances between the

tracked facial features (ui, vi)T and the predicted facial features (upi, vpi)

T computed

by projecting their corresponding 3D facial feature points (xi, yi, zi)T via the derived

equations 6.19 and 6.20.

In the error function fe, there are 6+p unknowns, including 6 pose parameters

and p expression parameters. Therefore, the unknown parameter vector ξ∗ can be

estimated in terms of the following constrained minimization problem:

ξ∗ = arg minξ

fe,

subject to the constraints f1(ξ) and f2(ξ). (6.23)

The above constrained minimization is solved by the sequential quadratic pro-

gramming (SQP) method [97], which is one of the most effective methods for solving

the optimization problem with nonlinear constraints. At each step, a quadratic pro-

gramming (QP) subproblem is solved using an active set strategy [31].

The algorithm needs an initial value, which is provided by the proposed N-

SVD method. Therefore, given a face image, the first step for our algorithm is to

Page 149: Zhiwei Zhu PhD Thesis

132

extract an initial value of ξ by the proposed N-SVD decomposition method; next,

the estimated value of ξ is further refined by the proposed nonlinear optimization

method. In this way, the nonlinear algorithm can efficiently converge to the optimal

value in less than 10 iterations. Experiments show that via the proposed estima-

tion algorithm, the face pose and facial expression parameters can be recovered

accurately and efficiently.

6.6 Experiment Results

Once the pose and expression parameters are recovered from the face images,

the facial expression will be independent of the face pose. In essence, if the decom-

position is successful, then both parameters will be estimated accurately; otherwise,

neither of them will be accurate. Therefore, the performance of the proposed mo-

tion decomposition technique can be evaluated on either the facial expression or face

pose parameters individually.

In the following, several experiments are conducted to test the validity of our

proposed motion decomposition algorithm. First, the performance is analyzed on a

set of synthetic data. Next, real face image sequences with various facial expressions

under natural head movements were collected and the performance of the proposed

techniques on them is subsequently analyzed. In these experiments, a Root-Mean-

Square error (RMSE) measure is used to estimate the accuracy of the recovered

results. The RMSE of an estimated vector is defined as the square root of the

average Euclidean norm of the difference between the estimated vector and the true

vector.

6.6.1 Performance on Synthetic Data

Synthetic data is generated from a set of basis facial deformation vectors and

a 3D neutral face mesh. The 3D neutral face mesh is obtained from a 3D generic

face model. Specifically, in this experiment, the number of basis facial deformation

vectors utilized is equal to 4. In addition, a set of face pose parameters M and

facial expression parameters αi are randomly sampled. Based on them, a sequence

that contains 300 frames of image points is generated. Finally, different levels of

Page 150: Zhiwei Zhu PhD Thesis

133

Gaussian noise with a zero mean are added into the generated coordinates of image

points at each frame. For each noise level, the errors associated with the parameters

(pose and expression) recovered by the original SVD method, N-SVD method and

the nonlinear method are computed from the 300-frame sequence.

When computing the errors, the pose parameters M and expression parameters

αi originally employed to generate the synthetic data serve as the ground truth. In

addition, for the nonlinear decomposition method, the solution obtained from the

N-SVD method is employed as an initial estimate.

In order to express the parameter errors in a meaningful and measurable quan-

tity, the following error measurement technique is performed. First, from the face

pose parameters M , three face pose angles and the face scale factor are recovered.

Therefore, the pose parameter error is characterized by the face pose angle error

in degrees and the face scale factor error in percentage. Second, from the facial

expression parameters αk, the facial deformation vector ∆X is computed as follows:

∆X =

p∑

k=1

αk∆Qk (6.24)

Hence, the facial expression parameter error can be characterized by the RMSE of

the facial deformations in pixels.

As discussed in Section 6.4, the original SVD decomposition method is very

sensitive to image noise; Figure 6.2 (a-c) illustrates its behaviors on the synthesized

image frames in the presence of noise. Apparently, as shown in Figure 6.2 (a-c), small

image noise produces large errors on the estimated pose and expression parameters.

On the other hand, for the proposed N-SVD method, the stability of the estimated

parameters under the image noise improves dramatically as shown in Figure 6.2

(a-c), and its accuracy degrades gracefully with the image noise level.

In addition, Figure 6.3 illustrates the improvements of the estimated parame-

ters via the proposed nonlinear method over the synthesized 300 frames as a function

of the standard deviation of the Gaussian noise. It clearly shows that the nonlinear

method reduces the errors of the parameters estimated from the N-SVD method

significantly as the image noise level increases.

Page 151: Zhiwei Zhu PhD Thesis

134

(a) (b) (c)

Figure 6.2: Average errors of the estimated parameters by the SVDmethod and the proposed N-SVD method respectively as afunction of Gaussian noise: (a) face pose error; (b) face scalefactor error; (c) facial deformation error

(a) (b) (c)

Figure 6.3: Average errors of the estimated parameters by the proposedN-SVD method and nonlinear method respectively as a func-tion of Gaussian noise: (a) face pose error; (b) face scalefactor error; (c) facial deformation error

6.6.2 Performance on Real Image Sequences

6.6.2.1 Neutral Face Under Various Face Orientations

In this experiment, subjects will move their faces freely in front of the camera,

while keeping the facial expression as neutral. Significant out-of-plane face rotations

are included, along with significant distance changes from the face to the camera.

Randomly selected face images with automatically tracked facial features from dif-

ferent image sequences are shown in Figure 6.4. Based on the tracked facial features,

the proposed motion decomposition method is subsequently applied to recover the

face pose and the facial expression for each face image.

Since these face image sequences contain only a neutral face, ideally, the ex-

Page 152: Zhiwei Zhu PhD Thesis

135

0 15 41 82 104 112

0 12 24 70 80 96

Figure 6.4: The randomly selected face images from a set of differentneutral face image sequences

tracted facial deformation vector V is close to zero. Thus, the RMSE of an estimated

facial deformation vector V can be simplified as follows:

RMS(V ) =

√‖V ‖

2

n(6.25)

where n is the dimension of the vector. Therefore, the RMSE of the estimated

facial deformation vector can be used as a metric to evaluate the performance of

the proposed technique: the smaller the RMSE, the better the performance of the

proposed technique.

Figure 6.5 shows the calculated RMSEs of the estimated facial deformation

vectors with the proposed N-SVD method as well as the nonlinear method respec-

tively for the two different image sequences, shown in Figure 6.4, correspondingly.

In these experiments, the output of the N-SVD method is always refined by the non-

linear method. As shown in Figure 6.5, the proposed nonlinear method works very

well and improves the N-SVD solution dramatically so that the facial deformations

can be accurately estimated. On the other hand, if the pose effects on the facial

features are ignored (all the faces are treated as frontal faces), then the RMSEs of

the calculated facial deformation vectors are significantly large, as shown in Figure

6.5.

Figure 6.6 (a) also displays the three estimated face pose angles by the pro-

Page 153: Zhiwei Zhu PhD Thesis

136

Figure 6.5: The calculated RMSEs of the estimated facial deformations

posed method, and the estimated face scale factor is displayed in Figure 6.6 (b).

The face scale factor characterizes the distance between the face and the camera.

They follow the movements of the face in the images very well visually.

6.6.2.2 Frontal Face with Different Facial Expressions

We also collected a set of face image sequences that contain significant facial

expressions from the frontal view. In each sequence, a person is changing his facial

expressions while facing the camera directly. Randomly selected face images with

automatically tracked facial features from these image sequences are shown in Figure

6.7.

Since the face in the image is frontal, the three pose angles of face in the image

must be approximately zero. In addition, the movements of the facial features in a

frontal face image can be extracted directly by subtracting them from their neutral

face image. Hence, for each face image, a facial deformation vector V ′

k can be

obtained. Via the proposed motion decomposition method, a facial deformation

vector Vk can be estimated. Therefore, by choosing V ′

k as ground truth, the RMSE

of an estimated facial deformation vector Vk is defined as follows:

RMS(Vk) =

√‖Vk − V ′

k‖2

n(6.26)

Page 154: Zhiwei Zhu PhD Thesis

137

(a) (b)

Figure 6.6: (a) The three estimated face pose angles; (b) The estimatedface scale factor

The calculated RMSE of an estimated facial deformation vector can be used

as a metric to evaluate the performance of the proposed motion decomposition tech-

nique: the smaller the RMSE, the better the performance of the proposed technique.

Figure 6.8 (a) illustrates the RMSEs of the estimated facial deformation vectors for

different face image sequences. From them, it shows that the proposed nonlinear

method can improve the solution of the proposed N-SVD method significantly so

that the average value of the computed RMSEs of the facial deformations is only

around 1 pixel, which is precise enough to discriminate most subtle facial expres-

sions. Table 6.1 quantitively summarizes the calculated RMSEs of the estimated

facial deformation vectors for the image sequences used in these experiments.

Page 155: Zhiwei Zhu PhD Thesis

138

0 7 10 14 18

0 5 7 9 11

0 9 13 17 22

Figure 6.7: The randomly selected images from a frontal face image se-quence

Table 6.1: The average RMSEs of the extracted facial deformation vectorsfor different image sequences

Sequence N-SVD NonlinearNumber Method (pixels) Method (pixels)

1 1.40 0.752 1.27 0.923 1.20 0.89

Figure 6.8 (b) further illustrates the average error of the estimated face pose

angles for different face image sequences. The calculated average error of the es-

timated face pose angles is summarized quantitively in Table 6.2. The nonlinear

method improves the solution of the N-SVD method so that the average error of the

estimated face pose angles can be equal to approximately one degree.

6.6.2.3 Non-neutral Face Under Various Face orientations

We also collected a set of image sequences containing simultaneous face pose

and facial expression changes. In each sequence, a person rotates his head freely

in front of the camera (but always starting from frontal view), while keeping the

Page 156: Zhiwei Zhu PhD Thesis

139

(a)

(b)

Figure 6.8: (a) The calculated RMSE of the estimated facial deforma-tion vectors; (b) The average error of the estimated face poseangles

Table 6.2: The average error of the extracted face pose angles for differentimage sequences

Sequence N-SVD NonlinearNumber Method (degrees) Method (degrees)

1 1.51 0.522 1.71 0.823 3.00 1.29

facial expression unchanged. Randomly selected face images for these sequences are

shown in Figure 6.9.

For example, in the “happy” face image sequence, each image contains a face

with the same facial expression, while rotating in front of the camera. Therefore, the

facial deformation vectors in all face images are equivalent. Since frame 0 contains

a frontal face, its extracted facial deformation vector V0 can serve as the ground

truth. Specifically, V0 is obtained by directly subtracting the face in frame 0 from

its neutral face. Therefore, the RMSE of an estimated facial deformation vector Vk

Page 157: Zhiwei Zhu PhD Thesis

140

0 17 26 69 80

0 13 27 70 79

0 12 42 59 66

Figure 6.9: The randomly selected images from three face image se-quences with different facial expressions. (Top: happy; Mid-dle: surprise; Bottom: disgust)

is defined as follows:

RMS(Vk) =

√‖Vk − V0‖

2

n(6.27)

Figure 6.10 (a) shows the calculated RMSEs of the estimated facial deforma-

tion vectors for a “happy” face image sequence with the proposed technique. It

shows that the nonlinear technique performs much better than the N-SVD method

alone on the face images with simultaneous face pose and facial expression changes.

In addition, Figure 6.10 (b) and (c) display the three estimated face pose angles and

the estimated face scale factor, respectively. They visually follow the movements of

the face in the images very well.

Similar to the “happy” image sequence, the RMSEs of the estimated facial

deformation vectors from two other image sequences are illustrated in Figure 6.10.

Table 6.3 summarizes the calculated RMSEs of the estimated facial deformation

vectors for the image sequences used in the experiments. It tells clearly that the

proposed two-stage motion estimation technique achieves very good results so that

Page 158: Zhiwei Zhu PhD Thesis

141

(a) (b) (c)

Figure 6.10: (a) The calculated RMSE of the estimated facial deformationvectors; (b) The estimated three face pose angles; (c) Theestimated face scale factor

most of the subtle facial expressions can be still discriminated via the recovered

facial deformations.

6.6.3 Processing Speed

The proposed facial motion decomposition technique is implemented using

C++ on a PC with a Xeon (TM) 2.80GHz CPU and a 1.00GB RAM. The resolution

of the captured images is 320×240 pixels, and the built facial motion decomposition

system, integrated with the facial feature tracker, runs at approximately 20 fps

comfortably.

Page 159: Zhiwei Zhu PhD Thesis

142

Table 6.3: The average RMSEs of the extracted facial deformation vectorsfor different image sequences

Sequence N-SVD Nonlinear Without PoseType Method (pixels) Method (pixels) Elimination (pixels)Happy 2.25 0.88 6.09

Surprise 2.01 1.17 5.31Disgust 1.84 1.01 5.87

6.7 Chapter Summary

In this chapter, a novel technique is presented to simultaneously recover the

rigid and nonrigid facial motions from the face images accurately. The coupling

effects of both motions in the 2D face image are first analytically modelled into an

nonlinear motion projection function, and then decomposed by the proposed two-

stage motion decomposition technique in a very efficient way. Experiments show

that the proposed method can simultaneously recover the rigid and nonrigid motion

of the face very accurately.

The main contributions of this chapter are summarized as follows. First, the

face pose and the facial expression parameters are analytically integrated into a

unified formulation, which can be solved efficiently by a nonlinear optimization

method combined with the N-SVD technique. Finally, a real time system is built

so that the face pose and facial expression parameters can be extracted accurately

as soon as the user is sitting in front of the camera, without black make-ups or

markers.

Page 160: Zhiwei Zhu PhD Thesis

CHAPTER 7

Facial Expression Recognition

7.1 Introduction

Via the technique proposed in chapter 6, the rigid facial motion related to face

pose and the non-rigid facial motion related to facial expression can be separated

successfully from a face image. In addition, the recovered non-rigid facial motion

is composed of the movements of a set of prominent facial features relative to the

neutral face. Since these non-rigid facial feature movements are closely related to

the generation of facial expressions, the facial expressions can be recognized from

them intuitively. In this chapter, based on the non-rigid facial motions recovered

from the face images, a computational model is constructed to model and under-

stand the facial expressions with the use of Dynamic Bayesian Networks (DBN)

[128]. Although a facial expression recognition framework with the use of DBN has

been introduced [128], no working system has ever been built to recognize facial

expressions in practice. Therefore, in this chapter, efforts are focused on developing

a real-time working system based on DBN. Finally, a facial expression recognition

system is built so that the six basic facial expressions can be recognized successfully

under natural head movements in real time.

7.2 Facial Expressions with AUs

The Facial Action Coding System (FACS) developed by Ekman et al. [24] is

the most comprehensive method of coding facial expressions. With the use of Action

Units (AUs) defined in FACS, all possible visually distinguishable facial movements

can be coded successfully into either a single AU or a combination of different AUs.

In essence, a facial expression can be expressed as a combination of different AUs

uniquely. Specifically, for the six basic facial expressions, each facial expression is

characterized by a set of different AUs as shown in Table 7.1. For example, AU12

(lip corner puller) can be directly associated with an expression of “happy,” and

AU9 can be directly associated with an expression of “disgust.”

143

Page 161: Zhiwei Zhu PhD Thesis

144

Table 7.1: The association of six basic facial expressions with AUs

Expressions Associated AUsHappy 6,12,25,26Anger 2,4,7,17,23,24,25,26

Sadness 1,4,7,15,17Disgust 9,10,17,25,26

Fear 1,2,4,20,25,26Surprise 1,2,5,26,27

7.3 Coding AUs with Feature Movement Parameters

A single AU describes a type of specific facial appearance changes that occur

with muscular contractions in certain facial regions. Usually, facial appearance

changes can be revealed directly from the movements of the facial features involved

with changes in appearance. Therefore, the movements of facial features can be

utilized to quantitatively measure and code the AUs. The recovered non-rigid facial

motion is composed of the movements of a set of twenty-eight selected facial features

as shown in Figure 7.1; hence, it can be used to code the AU directly.

1.1 1.2

1.3 1.4

1.5 1.6

2.11

2.8

2.9

2.10

2.7 2.6 2.5

2.2

2.3

2.4 2.1

3.1

3.2 3.3

4.1

4.2 4.4 4.6

4.8

4.7 4.5

4.3

Figure 7.1: The spatial geometry of the selected facial features markedby the dark dots

Since an AU is related to several different facial features, Table 7.3 groups

different facial features into AUs relevant to six basic facial expressions. In addition,

a set of parameters named as “Feature Movement Parameters” or “FMPs,” defined

and derived from the movements of the facial features, are shown in Table 7.3. In

total, there are 33 FMPs defined. Each face image is normalized to the same scale

Page 162: Zhiwei Zhu PhD Thesis

145

as the one in the neutral facial expression before the FMPs are extracted.

7.4 Modelling Spatial Dependency

Tables 7.1 and 7.3 deterministically characterize the relationships between

facial expressions and AUs as well as the relationships between FMPs and AUs. To

account for the uncertainties associated with facial feature movement measurements

and facial expressions, the otherwise deterministic relations are cast in a probabilistic

framework by a static BN model, as shown in Figure 7.2. The static BN model of

the facial expression consists of three layers: expression layer, facial AU layer, and

FMP layer.

Expressions

HAP SUR FER DIS SAD ANG

AU26 AU2 AU1 AU20 AU10 AU9 AU17 AU15 AU23 AU4 AU12 AU6 AU27 AU24 AU25 AU5 AU7

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 F23 F24 F25 F26 F27 F28 F32 F33 F29 F30 F31

Exp

ress

ion

L

ayer

A

U

Lay

er

FM

P

Lay

er

Figure 7.2: The BN model of six basic facial expressions. In thismodel, “HAP” represents “Happy,” “ANG” represents“Anger,” “SAD” represents “Sad,” “DIS” represents “Dis-gust,” “FEA” represents “Fear,” and “SUR” represents “Sur-prise”

The expression layer consists of hypothesis variable C, including six states

c1, c2, · · · , c6, which represent the six basic expressions, and a set of attribute vari-

ables denoted as HAP , ANG, SAD, DIS, FEA and SUR corresponding to the six

basic facial expressions, as shown in Figure 7.2. The goal of this level of abstraction

is to find the probability distribution of six facial expressions. The class state ci,

which represents the chance of class state ci given FMPs.

The AU layer is analogous to a linguistic description of the relationship be-

Page 163: Zhiwei Zhu PhD Thesis

146

Table 7.2: The association between facial action units and facial featuremovement parameters (FMPs)

AUs Description Name of FMP Value FMPs

AU1 Inner brow raise l i eyebrow Dy(1.3) F1raiser raise r i eyebrow Dy(1.4) F2

AU2 Outer brow raise l o eyebrow Dy(1.1) F3raiser raise r o eyebrow Dy(1.6) F4

AU4 Brow lower lower l i eyebrow Dy(1.3) F5lower r i eyebrow Dy(1.4) F6lower l m eyebrow Dy(1.2) F7lower r m eyebrow Dy(1.5) F8squeeze l eyebrow Dx(1.3) F9squeeze r eyebrow Dx(1.4) F10

AU5 Upper lid raise l t eyelid Dy(2.2) F11raiser raise r t eyelid Dy(2.8) F12

AU6 Cheek raiser raise l eyecorner Dy(2.1) F13raise r eyecorner Dy(2.11) F14

AU7 Lid tightener close l eye Dy(2.2)−Dy(2.3) F15close r eye Dy(2.8)−Dy(2.9) F16

AU9 Nose wrinkler stretch l nose Dy(3.2) F17stretch r nose Dy(3.3) F18

AU10 Upper lip raise t m lip Dy(4.4) F19raiser raise t l lip Dy(4.2) F20

raise t r lip Dy(4.6) F21

AU12 Lip corner raise l c lip Dy(4.1) F22puller raise r c lip Dy(4.8) F23

stretch l c lip Dx(4.1) F24stretch r c lip Dx(4.8) F25

AU15 Lip corner lower l c lip Dy(4.1) F26depressor lower r c lip Dy(4.8) F27

AU17 Chin raiser raise b m lip Dy(4.5) F28

AU20 Lip stretcher stretch l c lip Dx(4.1) F24stretch r c lip Dx(4.8) F25raise b m lip Dy(4.5) F28

AU23 Lip tighener tight l c lip Dx(4.1) F29tight r c lip Dx(4.8) F30

AU24 Lip pressor lower t m lip Dy(4.4) F31raise b m lip Dy(4.5) F28

AU25 Lips part open jaw (slight) Dy(4.4)−Dy(4.5) F32lower b midlip (slight) Dy(4.5) F33

AU26 Jaw drop open jaw (middle) Dy(4.4)−Dy(4.5) F32lower b midlip (middle) Dy(4.5) F33

AU27 Mouth open jaw (large) Dy(4.4)−Dy(4.5) F32stretch lower b midlip (large) Dy(4.5) F33

Note: functions Dx and Dy extract the x and y components of a facial feature movement res-pectively (see Figure 7.1 for the facial feature definition). In the FMP names, “l” represents “left,” “r” represents “right,” “i” represents “inner,” “o” represents “outer,” “m” represents “middle,” “t” represents “top,” “b” represents “bottom,” and “c” represents “corner.”

Page 164: Zhiwei Zhu PhD Thesis

147

tween AUs and facial expressions in Table 7.1. Each expression category, which is

actually an attribute node in the classification layer, consists of a set of AUs. These

AUs contribute visual cues to the understanding of the facial expression. The lowest

level of layer in the model is the sensory data layer containing FMPs, as given in

Tables 7.3. All the FMPs are observable, and they are connected to the correspond-

ing AUs. The value of an FMP is segmented into three ranges to differentiate the

intensity of an individual muscular action (e.g., low, middle, high). The range of

variation is determined by statistically analyzing the Cohn-Kanade facial expression

database [55].

Since the relationship between facial motion behaviors and facial expressions

is determined uniquely by human psychology, theoretically this alleviates the influ-

ence of inter-personal variations on facial expression model. Hence, the topology

of the BN facial expression model is invariant over time. Nevertheless, the model

needs be parameterized by the conditional probabilities for the intermediate nodes.

The conditional probabilities of AUs for a given facial expression are based on the

statistical results produced by a group of AU coders through visual inspection. The

parameters of conditional probabilities in FMP layer are estimated by Maximum

Likelihood Estimation from the given FMPs from 300 image sequences covering the

six basic expressions in Cohn-Kanade facial expression database [55]. Each condi-

tional probability in the AU layer is known, which guarantees that the rest of the

parameters will converge to a local maximum on the likelihood surface.

7.5 Modelling Temporal Dynamics

Facial expression often reveals not only the nature of the deformation of the

face, but also the relative timing of facial actions and their temporal evolution.

A facial action occurs when muscular contraction begins and increases in in-

tensity. The apex of the process is indicated by the maximum excursion of the

muscle, while the offset is observed in the relaxation of muscular action. Modelling

such a temporal course of facial expressions allows us to better understand the facial

representation of human emotion at each stage of its development.

The static BN model of facial expression works with visual evidences and

Page 165: Zhiwei Zhu PhD Thesis

148

beliefs at a single time instant, and it lacks the ability to capture the temporal

dependencies between the consecutive occurrences of expression in image sequences.

In contrast, a Dynamic Bayesian Networks (DBN) can be utilized to achieve spatio-

temporal analysis and interpretation of facial expressions.

Our DBN model is made up of interconnected time slices from a static BN, and

the relationships between two neighboring time slices are linked by the first-order

HMM. The relative timing of facial actions during the emotional evolution process

is expressed by moving a time frame in accordance with the frame motion of a video

sequence, so that the visual information at the previous time provides diagnostic

support for current expression hypothesis.

Eventually, the values for the current hypothesis are inferred from the com-

bined information of current visual cues through causal dependencies in the current

time slice, as well as from the preceding evidence obtained from temporal dependen-

cies. Figure 7.3 shows the temporal dependencies derived by linking the top nodes

of the BN model given in Figure 7.2.

t−1

t

t+1

HAP SAD ANG SUP DIS

FEADISSUPANGSADHAP

FEA

Expressions

Expressions

Figure 7.3: The temporal links of DBN for modelling facial expression(two time slices are shown since the structure repeats by ”un-rolling” the two-slice BN). Node notations are given in Figure7.2

The expression hypothesis obtained from the preceding time slice serves as a

priori information for current hypothesis, and it is integrated with current data to

produce a posteriori estimate of current facial expression. More details can be found

Page 166: Zhiwei Zhu PhD Thesis

149

in [128].

Therefore, after the DBN facial expression model is parameterized, given a

set of FMPs measured from a face image, the facial expression can be inferred

successfully from the DBN model via belief propagation over time.

7.6 Experiment Results

The output of the DBN model is the probability distributions over the six

basic facial expressions as functions of image frames. If the distributions of the six

basic facial expressions are equally likely, the face is in an absolutely neutral state.

Hence, the facial expression is identified as the one with the highest probability.

However, if there are multiple facial expressions showing outstanding distributions,

this indicates that either the subject performs a blended emotion or the system

confuses the facial actions.

21 70 161 294 392 483 602 672

Figure 7.4: Upper: a video sequence with 700 frames containing six ba-sic facial expressions. It only shows 8 snapshots for illustra-tion. Bottom: the output result shows probability distribu-tions (emotional intensities) over six basic facial expressionresulting from sampling the sequence every 7 frames

Figure 7.4 illustrates system output which shows the temporal course of facial

expressions resulting from sampling a video sequence every 7 frames. The sequence

Page 167: Zhiwei Zhu PhD Thesis

150

has 700 frames, containing six basic facial expressions plus the neutral states among

them. Though there is a certain amount of confusion (e.g., the confusion between

surprise and fear) due to ambiguity in its appearance and errors involved in track-

ing, the overall performance of modelling dynamics of emotional expressions is good.

The inability of current facial expression recognition systems to correlate and reason

about facial temporal information over time is an impediment to providing a coher-

ent overview of the dynamic behavior of facial expressions in an image sequence.

The proposed approach enables an automated facial expression recognition system

to recognize the facial expressions, yet model their temporal behavior so that var-

ious stages of the development of a human emotion can be visually analyzed and

dynamically interpreted by machine.

To further quantitatively characterize the performance of our technique, we

conduct the facial expression recognition with respect to manual recognition using

another 700-frame sequence. These manually labelled frames are then compared

with the results from our automated system. Table 7.3 summarizes the confusion

statistics compared against the ground truth by visual inspection. The results show

that the performance is apparently good for this sequence. However, our essential

purpose is to perceive the temporal course and intensity of facial expressions from

an image sequence rather than the recognition accuracy for individual images.

Table 7.3: Confusion statistics from the 700-frame sequence

GROUND RECOGNIZED FACIAL EXPRESSIONSTRUTH HAP SUP SAD ANG DIS FEA NEU Tot.

HAP 96 0 0 0 0 0 10 106SUP 0 78 0 4 0 0 8 90SAD 0 0 92 0 0 8 5 105ANG 0 0 0 80 0 0 6 86DIS 3 7 0 0 64 0 3 77FEA 0 9 13 0 0 73 8 103NEU 0 0 0 0 0 0 133 133

Tot. 99 94 105 84 64 81 173 700

Note: NEU denotes neutral.

Page 168: Zhiwei Zhu PhD Thesis

151

7.6.1 Processing Speed

The proposed facial expression modelling is implemented using C++ on a PC

with a Xeon (TM) 2.80GHz CPU and a 1.00GB RAM. The resolution of the cap-

tured images is 320× 240 pixels, and the built facial expression recognition system,

integrated with the facial feature tracker as well as the facial motion decomposition,

runs at approximately 20 fps comfortably.

7.7 Chapter Summary

In this chapter, based on the recovered non-rigid facial motion from face im-

ages, a probabilistic model is constructed to model and understand the six basic

facial expressions with the use of Dynamic Bayesian Networks (DBN). A real-time

facial expression recognition system has been successfully built so that the six basic

facial expressions can be recognized under natural head movements all in real time.

Compared to most of the existing facial expression recognition systems, our system

allows natural head movements, which is beyond the state of the art in the real time

facial expression recognition research.

Page 169: Zhiwei Zhu PhD Thesis

CHAPTER 8

Conclusion

This thesis addresses the problems of real-time and non-intrusive human facial be-

havior understanding for Human Computer Interaction. Computer vision techniques

characterizing three typical facial behaviors, namely eye gaze, head gesture, and fa-

cial expression, are developed in this thesis. Several fundamental issues associated

with each computer vision technique are addressed. In addition, we also make the-

oretical contributions in several areas of computer vision including object detection

and tracking, motion analysis and estimation, and pose estimation.

Specifically, the main contributions of the thesis are summarized as follows:

1. We present a new real time eye detection and tracking methodology that

works under variable and realistic lighting conditions as well as various face

orientations.

2. From the detected eye images, we propose an improved eye gaze tracking algo-

rithm that allows natural head movement, with minimum personal calibration.

3. Via the proposed Case-Based Reasoning with a confidence paradigm, a ro-

bust visual tracking framework is proposed to track the faces under significant

changes in lighting, scale, facial expression and head movement.

4. With the use of robust feature representation via Gabor wavelets as well as

the global shape constraint among the facial features, twenty-eight prominent

facial features are detected and tracked simultaneously.

5. Based on the set of tracked facial features, a robust decomposition method is

presented to separate the rigid head motion from the nonrigid facial motion

from the 2D face images.

6. Based on the recovered non-rigid facial motions, a DBN model is utilized to

recognize six basic facial expressions under natural head movement.

152

Page 170: Zhiwei Zhu PhD Thesis

153

Experiments were conducted to test these proposed algorithms with numerous sub-

jects under different lighting conditions. The algorithms were found to be robust,

reliable and accurate.

Based on our findings and experimental results, future research could focus on

improving each technique individually, as follows. For gaze tracking, high-resolution

cameras can be utilized to improve gaze estimation accuracy, and new techniques

are needed to further expand the volume of head movement for more natural inter-

actions between humans and computers. Regarding the facial feature tracking, it

fails when some facial features being tracked are occluded under large head orien-

tations. Therefore, an important future objective is to improve robustness of facial

feature tracking for near-profile face images. Finally, for the proposed CBR based

face tracking system, automatic maintenance of the face case base requires further

study.

Besides further improving each proposed technique, an important future re-

search goal is to build a prototype system that integrates all the component tech-

niques proposed in this research. Combined with a probabilistic user model, the

prototype system may be used to infer the user’s needs, intentions and affective

states for an effective human computer interaction.

Page 171: Zhiwei Zhu PhD Thesis

BIBLIOGRAPHY

[1] Applied Science Laboratories. http://www.a-s-l.com.

[2] LC Technologies, Inc. http://www.eyegaze.com.

[3] SensoMotoric. http://www.smi.de.

[4] A. Aamodt and E. Plaza. Case-based reasoning: foundational issues,methodological variations, and system approaches. AI Communications,7(1):39–59, 1994.

[5] B. Bascle and A. Blake. Separability of pose and expression in facial trackingand animation. In International Conference on Computer Vision, pages323–328, 1998.

[6] D. Beymer and M. Flickner. Eye gaze tracking using an active stereo head.In Proceedings of the International Conference on Computer vision andPattern Recognition, 2003.

[7] M. Black and Y. Yacoob. Recognizing facial expression in image sequencesusing local parameterized models of image motion. IJCV, 25(1):23–48, 1997.

[8] A. Blake, R. Curwen, and A. Zisserman. A framework for spatio-temporalcontrol in the tracking of visual contours. International Journal of ComputerVision, 11(2):127–145, 1993.

[9] A. Bobick, Intille S., J. Davis, F. Baird, C. Pinhanez, L. Campbell,Y. Ivanov, A. Schutte, and A. Wilson. The KidsRoom: APerceptually-Based Interactive and Immersive Story Environment. TechnicalReport 398, E15, 20 Ames Street, Cambridge, MA 02139, December 1996.

[10] M. Burl, T. Leung, and P. Perona. Face localization via shape statistics. InInternational Conference on Automatic Face and Gesture Recognition,Zurich, Switzerland, 1995.

[11] M. La Cascia, S. Sclaroff, and V. Athitsos. Fast, reliable head tracking undervarying illumination: An approach based on robust registration oftexture-mapped 3D models. IEEE Transactions on Pattern Analysis andMachine Intelligence (PAMI), 22, 2000.

[12] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigidobjects using mean shift. In IEEE Conference on CVPR00, 2000.

154

Page 172: Zhiwei Zhu PhD Thesis

155

[13] W. S. Cooper. Use of optimal estimation theory, in particular the kalmanfilter, in data analysis and signal processing. Rev. Sci. Instrument,57(11):2862–2869, 1986.

[14] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models.In European Conference on Computer Vision, Berlin, 1998.

[15] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,20:273–297, 1995.

[16] J. D. Daugman. Complete discrete 2-D gabor transforms by neural networksfor image analysis and compression. IEEE Transactions on ASSP,(36):1169–1197, 1988.

[17] D. Decarlo, D. Metaxas, and M. Stone. An anthropometric face model usingvariational techniques. In 25th annual conference on computer graphics andinteractive techniques, pages 67–74, 1998.

[18] A. T. Duchowski. Eye tracking methodology: Theory and practice. In SpringVerlag, 2002.

[19] Y. Ebisawa. Improved video-based eye-gaze detection method. IEEETransactions on Instrumentation and Measurement, 47(2):948–955, 1998.

[20] Y. Ebisawa, M. Ohtani, and A. Sugioka. Proposal of a zoom and focuscontrol method using an ultrasonic distance-meter for video-based eye-gazedetection under free-hand condition. In Proceedings of the 18th AnnualInternational conference of the IEEE Eng. in Medicine and Biology Society,1996.

[21] Y. Ebisawa and S. Satoh. Effectiveness of pupil area detection techniqueusing two light sources and image difference method. In Proceedings of the15th Annual Int. Conf. of the IEEE Eng. in Medicine and Biology Society,1993.

[22] G. Edwards. New software makes eyetracking viable: you can controlcomputers with your eyes, 1998.

[23] P. Eisert and B. Girod. Model-based estimation of facial expressionparameters from image sequences. In International Conference on ImageProcessing, pages 418–421, 1997.

[24] P. Ekman and W. Friesen. The Facial Action Coding System: A Techniquefor the Measurement of Facial Movement. Consulting Psychologists Press,Inc., San Francisco, CA, 1978.

Page 173: Zhiwei Zhu PhD Thesis

156

[25] I. A. Essa and A. P. Pentland. Coding, analysis, interpretation, andrecognition of facial expressions. IEEE Transactions on Pattern Analysisand Machine Intelligence, 19(7):757 – 763, 1997.

[26] G. C. Feng and P. C. Yuen. Variance projection function and its applicationto eye detection for human face recognition. International Journal ofComputer Vision, 19:899–906, 1998.

[27] G. C. Feng and P. C. Yuen. Multi-cues eye detection on gray intensityimage. Pattern recognition, 34:1033–1046, 2001.

[28] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm formodel fitting with applications to image analysis and automatedcartography. Communications of the ACM, 24(6):381–395, 1981.

[29] A. W. Fitzgibbon and R. B. Fisher. A buyers guide to conic fitting. In 5thBritish Machine Vision Conference, pages 513–522, 1995.

[30] A. Gee and R. Cipolla. Determining the gaze of faces in images. Image andVision Computing, 12:639–948, 1994.

[31] P. E. Gill, W. Murray, and M. H. Wright. Practical Optimization. London,Academic Press, 1981.

[32] P. E. Gill, W. Murray, and M. H. Wright. Numerical Linear Algebra andOptimization. Addison-wesley Publishing Company, 1991.

[33] I. D. Gluck. Optics. New York, Holt, Rinehart and Winston, 1964.

[34] S. B. Gokturk, J. Y. Bouguet, and R. Grzeszczuk. A data-driven model formonocular face tracking. In IEEE International Conference on ComputerVision, Vancouver,B.C., Canada, 2001.

[35] S. B. Gokturk, C. Tomasi, B. Girod, and J. Y. Bouguet. Model-based facetracking for view-independent facial expression recognition. In Fifth IEEEInternational Conference on Automatic Face and Gesture Recognition,Washington DC, 2002.

[36] P. W. Hallinan. Recognizing human eyes. In Proceedings of SPIE, Vol. 1570:Geometric Methods in Computer Vision, pages 212–226, 1991.

[37] A. Haro, M. Flickner, and I. Essa. Detecting and tracking eyes by using theirphysiological properties, dynamics, and appearance. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2000.

[38] R. Herpers, M. Michaelis, K. H. Lichtenauer, and G. Sommer. Edge andkeypoint detection in facial regions. In Proceedings of the 2th IEEEInternational Conference on Automatic Face and Gesture Recognition, 1996.

Page 174: Zhiwei Zhu PhD Thesis

157

[39] J. Ho, K. Lee, M. Yang, and D. Kriegman. Visual tracking using learnedlinear subspaces. In IEEE Conference on CVPR04, 2004.

[40] J. Huang, D. Li, X. Shao, and H. Wechsler. Pose discrimination and eyedetection using support vector machines (SVMs). In Proceeding ofNATO-ASI on Face Recognition: From Theory to Applications, pages528–536, 1998.

[41] J. Huang and H. Wechsler. Eye detection using optimal wavelet packets andradial basis functions (RBFs). International Journal of Pattern recognitionand Artificial Intelligence, 13(7):1009–1025, 1999.

[42] W. Huang and R. Mariani. Face detection and precise eyes location. InProceedings of the International Conference on Pattern Recognition, 2000.

[43] T. E. Hutchinson. Eye movement detection with improved calibration andspeed. U.S. patent 4950069, 1990.

[44] T. E. Hutchinson, K. P. White Jr., K. C. Reichert, and L. A. Frey.Human-computer interaction using eye-gaze input. In IEEE Transactions onSystems, Man, and Cybernetics, volume 19, pages 1527–1533, 1989.

[45] K. Hyoki, M. Shigeta, N. Tsuno, Y. Kawamuro, and T. Kinoshita.Quantitative electro-oculography and electroencephalography as indices ofalertness. Electroencephalography and Clinical Neurophysiology, 106:213–219,1998.

[46] A. Hyrshykari, P. Majaranta, A. Aaltonen, and K. Raiha. design issues ofidict: A gaze-assisted translation aid, 2000.

[47] R. J. K. Jacob. The use of eye movements in human computer interactiontechniques: What you look at is what you get. ACM Transactions onInformation Systems, 9(3):152–169, 1991.

[48] R. J. K. Jacob. Eye-movement-based human-computer interactiontechniques: Towards non-command interfaces. volume 4, pages 151–190.Ablex Publishing corporation, Norwood, NJ, 1993.

[49] R. J. K. Jacob and K. S. Karn. Eye tracking in human-computer interactionand usability research: Ready to deliver the promises. In The Mind’s Eyes:Cognitive and Applied Aspects of Eye Movements, J. Hyona, R. Radach, H.deubel (Eds.). Oxford, Elsevier Science, 2003.

[50] A.D. Jepson, D.J. Fleet, and T.F. El-Maraghi. Robust online appearancemodels for visual tracking. IEEE Transactions on PAMI, 25(10):1296–1311,2003.

Page 175: Zhiwei Zhu PhD Thesis

158

[51] Q. Ji and X. Yang. Real time visual cues extraction for monitoring drivervigilance. In Proceedings of the International Workshop on Computer VisionSystems, Vancouver, Canada, 2001.

[52] Q. Ji and X. Yang. Real-time eye, gaze, and face pose tracking formonitoring driver vigilance. In Real Time Imaging, pages 357-377, 2002.

[53] Q. Ji and Z. Zhu. Eye and gaze tracking for interactive graphic display. In2nd International Symposium on Smart Graphics, 2002.

[54] M. Kampmann and L. Zhang. Estimation of eye, eyebrow and nose featuresin videophone sequences. In International Workshop on Very Low BitrateVideo Coding, 1998.

[55] T. Kanade, J. F. Cohn, and Y. Tian. Comprehensive database for facialexpression analysis. In Proceedings of the International Conference on Faceand Gesture Recognition, 2000.

[56] S. Kawato and J. Ohya. Real-time detection of nodding and head-shaking bydirectly detecting and tracking the “between-eyes”. In Proceedings of theIEEE 4th International Conference on Automatic Face and GestureRecognition, 2000.

[57] S. Kawato and J. Ohya. Two-step approach for real-time eye tracking with anew filtering technique. In Proceedings of the International Conference onSystem, Man, and Cybernetics, pp.1366-1371, 2000.

[58] S. Kawato and N. Tetsutani. Detection and tracking of eyes for gaze-cameracontrol. In Proceedings of 15th International Conference on Vision Interface,2002.

[59] S. Kawato and N. Tetsutani. Real-time detection of between-the-eyes with acircle frequency filter. In Proceedings of the Asian Conference on ComputerVision, 2002.

[60] M. LaCascia, S. Sclaroff, and V. Athitsos. Fast, reliable head tracking undervarying illumination: An approach based on registration of textured-mapped3d models. IEEE Transactions on PAMI, 22(4):322–336, 2000.

[61] K. M. Lam and H. Yan. Locating and extracting the eye in human faceimages. Pattern Recognition, 29:771–779, 1996.

[62] D. B. Leake. Case-based reasoning. AAAI Press/The MIT Press, 1996.

[63] K. Lee and D. Kriegman. Online learning of probabilistic appearancemanifolds for video-based recognition and tracking. In IEEE Conference onCVPR05, 2005.

Page 176: Zhiwei Zhu PhD Thesis

159

[64] T. S. Lee. Image representation using 2D gabor wavelets. IEEE Transactionson Pattern Analysis and Machine Intelligence (PAMI), 18(10):959–971, 1996.

[65] H. Li, P. Roivainen, and R. Forchheimer. 3D motion estimation inmodel-based facial image coding. IEEE Transactions on Pattern Analysisand Machine Intelligence (PAMI), 15(6):545–555, 1993.

[66] J. Lim, D. Ross, R. Lin, and M. Yang. Incremental learning for visualtracking. In NIPS04, 2004.

[67] S. P. Liversedge and J. M. Findlay. Saccadic eye movements and cognition.Trends in Cognitive Science, 4(1):6–14, 2000.

[68] B. D. Lucas and T. Kanade. An iterative image registration technique withan application to stereo vision. In International Joint Conference onArtificial Intelligence, 1981.

[69] P. Maes, T. Darrell, B. Blumberg, and A. Pentland. The ALIVE system:Wireless, full-body interaction with autonomous agents. ACM MultimediaSystems, pages 105–112, 1997.

[70] M. F. Mason, B. M. Hood, and C. N. Macrae. Look into my eyes: Gazedirection and person memory. Memory, 12:637–643, 2004.

[71] Y. Matsumoto, T. Ogasawara, and A. Zelinsky. Behavior recognition basedon head pose and gaze direction measurement. In Proceedings of 2000IEEE/RSJ International Conference on Intelligent Robots and Systems,2000.

[72] I. Matthews, T. Ishikawa, and S. Baker. The template update problem.IEEE Transactions on PAMI, 26(6):810–815, 2004.

[73] Peter S. Maybeck. Stochastic Models, Estimation and Control, volume 1.Academic Press, Inc, 1979.

[74] S. Milekic. The more you look the more you get: intention-based interfaceusing gaze-tracking. In Bearman, D., Trant, J.(des.) Museums and the Web2002: Selected papers from an international conference, Archives andMuseum Informatics, 2002.

[75] L.P. Morency, A. Rahimi, and T. Darrell. Adaptive view-based appearancemodel. In IEEE Conference on CVPR03, 2003.

[76] C. H. Morimoto, A. Amir, and M. Flickner. Detecting eye position and gazefrom a single camera and 2 light sources. In Proceedings of the InternationalConference on Pattern Recognition, 2002.

Page 177: Zhiwei Zhu PhD Thesis

160

[77] C. H. Morimoto, D. Koons, A. Amir, and M. Flickner. Frame-rate pupildetector and gaze tracker. In IEEE ICCV’99 Frame-rate Workshop, 1999.

[78] C. H. Morimoto, D. Koons, A. Amir, and M. Flickner. Pupil detection andtracking using multiple light sources. Image and Vision Computing,18:331–336, 2000.

[79] C. H. Morimoto and M. Mimica. Eye gaze tracking techniques for interactiveapplications. Computer Vision and Image Understanding, Special Issue onEye Detection and Tracking, 98(1):4–24, 2005.

[80] C.H. Morimoto and M. Flickner. Real-time multiple face detection usingactive illumination. In Proceedings of the 4th IEEE International Conferenceon Automatic Face and Gesture Recognition 2000, Grenoble, France, 2000.

[81] C.H. Morimoto, D. Koons, A. Amir, and M. Flickner. Pupil detection andtracking using multiple light sources. Technical Report RJ-10117, IBMAlmaden Research Center, 1998.

[82] M. Motwani and Q. Ji. 3D face pose discrimination using wavelets. In InProceedings of IEEE International Conference on Image Processing,Thessaloniki, Greece, 2001.

[83] M. Nixon. Eye spacing measurement for facial recognition. In Proceedings ofthe Society of Photo-Optical Instrument Engineers, 1985.

[84] T. Ohno, N. Mukawa, and A. Yoshikawa. Freegaze: A gaze tracking systemfor everyday gaze interaction. In Proceedings of the symposium on ETRA2002, 2002.

[85] S. Or, W. Luk, K. Wong, and I. King. An efficient iterative pose estimationalgorithm. Image and Vision Computing, 16:353–362, 1998.

[86] C. W. Oyster. The human eye: Structure and function. In SinauerAssociates, Inc., 1999.

[87] A. Pentland. Looking at people. IEEE Transactions on Pattern Analysisand Machine Intelligence, 22(1):107–119, 2000.

[88] A. Pentland. Perceptual intelligence. Communications of the ACM,43(3):35–44, 2000.

[89] A. Pentland and T. choudhury. Face recognition for smart environments.IEEE computer, pages 50–55, 2000.

[90] A. Pentland, B. Moghaddam, and T. Starner. View-based and modulareigenspaces for face recognition. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 1994.

Page 178: Zhiwei Zhu PhD Thesis

161

[91] R. W. Picard, E. Vyzas, and J. Healey. Toward machine emotionalintelligence: Analysis of affective physiological state. IEEE TransactionsPattern Analysis and Machine Intelligence, 23(10), 2001.

[92] A. Rahimi, L.P. Morency, and T. Darrell. Reducing drift in parametricmotion tracking. In IEEE Conference on ICCV01, 2001.

[93] M. Reinders, R. Koch, and J. Gerbrands. Locating facial features in imagesequences using neural networks. In Proceedings of the Second InternationalConference on Automatic Face and Gesture Recognition, Killington, USA,1997.

[94] M. J. Reinders, R. W. Koch, and J. Gerbrands. Locating facial features inimage sequences using neural networks. In International Conference onAutomatic Face and Gesture Recognition, 1996.

[95] N. Sarris, N. Grammalidis, and M. G. Strintzis. FAP extraction using threedimensional motion estimation. IEEE Transactions on Circuits and Systemsfor Video Technology, 12(10):865 – 876, 2002.

[96] R. C. Schank. Dynamic memory: a theory of reminding and learning incomputers and people. Cambridge University Press, 1983.

[97] K. Schittkowski. NLQPL: A FORTRAN-subroutine solving constrainednonlinear programming problems. Annals of Operations Research, 5:485 –500, 1985.

[98] H. Schneiderman and T. Kanade. A statistical method for 3D objectdetection applied to faces and cars. In In IEEE Conference of ComputerVision and Pattern Recognition, 2000.

[99] J. Sherrah and S. Gong. Exploiting context in gesture recognition, 1999.

[100] J. Sherrah and S. Gong. VIGOUR: A system for tracking and recognition ofmultiple people and their activities, 2000.

[101] S. W. Shih and J. Liu. A novel approach to 3-D gaze tracking using stereocameras. In IEEE Transactions on Syst. Man and Cybern., part B,volume 34, pages 234–245, 2004.

[102] S. A. Sirohey and A. Rosenfeld. Eye detection. In Technical ReportCAR-TR-896, Center for Automation Research, University of Maryland,College Park,MD, 1998.

[103] S. A. Sirohey and A. Rosenfeld. Eye detection in a face image using linearand nonlinear filters. Pattern recognition, 34:1367–1391, 2001.

Page 179: Zhiwei Zhu PhD Thesis

162

[104] K. H. Tan, D. Kriegman, and H. Ahuja. Appearance based eye gazeestimation. In Proceedings of the IEEE Workshop on Applications ofComputer Vision, pages 191–195, 2002.

[105] Y. Tian, T. Kanade, and J. Cohn. Recognizing action units for facialexpression analysis. IEEE Transactions on Pattern Analysis and MachineIntelligence, 23(2):97 – 115, 2001.

[106] Y. Tian, T. Kanade, and J. F. Cohn. Dual-state parametric eye tracking. InProceedings of the 4th IEEE International Conference on Automatic Faceand Gesture Recognition, 2000.

[107] K. Toyama and A. Blake. Probabilistic tracking with exemplars in a metricspace. IJCV, 48(1):9–19, 2002.

[108] K. Toyama, R. S. Feris, J. Gemmell, and V. Kruger. Hierarchial waveletnetworks for facial feature localization. In International Conference onAutomatic Face and Gesture Recognition, Washington D.C., USA, 2002.

[109] E. Trucco and A. Verri. Introductory techniques for 3D computer vision.1998.

[110] L. Vacchetti, V. Lepetit, and P. Fua. Stable real-time 3d tracking usingonline and offline information. IEEE Transactions on PAMI,26(10):1391–1391, 2004.

[111] P. Viola and M. Jones. Robust real-time object detection. IJCV,57(2):137–154, 2004.

[112] J. Waite and J.M. Vincent. A probabilistic framework for neural networkfacial feature location. British Telecom Technology Journal, 10(3):20–29,1992.

[113] J. Wang, E. Sung, and R. Venkateswarlu. Eye gaze estimation from a singleimage of one eye. In Proceedings of International Conference on ComputerVision, 2003.

[114] C. Ware and H. H. Mikaelian. An evaluation of an eye tracker as a device forcomputer input. In ACM conference on human factors in computing systemsand graphics interface, Toronto, 1987.

[115] K. Waters, J. Rehg, M. Loughlin, S. Kang, and D. Terzopoulos. Visualsensing of humans for active public interfaces, 1998.

[116] X. Wei, Z. Zhu, L. Yin, and Q. Ji. A real-time face tracking and animationsystem. In IEEE Workshop on Face Processing in Video, Washington DC,USA, 2004.

Page 180: Zhiwei Zhu PhD Thesis

163

[117] L. Wiskott, J. M. Fellous, N. Kruger, and C. V. Malsburg. Face recognitionby elastic graph matching. IEEE Transactions on Pattern Analysis andMachine Intelligence (PAMI), 19(7), 1997.

[118] J. Xiao, T. Kanade, and J. F. Cohn. Robust full-motion recovery of head bydynamic templates and re-registration techniques. In IEEE Conference onAFGR02, 2002.

[119] X. Xie, R. Sudhakar, and H. Zhuang. On improving eye feature extractionusing deformable templates. Pattern Recognition, 27:791–799, 1994.

[120] J. Yao and W. Cham. Efficient model-based linear head motion recoveryfrom movies. In IEEE Computer Society Conference on Computer Visionand Pattern Recognition, Washington DC, 2004.

[121] P. Yao, G. Evans, and A. Calway. Using affine correspondence to estimate3-D facial pose. In Proceedings of the IEEE International Conference onImage Processing, pages 919–922, 2001.

[122] A. Yilmaz, K. Shafique, and M. Shah. Estimation of rigid and non-rigidfacial motion using anatomical face model. In Proceedings of theInternational Conference on Pattern Recognition, Quebec City, QC, Canada,2002.

[123] D. H. Yoo and M. J. Chung. A novel non-intrusive eye gaze estimation usingcross-ratio under large head motion. Computer Vision and ImageUnderstanding, Special Issue on Eye Detection and Tracking, 98(1):25–51,2005.

[124] A. Yuille, P. Hallinan, and D. Cohen. Feature extraction from faces usingdeformable templates. International Journal of Computer Vision,8(2):99–111, 1992.

[125] S. Zhai, C. Morimoto, and S. Ihde. Manual and gaze input cascade (magic)pointing. In ACM CHI’99, Pittsburgh, PA, USA, 1999.

[126] S. Zhai, C. H. Morimoto, and S. Ihde. Manual and gaze input cascaded(magic) pointing. In ACM SIGHCI-Human Factors Comput. Syst.Conference, 1999.

[127] L. Zhang. Estimation of eye and mouth corner point positions in aknowledge-based coding system. In Proceedings of SPIE, Vol 2952, pp.21-18, 1996.

[128] Y. Zhang and Q. Ji. Active and dynamic information fusion for facialexpression understanding from image sequences. IEEE Transactions onPattern Analysis and Machine Intelligence (PAMI), 27(5), 2005.

Page 181: Zhiwei Zhu PhD Thesis

164

[129] Z. Zhang. Feature-based facial expression recognition: Experiments with amulti-layer perceptron. In Technical Report INRIA 3354, 1998.

[130] Z. Zhang. A flexible new technique for camera calibration. IEEETransactions on Pattern Analysis and Machine Intelligence,22(11):1330–1334, 2000.

[131] J. Zhu and J. Yang. Subpixel eye gaze tracking. In Proceedings of the IEEEInternational Conference on Automatic Face and Gesture Recognition,Washington D.C., pages 131–136, 2002.

[132] Z. Zhu, K. Fujimura, and Q. Ji. Real-time eye detection and tracking undervarious light conditions. In Symposium on Eye Tracking Research andApplications, 2002.

[133] Z. Zhu and Q. Ji. Eye and gaze tracking for interactive graphic display.Machine Vision and Applications, 15(3):139–148, 2004.

[134] Z. Zhu and Q. Ji. Robust real-time eye detection and tracking under variablelighting conditions and various face orientations. Computer Vision andImage Understanding, Special Issue on Eye Detection and Tracking,38(1):124–154, 2005.

[135] Z. Zhu, Q. Ji, K. Fujimura, and K. Lee. Combining kalman filtering andmean shift for real time eye tracking under active IR illumination. InInternational Conference on Pattern Recognition, 2002.