gazelib - city ulbms03.cityu.edu.hk/studproj/cs/2010csnky597.pdf · supervisor : prof ip, ho shing...

(09CS026)

GazeLib A low Cost Implementation of Real-time Gaze Tracking

Framework

(Volume 1 of 1 )

Student Name : Ng King Yui

Student No. :

Programme Code

: BScCS

Supervisor : Prof IP, Ho Shing Horace

Date : 12 April, 2010

City University of Hong Kong Department of Computer Science BSCCS/BSCS Final Year Project 2009-2010

Final Report

For Official Use Only

A low Cost Implementation of Real-time Gaze Tracking Framework Final Report

- Page 2 of 81-

Declaration

I have read the project guidelines and I understand the meaning of academic

dishonesty, in particular plagiarism and collusion. I hereby declare that the work I

submitted for my final year project, entitled:

GazeLib: A low Cost Implementation of Real-time Gaze Tracking Framework

does not involve academic dishonesty. I give permission for my final year project work

to be electronically scanned and if found to involve academic dishonesty, I am aware of

the consequences as stated in the Project Guidelines.

Student Name: Ng King Yui Signature: _____________

Student ID : Date : _____________


- Page 3 of 81-

Abstract

Eye gaze reflects a person’s attention over time, which is a powerful cue for

determining what might be interesting. Therefore, eyes have not only indispensable

meaning for human communication, but also great potential to build human computer

interaction in a more natural and direct mode. Ironically, most available gaze trackers

are either driven by specific design operating software or high end hardware. High

costs have always been the barrier prohibiting wide spread of gaze tracking technology.

Moreover, the majority gaze trackers are adopted corneal reflection tracking technique

which actively illuminates the eye region by infrared light requiring quasi-stable lighting

conditions to operate. In addition, potential eye hazards maybe arise from long period

or close proximity IR exposure. To solve these problems, a robust hybrid method

integrates model and feature based tracking approaches is proposed in this project.

The core of the proposed method is applying Active Shape Model fitting technique to

locate facial features. Eye features are then extracted by sophisticated image

processing including filtering and eclipse fitting. Towards a framework available to

public, the proposed framework is packed in programming library and available in an

open-source package.Evaluation experiments show that system prototype is capable to

perform real-time remote gaze tracking under several lighting conditions with low-cost

and off-the-shelf webcams while maintaining acceptable accuracy.The proposed

method has significantly improved the usability, reduced the cost of using gaze tracking

technology, which is an important step to make it enter the mass market.


- Page 4 of 81-

Acknowledgments

I would like to thank my supervisor Prof. Horace H.S. IP for his advice and valuable

support throughout the development of this project. This project really would not have

been reached to an end without his guidance and patience.

I would also like to thank Dr. Ken C.K. LAW, Dr. Joe C.H. YUEN from Department of

Computer Science and Dr. Lionel P.K. SUN from Department of Mathematics for their

gentlemanly supports and guidance.

Furthermore, also great thanks to the organizations, including FG-NET consortium and

DTU IMM that prepared and published the annotated face databases used in the

project, and to those who allowed their faces to be used.


- Page 5 of 81-

A low Cost Implementation of Real-time Gaze Tracking Framework

Final Report

K.Y. Ng

Deliverable date: 12 April, 2009


- Page 6 of 81-

Table of Contents Introduction .................................................................................................................... 9

Introduction ............................................................................................................... 10

Literature Review ......................................................................................................... 13

2.1 Gaze Tracking ..................................................................................................... 14

2.1.1 Biological structure of human eye ................................................................ 14

2.1.2 Mathematical eye model .............................................................................. 17

2.2 Eye Tracking Techniques.................................................................................... 19

2.2.1 Electro-Oculography (EOG) ......................................................................... 19

2.2.2 Scleral Contact lens/ Search Coils ............................................................... 20

2.2.3 Video-Oculography with Corneal Reflection ................................................. 20

2.2.4 Video-Oculography with visual light ............................................................. 23

2.3 Video-based gaze tracking hardware settings .................................................... 24

2.3.1 Head-mount ................................................................................................. 24

2.3.2 Table-mount ................................................................................................. 24

2.4 Potential hazards with IR .................................................................................... 25

Methodology ................................................................................................................. 26

3.1 Introduction ......................................................................................................... 27

3.2 Active Shape model (ASM) ................................................................................. 27

3.3 Active appearance model (AAM) ........................................................................ 28

3.4 POSIT ................................................................................................................. 28

3.5 RANSAC ............................................................................................................. 29

Design & Implementation ............................................................................................. 30

4.1 Introduction ......................................................................................................... 31

4.2 Development Environment .................................................................................. 31

4.3 System design .................................................................................................... 32

4.3.1 Architecture .................................................................................................. 32

4.3.2 Conceptual Class diagram ........................................................................... 33

4.4 System implementation ....................................................................................... 34

4.4.1 Overall system flow ...................................................................................... 34

4.4.2 Active Shape Model building ........................................................................ 35

4.4.3 Face detection and tracking ......................................................................... 37

4.4.4 Head poses estimation ................................................................................. 41

4.4.5 Eye feature extraction .................................................................................. 44

4.4.6 Gaze Estimation ........................................................................................... 46


- Page 7 of 81-

Results & Comments .................................................................................................... 49

5.1 Introduction ......................................................................................................... 50

5.2 Testing environment............................................................................................ 50

5.3 Performance of face tracking .............................................................................. 51

5.3.1 Presence of distractions ............................................................................... 51

5.3.2 Various ambient lighting conditions .............................................................. 54

5.3.3 Different facial expressions .......................................................................... 55

5.3.4 Blurred inputted image ................................................................................. 56

5.4 Performance of eye features extraction .............................................................. 56

5.4.1 Presence of distractions ............................................................................... 56

5.4.2 Various ambient lighting conditions .............................................................. 60

5.4.3 Various iris colors ......................................................................................... 61

5.5 Performance of head poses estimation ............................................................... 62

5.5.1 Pose estimation results using POSIT ........................................................... 62

5.5.2 Pose estimation results using LK Optical flow .............................................. 63

5.5.3 Compare of using two approaches ............................................................... 63

5.6 Performance of whole system ............................................................................. 65

5.6.1 Speed ........................................................................................................... 65

5.6.2 Tracking with off-the-shelf equipment .......................................................... 66

5.6.3 Gaze tracking accuracy ................................................................................ 66

5.6 Case study: GazePad ......................................................................................... 67

5.6.1 Motivations ................................................................................................... 68

5.6.2 Interface Design Concepts ........................................................................... 69

5.6.3 Operation ..................................................................................................... 71

5.6.3 Performances ............................................................................................... 72

Conclusions .................................................................................................................. 73

6.1 Critical reviews .................................................................................................... 74

6.1.1 Achievements ............................................................................................... 74

6.1.2 Limitations .................................................................................................... 75

6.2 Future work ......................................................................................................... 76

6.3 Application areas ................................................................................................ 76

6.4 Project feedback ................................................................................................. 77

References ................................................................................................................... 78

References ................................................................................................................ 79


- Page 8 of 81-

Revision History:

Date Author(s) Comments

09-04-2010 Jack Ng First draft version

11-04-2010 Jack Ng First release

16-01-2011 Jack Ng Changed the title;

Modified the Abstract;

Typos and Grammar errors correction


- Page 9 of 81-

Chapter 1

Introduction


- Page 10 of 81-

Introduction

Eyes are the most important sensory organ of human; more than half sensory

impressions come from eyes. Moreover, gaze is a powerful cue for determining what

might be interesting for the observer (Duchowski, 2003). Generally speaking, eye gaze

is an indicator showing a person’s attention over time. Therefore, eyes have not only

indispensable meaning for human communication, but also great potential to build

human computer interaction in a more natural and direct mode (Jacob and Karn, 2003).

Since gaze information has valuable and useful applications in human computer

interaction and user intention detection, various gaze tracking algorithms have been

proposed and some of them have been commercialized (Daunys et al, 2006).

Gaze tracking which is originated from the research of eye movement (Jacob, 1995) is

defined as a continuous process of measuring the "Point of Regard" (PoR) or the "Line

of Sight" (LoS) of eye (ITU Gaze group, 2009). Eye tracking and gaze estimation are

the two main procedures involved in tracking eye gaze. The process for detecting and

tracking relevant features (e.g. pupil center) in the eye image is known as eye tracking.

Gaze estimation is the mathematical procedure to translate image features into the

gaze coordinates.

With the advancements in computer vision technologies, recently, gaze tracking has

already been considered as a solved problem. Corneal reflection based tracking

method is commonly adopted by popular gaze tracking algorithms (Daunys and

Ramanauskas, 2004; Goni, 2004; Li et al, 2005) and commercialized gaze tracking

products (EyeTech Digital Systems, 2009; Lc technologies, 2009). Infrared (IR) lights

used to actively illuminate the eye region to produce speak of light reflected by the

cornea are known as corneal reflection or glint. The corneal reflection remains

stationary during eye movement. Based on the eye images captured by camera, gaze

can be estimated based on the relative position between pupil and the glints in the

image. Gaze tracking using active IR method can be divided into two types: remote

tracking and head mount system. Remote tracking system which widely appears in

commercial tracking systems usually employs high-end camera for image capturing.

High accuracy and few degrees of head movement can be achieved. Head mounted


- Page 11 of 81-

system is placed in a helmet or special glasses together with IR lighting device and

camera. The whole system follows the user’s head movement. Li et al (2005) shows

that satisfactory tracking results can still be obtained even low resolution camera is

used. Thus most low-cost gaze tracking solutions are based on the head mounted

approach (San Agustin et al, 2009; Li et al, 2005). However, gaze tracking systems

based on IR illumination have many limitations. First, most eye tracker using IR cannot

properly operate when presenting another light source since this method greatly relied

on IR illumination and thus quasi-stable lighting conditions is the minimal prerequisite

(Villanueva et al, 2008). As a result, this approach is only suitable for indoor use, and

not recommended for user wearing glasses. Second, the position of IR light source and

the camera need to be carefully calibrated before tracking. It is not a favor for home use

environment. IR lights are employed to produce corneal reflections as they are barely

visible to human vision. It truly enhances user experiences towards gaze tracking, but,

at the same time eyes’ protection mechanisms against bright light by the natural

aversion cannot function. Issues regarding the long period of IR exposure time are

raised (Mulvey et al, 2008). Guidelines regarding long period or close proximity

exposure of IR have not been addressed in current infrared safety standards yet. Thus,

potential eye hazard may exist. More recently, various eye gaze-tracking algorithms

without IR lights have also been proposed (Villanueva et al, 2008). Kohlbecher et al

(2008) proposed a gaze tracking algorithm based on the shape of iris through ellipse

fitting to infer the eye gaze. Again, high end hardware components are required.

As mentioned before, most gaze tracking systems are driven by specific design or high-

end hardware and operating software, they are varied by different manufacturers (Bates

et al, 2005). High costs of hardware and software have always been the barrier

prohibiting the wide spread of gaze tracking technology. The marketing study

conducted by Jordansen et al (2005) described that up to year 2005, an eye tracking

system in Europe costs from EUR 4,100 to EUR 17,900, which is around HK 47, 200 to

HK 207, 640. Furthermore, same study reported the majority targeted user groups of

commercial gaze tracking products are those with disabilities, such as ALS or locked-in

syndrome and research organizations. Widespread integration of eye tracking to

consumer-grade human computer interfaces is rarely seen.

This project focuses on promoting gaze tracking technology to consumer-grade human

computer interfaces by reducing the price, emphasizing ease-of-use, increasing the


- Page 12 of 81-

extendibility, and enhancing the flexibility and mobility. Instead of relying on active IR

illumination and the corneal reflections, a robust facial feature based gaze tracking

approach proposed by Chen et al (2008) is employed. 2D facial features are tracked

and then used to estimate gaze. In contrast with Chen et al (2008), my proposed

system only requires a single uncalibrated camera without hardware modifications (e.g.

building IR LED grid) instead of stereo camera, thus off-the-shelf components can be

used. This method can work properly without IR lights, so it makes the gaze tracking

system work under both indoor and outdoor conditions. Since no active illumination is

required, therefore wearing glasses is no longer a problem. Owing to low-cost and off-

the-shelf hardware components are employed, the price will be reduced by hundred

times if a webcam is used as it only costs about HK 100 to HK 500. Once the price

drops to the mentioned range, gaze tracking interfaces will appear everywhere

(Jordansen et al, 2005). Gaze tracking technology will revolutionize future development

of human computer interaction methodology. The framework is packed into a

programming library and available to public in an open-source package which makes

those complicated implementations transparent. Developers can build applications

concerning to gaze tracking interface only in few function calls.

The report will begin by presenting the historical and theoretical background reviews on

gaze tracking techniques. Detailed descriptions and justifications of the proposed

method will be shown immediately after literature review section. Experiments were

conducted with results to demonstrate the performances of our gaze tracking system. A

case study on an application prototype built on top of our library is discussed in details.

Finally, conclusions on achievements and limitations were given, and future works are

suggested.


- Page 13 of 81-

Chapter 2

Literature Review

Gaze Tracking

Eye Tracking Techniques

Video-based gaze trackinghardware setting

Potential hazards with IR


- Page 14 of 81-

2.1 Gaze Tracking

General speaking, eye tracking is the process for measuring the eye position and

movement. This project is interested in gaze tracking rather than eye tracking, but rest

of the report will go to review a range of techniques in eye tracking as well as face

tracking, which is related to eye-gaze tracking. The term “gaze tracking” instead of

“eye-tracking” will be used when refers to issues of measuring the eye-gaze direction or

"Point of Regard". Knowledge in biological and psychological of the human vision

system is essential to understand the process from gathering eye's positional

information to eye-gaze information.

2.1.1 Biological structure of human eye

Figure 2.1: The Anatomy of the Eye

(Quade, 2009)

Eye is regarding as one of the most complex organs in the human body. Operations of

the eye can be imagined as operating a camera. Light rays from an object enter the eye

through a small hold called pupil, then passing through a focusing lens and finally be

focused on the retina. Ciliary muscles are responsible for changing the thickness of the

lens (i.e. focal length is adjusted) in order to focus objects from various distances on the

retina. Iris which gives the colored ring outlook of the eye is used to controls the amount

of light entering the eye. Retina is a membrane containing numerous photoreceptors

(rods and cons) which lying on the near surface of the eyeball similar to film in a

camera. The photoreceptor transforms the light energy to electrical impulses or neural

signals, and these signals are then transmitting to visual processing part of the brain

through the optic nerve (Hyrskykari, 2006).


- Page 15 of 81-

Figure 2.2: Cross section of a human eye

(Hyrskykari, 2006)

2.1.1.1 Visual angle

Visual angle (angular size) is the angle of a viewed object subtend to the retina. Given

the object’s height s, d the distance from the lens to the object, the visual angle α can

be calculated using the formula: (Hyrskykari, 2006).

Figure 2.3: The visual angle

(Hyrskykari, 2006)

2.1.1.2 Field of view

Field of view (or Field of vision) is the defined as the horizontal angular (linear or areal)

extent of a given scene that is seen by eyes and determined by the placement of eyes.

Fields of view can be classified into two types: field of view of an individual eye and field

of view of an overlapped portion of eyes (binocular field). Field of view of human is in-

between 160 to 208 degrees and 120 to 180 degrees for individual eye (Savas, 2005).


- Page 16 of 81-

2.1.1.3 Visual acuity

Visual acuity refers to the ability of a person to perceive spatial detail. The further the

distance apart from the fovea, the lower is the visual acuity. A normal young person has

visual acuity measuring in the order of minutes or sometimes seconds of visual angle,

but visual acuity will decrease as the age increase. Whereas visual acuity is measuring

in minutes, such accurate gaze estimation cannot be obtained because of gaze cannot

be considered to be only a sharp point on a scene. When a point on a scene falls onto

center of the fovea, not only that point can be perceived sharply, but also some other

surrounding areas that fall onto the rest of the fovea. In addition, a person can shift his

own visual attention without eye movement (Hyrskykari, 2006). These are the reasons

of some potential error is still appeared even an exact point of a scene fall onto the

fovea is tracked. This potential error is reported by Jacob and Karn (2003) is

approximate one degree and Duchowski (2003) two degrees.

Figure 2.4: The visual acuity of the eye (Hyrskykari, 2006)

2.1.1.4 Movement of eye

Eye movement can be divided into three types, namely saccadic movements, smooth

pursuit movements and convergent movements. Smooth pursuit eye movements occur

when eyes tracing a moving object in field of view. Convergent eye movements are

response to keep both eyes focus on the object. Since eyes normally do not only trace


- Page 17 of 81-

the object smoothly, but also perform sudden jumps from one point to another, this is

called saccadic eye movements. Saccadic eye movements are one of the fastest

movements our body can make. Eyes can perform rotation in an amazing speed of

about 500 degree per second and repeat this saccadic action over hundred thousand a

day (Savas, 2005). Saccadic eye movements are done by three pairs of muscles

attached outside on the eyeball. They are arranged in response three rotation actions:

horizontal (left - right), vertical (up - down) and about the line of sight. The saccadic

latencies are called fixations when the perception of visual objects occurs. Fixation time

is various for different tasks but typically average around 250 ms. Eyes are not entirely

steady throughout whole fixation period, they are performing some movements in a

smaller scale. These small movements caused recognition of fixation become more

complicated (Hyrskykari, 2006).

Figure 2.5:Eye’s directions of saccadic movements (Oculomotor Research Group, 2006)

2.1.2 Mathematical eye model

Formulate the eye as a mathematical model is a must in order to enable performing

precise description and calculation of gaze tracking. Optical axis is defined as an

imaginary line passing through the eye ball center and the pupil center. The visual axis

is defined as the line joining the center of the fovea and the lens which make an angle

to the optical axis. Daunys et al(2005) reported that in a typical adult human eye, the

fovea falls about 4-5 degree temporally and about 1.5 degree below the point of

intersection of the optic axis and the retina.


- Page 18 of 81-

Figure 2.5:Mathematical model framework of eye (Daunys et al, 2006)


- Page 19 of 81-

2.2 Eye Tracking Techniques

Although human computer interfaces involving user's eye-gaze control mode is a new

branch of eye tracking research. However, eye tracking itself has its long history being

used for medical or psychological research for more than a half century, but not for

everyday computer interfaces (Duchowski, 2003). Tracking pupil or iris center and

determine the degree of eye movement in the face image is the first step of designing a

gaze tracking method. Eye detection and eye tracking techniques have been developed

for more than a half century which is employed in medical and psychological research

(Duchowski, 2003).Recently, eye tracking approaches can be classified into at least

four categories. Detail explanations are as follows.

2.2.1 Electro-Oculography (EOG)

Electro-Oculography (EOG) technique was been widely used in eye movement tracking

over the pass forty-year and still being frequently used in the clinical environment today.

There are approximate prominently 1mV potential differences between Cornea and

Fundus. EOG evaluates eye movement by measuring the electric potential differences

of skin around the eyes. This technique is measuring eye movement relative to user’s

head position. Therefore, it is not quite suitable for measuring point of regard. Although

EOG is cheap and non-invasive, however it is not a reliable method for quantitative

measurement due to electrical signal might be changed even there is no eye movement

as well as affected by metabolic changes in the eye (Savas, 2005).

Figure 2.6: An EOG implemented eye tracker

(EagleEyes Project, 2009)


- Page 20 of 81-

2.2.2 Scleral Contact lens/ Search Coils

This method employs contact lens mounted with wired coil attaching on the eye directly.

Electrical potential difference is induced when the wired coil moving in the magnetic

field, the eye’s movement was calculated through measuring the induced electric

potential differences of the wired coil. EOG gives a very high temporal and spatial

resolution result, which allowed small eye movement measurement. Although this is the

most precise method for performing eye tracking, but it is invasive. Therefore, this

method is rarely used in the clinical environment but usually implemented in research

environments (Duchowski, 2003).

Search Coil

Turntable with Primelec Coil System, Neurology

Dept. University Hospital Zurich

Coil Frame (350 mm)

Figure 2.7: Eye Tracking System CS681 (Primelec, D. Florin., 2009)

2.2.3 Video-Oculography with Corneal Reflection

When a fixed light source is actively illuminating the eye region, light reflections will be

formed on the cornea, known as “Purkinje images”. Infrared (IR) light is usually being

used as that light source since IR is barely visible to human eye hence it does not serve

as a distraction. The first Purkinje image (called “glint”) is captured by eye tracker using

calibrated infrared sensitive camera. The position of glint is remained constant with

minor head movement, thus eye rotation is truly reflected by the position between the


- Page 21 of 81-

pupil centre and the glint. Therefore, viewer’s Point of Regard (POR) can be calculated

using this prosperities. There are two general types of eye tracking techniques related

to active eye illumination method: “Bright Pupil Tracking” and “Dark Pupil Tracking”.

The different between these two techniques is based on the location of the light source

(Daunys, 2006; Duchowski, 2003; Glenstrup and Engell-Nielsen, 1995). Both

techniques will produce large iris pupil contrast in the captured image which allows

robust eye tracking, but, there are two related problems. First, the contract between

pupil and the rest of eye area in the image becomes not clear if other external light

sources are present at the same time such as a outdoor condition which make the

tracking algorithm hard to determine the boundaries of eye features. Second, when the

user wearing glasses or contact lenses, multiple glints will appears which is hard for the

algorithm to find the true corneal reflection (Daunys, 2006).

Figure 2.8: The four Purkinje images are form when lights directed to the eye

(Glenstrup and Engell-Nielsen, 1995)

2.2.3.1 Bright Pupil Tracking

If the illumination is coaxial with the optical path, eye will act as a retroreflector.

Therefore, the lights reflecting off the retina will be the same direction as incoming light

which similar to red eye. This phenomenon is known as bright pupil effect which make

pupil appears as a very bright spot and iris as a dark disc in the captured image. This

approach is work better for people with blue iris color (Tobii Technology AB, 2009).


- Page 22 of 81-

Figure 2.9: A gold corner cube

retroreflector (Retroreflector, 2009) Figure 2.10: Bright pupil formed on the

captured image(Daunys, 2006)

2.2.3.2 Dark Pupil Tracking

If the illumination source is offset from the optical path, the reflecting light form retina

will not same as the incoming light direction. Therefore the pupil appears dark in the

captured image. This approach is work better for people with dark eyes (Tobii

Technology AB, 2009).

Figure 2.11: Working principle of a corner reflector (EyeTracking, 2009)

Figure 2.12: Eye region image with corneal reflex (Daunys, 2006)

2.2.3.3 Problems

Large iris pupil contrast allows robust eye tracking with all iris color, but, there are two

problems with this technique. First, the contract between pupil and the rest of eye area

in the image becomes not clear if other external light source is present at the same time

such as outdoor condition which made the tracking algorithm hard to determine the

boundaries of eye features. Second, when the user wearing glasses or contact lenses,

multiple glints will appears which is hard for the algorithm to find the true corneal

reflection (Daunys, 2006).

http://upload.wikimedia.org/wikipedia/commons/0/03/Corner-Cube.jpg


- Page 23 of 81-

2.2.4 Video-Oculography with visual light

This approach is only relying on the image analysis algorithms alone rather than active

illumination. Images captured by calibrated camera under normal lighting are direct

inputted to the algorithm for performing gaze estimation. There are various algorithms

proposed under in this category, which can be mainly classified into three types:

deformable templates based, appearance based, and feature based methods.

Deformable template-based and appearance based methods are attempted to fit the

predefined model to the image while feature based are attempted to fit the image

features to the fixed model (Daunys, 2006).

2.2.4.1 Deformable templates based

Deformable template tracking method is based on a manually predefined generic

template which is matched to the image. The correlation value computed for a

candidate image with the predefined template which is used to determine existence of

eye. This approach is accurate and easy to implement, but it cannot deal with variation

in scale, pose and shape effectively. Moreover, matching a template is computationally

demanding and high contrast image is required (Savas, 2005).

2.2.4.2 Appearance based

Appearance based tracking method is based on statistical analysis and machine

learning to find the relevant characteristics of eye and non-eye images. The learned

characteristics are appearing in the form of distribution models or discriminant

functions, which are used for eye detection. Distribution model is a probabilistic

framework. Bayesian classification or maximum likelihood is used to classify whether a

candidate image as eye or non-eye. However, high dimensional image makes the

implementation of Bayesian classification infeasible. Discriminant function is derived by

projecting the high dimensional image to a lower dimensional space which used to

image classification. PCA and Hidden Markov Model are the most commonly used

appearance based technique (Savas, 2005).

2.2.4.3 Feature based

Feature based tracking methods is based on extracting particular features such as color

distribution of the eye region or feature points of the eye in the image to perform

identification. Feature based tracking consists of feature extraction and feature


- Page 24 of 81-

mapping. A typical feature based tracking algorithms are particle filter, Gabor filters,

Kalman filtering and mean shift (Zhou et al, 2008).

2.3 Video-based gaze trackinghardware settings

Video-based gaze tracking approach is the main concern of this project, therefore, only

hardware setting adopted by video based gaze tracker will be reviewed. Generally

speaking, video based gaze tracker can be classified into two types: head-mount

tracker and remote-tracker (or table-mount) based on whether the cameras are

attached to the subject’s head or positioned remotely.

2.3.1 Head-mount

Head-mounted gaze tracker estimate gaze direction relative to the user’s head position.

Applications that require fast head movements and low cost gaze tracking solutions

(Winfield, 2005; The system I4Contro, 2009) are preferred to employ head-mount

approach. Since the camera is placed in a close range to user’s eye. On the other

hand, higher intrusion level makes this type of trackers unsuitable for computer control

(Daunys et al, 2006).

Figure 2.13: openEyes system using head-mounted device (Winfield, 2005)

2.3.2 Table-mount

Table-mount gaze trackers track the head position and orientation in 2D or 3D space.

This type of system does not require any attachment to the user and allow free head

movement within certain limits, thus these are more adequate for computer control.

However, accuracy of gaze estimation is lower compare with head-mount system. High-

resolution camera is usually preferred for remote tracking (Daunys et al, 2006).


- Page 25 of 81-

2.4 Potential hazards with IR

The spectral emission of infrared LEDs employed in most IR based eye tracker is

usually limited to a near infrared band (IR-A, 780-1400 nm). Notwithstanding IR-A band

LEDs have been tested and result shows that clearly no hazard to the eye for viewing in

some short period of time (few hours) based on current national and international ocular

exposure limits for infrared optical radiation. However, explicit guidelines regarding long

period or close proximity exposure of the eye to IR have not been addressed in any

current infrared safety standards yet. Potential hazards with are still remaining an open

question (Mulvey et al, 2008). Moreover, Mulvey et al (2008) reported emissions are

possible outside IR-A range if a conventional incandescent lamp or discharge lamp that

has been filtered to block most the visible light and transmit IR-A is employed.

Figure 2.15: The different photo-biological effects of optical radiation (Mulveyet et al, 2008)


- Page 26 of 81-

Chapter 3

Methodology

Introduction

Active shape model

Active appearance model

POSIT

RANSAC


- Page 27 of 81-

3.1 Introduction

Instead of relying on active IR illumination and the corneal reflections, a robust facial

features based gaze estimation approach is proposed based on the ASM face tracking

algorithm and proposed eye features extraction algorithm. To achieve our aims, the

proposed approach involves different algorithms and solutions in computer vision. In

this chapter, briefly reviews are given on some of the techniques and concepts, which

were adopted in our system.

3.2 Active Shape model (ASM)

Automatic and accurate location of facial features is a difficult problem in computer

vision. The variety of human faces, expressions, facial hair, glasses, poses, and lighting

contributes to the complexity of the problem. Active Shape Model (ASM) is a solution to

this problem. Active Shape Model is a kind of shape statistical models, which iteratively

deform to fit to the object in a new image. The shape is constrained by a Statistical

Shape Model which only can be deformed in ways seen in a training set of annotated

examples. ASM is needed to be trained on a set of manually landmarked images first.

After training, the statistical shape model can then be used to extracting feature points

on a face. The searching steps involve:

1. Locating each landmark independently.

2. Correcting the locations of each landmark if necessary by looking at how the

landmarks are located with respect to each other.

Figure 3.2: ASM template fitting process


- Page 28 of 81-

(Cootes, 2009)

3.3 Active appearance model (AAM)

The Active Appearance Model (AAM) merges the shape and texture model into a single

model of appearance. AAM itself contains a statistical model of the shape and grey-

level appearance of the object of interest. Template matching of the model to the image

involves finding model parameters, which minimize the difference between the image

and a synthesized model (Cootes, 2009).

Figure 3.2: AAM template matching process (Cootes, 2009)

3.4 POSIT

Pose from Orthography and Scaling with Iteration” also known as POSIT is a useful

algorithm used to estimate the positions of known objects in three dimensions. It is

originally proposed by DeMenthon, D. 1993. To compute an object’s pose, at least four

non-coplanar points, their corresponding 2D projections on the image must be found.

The perspective scaling of known objects can be found and thus compute its

approximate pose through the first part of the algorithm – pose from orthography and

scaling (POS). However, the approximations from POS will not be very accurate.

Therefore, the four observed points are then projected at the pose calculated through

POS and start POS algorithm again with these new point positions. Typically, the true

object pose can be recovered within four or five iterations. (Bradski and Kaehler, 2008)


- Page 29 of 81-

3.5 RANSAC

RANSAC is an abbreviation for "Random Sample Consensus" was first published by

Fischler and Bolles in 1981. POSIT is an algorithm for robust fitting of the model to data

in the presence of many data outliers. The inputs to the RANSAC algorithm are a set of

obtained data values, a parameterized model which can be fitted to the data, and some

threshold parameters.

1. Selecting a random subset S of the original data as hypothetical inliers

2. Fitting the model to the hypothetical inliers

3. Testing all other data against the fitted model to see whether they fit well to the

estimated model, if yes, also considered as a hypothetical inliers

4. Then re-estimating the model to new hypothetical inliers if there contain sufficient

points

5. Evaluating the model by estimating the error of the inliers relative to the model

Repeat step 1-5 N times, each time producing either a rejected model or a refined

mode, the refined model is kept if its error is lower than the last saved model. After N

iterations, we can get the fitted model which is best fitted (Fisher, 2009).


- Page 30 of 81-

Chapter 4

Design & Implementation

Introduction

Development environment

System design

System implementation


- Page 31 of 81-

4.1 Introduction

The designs and implementations of the gaze-tracking framework library named

“GazeLib” will be presented in details. The main capabilities of the library are:

1. Face detection

2. Annotated facial feature points tracking

3. Head pose estimation

4. Features extraction of both eyes individually ( pupil or iris center and radius)

5. 2D trajectory extraction of both eyes individually ( measuring the pupil center )

6. Gaze tracking of both eyes individually

7. Blink detection of both eyes individually

In additions, tracking results in video sequence (e.g. facial model fitting result, eye

trajectory extraction, etc.) is also implemented, which may provide useful information for

further analysis or studies.

4.2 Development Environment

The system is developed using a notebook with the following configurations:

Hardware configurations:

1. CPU: Intel Core2Due (P7500) 1.66 GHz, FSB 800 MHz

2. RAM: 2GB DDR2-667

3. MON: 13.3-inch LCD widescreen display, 1280 x 800 pixel resolution

4. CAM:

a. Built-in iSight, 640 x 480 pixel resolution

b. E-tiger, 640 x 480 pixel resolution

Software configurations:

1. Microsoft Visual C++ 2008

2. OpenCV 2.0

3. ASMLibrary 4.09


- Page 32 of 81-

4.3 System design

4.3.1 Architecture

The system architecture is illustrated in the figure. The system is based on OpenCV

library and ASM Library. OpenCV (Open Source Computer Vision) programming library

originates from Intel, providing a lot of algorithms for computer vision processing. ASM

Library (Active Shape Model Library SDK) is a C++ implementation of the Active Shape

Model framework developed by YAO Wei. This great library contains algorithms for

training and building the statistical model together with the fitting algorithms.

“GazeLib” is packed and released as an open source programming library framework.

Developers can build their own applications making use of gaze tracking technology

based on “Gazelib” without any prior knowledge.“GazeLib” not only aims to achieve

high reusability, but make it possible to create revolutionary applications that will set the

bar for the next generation gaze tracking application.

Figure 4.1: System architecture


- Page 33 of 81-

The system is divided into three components:

1. Face detection and tracking component

2. Eye tracking component

3. Gaze to screen coordinates mapping component

All components need to be work collectively and correctly in order to have a working

gaze tracking system.

4.3.2 Conceptual Class diagram

Figure 4.2: System class diagram


- Page 34 of 81-

4.4 System implementation

4.4.1 Overall system flow

Figure 4.3: System flowchart


- Page 35 of 81-

The flowchart summarized the main system flow as a whole, face detection, facial

feature tracking, head pose estimation, eye feature extraction and gaze mapping

process. Every algorithm in details will be discussed through later sections.

4.4.2 Active Shape Model building

At the very beginning, a onetime Active Shape Model (ASM) training process needs to

be done before everything can work correctly.

Number of landmarks

Figure 4.4:Mean error versus number of landmarks

(Milborrow and Nicolls 2008)

The landmark point number of the model directly affects the fitting result. Milborrow and

Nicolls (2008) conducted experiments on point-to-point error against the number of

landmarks, the result in the study shows that in order to improve the mean t is to

increase the number of landmarks in the model. Because of fitting a landmark tends to

help fitting other landmarks. Therefore fitting results are improved by increasing the

landmarks number. In a meanwhile, the searching time increases roughly linearly with

more landmarks.

As a result, different head pose images with manually annotated 68 facial landmark

points of various individuals were used for active shape model training algorithm to

build the 2D statistical model. The face image used in ASM training is extracted from

the G-NET AGING DATABASE and DTU IMM face database.


- Page 36 of 81-

Figure 4.5: Manually annotated 68 landmarks face model


- Page 37 of 81-

4.4.3 Face detection and tracking

The face detection algorithm is based on Viola-Jones Classifier implemented in the

OpenCV library. The facial feature points tracking algorithm is a model-based face

tracking method.

Low cost tracking equipments such as a webcam can only offer low resolution capture,

which is an inborn limitation to the tracking system. Fortunately, the eye tracker does

not require very accurate feature points fitting result. A satisfactory result is enough to

define the searching windows for eyes’ features extraction. In addition, the whole

system operates in real-time, so the tracking speed is directly affecting the performance

of the tracker. Thus, high stability and efficiency with acceptable accuracy are the main

concerns in designing the face tracking algorithm.

Active appearance model (AAM) is originally designed to use as facial feature

extraction. Compare with active shape model (ASM), AAM is more stable and accurate.

AAM face tracking is stable provide that nearly frontal face. It does not work quite well

for tracking face at an angle. It is found that AAM cannot deform to right shape after

head rotation. ASM tracking is not as stable as AAM, but a face rotated at some degree

it is still able to be tacked but not rotation in a great angle.


- Page 38 of 81-

Figure 4.6: Fitting results comparison between AAM and ASM

In addition, the computational cost for performing AAM is much higher than ASM. The

result in the experiments shows that cost for performing AAM only can achieve average

5 fps while ASM can achieve average 9.5 fps. The results indicated that AAM is not a

good choice for implementing a real-time gaze tracking system.

Figure 4.7: Performances comparison between AAM and ASM

By experiments and observations, the model fitting stability and result can be improved

by reducing noises and details on the inputted image. The following figure showing the

fitting results of performed ASM fitting directly, use Gaussian filter or Median filter


- Page 39 of 81-

before performing ASM fitting. The result shows that the fitting performance of the

Median filter is the best among three. Therefore, the inputted image is passed to the

median filter to perform noise reduction first. Median filter can reduce the noise in an

image while preserving the edges at the same time. This is the reason to choose

Median filter rather than other filters. In other to further increase the speed, the image

will be scaled by half before doing active shape model fitting.

Figure 4.8: Face fitting result with different filters

Before doing active shape model fitting, the face is first detected by using Viola-Jones

Classifier. If the face is presence, the mean model shape will be initiated by using the

VJ classifier detection result. Finally, the active model will be fitted iteratively to the

image until converge or maximum allowed number iteration reach.

Locating facial landmarks is equivalent to locating facial features since landmarks lay

out facial features, which are the main advantage of adopting this approach. As there


- Page 40 of 81-

are 68 landmarks distributed on the whole face, not only the feature points around eyes

will be tracked. Therefore, this tracking method is not limited to eyes’ region extraction,

but also can be extended to other facial feature extraction and tracking such as mouth.

Figure 4.9: head and detection tracking algorithm flowchart


- Page 41 of 81-

4.4.4 Head poses estimation

There are two approaches to implement head poses estimation algorithm. The first

approach is using POSIT algorithm provided in OpenCV library. The second approach

is using LK Optical, RANSAC algorithm together with POSIT to perform head poses

estimation. The details of two approaches will be discussed in the following section.

4.4.4.1 Approach 1: Using POSIT only

This approach is relative simple compare to the second approach. The head tracker is

keep fitting the facial landmarks to the inputted image, once the user initiated the 3D

head pose estimation algorithm, the lankmarks fitting result from head tracking will be

used to build a 3D head model and the POSIT object. In the next round execution, the

POSIT object together with the face fitting result will be inputted to the POSIT algorithm

to perform the 3D head poses estimation, a 3D rotation matrix and a translation vector

were obtained.

Figure 4.10: POSIT head pose estimation algorithm flowshart

4.4.4.2 Method 2:Using LK Optical flow + RANSAC + POSIT

The second approach is more complicated than the first approach. It is trigger to initiate

by the user, or otherwise the system just keeps doing face tracking and eye features

extraction. Numerous feature points will be marked on the inputted face image when


- Page 42 of 81-

the algorithm is first time triggered. The marked feature points will be considered as

model points set which are used to initiate POSIT object later. The feature points

marked will be tracked using LK Optical flow algorithm. RANSAC algorithm is applied to

the successful tracked feature points. The following steps were performed:

1. A subset from the successful tracked points will be randomly selected to perform

3D pose estimation.

2. All other points in the successful tracked points will be tested against the fitted

model to see whether they fit well to the estimated model.

3. Points within a certain distance threshold are considered as inlier to the model.

These steps were repeated several times, the set of with the largest inlier number is

considered as the best tracked points set.

Finally, POSIT object is built on the fly using the model points and the best tracked

points. POSIT algorithm is used to estimate the 3D head pose. After the estimation

process, the outliners in model points set will be deleted.


- Page 43 of 81-

Figure 4.11: Using LK Optical flow, RANSAC with POSIT to estimate head pose


- Page 44 of 81-

4.4.5 Eye feature extraction

The pupil region can be extracted by using the edge detection and ellipse fitting

algorithm. In order to obtain a good fitting result, some preprocessing steps need to be

done before performing edge detection.

The first step is to convert the color image to grey scale color space in order to facilitate

edge detection. By experiment, more strengthened edge image can be obtained if using

different components of a color spaces, for example, using B component of RGB is

better than converting whole RGB to gray-scale. After converted to grey scale, median

filter is applied to reduce noises while preserving the sharp edges. Histogram

equalization is performed spreading out the brightness values of the image, thus image

contrast is increased.

Colors of human iris are ranging from brown to green, blue and dark brown, etc. The

color of the pupil region is always black despite a wide range of iris colors. There are

strong contrasts between dark iris/pupil with their outside area. Therefore binary

threshold can be applied to the grey scale eye image to eliminate edges formed by

lower intensity areas.

Figure 4.12: Image threshold results of different color iris

Edge detection is performed on the threshold image, since more than one edges curve

maybe existed. Therefore a knowledge based method is implemented in order to

extract the best fit iris/pupil ellipse. The best fit ellipse is selected using following steps:

1. Select an edge from the edges detected.

2. Perform ellipse fitting on the selected edge curve.

3. Mark the fitting as valid if the ratio of the major radius and the minor radius is

less than three times, otherwise ignore the fitting result and continue to the next

iteration.

4. Repeat the above steps until all edges were examined


- Page 45 of 81-

5. Select the ellipse with the largest area

After the best fitted ellipse is extracted, the rest steps are going to find whether the

ellipse is a valid iris. The testing criteria are as follows:

1. Result ellipse width > inputted eye region image width / 2

2. Result ellipse width inputted eye region image height

4. Result ellipse height


- Page 46 of 81-

4.4.6 Gaze Estimation

The screen is treated as a rectangle with n * m pixel. Assuming eye movement range

corresponding to the corner points of the screen are formed a prefect rectangle with i * j

pixel. A simple gaze estimation method using ratio mapping is adapted. The details will

be discussed in the following sections.

4.4.6.1 System calibration

Since a ratio based screen coordinate mapping is used to estimate the eye-gaze,

therefore, a calibration procedure is performed in order to obtain different pupil center

positions in pixel (reference to the capturing screen) corresponding to the edge of the

screen to calculate the eye movement rectangle. User is required to focus on different

calibration points printed on the screen during the system calibration process.

Figure 4.14: Nine point calibration procedure (red dot showing the calibration points, the blue color number is the calibration

sequence and brown characters inside () defining optimal criteria to that particular calibration point)

The steps are as follows:

1. A particular calibration point is printed on screen at each time

2. The pupil positions corresponding to that particular point in certain time period

are recorded while user focusing on that point.


- Page 47 of 81-

3. An optimal point is selected from the recorded points set in order to represent

the pupil position corresponding to the particular screen calibration point.

p.s. The optimal point means the point which closest to the limit, different

calibration point have its own optimal definition. The detail is shown in the

following figure. By experiment, the following optimal definition is more accurate

than calculating the average among the recorded set of points.

4. The above steps are repeated until pupil centers’ position of all calibration point

is obtained.

Figure 4.15: system calibration flowchart


- Page 48 of 81-

4.4.6.2Gaze estimation

For gaze estimation, a very simple mapping is performed. Firstly, the screen is treated

as a square with n * m pixel and the eye movement range corresponding to the corner

points of the screen are formed a prefect rectangle (gaze rectangle) with i * j pixel. The

gaze rectangle is calculated by fitting a rectangle to the pupil centers obtained in the

system calibration procedure.

Pupil center and gaze rectangle coordinates are measured reference to the capturing

screen coordinate. By solving the above two equations, the gaze screen coordinates

mapping can be obtained.

Figure 4.16: The pupil center to screen coordinate mapping


- Page 49 of 81-

Chapter 5

Results & Comments

Introduction

Testing environment

Performance evaluations


- Page 50 of 81-

5.1 Introduction

In this chapter, evaluation on performance of the whole system and other subsystem

will be presented. At the beginning, prosperities of the testing environment will be

defined. Then, performance of each component is explained. Since the result obtained

from pervious step is the input of the next step, thus the performance of the whole

system is highly dependent on the performance of each component. For an example, in

order to extract the pupil center of each eye, the system must first localize the facial

feature points correctly; then the eye searching region will be calculated directly from

the face tracking result. Finally, the eye features can be determined. At the end of this

chapter, performance of an application called “GazePad:” which implemented based on

our gaze tracking library will be studied and discussed.

5.2 Testing environment

The system is tested with a notebook and a PC workstation with same software

configurations but in different hardware environments:

Notebook configurations:

1. CPU: Intel Core2Due (P7500) 1.66 GHz, FSB 800 MHz


3. MON: 13.3-inch LCD widescreen display, 1280 x 800 pixel resolution

4. OS : Window 7 Professional 32bit

PC workstation configurations:

1. CPU: Intel Core2Due (E8500) 3.17 GHz, FSB1333 MHz


3. MON: 21-inch LCD widescreen display, 1280 x 800 pixel resolution

4. OS : Window Server 2008 Enterprise 32bit

Camera used:

1. Macbook built-in iSight, 640 x 480 pixel resolution

2. E-tiger, 640 x 480 pixel resolution

3. Polar Net-Cam, 640 x 480 pixel resolution


- Page 51 of 81-

Software configurations:

1. Microsoft Visual C++ 2008

2. OpenCV 2.0

3. ASMLibrary 4.09

Figure 5.1: showing the testing environment hardware settings

5.3 Performance of face tracking

5.3.1 Presence of distractions

ASM face fitting can handle partly occluded face, because of other points at un-

occluded part are helping to fit the points at occluded part. It is also indicated that the

number of point is directly affecting the fitting performance in terms of accuracy. The

more the points in training the ASM, the more stable and accurate fitting result can be

achieved. However, increasing the number points also affects the fitting speed. Since

the face tracker is Implemented using the active shape model, therefore the face model

can be fit on the image as long as the most part of the face is presented, which allowing

the tracker to be robust against distractions. As a result, presence of other faces, hand

movements across the face or wearing glasses will not cause the algorithm to lose

tracking of the face. Experiences against different distractions were conducted to show

the performance of the face tracking algorithm.


- Page 52 of 81-

5.3.1.1 Glasses

Experiments are conducted in regarding ASM face fitting when the user wearing

glasses. Two types of glasses were tested: glasses with non-black color glasses frame

and glasses with black color glasses frame. The result shows that the head tracker is

performed well when the user wearing glasses regardless what type it is.

Figure 5.2:Tracking with a pair of non-black color frame glasses

Figure 5.3: Tracking with a pair of black color frame glasses

5.3.1.2 Passing hand

Unlike other tracking techniques, for example, Camshift or LK Optical flow, ASM is not

easily distracted by a moving similar color object across the tracking object. ASM can

be used to tack face with large movement as well while LK Optical flow not able to do

this.


- Page 53 of 81-

Figure 5.4: Face tracking with hand occludes part the face

5.3.1.3 Multiple faces

Our face tracking algorithm is able to handle present with other faces. Only the face

that is the closest to the central axis of the capturing screen will be considered while all

other faces will be ignored. It is also indicated that the face tracking algorithm can be

applied to different people.

Figure 5.5: Face tracking with multiple faces


- Page 54 of 81-

5.3.1.4 Face-like structures

The face tracking result is not distracted by face-like structures since the ASM only

fitted to the face closed to the training set.

Figure 5.6: Face tracking with face-like structure present

5.3.2 Various ambient lighting conditions

5.3.2.1 Indoor

Different indoor ambient lighting conditions are stimulated by adjusting a different level

of brightness against the normal lighting level. It is shown that the face tracking

algorithm can be applied to different lighting conditions means our face tracking

algorithm works well in natural lighting frustrations. However, the tracking result is not

quite stable and accurate in too light or too dark condition.

Figure 5.7: Face tracking in various lighting simulation


- Page 55 of 81-

5.3.2.2 Outdoor with complex background

The face tracker is tested in an outdoor environment with complex background and

moving object present. It works well in outdoor.

Figure 5.8: Face tracking at outdoor environment

5.3.3 Different facial expressions

The trained active shape model can be deformed to fit face expressions as well. Some

of the tracking results are shown in the following figure. The ASM fitting result is good

as long as large ranges in different facial expressions in training samples are provided.

Figure 5.9: ASM model fitting with difference facial expressions


- Page 56 of 81-

5.3.4 Blurred inputted image

A bit blurring of the inputted image is not affecting the fitting result provided that the

edge of the face can be recognized. However, sometimes the face shape model maybe

converged to wrong shape.

Figure 5.10: ASM and iris fitting on blurred image

5.4 Performance of eye features extraction

5.4.1 Presence of distractions

The result of eye features extraction is highly depending on the face fitting result. The

eye tracker works well as long as the face fitting result is good, since the active shape

model defines the eye tracker searching region. If the deformable model is converged

to wrong shape, the eye features extraction result will be wrong.

Figure 5.11: Searching regions of two eyes

In some cases, the eye tracker is not affected even the active shape model is

converged to wrong shape if the upper part of the shape model is almost in a right

position.


- Page 57 of 81-

Figure 5.12: Eye features extraction in wrong converged shape model

5.4.1.1 Glasses

The face tracking result does not much affect by different type of glasses. However, eye

features extraction can be greatly affected by different glasses frame color of the

glasses. For example, user wearing a pair of glasses with black color glasses frame,

since the iris fitting algorithm is based on the selecting the fitted ellipse with the largest

area. Therefore, sometimes the features extracted maybe wrong.

Figure 5.13: Features extraction result affected by black color glasses frame

In most of the times, the features extraction is right. The following experiments show

the extraction results.


- Page 58 of 81-

Figure 5.14: Eye features extraction with a pair of non-black color frame glasses

Figure 5.15: Eye features extraction with a pair of black color frame glasses

5.4.1.2 Passing hand

As discussed in the early parts, eye features extraction is highly depending on the face

tracking result. Since, the face tracker can handle a similar color object passing through

the user’s face, therefore the eye features can be extracted correctly by the eye tracker.


- Page 59 of 81-

Figure 5.16: Face tracking with hand occludes part the face

5.4.1.3 Face-like structures and multiple faces

Since the searching region inputted to the eye tracker is defined by the fitted face

shape model. Thus, present of multiple faces or face-like structure does not cause any

effects on eye features extraction.

Figure 5.17: Eye features extraction with face-like structure present


- Page 60 of 81-

5.4.2 Various ambient lighting conditions

Lighting is an important factor which affecting the extraction result the most. Extreme

High or low brightness will cause errors in the color thresholding process, thus the

extraction quality and performance will be degraded. Our eye tracker can perform well

in the following conditions.

5.4.2.1 Indoor

Different indoor ambient lighting conditions are stimulated by adjusting a different level

of brightness against the normal lighting level same as in face tracking lighting

simulation. The results showing the eye feature extraction does not work quite well in

the too dark environment. Features can still be extracted, but the stability and accuracy

are degraded.

Figure 5.18: Eye features extraction in various lighting simulation

The tracking result in the dark environment is not performed well because of the color

contrast different between the iris region and other regions on eye areno longer sharp.

As a result, bad threshold image of the eye region is produced, which causing the iris

fitting algorithm not work well.

5.4.2.2 Outdoor with complex background

IR based gaze tracker is not performed well in outdoor condition since IR is easily

affected by present of other light sources. Our tracker is based on ambient color, which

can work well in outdoor environment.


- Page 61 of 81-

Figure 5.19: Eye features extraction at outdoor environment

5.4.3 Various iris colors

Our eye features extraction algorithm is originally designed for dark iris color eye, but in

reality, human iris colors are in a wide range. Our eye features extraction algorithm can

be applied to different colored iris eye without code modification. The only thing

changed is that the algorithm is extracting the pupil contour rather than extracting the

iris region. The algorithm works because the pupil region must be black regardless of

the iris color. Thus pupil center and region can be still extracted. This change will not

cause significant affect on gaze estimation since gaze estimation algorithm is based on

the pupil center.


- Page 62 of 81-

Figure 5.20: Pupil fitting with different iris color

5.5 Performance of head poses estimation

Two approaches are proposed in earlier part to perform head pose estimation, there

are pros and cons in both approaches. The result is illustrated and discussed in details

in this section.

5.5.1 Pose estimation results using POSIT

Head pose is recovered by using POSIT algorithm, in general the resulting 3D rotation

matrix and the translation vector are correct but not in a very precise manner. In

addition, the ASM tracking is not quite stable resulting constantly frustration in the head

poses estimation result. There are always +/- 10 degree errors in the estimated result.

Figure 5.21: Head pose estimation result


- Page 63 of 81-

5.5.2 Pose estimation results using LK Optical flow

This approach estimate3D head pose using POSIT together with LK Optical flow and

RANSAC. The estimation result is very stable and accurate since it does not depend on

the ASM face fitting result. However, the feature points are easily affected by moving an

object across the face or rapid head movement.

Figure 5.22: The rotational angles (roll, yaw and pitch) is calculated using the head pose estimation result

5.5.3Compare of using two approaches

The first approach is only using POSIT algorithm while the second approach is using

POSIT together with LK Optical flow and RANSAC. The tracking accuracy and stability

of the second approach are ahead of the first approach. However, our system is

adopted the first approach as head pose estimation algorithm because of the following

factors:


- Page 64 of 81-

1. The second approach shows a better estimation result, but it is computational

costly than the first approach. It only operates at half speed of the first approach

(showing in the following figure). Since our gaze tracking system is operating in

real-time, thus second approach is not affordable to be used.

2. In the second approach, the tracking deleted by using RANSAC is not recovered

automatically while the first approach does not have this problem.

3. Neither other distractions (e.g. moving hand) nor large head movement is

allowed in the tracking process in the second approach since LK Optical flaw will

lose track of the feature points.

Figure 5.23: Comparison of two head pose estimation approaches


- Page 65 of 81-

Figure 5.24: A crossing hand on the face

5.6 Performance of whole system

In this project, a real-time gaze tracking library with 3D head poses estimation is

developed. The performance of each steps are critical for the final output. In this section

performance of the whole system will be tested.

5.6.1 Speed

The system is tested on two computer setting environments mention in section 5.2. All

components are evaluated as a whole. Our gaze tracking system can achieve an

overall average nearly 8 fps in the desktop testing environment and average 5 fps in the

laptop testing environment. The result shows that the frame rate will be higher with

increasing processor speeds.


- Page 66 of 81-

Figure 5.25: Measuring speed in frame per second during tracking (nearly 8 fps

achieved)

5.6.2 Tracking with off-the-shelf equipment

The gaze tracking library is tested against various off-the-shelf monitors and low cost

webcams. The webcams used were listed in following figure. All of them are not

calibrated before use.

Figure 5.26: The webcams used in the system development and testing

5.6.3 Gaze tracking accuracy

The final step reaches to estimate the eye-gaze in our system. Gaze estimation

algorithm highly depends on the performance of the face tracking result and eye

features extraction result. In our case, since the ASM fitting is not very stable, all

landmarks’ position in the shape model shift slightly frame to frame. Thus the result in

eye features extraction is slightly affected. As a result, the gaze estimation result

appears to have a slight pulsating movement. The overall position of the estimation is

correct dispirit of the pulsating movement errors. The following figure shows the result

of a user focus in the nine positions same as in the system calibration.


- Page 67 of 81-

Figure 5.27: Gaze mapping for focusing on the nine calibration points’ position

(Green color cross indicating the left eye gaze, while blue color cross is indicating the right eye gaze)

5.6 Case study: GazePad

GazePad is designed as a test bed application to test the possibilities for using eye-

gaze as an input control. GazePad is a simple application built on top of our gaze

tracking library. GazePad acquires gaze information from the gaze tracking library, the

gaze screen coordinates is then mapped on characters pad in order perform letter

input. GazePad is intuitive to use, learning and training are not required for using this

application.

Figure 5.28: GazePad operating environment


- Page 68 of 81-

5.6.1 Motivations

Consider a person's entire body is paralyzed (including mouth, facial movements, etc.)

but thinking and language processes remain intact. The person's brain is literally

locked-into their barely functioning body. How do they communicate with the outside

world? Traditionally, there are some alternative communication methods that can be

used. Some of them are listed below:

1. Using a simple blinking system, like blink once for "yes" and twice for "no".

2. Using a more complex Morse code blinking system.

3. With a vocal communication partner together with the simple blinking system, the

communication partner keeps constantly saying "Is it an A? Is it a B?" etc. Blink

once for "yes" and twice for "no".

4. Using an alphabet card board with a vocal communication partner, similar to the

above approach, partner keeps saying "Is the letter in the 1st row? Is the letter in

the 2nd row?" etc.

5. Some paralyzed people who still have free head movement can use a head

mounted stick for typing (shown in the following figure).

6. The most convenient method is using gaze tracking control like Prof. Stephen

Hawking.

Figure 5.29: A disable patient using a head-mount

stick for typing (www.skymyworlds.com, 2009)

Figure 5.30: Prof. Hawking with his gaze control

(Wikipedia, 2009)

All above methods are very slowly and inconvenient excepting the gaze control method.

There are strong indications that gaze tracking technique has potential to become an

important component in human machine interfaces. Ironically, although there are some

successful rolled out commercial gaze trackers in the market. However, those systems


- Page 69 of 81-

and equipment are super expensive, which are not affordable by most of the people.

High cost and special designed components (no other alternative choice) are the main

reasons that keeping apart the people who have actually needs. If the price of gaze

communication systems can be dropped to some range, gaze control could become a

preferred means of control for a large group of people (Jordansen et al,2005).

A gaze typing system called “GazePad” is developed based on low cost off-the-shelf

components that can be bought in most consumer hardware stores. Whish helping

those people lives become easier and meaningful.

The main user groups are people with motor neuron disease (MND)(e.g. whole body

paralyzed) and amyotrophic lateral sclerosis (ALS).

5.6.2 Interface Design Concepts

Since using gaze control cannot achieve as accurate as using the mouse when pointing

to a particular object (e.g. a small button). Thus a conventional on screen keyboard

layout cannot be adopted because of all buttons are closely packed and small.

Thereby, some alternative input systems using in touch screen mobile phones were

studied since they are not demanding on high accuracy. For example, Q9 Chinese and

Q9 English are divided based on a 3x3 matrix.

Figure 5.31: QCode Chinese input system (www.qcode.com, 2009)


- Page 70 of 81-

Figure 5.32: Character board for paralyzed people communicate

(Univ. of Washington, 2009)

After tested the accuracy of our developed gaze tracking library and case study of other

alternative imputed methods, it is found that the button size is larger the better. Thus,

we have designed to divide the whole screen into 3x3 matrix similar to QCode inputting

method. The letters are placing on the whole screen like the character board. Our

design also combining the concept of QWERTY keyboard, grouping letter with their

usage frequency. They are ranked based on the usage frequency in order to ensure

prompt and speedy input.

Figure 5.33: QWERTY keyboard (www.computerhope.com, 2009)


- Page 71 of 81-

5.6.3 Operation

When the user wants to input a particular character, it is required to focus on the cell

which contains that character for some period of time (two seconds in our case) to enter

the sub single character selecting page. By using this selection method, every latter can

be typed in two steps.

Figure 5.34: GazePad operating screen

For an example, I want to type a letter “s”, and then I focused on the first row second

column, the single character selecting page is entered after 2 seconds. After that, I

focus on the letter “s” in the second row third column for 2 seconds. After the letter “s” is

typed.

Figure 5.35: Letter selection process


- Page 72 of 81-

Apart from letter input, symbol and number input are also supported in GazePad.

Similar to using a mobile phone, change the inputting mode in order to type symbols

and numbers.

Figure 5.36: Different inputting model is supported

5.6.3 Performances

In the typing experiments, a string contains character and symbol “cityu computer

science. jack ^.^” is inputted using GazePad. The record time used is about 3.3 minutes

which means that on average 9 characters can be inputted per minute. In addition,

every letter can be selected in two steps.


- Page 73 of 81-

Chapter 6

Conclusions

Critical reviews

Future work

Application area

Project feedback


- Page 74 of 81-

6.1 Critical reviews

In this project, a gaze tracking library with facial feature tracking, eye features extraction

and head poses estimation was developed. The facial feature tracking is based on

active shape model fitting, which is a model-based face tracking algorithm. The active

shape model was trained by numerous face images of different individual manually

annotated with 68 feature points. The shape result from facials model fitting is then

being used to extract two eyes searching region individually. The extraction of detailed

infor

gazelib - city ulbms03.cityu.edu.hk/studproj/cs/2010csnky597.pdf · supervisor : prof ip, ho shing...

Documents