realtime and accurate 3d eye gaze capture with dcnn-based...

14
1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEE Transactions on Visualization and Computer Graphics 1 Realtime and Accurate 3D Eye Gaze Capture with DCNN-based Iris and Pupil Segmentation Zhiyong Wang, Jinxiang Chai, and Shihong Xia Abstract—This paper presents a realtime and accurate method for 3D eye gaze tracking with a monocular RGB camera. Our key idea is to train a deep convolutional neural network(DCNN) that automatically extracts the iris and pupil pixels of each eye from input images. To achieve this goal, we combine the power of Unet [1] and Squeezenet [2] to train an efficient convolutional neural network for pixel classification. In addition, we track the 3D eye gaze state in the Maximum A Posteriori (MAP) framework, which sequentially searches for the most likely state of the 3D eye gaze at each frame. When eye blinking occurs, the eye gaze tracker can obtain an inaccurate result. We further extend the convolutional neural network for eye close detection in order to improve the robustness and accuracy of the eye gaze tracker. Our system runs in realtime on desktop PCs and smart phones. We have evaluated our system on live videos and Internet videos, and our results demonstrate that the system is robust and accurate for various genders, races, lighting conditions, poses, shapes and facial expressions. A comparison against Wang et al. [3] shows that our method advances the state of the art in 3D eye tracking using a single RGB camera. Index Terms—3D eye gaze tracking, convolutional neural network, facial capture 1 I NTRODUCTION E YE gaze animation is an important component of facial animation, and tracking the eye gaze motion of real people is one of the most appealing approaches to generate eye gaze animations. An ideal solution to the problem of eye gaze tracking is using a single RGB camera to capture 3D states of eye gaze, as it is a low-cost approach and can potentially use legacy and uncontrolled videos. Recently, researchers in computer graphics and vision have taken great steps on facial performance capture and eye gaze track- ing. Notably, Wang and colleagues [3] present the first realtime 3D eye gaze tracking system by applying random forest classification to extract iris and pupil pixels. However, random forest classifiers have two major limitations. First, they are not very accurate and often produce wrong results in pixel annotation. Another limitation of random forest classifiers is that they require excessive memory usage and therefore are not appropriate for mobile applications. This paper introduces a realtime and accurate 3D eye gaze capture method that addresses the limitations of using random forest classifiers for 3D eye gaze tracking. Our key idea is to train a deep convolutional neural network that automatically extracts the iris and pupil pixels of each eye from input images. To achieve this goal, we combine the power of Unet [1] and Squeezenet [2] to train an efficient convolutional neural network for pixel classification. The pipeline of our system is shown in Fig.1. We start the process by automatically detecting the important 2D facial features and optical flow constraints for each frame. The extracted facial features and optical flow constraints are then used to reconstruct 3D head poses and large-scale facial deformations Zhiyong Wang and Shihong Xia are with the Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technol- ogy, Chinese Academy of Sciences, Beijing, 100190, China. Zhiyong Wang and Shihong Xia are also with the University of Chinese Academy of Sciences. E-mail: {wangzhiyong, xsh}@ict.ac.cn Jinxiang Chai is with Texas A&M University. E-mail: [email protected] using multilinear expression deformation models. We formulate the 3D eye gaze tracker in the Maximum A Posteriori (MAP) framework, which sequentially infers the most likely state of the 3D eye gaze at each frame. When eye blinking occurs, the eye gaze tracker can obtain an inaccurate result. We therefore also extend our DCNN model for eye close detection and use it to improve the accuracy and robustness of 3D eye gaze tracking. Our system runs in real time on both PCs and smart phones. We have evaluated our system on live videos and Internet videos, and our results demonstrate that the system is robust, accurate and efficient for various races, genders, lighting conditions, head poses, shapes and facial expressions. To assess the quality of the captured eye gaze, we compare the result with ground truth data annotated by human subjects. We evaluate the importance of the key components of our 3D eye gaze tracker and show that our system achieves state-of-the-art accuracy by comparison with an state-of-the-art system [3]. Finally, we show the application of our performance capture system in performance-based facial animation and realtime gaze data capture. 1.1 Contributions Our work is made possible by two main technical contributions: We present an efficient convolutional neural network for iris and pupil classification and eye close detection. More specifically, we combine the power of Unet [1] and Squeezenet [2] to train a compact convolutional neural network suitable for realtime mobile applications. We integrate our iris and pupil segmentation as well as eye close detection into a 3D eye tracking framework, achieving a state-of-the-art result on realtime 3D eye tracking with a single RGB camera. 2 BACKGROUND Our realtime facial performance capture system simultaneously tracks 3D head poses, facial expression deformations and 3D

Upload: others

Post on 08-Aug-2020

5 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

1

Realtime and Accurate 3D Eye Gaze Capturewith DCNN-based Iris and Pupil Segmentation

Zhiyong Wang, Jinxiang Chai, and Shihong Xia

Abstract—This paper presents a realtime and accurate method for 3D eye gaze tracking with a monocular RGB camera. Our key ideais to train a deep convolutional neural network(DCNN) that automatically extracts the iris and pupil pixels of each eye from inputimages. To achieve this goal, we combine the power of Unet [1] and Squeezenet [2] to train an efficient convolutional neural network forpixel classification. In addition, we track the 3D eye gaze state in the Maximum A Posteriori (MAP) framework, which sequentiallysearches for the most likely state of the 3D eye gaze at each frame. When eye blinking occurs, the eye gaze tracker can obtain aninaccurate result. We further extend the convolutional neural network for eye close detection in order to improve the robustness andaccuracy of the eye gaze tracker. Our system runs in realtime on desktop PCs and smart phones. We have evaluated our system onlive videos and Internet videos, and our results demonstrate that the system is robust and accurate for various genders, races, lightingconditions, poses, shapes and facial expressions. A comparison against Wang et al. [3] shows that our method advances the state ofthe art in 3D eye tracking using a single RGB camera.

Index Terms—3D eye gaze tracking, convolutional neural network, facial capture

F

1 INTRODUCTION

E YE gaze animation is an important component of facialanimation, and tracking the eye gaze motion of real people

is one of the most appealing approaches to generate eye gazeanimations. An ideal solution to the problem of eye gaze trackingis using a single RGB camera to capture 3D states of eye gaze,as it is a low-cost approach and can potentially use legacy anduncontrolled videos.

Recently, researchers in computer graphics and vision havetaken great steps on facial performance capture and eye gaze track-ing. Notably, Wang and colleagues [3] present the first realtime 3Deye gaze tracking system by applying random forest classificationto extract iris and pupil pixels. However, random forest classifiershave two major limitations. First, they are not very accurate andoften produce wrong results in pixel annotation. Another limitationof random forest classifiers is that they require excessive memoryusage and therefore are not appropriate for mobile applications.

This paper introduces a realtime and accurate 3D eye gazecapture method that addresses the limitations of using randomforest classifiers for 3D eye gaze tracking. Our key idea is to traina deep convolutional neural network that automatically extractsthe iris and pupil pixels of each eye from input images. To achievethis goal, we combine the power of Unet [1] and Squeezenet[2] to train an efficient convolutional neural network for pixelclassification. The pipeline of our system is shown in Fig.1. Westart the process by automatically detecting the important 2Dfacial features and optical flow constraints for each frame. Theextracted facial features and optical flow constraints are then usedto reconstruct 3D head poses and large-scale facial deformations

• Zhiyong Wang and Shihong Xia are with the Beijing Key Laboratory ofMobile Computing and Pervasive Device, Institute of Computing Technol-ogy, Chinese Academy of Sciences, Beijing, 100190, China.Zhiyong Wang and Shihong Xia are also with the University of ChineseAcademy of Sciences.E-mail: wangzhiyong, [email protected] Chai is with Texas A&M University.E-mail: [email protected]

using multilinear expression deformation models. We formulatethe 3D eye gaze tracker in the Maximum A Posteriori (MAP)framework, which sequentially infers the most likely state of the3D eye gaze at each frame. When eye blinking occurs, the eye gazetracker can obtain an inaccurate result. We therefore also extendour DCNN model for eye close detection and use it to improve theaccuracy and robustness of 3D eye gaze tracking.

Our system runs in real time on both PCs and smart phones.We have evaluated our system on live videos and Internet videos,and our results demonstrate that the system is robust, accurateand efficient for various races, genders, lighting conditions, headposes, shapes and facial expressions. To assess the quality ofthe captured eye gaze, we compare the result with ground truthdata annotated by human subjects. We evaluate the importance ofthe key components of our 3D eye gaze tracker and show thatour system achieves state-of-the-art accuracy by comparison withan state-of-the-art system [3]. Finally, we show the applicationof our performance capture system in performance-based facialanimation and realtime gaze data capture.

1.1 ContributionsOur work is made possible by two main technical contributions:

• We present an efficient convolutional neural network foriris and pupil classification and eye close detection. Morespecifically, we combine the power of Unet [1] andSqueezenet [2] to train a compact convolutional neuralnetwork suitable for realtime mobile applications.

• We integrate our iris and pupil segmentation as well aseye close detection into a 3D eye tracking framework,achieving a state-of-the-art result on realtime 3D eyetracking with a single RGB camera.

2 BACKGROUND

Our realtime facial performance capture system simultaneouslytracks 3D head poses, facial expression deformations and 3D

Page 2: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

2

2D Feature Detection

Optical Flow Computing

3D facialReconstruction

DCNN based Segmentation and

Eye Close Detection

3D Eye Gaze Tracking

Input Video Frames

Eye Image Patch

Output: eye gaze states, 3D head pose, facial

expression

Edge Detection

Fig. 1: The pipeline of our system.

eye gazes using a single RGB camera. Therefore, we focus ourdiscussion on methods and systems developed for acquiring 3Dfacial performances, image segmentation and eye gaze motions.

2.1 Facial Performance Capture

Facial performance capture has been explored for decades incomputer vision and graphics. The traditional methods for facialperformance capture can be divided into three categories: marker-based motion capture systems [4], [5], marker-less facial capturesystems that use depth and/or color data obtained from structuredlight systems [6]–[9], and multiview stereo reconstruction [10]–[13]. Recent advances in 3D depth sensing have enabled a numberof facial performance capture techniques using RGBD cameras[14]–[20].

A more attractive solution for facial performance capture isusing a single RGB camera, as it is low-cost and easy to set up.These methods [21] first locate semantic facial landmarks in avideo frame and then use them to drive 3D facial animation. Re-cently, several techniques have been proposed for locating/trackingfacial landmarks, including the constrained local model [22], [23]and boosted regression [24]–[28].

Recently, Cao and colleagues [29], [30] extend the idea ofcascaded shape regression [24] for 3D facial capture. They firstpropose an online adaptation method to regress and refine thecamera parameters and user identity during tracking. Then, Caoand colleagues [30] further extend the idea to reconstruct high-fidelity facial details such as wrinkles in realtime using shadingcues.

In addition to online techniques, the shape-from-shading tech-nique has further been proposed to capture detailed and dynamic3D face [30]–[32] for offline systems. Wu et al. [33] present ananatomically-constrained local face model to reconstruct facialperformances in very high quality. Wen et al. [34] use an eyelidlinear model to reconstruct the detailed motion of the eye lid.

None of these systems capture 3D eye gaze movement from2D images, which our work is focused on. It is also worthmentioning that our framework is flexible in that it can beincorporated with any 3D facial performance capture system tocapture coordinated movements between 3D head poses, facialdeformations and eye gazes.

2.2 Eye Gaze Tracking

Eye gaze tracking and eye detection have a long history inthe field of human computer interaction and computer vision.Previous methods mainly focused on 2D eye gaze detection andtracking. These methods can be classified into two categories: IR-illumination based approaches and image based approaches. TheIR based approaches make use of the corneal reflection under IRillumination to efficiently detect the iris and pupil regions, whilethe image-based approaches focus on reconstructing the eye gazeby the appearance of the human eyes.

The simplicity and effectiveness of the IR based methods [35]have made them among the most successful approaches for eyegaze capture. Almost all the commercial eye trackers [36]–[38] arebased on this technique. However, the IR based methods requirethe users to wear special glasses or set up an IR device for eyegaze capture. Moreover, the IR based methods often focus on 2Deye gaze capture. Therefore, these methods are not flexible forcapturing 3D eye gazes in RGB videos.

The traditional image-based methods can be divided into threecategories: template-matching methods [39], [40], appearance-based methods [41], [42], and feature-based methods [43], [44].However, all these methods are not robust enough to handlevariations in subjects, head poses, facial expressions and lightingconditions. Wood et al. [45] introduced a 3D eye region modelfor modeling the shape and texture of the region around the eyes.They successfully achieved the redirection of the eye gaze for asingle RGB image or video sequence. However, for the complexity

Page 3: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

3

of the derivative of the loss function, they used numerical centralderivatives in optimization, making the system unable to work inrealtime. Berard et al. [46] builded a system to acquire the shapeand texture of an eye at very high resolution for offline usage.Recent approaches mainly focus on cascaded shape regression[24] and deep learning-based methods [47]–[51] for pupil centerdetection. By adding two extra landmarks on the center of thepupils, these methods formulate the problem as a regressionproblem, and they are able to detect the facial feature pointsand pupil center simultaneously. Cheng et al. [52] improved thegaze estimation performance for asymmetric eyes. Avisek et al.[53] trained a gaze estimator on synthetic images, and then adaptan adversarial approach such that features of synthetic and realimages are indistinguishable. Park et al. [54] estimated 3D gazedirection with an intermediate pictorial representation. Both ourapproach and theirs take pixel-wise classification as intermediatepictorial representations rather than regress for the final targetdirectly. There are two main difference between the two methods:the first one is that we first extract the mask of the iris and pupil,then estimate the eye gaze direction by model based tracking,while their intermediate pictorial representation is defined by theeye gaze direction. The second one is that our algorithm works onvideo input, while their algorithm estimates eye gaze direction bysingle color image.

Recently, Wang et al. [3] introduced the first realtime 3D eyegaze tracking system for a single RGB camera. In particular, theyapply random forest classifiers to extract iris and pupil pixels anduse them to sequentially infer the most likely state of the 3D eyegaze at each frame in the MAP framework. More recently, Wenet al. [55] introduced a method for estimating eyeball motions forRGBD inputs by minimizing the differences between a renderedeyeball and a recorded RGB image.

Our work is most similar to Wang et al. [3] because bothsystems utilize the labeled iris and pupil pixels to sequentiallytrack the 3D eye gaze state in the MAP framework. The mostimportant difference between the two methods is that we usea deep convolutional neural network rather than random forestclassifiers for the iris and pupil pixel classification. Fig. 5 and Fig.8 show that the DCNN achieves much more accurate results thanrandom forest classifiers. Another benefit of using a DCNN for 3Deye tracking is that it is often much more compact than randomforest classifiers and therefore is more suitable for memory-friendly applications such as smart phones. Another difference isthat we use the same DCNN, rather than learn a separate randomforest classifier, for eye close detection. Moreover, we use opticalflow constraints in addition to 2D facial features for 3D facialreconstruction, and we use the Jaccard distance to measure thedifference between the hypothesized and observed masks of theiris and pupil pixels rather than the Euclidean distance of the maskcenters. We find that the system achieves better results by thesetwo improvements.

Our work on iris and pupil annotation is motivated by therecent success of image segmentation using deep convolutionalneural networks. In recent years, DCNNs have advanced the state-of-the-art in image segmentation [1], [56]–[60]. Long et al. [56]showed that a fully convolutional network (FCN), trained end-to-end, pixels-to-pixels on semantic segmentation, could achievegood results without further machinery. Unet [1] and Segnet [58]extended this idea by applying encoder-decoder architecture forsemantic pixelwise segmentation.

We extend the deep convolutional neural network to solve

the iris and pupil pixel classification problem for 3D eye gazetracking. In particular, we combine the power of Squeezenet [2]and Unet [1] to train a compact neural network suitable for 3D eyegaze tracking on mobile applications.

3 OVERVIEW

Our goal is to build a realtime 3D eye gaze tracking system usinga single RGB camera. The problem is challenging because anaccurate estimate of a 3D eye gaze often requires accurate esti-mates of 3D head poses and facial deformations around the eyes.Another challenge is handling significant occlusion in eye gazeobservations, because the pupil and iris regions are often occludedby the eyelids. In addition, a loss of depth information in theprojection from 3D to 2D and unknown camera parameters, light-ing conditions and iris color further complicate the reconstructionproblem. To address these challenges, we propose an end-to-endfacial performance capture system that simultaneously tracks 3Dhead poses, eye gazes and facial expression deformations withautomatic eyeball calibration using a single RGB camera. Thewhole system consists of four main components summarized asfollows. The pipeline of our system is shown in Fig. 1.

2D face and 3D facial reconstruction. We first automaticallydetect and track important facial features in monocular videosequences by recently developed DCNN based methods [61]. Inaddition, we compute the optical flow of each pixel in the faceregion using a recently developed optical flow estimation method[62]. We then perform a data-driven 3D facial reconstruction tech-nique to reconstruct the 3D head pose and large-scale expressiondeformations using multi-linear expression deformation models.

Iris and pupil segmentation. We introduce an efficient user-independent pixel classifier to automatically annotate the iris andpupil pixels in the eye region, which is bounded by detected faciallandmarks in the eye region. We combine the power of Unet[1] and Squeezenet [2] to train a compact convolutional neuralnetwork suitable for realtime mobile applications. The size of thetrained DCNN model is critical for mobile applications because ofthe limited computational resources on smart phones. We extendthe DCNN by introducing an additional output branch of thenetwork for eye close detection, as shown in Fig. 3. In addition,we extract the outer contour of the iris (i.e., limbus) to furtherimprove the robustness and accuracy of our gaze tracker.

Automatic eyeball calibration. To track the eye gaze statesin the video sequences, we need to estimate the geometric shapeand position of the eyeballs and the radius of the iris region. Weapproximate the geometric shape of the eyeball by a sphere. Theradius of the eyeball is set to 12.5 mm, which is the average of anadult’s eyeballs. We introduce an automatic eyeball calibrationprocedure that automatically selects a proper input frame forcalibration and use it to estimate the 3D position of the eyeballsand the radius of the iris.

3D eye gaze tracking. We use the spherical coordinate ofeach pupil center on the eyeball to represent the state of the 3Deye gaze. We track the state of the eye gaze by the extracted 2Diris and pupil pixels, the outer contour of the iris and the estimated3D head pose. We formulate the problem in the MAP frameworkand apply a sampling based method to search for the most likelystate of eye gaze.

Page 4: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

4

4 2D FACE ALIGNMENT AND 3D FACIAL RECON-STRUCTION

Following [3], [26], [32], we reconstruct the 3D head poses andfacial deformations by 2D image features, using multilinear facemodels [24] as the underlying face prior model. There are twomain differences between our method and theirs: the first is thatwe apply a DCNN-based method similar to [61] rather than arandom forest based method for 2D face alignment to obtain betterrobustness and accuracy; the second is that we apply optical flowconstraints for more accurate 3D facial reconstruction.

The detected landmarks are often too sparse to reconstruct thecomplete facial deformations, which leads to missing informationin regions where landmarks are absent, e.g., the cheek and fore-head regions. We use the temporal coherence as additional cuesfor the reconstruction. Inspired by [63] and [64], we apply the fastoptical flow estimation method [62] for neighboring frames of theinput video inside the face region to extract the motion flow. Wethen map the motion flow to each vertex projection in screen spaceby bilinear interpolation. This correspondence of the vertices andpixels gives additional cues for the 3D reconstruction.

Assuming an ideal pinhole camera model, the 2D projectionof the points in the face can be represented as

P2d = Π(R(Cr×2 wid×3 wexp)+T ) (1)

where R and T are the global rotation and translation of the facemodel, wid and wexp are weights for identity and expression, andΠ is the intrinsic matrix of the camera, which projects the 3Dpoints to a 2D image. We formulate the reconstruction problem asan optimization problem, that is,

argminR,T,wid ,wexp

E f eatures +woptEopt +wregidEid +wregexpEexp (2)

E f eatures, Eid and Eexp are described by previous methods [3], [26].For the optical flow constraint term Eopt , we have the followingobjective function:

Eopt = ||pt−12d,i +Ui− pt

2d,i||2 (3)

where pt−12d,i and pt

2d,i are the 2D projection results of the i-thlandmark in frame t− 1 and frame t and Ui is the correspondingoptical flow of point pt−1

2d,i between the two frames. We solve theoptimization problem by ceres-solver [65].

5 IRIS AND PUPIL SEGMENTATION

In this section, we describe how we extract the features of the irisfrom the eye image patches. This is one of the main differencesbetween our method and that of Wang et al. [3]. Eye movementis rather complex, including, for example, fixation, saccade andsmooth pursuit [66]. When eye saccade occurs, a temporal track-ing method could be vulnerable to error accumulation. To addressthis challenge, we use a DCNN-based segmentation method toperform a per-frame pixel extraction of the iris and pupil region.

Many researchers [47]–[49], [67] formulate the pupil detectionproblem as a regression problem. However, as the iris and pupil areoften occluded by the eyelids, regressing the position of the pupilfrom the image may lead to an unstable result. Following [3],we formulate the problem as a classification problem of labelingwhich part of the pixels in the eye region belong to the iris andpupil region. Then, we track the eye gaze to update the eye state.

(a) (b)

Fig. 2: The iris and pupil segmentation. (a): the input image; (b):the segmentation result highlighted by green masks. In addition,our convolutional neural network also obtain the eye close status.

5.1 Eye Image Patches AlignmentTo simplify the problem, we first align all the eye image patches tothe mean shape. We calculate the similarity transformation basedon the feature points on the eyelid as shown in Fig. 4. As humaneyes are symmetrical, we mirror the right eye image patches andprocess the index and coordinate of the eyelid feature pointsaccordingly. We align all the image patches to the mean shapeof the feature points of the left eye. Then, we train the networkby the aligned image patches. For testing, the image patches ofthe right eyes are first mirrored and then aligned by feature points.Then, we obtain the segmentation and eye close detection resultby the trained network. Finally, an inverse transformation of theimage alignment is used to convert the result back to the originalimage space.

5.2 Iris and Pupil SegmentationIn this section, we describe how we solve the image segmentationproblem for the aligned image patches. We apply a convolutionalneural network to predict the probability that each pixel belongsto the iris and pupil region and, at the same time, obtain the eyeclose detection result.

DCNN-based segmentation. Given an input eye image patch,the goal of image segmentation is to partition the image intomultiple sets of pixels. More precisely, it aims to assign a labelto every pixel in the image such that pixels belonging to the sameobject have the same label. The network takes an eye image patchas an input and then generates a probability map with the samesize of the input image, while each pixel of the probability mapindicates the probability that a corresponding pixel of the inputimage belongs to the iris and pupil region. An example is shownin Fig. 2.

Eye Close Detection. Eye blinking and eyelid occlusionsare natural and frequent actions during facial capture. When eyeblinking or eyelid occlusion occur, the segmentation result andedge map of the eye gaze detector may become unstable, whichleads to an inaccurate 3D eye gaze result. We use eye closedetection to handle these failure cases by adding an additionalbranch to the network. After the result of the eye close detectionis obtained, if we find that one eye of the user is closed, we usethe state of the opened eye for both eyes. If both eyes of the userare closed, we use the eye states from the previous frame to obtaina stable visualization.

Network Structure. We combine the power of Unet [1] andSqueezenet [2] to solve the problem. We apply the convolution-al encoder-decoder framework from Unet for the segmentation

Page 5: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

5

fire1

fire2

fire3

Maxpool/2

Maxpool/2

input image

fire4Maxpool/2 Up-conv 2*2

Up-conv 2*2

Up-conv 2*2

segmentation output

48*96*3

fire3_d

fire2_d

fire1_d

24*48*8

12*24*16 12*24*16

6*12*32

48*96*4

24*48*8

48*96*1

48*96*4

Eye close

conv1

Conv,1X1 Conv,3X3

concatthe structure of

'fire' module

input

output 2*1

Fig. 3: The network structure of our system. We use an encoder-decoder framework similar to Unet [1] to solve the problem. We showthe structure of the ”fire” module [2] on the bottom left corner. A ”fire” module comprises a squeeze convolution layer (which has only1×1 filters), feeding into an expand layer that has a mix of 1x1 and 3x3 convolution filters. The number of parameters of the networkis considerably decreased by this structure.

(a) (b) (c)

Fig. 4: Image alignment for eye patches. (a): the eye imagepatch; (b): the mean shape; (c): the aligned image. We compute asimilarity transformation between the eyelid feature points of (a)and (b) to warp the test image to the mean shape to obtain thealigned image (c).

problem and use two branches on top of the encoder network forlearning the segmentation and eye close detection simultaneously.We use a decoder network for segmentation and a single fullyconnected layer for eye close detection. The network structureis shown in Fig. 3. We replace the 3× 3 convolutional layers inUnet by the ’fire’ module of Squeezenet to reduce the parameternumber and the running time of the network, which is importantfor mobile applications. The ’fire’ module reduce the parameternumber from 500k to 50k and the floating-point operations from132.53M to 5.81M, while still obtain similar segmentation result.

The size of the original input image patch is 48× 96, with 3channels. We use the ’fire’ module in every image resolution level.We apply max pooling layers between neighboring image levelsand multiply the channel number by 2 after each level. After 4levels of the encoder, we obtain a feature map with a size of 6×12,with 32 channels. The eye close detection branch is a single 2×1fully connected layer for predicting the probability that the eye inthe image patch is closed.

For the decoder branch, our goal is to upsample the result ofthe feature map to the size of the original image. We use 2× 2

upconvolutions to upsample the feature map in every level. Aftereach upsampling, we concatenate the upsampled feature imagewith the corresponding encoder feature image to combine themore explicit information from the higher layers and the locationinformation from the lower layers of the encoder. Correspondingto the pooling layers in the encoder, the 2×2 upconvolutions areused four times to obtain a segmentation result with the sameresolution as the input images. We only show the convolutionallayers, upconvolutional layers and pooling layers in Fig. 3, recti-fied linear units (ReLUs) are used after every convolutional andupconvolutional layer.

Network Training. We use the cross-entropy loss to measurethe ground truth and the segmentation result, which can be writtenas

E = argminθ

−∑i∈Ω

wilog(Pi)+(1−wi)log(1−Pi) (4)

where Ω is the set of the pixels all over the image patch, θ isthe parameters of the neural network, and wi is the given labelof pixel i, where wi = 1 for iris and pupil pixels and wi = 0for other pixels. Pi is the output from the convolutional neuralnetwork described above, indicating the predicted probability thatthe current pixel belongs to iris and pupil region. The loss term ofthe eye close detection is the cross-entropy loss of the output label.The input images and their corresponding segmentation maps areused to train the network with the stochastic gradient descentimplementation of Caffe [68]. The weights of the segmentationloss and eye close label loss are set to 2 and 50, respectively.We set the initial learning rate, momentum, batch size and weightdecay to 0.00002, 0.9, 32 and 1e-7, respectively. We drop thelearning rate by a factor of 1.5 after every 10k iterations and trainthe network for 100k iterations.

Page 6: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

6

We show a simple comparison between our result and theresult from [3] in Fig. 5. It shows that our method achieves amore accurate result. A more complete comparison is discussed inSection7.2.

(a) (b) (c)

Fig. 5: Comparison between our result and the result from thestate-of-the-art [3]. (a): the original input image; (b): our result; (c)the result from the state-of-the-art [3]. It is clear that our methodachieves a better result.

6 3D EYE GAZE TRACKING

In this section, we describe our tracking procedure for a 3D eyegaze. We follow the exact same approach as [3], with the exceptionof the segmentation loss term and the automatic calibration.We describe it here for completeness of the paper. Given a 3Dhead pose and the extracted 2D iris and pupil mask, we firstautomatically calibrate the parameters of the eye gaze, includingthe positions of the eyeball centers and the radius of iris. Then,we describe our model based tracking method. We formulate theproblem in the Maximum A Posteriori (MAP) framework andapply a sampling based algorithm to search for the global optimalsolution of the problem.

Following [3], we represent the eye gaze state V as:

V = Px,Py,Pz,s,θ ,φ (5)

where Px,Py,Pz represent the 3D eye center in face model space,s is the radius of the iris and pupil, and θ and φ are the sphericalcoordinates of the iris center on the eyeball. By these six variables,the states of one eye can be determined in the current frame. θ andφ are updated frame by frame, while Px,Py,Pz and s are constantin the same video. We calibrate for these parameters before thetracking starts.

6.1 Automatic Eyeball CalibrationAs the eyeball center and iris radius are person-specific, wecalibrate these parameters (Px,Py,Pz, and s) before the trackingstarts. We do the calibration only once for each capture becausethese parameters are constant in the same subject.

After the user appears on camera, we start the reconstructionof the head poses and the 3D facial deformations simultaneously.Once we find the user is facing towards the camera, lookingforward, and his eyes are open in the current frame, we select thecurrent frame for calibration. The criterion for facing towards thecamera is defined as cosα > 0.98, where α is the angle betweenthe camera direction and the facing direction of the user. Thecriteria for looking forward and eye open are deyelid > α · dcornerand dcenter > β · dcorner, where deyelid is the distance between thecenter of the feature points of the upper and lower eyelids, dcorneris the distance between the inner and outer eye corners, anddcenter is the distance between the mean shift center of the 2Dsegmentation result and the center of 2D feature points on theeyelid. In practice, α and β are set to 0.22 and 0.1, respectively.

Once we have selected a specific frame for calibration, weback project the mean shift center of the 2D mask and the edgepixels to the 3D model by the 3D head pose to obtain the 3Dposition of the iris center and 3D iris radius. We can fit a spherefrom the image appearance for the calibration. However, sinceeyeballs are not completely visible, the estimation result could beinaccurate. To obtain better robustness, we use a fixed eyeballradius of r = 12.5mm, which is the average for adult people.Finally, we add a vector of (0,0,−r) to the center of the iris toobtain the 3D eyeball center.

6.2 3D Eye Gaze TrackingOnce the positions of the eyeball centers and the radius of the irisare calibrated, we can perform eye gaze tracking by iris and pupilmask and iris edge to get the spherical coordinates of the eye gazefor each frame.

In addition to fixation and smooth pursuit, the human eye gazecan perform a saccade, which means that temporal tracking canlead to error accumulation or tracking loss. On the other hand,extracting the pupil centers from features of the current frame,though robust, could be not temporally smooth.

We combine the iris features and the temporal prior into anMAP framework to address the challenges. We then solve theproblem by a sampling based algorithm because of the difficultyof the derivative computing.

The optimization problem can be written as:

x∗t = argmaxxt

P(xt |Ot ,xt−1) (6)

where xt and xt−1 are the eye states of the current frame t andthe previous frame t − 1 and Ot are the detected iris featuresin the current frame. Assume that Ot and xt−1 are conditionallyindependent given xt . Using Bayes’ rule, the objective functioncan be written as:

x∗t = argmaxxt

P(Ot |xt)P(xt−1|xt) (7)

where P(Ot |xt) measures how well state xt fits the extracted irisfeatures Ot and P(xt−1|xt) measures how well the states xt andxt−1 satisfy the temporal distribution.

There are two parts of the iris features from eye image patches:the iris and pupil mask and the iris edge map. The iris featurelikelihood is formulated as follows:

P(Ot |xt) ∝ exp(−wmaskEmask−wedgeEedge) (8)

where wmask and wedge are the weights of the mask term and edgeterm, which are set to 3 and 1, respectively.

Iris and pupil mask likelihood. The mask term measures thedifference between the observed iris mask and the hypothesizedmask by the Jaccard distance loss [69], [70], which is obtainedby subtracting the Intersection over Union (IoU) from 1. It can bewritten as:

Emask = 1− |A⋂

B||A

⋃B|

(9)

where A is the intersection of the eye mask and the 2D iris maskfrom the image segmentation and B is intersection of the eye maskand the hypothesized 2D iris, as shown in Fig. 6. This objectivefunction is 0 when the two masks are entirely coinciding and 1when the two masks have no overlap.

Iris edge likelihood. In addition to the iris and pupil seg-mentation information, we apply Canny edge operator [71] to

Page 7: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

7

(a) (b) (c)

Fig. 6: The definition of the Jaccard distance loss. (a): the seg-mentation result of the iris and pupil; (b): the hypothetical maskof the iris and pupil; (c): overlapping display of the two masks.The Jaccard distance loss is defined as 1− intersection/union. In(c), we note the sizes of the areas of green, yellow and red as G,Y, and R. The Jaccard distance loss can be written as R+G

R+G+Y .

extract the iris edge map, following [3]. The edge term measuresthe discrepancies of the edge maps between the observed andrendered images. We also use the trimmed chamfer distance tomeasure the discrepancies between the hypothesized edge andthe observed edge and sum the K = 0.6 ∗ pixel number smallestdistances among the hypothesized edge pixels to obtain a robustresult.

Dynamic likelihood. Following [3], we also apply a dynamiclikelihood term to measure the temporal coherence of the eyegaze motion. We apply the truncated distance loss as the dynamiclikelihood, which can be written as:

P(xt−1|xt) ∝ exp−min(dsphere(xt ,xt−1),τ)) (10)

where dsphere(xt ,xt−1) is the great-circle distance between thetwo spherical coordinates. The threshold τ is set to 0.14 rad (8degree). This objective function increases with the distance whenthe distance is smaller than τ but becomes flat when the distanceis larger than τ . This objective function effectively ensures thesmoothness of the eye gaze motion while also able to track eyesaccade.

Optimization. The derivatives of the objective function Eq.7is nontrivial. We back projected the 2D mean-shift center of theiris and pupil mask to the 3D eyeball to obtain the initializationof the eye gaze states. Similar to [3], we solve the problem usinga sampling-based algorithm. We evaluate the probability of eachstate by the objective function described in Eq.7.

7 EVALUATIONS AND RESULTS

We have demonstrated the power of our system on both live videostreams captured by web camera and video sequences downloadedfrom the Internet. We have also tested our system on smart phones.In addition, we have evaluated the effectiveness and accuracy ofthe tracking result of our system by comparison with state-of-the-art methods and evaluated the importance of key componentsof our system. Our system achieves realtime performance onan Intel(R) i7(R) CPU 3.70 GHz and NVIDIA GTX 1080Tigraphics card. For each input frame, it cost 13 ms for 2D and3D facial expression deformation tracking, 2 ms for iris and pupilsegmentation and 3 ms for eye gaze tracking for each eye.

7.1 Test on Real DataTest on online video streams and Internet videos. We havetested our eye gaze tracker on different subjects (Fig. 9 and Fig.10) to demonstrate its robustness and effectiveness. In the onlinetest, we use a web camera with a resolution of 800 × 600. Theaccompanying video shows that the system can track the eye gaze

robustly and accurately, even under fast eye gaze rolling, extremeillumination changes and large head pose changes. The system isfully automatic. The results are best seen in the accompanyingvideo.

Test on smart phones. Our system can also run on smartphones. We have tested our system on an iPhone 8 and achieved aframe rate of 14 frames per second (fps). Fig. 7 shows an exampleof our result on the iPhone. Please refer to the accompanying videofor the animation results.

Fig. 7: Our result on a smart phone. We achieve a frame rate of 14frames per second (fps) on iPhone 8.

7.2 Comparison with the state-of-the-art methodIn this section, we evaluate the effectiveness and accuracy of oursystem by comparison against the state-of-the-art 3D eye gazetracking method of Wang et al. [3] and a state-of-the-art 2Deye gaze estimation method [54]. The comparison is evaluatedon video sequences containing subjects under various poses andlighting.

Comparison with Wang et al. [3] We first show the superi-ority of our DCNN based segmentation over a random forest inFig. 8 for 2D iris and pupil extraction. Our method achieves abetter result in all the cases for 2D iris and pupil segmentation.The probability of eye close from our network is labeled at thebottom of every image. Moreover, our model only require 50kparameters because we only have one small fully-connected layerand ’fire’ module significantly reduces the parameter numbers ofconvolutional layers. For comparison, the method of [3] requiresmillions of parameters.

To measure the 3D tracking result with the ground truth, weproject the 3D positions of the pupil centers to the 2D image.The image frames with closed eyes are excluded for evaluationbecause the ground truth data for those frames are hard to obtain.We then manually label the 2D pupil centers for the remaining

Page 8: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

8

(a)

(b)

(c)

(d)

Fig. 8: Our 2D segmentation results with various genders, head poses, and eye states. The input images (top rows), the results of ourDCNN based segmentation (middle rows), and the results from random forest [3] (bottom rows). We also show the predicted eye closeprobability by the network at the bottom of the images.

image frames to obtain the ground truth data for comparison. Wemeasure the distance between the projected 2D pupil centers andthe ground truth. The error metric is defined as:

E =||pl− p′l ||+ ||pr− p′r||

2||pl− pr||(11)

Fig. 11 shows that our method obtains a steeper accumulativeerror curve than that of Wang et al. [3] in both video clips, whichdemonstrates that our method is more robust and accurate. Fig. 12shows our method also obtains more accurate result qualitatively.

Comparison with Park et al. [54]. We have evaluated theeffectiveness of our algorithm by comparing with state-of-the-artmethods in image-based gaze estimation [54]. To obtain groundtruth 3D eye gaze data, we manually labeled pupil centers in im-ages and used them along with 3D eye ball centers estimated from

the images to reconstruct 3D eye gaze directions. We measuredthe errors of eye gaze direction in degrees.

We evaluated both methods on the same sets of training dataand testing data. We used the code provided by the authors [54]for comparison. We computed the errors of eye gaze estimationfor both methods. The results are shown in Table 1. Our errors areapproximately in the same level as theirs. It is worth pointing outthat our algorithm is designed to achieve real-time performanceon mobile phones. Therefore, our deep learning network is muchsmaller than [54]. It has only 22% parameters and 6.8% floating-point operations of [54].

7.3 Evaluation of the key components

We evaluate the key components of our 3D facial and eye gazetracker in this section.

Page 9: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

9

Fig. 9: Our 3D facial and eye gaze tracking results with various genders, head poses, and races.

Page 10: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

10

Fig. 10: More results.

Page 11: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

11

Fig. 11: Quantitative comparison with result from Wang et al. [3]on two video clips. We achieve a more accurate result mainlybecause the DCNN-based segmentation is more accurate than therandom forest-based method.

TABLE 1: Mean error of eye gaze directions in degree.

method mean errorour method 8.4241

Park et al. [54] 6.1909

Evaluation on 3D facial reconstruction. Although 3D facialreconstruction is not our contribution, we would like to discussthe benefit of the optical flow term in the objective function. Wehave captured a head-mounted video of an actor, in which theactor’s head pose is consistent in the video. We report the standarddeviations of the reconstructed head pose parameters with/withoutthe optical flow term in Table 2. It shows that result with theoptical flow term is more consistent than that without this term.We also show the distribution of the translation on the z-axis inFig. 13.

Evaluation on 3D eye gaze tracking. We apply the IoU lossrather than the distance between the hypothesized and observediris centers which is used by [3] in the eye gaze tracking. We have

(a) (b) (c)

Fig. 12: Qualitative comparison with result from Wang et al. [3].(a): the original input image; (b): our result; (c): result from [3].Both methods share the same calibration of the 3D eye center andiris radius. We achieve a more accurate result than [3].

translation on z-axis26.5 27 27.5 28 28.5 29 29.5 30 30.5

perc

enta

ge o

f occ

urre

nce

0

0.1

0.2

0.3

0.4

0.5

0.6

without optical flow termwith optical flow term

Fig. 13: Distribution of translation on z-axis (the axis that isperpendicular to the camera plane) of the head pose parameterswith/without the optical flow term in a head-mounted video.

evaluated the importance of this improvement for the eye gazetracking process. We track the motion of the eye gaze by the twomethods separately and compare the results in Table 3. The resultshows that this measurement improves the performance.

7.4 Application

We now discuss the application of our facial and eye gazeperformance capture system in realtime eye gaze capture andvisualization.

Realtime eye gaze capture. With the reconstructed 3D headpose and the eye gaze states from our tracking system, we canlocate eye gaze points (fixations) on a screen at runtime andvisualize the points.

Before the capture start, we ask the actor to look at ninepredefined points on the screen (top left, top middle, top right,middle left, center of screen, right middle, bottom left, bottommiddle and bottom right). We track the global head poses andeye gaze directions and then obtain the eye gaze point on thecamera image plane by the intersection of the eye gaze directionand camera image plane. We then compute a linear transformationbetween the coordinates on the camera image plane and the screenplane by formulating a least squares problem. We visualize the eyegaze points on the screen by a white point. We show an examplein Fig. 14. This result is best seen in the accompanying video.

Page 12: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

12

TABLE 2: Standard deviations of head pose parameterswith/without optical flow term.

row pitch yaw tx ty tzwithout 0.0091 0.0063 0.0255 0.0040 0.0165 0.6284

with 0.0071 0.0080 0.0236 0.0033 0.0145 0.2865

TABLE 3: MSE error for pupil position for center distance andIoU loss. The error is normalized by the distance between thepupils.

MSEcenter distance loss 0.0388

IoU loss 0.0381

Fig. 14: Example of tracking eye gaze points (fixations) on ascreen when the user is watching a video of a cheetah.

8 CONCLUSION

In this paper, we demonstrate an end-to-end realtime system thatcaptures the 3D head poses, facial expression deformations and3D eye gaze states using a single RGB camera. Our key idea is totrain a deep convolutional neural network for automatic iris andpupil annotation and eye close detection. We formulate the 3Deye gaze tracker in the Maximum A Posteriori (MAP) framework,which sequentially infers the most likely state of 3D eye gazein each frame. Our system advances the state of the art in 3Deye gaze tracking using a single RGB camera. This approach isappealing to facial and eye gaze performance capture because itis fast, fully automatic and accurate. Our system runs in real timeon desktop PCs and smart phones. We have tested our system onboth live video streams and Internet videos, demonstrating thatit is accurate and robust under a variety of uncontrolled lightingconditions and overcoming significant differences of gender, race,pose, shape and facial expression.

REFERENCES

[1] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” medical image computing andcomputer assisted intervention, pp. 234–241, 2015.

[2] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewerparameters and ¡0.5mb model size,” arXiv: Computer Vision and PatternRecognition, 2017.

[3] C. Wang, S. Xia, and J. Chai, “Realtime 3d eye gaze animation using asingle rgb camera,” Acm Transactions on Graphics, vol. 35, no. 4, p. 118,2016.

[4] B. Bickel, M. Botsch, R. Angst, W. Matusik, M. Otaduy, H. Pfister, andM. Gross, “Multi-scale capture of facial geometry and motion,” ACMTrans. Graph., vol. 26, no. 3, pp. 33:1–33:10, 2007.

[5] H. Huang, J. Chai, X. Tong, and H.-T. Wu, “Leveraging motion captureand 3d scanning for high-fidelity facial performance acquisition,” ACMTrans. Graph., vol. 30, no. 4, pp. 74:1–74:10, 2011.

[6] L. Zhang, N. Snavely, B. Curless, and S. Seitz, “Spacetime faces: highresolution capture for modeling and animation,” ACM Transactions onGraphics, vol. 23, no. 3, pp. 548–558, 2004.

[7] W.-C. Ma, A. Jones, J.-Y. Chiang, T. Hawkins, S. Frederiksen, P. Peers,M. Vukovic, M. Ouhyoung, and P. Debevec, “Facial performance syn-thesis using deformation-driven polynomial displacement maps,” ACMTrans. Graph., vol. 27, no. 5, pp. 121:1–121:10, 2008.

[8] H. Li, B. Adams, L. J. Guibas, and M. Pauly, “Robust single-viewgeometry and motion reconstruction,” ACM Trans. Graph., vol. 28, no. 5,pp. 175:1–175:10, 2009.

[9] T. Weise, H. Li, L. Van Gool, and M. Pauly, “Face/off: live facialpuppetry,” in Symposium on Computer Animation, 2009, pp. 7–16.[Online]. Available: http://doi.acm.org/10.1145/1599470.1599472

[10] D. Bradley, W. Heidrich, T. Popa, and A. Sheffer, “High resolutionpassive facial performance capture,” ACM Trans. Graph., vol. 29, no. 4,pp. 41:1–41:10, 2010.

[11] T. Beeler, B. Bickel, P. Beardsley, B. Sumner, and M. Gross, “High-quality single-shot capture of facial geometry,” ACM Trans. Graph.,vol. 29, no. 4, pp. 40:1–40:9, 2010.

[12] T. Beeler, F. Hahn, D. Bradley, B. Bickel, P. Beardsley, C. Gotsman,R. W. Sumner, and M. Gross, “High-quality passive facial performancecapture using anchor frames,” ACM Trans. Graph., vol. 30, no. 4, pp.75:1–75:10, 2011.

[13] L. Valgaerts, C. Wu, A. Bruhn, H.-P. Seidel, and C. Theobalt,“Lightweight binocular facial performance capture under uncontrolledlighting,” ACM Trans. Graph., vol. 31, no. 6, pp. 187:1–187:11, Nov.2012. [Online]. Available: http://doi.acm.org/10.1145/2366145.2366206

[14] T. Weise, S. Bouaziz, H. Li, and M. Pauly, “Realtime performance-basedfacial animation,” ACM Trans. Graph., vol. 30, no. 4, pp. 77:1–77:10,2011.

[15] Y. Chen, H. Wu, F. Shi, X. Tong, and J. Chai, “Accurate and robust 3dfacial capture using a single rgbd camera,” pp. 3615–3622, 2013.

[16] S. Bouaziz, Y. Wang, and M. Pauly, “Online modeling for realtime facialanimation,” ACM Trans. Graph., vol. 32, no. 4, pp. 40:1–40:10, Jul.2013. [Online]. Available: http://doi.acm.org/10.1145/2461912.2461976

[17] H. Li, J. Yu, Y. Ye, and C. Bregler, “Realtime facialanimation with on-the-fly correctives,” ACM Trans. Graph.,vol. 32, no. 4, pp. 42:1–42:10, Jul. 2013. [Online]. Available:http://doi.acm.org/10.1145/2461912.2462019

[18] P.-L. Hsieh, C. Ma, J. Yu, and H. Li, “Unconstrained realtime facial per-formance capture,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2015, pp. 1675–1683.

[19] H. Li, L. C. Trutoiu, K. Olszewski, L. Wei, T. Trutna, P. Hsieh,A. Nicholls, and C. Ma, “Facial performance sensing head-mounteddisplay,” international conference on computer graphics and interactivetechniques, vol. 34, no. 4, p. 47, 2015.

[20] Y. Liu, F. Xu, J. Chai, X. Tong, L. Wang, and Q. Huo, “Video-audio driven real-time facial animation,” ACM Transactions on Graphics(TOG), vol. 34, no. 6, p. 182, 2015.

[21] J. Chai, J. Xiao, and J. K. Hodgins, “Vision-based control of 3d facialanimation,” pp. 193–206, 2003.

[22] J. M. Saragih, S. Lucey, and J. F. Cohn, “Real-time avatar animationfrom a single image,” in Automatic Face & Gesture Recognition andWorkshops (FG 2011), 2011 IEEE International Conference on. IEEE,2011, pp. 117–124.

[23] T. Baltrusaitis, P. Robinson, and L.-P. Morency, “3d constrained localmodel for rigid and non-rigid facial tracking,” in Computer Vision andPattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012,pp. 2610–2617.

[24] X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shaperegression,” International Journal of Computer Vision, vol. 107, no. 2,pp. 177–190, 2014.

[25] X. Xiong and F. De la Torre, “Supervised descent method and its appli-cations to face alignment,” in Computer Vision and Pattern Recognition(CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 532–539.

[26] C. Cao, Y. Weng, S. Lin, and K. Zhou, “3d shape regression for real-timefacial animation,” ACM Trans. Graph., vol. 32, no. 4, pp. 41:1–41:10, Jul.2013. [Online]. Available: http://doi.acm.org/10.1145/2461912.2462012

[27] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fpsvia regressing local binary features,” in Computer Vision and PatternRecognition (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 1685–1692.

Page 13: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

13

[28] V. Kazemi and J. Sullivan, “One millisecond face alignment with an en-semble of regression trees,” in Computer Vision and Pattern Recognition(CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 1867–1874.

[29] C. Cao, Q. Hou, and K. Zhou, “Displaced dynamic expression regressionfor real-time facial tracking and animation,” ACM Transactions onGraphics (TOG), vol. 33, no. 4, p. 43, 2014.

[30] C. Cao, D. Bradley, K. Zhou, and T. Beeler, “Real-time high-fidelityfacial performance capture,” ACM Transactions on Graphics (TOG),vol. 34, no. 4, p. 46, 2015.

[31] P. Garrido, L. Valgaerts, C. Wu, and C. Theobalt, “Reconstructingdetailed dynamic face geometry from monocular video.” ACM Trans.Graph., vol. 32, no. 6, p. 158, 2013.

[32] F. Shi, H.-T. Wu, X. Tong, and J. Chai, “Automatic acquisition of high-fidelity facial performances using monocular videos,” ACM Transactionson Graphics (TOG), vol. 33, no. 6, p. 222, 2014.

[33] C. Wu, D. Bradley, M. H. Gross, and T. Beeler, “An anatomically-constrained local deformation model for monocular face capture,”international conference on computer graphics and interactive techniques,vol. 35, no. 4, p. 115, 2016.

[34] Q. Wen, F. Xu, M. Lu, and J. Yong, “Real-time 3d eyelids trackingfrom semantic edges,” international conference on computer graphicsand interactive techniques, vol. 36, no. 6, p. 193, 2017.

[35] C. H. Morimoto and M. Flickner, “Real-time multiple face detectionusing active illumination,” in Automatic Face and Gesture Recognition,2000. Proceedings. Fourth IEEE International Conference on. IEEE,2000, pp. 8–13.

[36] Anon, Applied science laboratories, “http://www.a-s-l.com.com,” 2015.[37] Lc Technologies, “http://www.eyegaze.com,” 2015.[38] Tobii Technologies, “http://www.tobii.com,” 2015.[39] M. Chau and M. Betke, “Real time eye tracking and blink detection with

usb cameras,” Boston University Computer Science Department, Tech.Rep., 2005.

[40] P. M. Corcoran, F. Nanu, S. Petrescu, and P. Bigioi, “Real-time eyegaze tracking for gaming design and consumer electronics systems,”Consumer Electronics, IEEE Transactions on, vol. 58, no. 2, pp. 347–355, 2012.

[41] J. Huang and H. Wechsler, “Eye detection using optimal wavelet pack-ets and radial basis functions (rbfs),” International Journal of PatternRecognition and Artificial Intelligence, vol. 13, no. 07, pp. 1009–1025,1999.

[42] W. Huang and R. Mariani, “Face detection and precise eyes location,” inPattern Recognition, 2000. Proceedings. 15th International Conferenceon, vol. 4. IEEE, 2000, pp. 722–727.

[43] S. Kawato and J. Ohya, “Real-time detection of nodding and head-shaking by directly detecting and tracking the between-eyes,” inAutomatic Face and Gesture Recognition, 2000. Proceedings. FourthIEEE International Conference on. IEEE, 2000, pp. 40–45.

[44] Y.-l. Tian, T. Kanade, and J. F. Cohn, “Dual-state parametric eye track-ing,” in Automatic Face and Gesture Recognition, 2000. Proceedings.Fourth IEEE International Conference on. IEEE, 2000, pp. 110–115.

[45] E. Wood, T. Baltrusaitis, L. Morency, P. Robinson, and A. Bulling,“Gazedirector: Fully articulated eye gaze redirection in video,” ComputerGraphics Forum, vol. 37, no. 2, pp. 217–225, 2018.

[46] P. Berard, D. Bradley, M. Nitti, T. Beeler, and M. H. Gross, “High-quality capture of eyes,” international conference on computer graphicsand interactive techniques, vol. 33, no. 6, p. 223, 2014.

[47] Megvii Technology, “http://www.faceplusplus.com.cn/demo-landmark/,”2015.

[48] T. Baltrusaitis, P. Robinson, and L. Morency, “Openface: An open sourcefacial behavior analysis toolkit,” pp. 1–10, 2016.

[49] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “Its written all over yourface: Full-face appearance-based gaze estimation,” in Computer Visionand Pattern Recognition Workshops (CVPRW), 2017 IEEE Conferenceon. IEEE, 2017, pp. 2299–2308.

[50] R. Ranjan, S. D. Mello, and J. Kautz, “Light-weight head pose invariantgaze tracking,” 2018.

[51] J. Lemley, A. Kar, A. Drimbarean, and P. Corcoran, “Efficient cnn imple-mentation for eye-gaze estimation on low-power/low-quality consumerimaging systems,” 2018.

[52] Y. Cheng, F. Lu, and X. Zhang, “Appearance-based gaze estimation viaevaluation-guided asymmetric regression.” pp. 105–121, 2018.

[53] A. Lahiri, A. Agarwalla, and P. K. Biswas, “Unsupervised domainadaptation for learning eye gaze from a million synthetic images: Anadversarial approach.” arXiv: Computer Vision and Pattern Recognition,2018.

[54] S. Park, A. Spurr, and O. Hilliges, “Deep pictorial gaze estimation,”european conference on computer vision, pp. 741–757, 2018.

[55] Q. Wen, F. Xu, and J. Yong, “Real-time 3d eye performance recon-struction for rgbd cameras,” IEEE Transactions on Visualization andComputer Graphics, vol. 23, no. 12, pp. 2586–2598, 2017.

[56] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” computer vision and pattern recognition, pp.3431–3440, 2015.

[57] L. Chen, G. Papandreou, I. Kokkinos, K. P. Murphy, and A. L. Yuille,“Semantic image segmentation with deep convolutional nets and fullyconnected crfs,” international conference on learning representations,2015.

[58] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con-volutional encoder-decoder architecture for image segmentation,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 39,no. 12, pp. 2481–2495, 2017.

[59] G. Lin, A. Milan, C. Shen, and I. D. Reid, “Refinenet: Multi-path re-finement networks for high-resolution semantic segmentation,” computervision and pattern recognition, pp. 5168–5177, 2017.

[60] L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrousconvolution for semantic image segmentation,” arXiv: Computer Visionand Pattern Recognition, 2017.

[61] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Learning deep representationfor face alignment with auxiliary attributes,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 38, no. 5, pp. 918–930,2016.

[62] T. Kroeger, R. Timofte, D. Dai, and L. Van Gool, “Fast optical flowusing dense inverse search,” european conference on computer vision,pp. 471–488, 2016.

[63] D. Decarlo and D. Metaxas, “The integration of optical flow anddeformable models with applications to human face shape and motionestimation,” in Conference on Computer Vision and Pattern Recognition,1996, p. 189.

[64] C. Cao, M. Chai, O. Woodford, and L. Luo, “Stabilized real-time facetracking via a learned dynamic rigidity prior,” ACM Transactions onGraphics (Proc. SIGGRAPH Asia), vol. 37, no. 6, dec 2018.

[65] S. Agarwal and K. Mierle, Ceres Solver: Tutorial & Reference, GoogleInc.

[66] K. Ruhland, S. Andrist, J. Badler, C. Peters, N. Badler, M. Gleicher,B. Mutlu, and R. Mcdonnell, “Look me in the eyes: A survey of eye andgaze animation for virtual agents and artificial systems,” in EurographicsState-of-the-Art Report, 2014, pp. 69–91.

[67] F. Lu, Y. Sugano, T. Okabe, and Y. Sato, “Adaptive linear regressionfor appearance-based gaze estimation,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 36, no. 10, pp. 2033–2046, 2014.

[68] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.

[69] L. R. Dice, “Measures of the amount of ecologic association betweenspecies,” Ecology, vol. 26, no. 3, pp. 297–302, 1945.

[70] T. Sorensen, “A method of establishing groups of equal amplitude in plantsociology based on similarity of species and its application to analysesof the vegetation on danish commons. biol skr 5:1-34,” vol. 5, pp. 1–34,1948.

[71] J. Canny, “A computational approach to edge detection,” Pattern Analysisand Machine Intelligence, IEEE Transactions on, no. 6, pp. 679–698,1986.

Zhiyong Wang Zhiyong Wang received the BScdegree in automation from Tsinghua Universi-ty(THU), China, in 2011. He is currently workingtoward the PhD degree in computer science atUniversity of Chinese Academy of Sciense su-pervised by Prof. Shihong Xia.

Page 14: Realtime and Accurate 3D Eye Gaze Capture with DCNN-based ...humanmotion.ict.ac.cn/papers/2019P4_EYE/paper.pdf · capture method that addresses the limitations of using random forest

1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2938165, IEEETransactions on Visualization and Computer Graphics

14

Jinxiang Chai Jinxiang Chai received PhD de-gree in computer science from Carnegie MellonUniversity(CMU). He is currently an associateprofessor in the Department of Computer Sci-ence and Engineering at Texas A&M University.His primary research is in the area of computergraphics and vision with broad applications inother disciplines such as virtual and augment-ed reality, robotics, human computer interaction,and biomechanics. He received an NSF CA-REER award for his work on theory and practice

of Bayesian motion synthesis.

Shihong Xia Shihong Xia received PhD degreein computer science from University of ChineseAcademy of Sciences. He is currently a profes-sor of Institute of Computing Technology, Chi-nese Academy of Sciences(ICT, CAS), and di-rector of the human motion laboratory. His pri-mary research is in the area of computer graph-ics, virtual reality and Artificial Intelligence.