hybrid on-line 3d face and facial actions tracking in rgbd ... · candide model are designed to...

6
Hybrid On-line 3D Face and Facial Actions Tracking in RGBD Video Sequences Hai X. Pham, Vladimir Pavlovic Department of Computer Science Rutgers University, Piscataway, NJ 08854 Emails: [hxp1,vladimir]@cs.rutgers.edu Abstract—In this paper, we propose a hybrid model-based tracker for simultaneous tracking of 3D head pose and facial actions in sequences of texture and depth frames. Our tracker utilizes a generic wireframe model, the Candide-3, to represent facial deformations. This wireframe model is initially fit into the first frame by an Iterative Closest Point algorithm. Given the result after the first frame, our tracking algorithm combines both Iterative Closest Point technique and Appearance Model for head pose and facial actions tracking. The tracker is capable of adapting on-line to the changes in appearance of the target and thus the prior training process is avoided. Furthermore, the tracking system works automatically without any intervention from human operators. I. I NTRODUCTION Tracking human faces has remained an active research area among the computer vision community for a long time due to its usefulness in a number of applications, such as video surveillance, expression analysis and human-computer interaction. An automatic vision-based tracking system is desirable and such a system should be capable of recovering the head pose and facial features, or facial actions. It is a non- trivial task because of the highly deformable nature of faces and their rich variability in appearances. Face tracking has been performed accurately by feature- based trackers [1][2]. In [1] a two-stage approach was devel- oped for 3D head pose and facial features tracking in monoc- ular video sequences. 3D facial deformations are learned from stereo data and the features are tracked by optical flow. In [2], tracking is obtained via a set of linear predictors modeling pixel intensities and a statistical method is used to identify relevant areas to locate feature points. Also based on feature points and their local information, Active Shape Models (ASMs) [3] use a point distribution model to capture shape variations. The shape consists of a set of landmarks with their local appearance distributions. Shape parameters are updated by iteratively searching the best nearby match for each landmark point. ASMs are simple and fast. However, they require a large amount of annotated training data to learn shape models, and by their inherent approach ASMs only use image information sparsely. Active Appearance Models (AAMs) were proposed to overcome the limitation of ASMs by adopting statistical tex- ture models. In [4][5], both shape and texture distribution models are learned from training data. Alternatively, in [6][7], a template is used in place of shape models and texture distribution models are mapped onto that template. In contrast to ASMs, parameters are updated iteratively by minimizing the difference between the incoming texture and the texture mean which was learned from training data. To avoid the training process of AAMs, some authors have proposed On- line Appearance Models (OAMs) [8][9] for tracking. In these works, the appearance distribution is constantly updated to reflect changes over time and compensate for drifting. How- ever, a good initialization is required, and often it is performed manually. In recent years, affordable range sensors have been intro- duced, such as the Microsoft TM Kinect and similar depth cam- eras. Face tracking research has since adopted 3D deformable model fitting techniques to take advantage of the now avail- able depth information [10][11]. These works were based on Iterative Closest Point (ICP) procedure [12][13][14]. In [10], an ICP-based regularized maximum likelihood deformable model fitting algorithm was developed. Feature points are tracked across frames using optical flow and integrated into the framework to track facial movements. This method, while efficient, relies on optical flow tracking of feature points which often deteriorates over time, causing inaccurate deformations. Following a more graphics-based approach, the authors of [11] developed a face tracking system aiming toward facial animation. Initially, a set of specific facial poses are acquired from the human subject. These poses are then combined to form the final pose in the tracking process. Weights of shapes are learned from maximum a posteriori estimate. Their tracking algorithm is robust, however it requires a set of specifically designed blendshapes, and there is still a short off-line training procedure before it can work. In this paper, we propose a hybrid on-line 3D face tracker to take advantages of both texture and depth information, which is capable of tracking 3D head pose and facial actions simultaneously. First, we employ a generic deformable model, the Candide-3, into our ICP fitting framework. Second, we introduce a strategy to automatically initialize the tracker using the depth information. Lastly, we propose a hybrid tracking framework that combines ICP and OAM to utilize the strengths of both techniques. The ICP algorithm, which is aided by optical flow to correctly follow large head movement, robustly tracks the head pose across frames using depth information. It provides a good initialization for OAM. In return, the OAM algorithm maintains the texture model of the face, adjusts any drifting incurred by ICP and transforms the 3D shape closer to correct deformation, which then provides ICP with a good

Upload: others

Post on 02-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hybrid On-line 3D Face and Facial Actions Tracking in RGBD ... · Candide model are designed to cover face biometry (SUs) and facial expressions (AUs). Each person has a unique set

Hybrid On-line 3D Face and Facial ActionsTracking in RGBD Video Sequences

Hai X. Pham, Vladimir PavlovicDepartment of Computer Science

Rutgers University, Piscataway, NJ 08854Emails: [hxp1,vladimir]@cs.rutgers.edu

Abstract—In this paper, we propose a hybrid model-basedtracker for simultaneous tracking of 3D head pose and facialactions in sequences of texture and depth frames. Our trackerutilizes a generic wireframe model, the Candide-3, to representfacial deformations. This wireframe model is initially fit into thefirst frame by an Iterative Closest Point algorithm. Given theresult after the first frame, our tracking algorithm combinesboth Iterative Closest Point technique and Appearance Modelfor head pose and facial actions tracking. The tracker is capableof adapting on-line to the changes in appearance of the targetand thus the prior training process is avoided. Furthermore, thetracking system works automatically without any interventionfrom human operators.

I. INTRODUCTION

Tracking human faces has remained an active researcharea among the computer vision community for a long timedue to its usefulness in a number of applications, such asvideo surveillance, expression analysis and human-computerinteraction. An automatic vision-based tracking system isdesirable and such a system should be capable of recoveringthe head pose and facial features, or facial actions. It is a non-trivial task because of the highly deformable nature of facesand their rich variability in appearances.

Face tracking has been performed accurately by feature-based trackers [1][2]. In [1] a two-stage approach was devel-oped for 3D head pose and facial features tracking in monoc-ular video sequences. 3D facial deformations are learned fromstereo data and the features are tracked by optical flow. In [2],tracking is obtained via a set of linear predictors modelingpixel intensities and a statistical method is used to identifyrelevant areas to locate feature points.

Also based on feature points and their local information,Active Shape Models (ASMs) [3] use a point distributionmodel to capture shape variations. The shape consists of a setof landmarks with their local appearance distributions. Shapeparameters are updated by iteratively searching the best nearbymatch for each landmark point. ASMs are simple and fast.However, they require a large amount of annotated trainingdata to learn shape models, and by their inherent approachASMs only use image information sparsely.

Active Appearance Models (AAMs) were proposed toovercome the limitation of ASMs by adopting statistical tex-ture models. In [4][5], both shape and texture distributionmodels are learned from training data. Alternatively, in [6][7],a template is used in place of shape models and texture

distribution models are mapped onto that template. In contrastto ASMs, parameters are updated iteratively by minimizingthe difference between the incoming texture and the texturemean which was learned from training data. To avoid thetraining process of AAMs, some authors have proposed On-line Appearance Models (OAMs) [8][9] for tracking. In theseworks, the appearance distribution is constantly updated toreflect changes over time and compensate for drifting. How-ever, a good initialization is required, and often it is performedmanually.

In recent years, affordable range sensors have been intro-duced, such as the MicrosoftTMKinect and similar depth cam-eras. Face tracking research has since adopted 3D deformablemodel fitting techniques to take advantage of the now avail-able depth information [10][11]. These works were based onIterative Closest Point (ICP) procedure [12][13][14]. In [10],an ICP-based regularized maximum likelihood deformablemodel fitting algorithm was developed. Feature points aretracked across frames using optical flow and integrated intothe framework to track facial movements. This method, whileefficient, relies on optical flow tracking of feature points whichoften deteriorates over time, causing inaccurate deformations.Following a more graphics-based approach, the authors of[11] developed a face tracking system aiming toward facialanimation. Initially, a set of specific facial poses are acquiredfrom the human subject. These poses are then combinedto form the final pose in the tracking process. Weights ofshapes are learned from maximum a posteriori estimate. Theirtracking algorithm is robust, however it requires a set ofspecifically designed blendshapes, and there is still a shortoff-line training procedure before it can work.

In this paper, we propose a hybrid on-line 3D face trackerto take advantages of both texture and depth information,which is capable of tracking 3D head pose and facial actionssimultaneously. First, we employ a generic deformable model,the Candide-3, into our ICP fitting framework. Second, weintroduce a strategy to automatically initialize the tracker usingthe depth information. Lastly, we propose a hybrid trackingframework that combines ICP and OAM to utilize the strengthsof both techniques. The ICP algorithm, which is aided byoptical flow to correctly follow large head movement, robustlytracks the head pose across frames using depth information. Itprovides a good initialization for OAM. In return, the OAMalgorithm maintains the texture model of the face, adjusts anydrifting incurred by ICP and transforms the 3D shape closerto correct deformation, which then provides ICP with a good

Page 2: Hybrid On-line 3D Face and Facial Actions Tracking in RGBD ... · Candide model are designed to cover face biometry (SUs) and facial expressions (AUs). Each person has a unique set

Fig. 1. Candide-3 Wireframe Model and 7 action deformation units. (a)The base shape. (b) Upper Lip Raiser. (c) Jaw Drop. (d) Lip Stretcher. (e)Inner Brow Lowerer. (f) Lip Corner Depressor. (g) Outer Brow Raiser. (h)Lip Presser.

initialization in the next frame. Throughout this paper, it isassumed that we know the alignment and projection betweencolor frame, depth frame and 3D world coordinates.

II. THE CANDIDE-3 WIREFRAME MODEL

A. Face Modeling and Parameterization

Candide-3 is a simple, generic wireframe model (WFM)developed by J. Ahlberg [15]. The Candide-3 WFM consists of113 vertices, and 184 triangles constructed from these verticesthat define its surface, as shown in Fig.1(a).

The Candide model was originally designed for model-based face coding and its deformations are directly related toface biometry and facial expressions. The 3D shape of themodel is controlled by a set of Action Deformation Units(AUs) and Shape Deformation Units (SUs). Since every vertexcan be transformed independently, each vertex of the model isreshaped according to:

g = p0 + Sσ +Aα (1)

where p0 is the base coordinates of a vertex p, S and A areshape and action deformation matrices associated with vertexp, respectively. σ is the vector of shape deformation parametersand α is the action deformation parameters vector. In general,the transformation of a vertex given global motion includingrotation R and translation t is defined as:

p′ = R(p0 + Sσ +Aα) + t (2)

Notice that we exclude the scaling factor s due to: 1) hu-man heads have similar size in most cases thus we cancalibrate the scale beforehand, and 2) shape deformationswill take care of micro-scaling through the fitting proce-dure. The geometry of the model is parameterized by vectoru =

[θx, θy, θz, tx, ty, tz, σ

T , αT]T

, where θx, θy , θz arethree rotation angles, tx, ty , tz are three translation valuescorresponding to three axes x, y and z.

B. Shape Deformation Formulation

Let us assume that the current shape is defined by atransforming function T: P = T (P0, u). Given new data, we

would like to find a new parameters vector u’ to make thenew shape P ′ = T (P0, u

′) fit new data best. As in any fittingprocedures, we want to find a displacement vector ∆u suchthat u′ = u+ ∆u. Substituting u’ into Eq. (2) we have:

P ′ = R′ [P0 + S(σ + ∆σ) +A(α+ ∆α)] + t+ ∆tR′ = R(θx + ∆θx, θy + ∆θy, θz + ∆θz)

∆t = (∆tx,∆ty,∆tz)T

(3)

We make another assumption that the rotation from thecurrent shape P to the new shape P’ is relatively small, andwe can use the following approximation:

R(θx + ∆θx, θy + ∆θy, θz + ∆θz) = ∆R×R(θx, θy, θz) (4)

∆R ' I + ∆θx ×Rx + ∆θy ×Ry + ∆θz ×Rz (5)

Rx =

0 0 0

0 0 10 −1 0

Ry =

0 0 −1

0 0 01 0 0

Rz =

0 1 0

−1 0 00 0 0

Substituting (4) and (5) back into (3), we can compute the newshape P’ as:

P ′ 'P +

∑i=x,y,z

∆θi ×Ri

× P + S∆σ +A∆α

+

∑i=x,y,z

∆θi ×Ri

× (S∆σ +A∆α) + ∆t

(6)

From now on when finding the transformation fromshape P to shape P’, we will find the unknown vector(∆θT ,∆tT ,∆σT ,∆αT

)T, whose initial values are zeros.

C. Initialization and Tracking Steps

As being stated in Section II-A, the deformations of theCandide model are designed to cover face biometry (SUs)and facial expressions (AUs). Each person has a unique setof corresponding shape units, whereas facial expressions, oraction units, are person-independent. However, according tothe transformation in Eq.(6), we optimize both shape andaction parameters simultaneously, and it is difficult if notimpossible to find the unique set of shape parameters for thetest subject. Thus, we use the first frame to calculate shapeparameters; from the second frame onward, we keep shapeparameters fixed and only optimize on action parameters.Initialization: set action parameters to zeros and optimize on(∆θT ,∆tT ,∆σT

)T. It requires neutral expression in the first

frame.Tracking: keep shape parameters unchanged and optimize on(∆θT ,∆tT ,∆αT

)T. In this paper, we track 7 facial actions

as shown in Figure.1(b-h).

III. INITIALIZATION

In this section, first, we take advantage of the depthinformation to efficiently find the initial head pose using anSVD-based registration technique and a set of feature pointsdetected from the input texture frame. Next, we introduce ourICP fitting procedure to optimize transformation parameterson the input point cloud.

Page 3: Hybrid On-line 3D Face and Facial Actions Tracking in RGBD ... · Candide model are designed to cover face biometry (SUs) and facial expressions (AUs). Each person has a unique set

Fig. 2. Six point pairs are used in the SVD registration method. Six cornerspoints (red circle) of the model (left) are matched with six correspondingpoints (green) in the input image (right).

A. Head Pose Estimation

Before the ICP algorithm starts, we find an approximaterigid transformation (rotation, translation) to roughly fit thewireframe model to the face point cloud in the input frame.We use the SVD-based registration method of two 3D pointsets in [16] for this purpose.

From observation, we decide to choose six points: left/righteye corners and mouth corners to form these two point sets,since these points can be detected with high accuracy in theinput color frame and their relative positions to each other areconsistent. Figure.2 depicts two point sets. In this work, we usethe Viola & Jones face detector [17] and an ASM alignmentalgorithm [18] to detect these points; but any other algorithmwill suffice. In short, we 1) find the feature points in inputcolor frame; 2) project these points into 3D world coordinatesto form a 3D point set; and 3) find initial rotation R0 andtranslation t0 using SVD method.

We also pick some landmark points returned by ASM andform corresponding point pairs with a subset of vertices ofthe Candide model, and calculate initial shape parameters byminimizing the total squared distance between two point sets,in other words, solving a linear least-squares problem. Thisstep is in fact a simple ICP iteration.

B. Iterative Closest Point for Initialization

After computing the initial estimate values in the previoussection, we implement an ICP algorithm [19] to optimizerotation, translation and shape deformation parameters simul-taneously. Given a set of N vertices of the 3D wireframe modelpi, a transformation function T and the corresponding points difrom the input point cloud, we minimize the following error:

E(u) =

N∑i=1

(di − T (pi, u))2 (7)

The optimal registration is given by minimizing E(u) over u:

_u = arg min

u

N∑i=1

(di − T (pi, u))2 (8)

In general, the ICP algorithm iterates two steps: startingfrom initial estimate of u, it repeats the C-step: compute thecorrespondences set di from input point cloud; and the T-step:update the transformation by calculating new values of u; untilconvergence.

C-step: in this step, with the point cloud D, we findthe correspondence di of each vertex pi such that di =

arg mindj

‖dj − p′i‖2,∀dj ∈ D with p′i = T (pi, ut), ut are

current parameters. We make use of the matching method in[13]. Searching for closest points is accelerated using K-Dtree. After we have found all correspondences di, we discardpairs of outliers which has ‖di − p′i‖2 ≥ Dmax. The thresholdDmax is updated after each ICP iteration; for more detailsplease refer to [13]. At the end of this process we retrievea subset of Nd pairs of the initial correspondences set. Thisfiltering method makes ICP more robust and we do not haveto treat vertices hidden by head pose in any special way, theyare rejected automatically since their corresponding pairs willhave larger distances than the rest.

T-step: in this step we find the optimal parameters: _ut+1 =

arg minu

Nd∑i=1

(di − T (pi, u))2

Both steps try to reduce the error, thus convergence to alocal minima is guaranteed.

In our framework, minimizing the error E(u) in the T-stepis a non-linear least-squares problem as T is a non-linearfunction. We choose the Levenberg-Marquardt algorithm(LMA)[20] to solve the optimization in Eq.(8). It is morerobust than the standard Gauss-Newton method, and LMAcan find the solution even when it starts far off the finaloptima, which suits our tracking problem when changesbetween frames may be drastic in some cases. Interestedreaders are encouraged to refer to [12] and [20] for moredetails. Note that in our ICP procedure, the Jacobian matrixcan be computed analytically using the formulation in Eq.(6).

Algorithm 1: The initialization algorithm.Data: input point cloud, texture image of the first frameResult: parameters vector u1Estimate initial head pose;while not converged do

Find correspondences set from point cloud;Discard outliers from the set;Update Transformation;

end

IV. TRACKING

We introduce a two-step optimization procedure for ourmodel-based tracking framework. First, an ICP algorithmcalculates the deformation parameters from the point cloud ofthe current frame. After that, an appearance-based algorithmoptimizes the deformation parameters once more over thecurrent texture frame to rectify any error incurred by ICP.

A. Iterative Closest Point for Tracking

The ICP procedure in the tracking phase is basically assame as in the initialization phase, the only difference is inthe first iteration. The rigid motion, especially translation ofthe head from frame t-1 to frame t may be large and ifwe apply parameters retrieved from the previous frame asinitialization to ICP, the algorithm would likely not converge

Page 4: Hybrid On-line 3D Face and Facial Actions Tracking in RGBD ... · Candide model are designed to cover face biometry (SUs) and facial expressions (AUs). Each person has a unique set

to the correct solution. To quickly compensate for this possiblylarge movement, we use the following method: Given thetransformed shape in frame t-1, we project every 3D vertexof the Candide model to the color frame t-1 to form a setof Nf 2D points x2Di (Nf ≤ 113). We then track this setof feature points from color frame t-1 to frame t using opticalflow. After positions of these feature points in the current frameare recovered, we back-project them to 3D world coordinatesto form a set of Nf 3D correspondences x3Di and use them inthe first iteration of ICP, instead of searching for closest pointsfrom the point cloud. The ICP tracking procedure is given inAlgorithm.2.

Algorithm 2: The ICP algorithm for tracking.Data: input point cloud, current parameters vector ut−1Result: new parameters vector utwhile not converged do

if this is the first iteration thenTrack 2D vertices to current frame;Project 2D vertices to 3D world coordinates;Form the set of correspondences;

elseFind correspondences set from point cloud;

endDiscard outliers from the set;Update Transformation;

end

In this tracking step, we keep the facial action deformationsremain human-like when drifting happens by enforcing upperand lower bounds on seven action deformation parameters. Weuse a variant of the Levenberg-Marquardt algorithm with box-constraints [21] for optimization.

However, optical flows of these feature points deteriorateafter each frame. After several frames, these feature points maydrift far away from their supposedly actual positions. Thus, weimplement an appearance-based algorithm which is explainedfurther in Sec.IV-B to adjust the transformation parameterscloser to correct values. If the feature points in frame t areplaced correctly, there will be very little drift in frame t+1,and the ICP algorithm will likely converge to the right solution.Furthermore, since ICP has provided a good initial estimate forthe appearance-based algorithm, it will also converge faster.

B. On-line Appearance Model

The On-line Appearance Model (OAM) [8][9] is developedupon the principles of the original Active Appearance Models[4][5][6][7]. In constrast to traditional AAMs, the mean textureis defined by a fixed-sized template and it is updated over timeby employing a particle filter-like approach.

Given a new input image I and the corresponding deforma-tion parameters vector u, we apply a piece-wise affine warpingfunction ψ(I, u) similar to AAMs to map pixels of eachtriangle of the 3D wireframe model onto the correspondingtriangle of the 2D template. The function results in a fixed-sized observation texture vector χ. This texture vector is thennormalized to χ ∼ N (0, 1) to partially compensate for differ-ent lighting conditions. The warping process is demonstrated

Fig. 3. The face patch template and piece-wise affine warping. (a) The modelfitting result in the first frame. (b) The face patch template built from the firstframe, eye regions are excluded.

in Figure.3.χ← ψ(I, u) (9)

We do not track eyelids and irises, thus we exclude their pixelsfrom the template to eliminate their effects on optimization.

In the OAM framework, the appearance model at timet, or the mean texture µt, varies over time and models theappearance presentation in all observations χ up to frame t-1.All normalized texture vectors are modelled using a multi-variate Gaussian distribution with independent variables anddiagonal covariance matrix Σ. Each pixel follows a Gaussiandistribution χi ∼ N (µi, σi). The observation likelihood attime t is defined as:

p (It|ut) = p (χt|ut) =

n∏i=1

N (χit ;µit , σit) (10)

After recovering the deformation parameters vector ut of thecurrent frame, we also get the new observation χt. We updatethe Gaussian distribution from frame t to frame t+1 usingrecursive filtering technique:

µit+1= (1− α)µit + αχit (11)

σ2it+1

= (1− α)σ2it + α(χit − µit)

2 (12)

The initial mean texture µ1 is constructed at the first frame,after we have recovered the shape deformation parameters.α is set to 0.1 in our experiments. To find the optimaldeformation parameters given a new frame, we minimize theMahalanobis distance between the warped texture and thecurrent appearance mean:

_ut = arg min

ut

n∑i=1

(χ(ut)i − µit

σit

)2

(13)

This is a non-linear least-squares problem, and we applythe same LMA optimization as in ICP. The only difference isthat we cannot compute the Jacobian matrix analytically, thuswe use numerical approximations, similar to [9]. Moreover,in OAM the computation of the Jacobian matrix is expensive,we keep it unchanged during the optimization and only updateit once in each frame after the deformation parameters arerecovered.

V. EXPERIMENTS

For quantitative and qualitative results, we carry outexperiments on synthetic RGBD sequences that we renderfrom the BU-4DFE dataset. We also demonstrate the trackingperformance on real RGBD sequences captured from theMicrosoftTMKinect and CreativeTMSenz3D depth cameras.

Page 5: Hybrid On-line 3D Face and Facial Actions Tracking in RGBD ... · Candide model are designed to cover face biometry (SUs) and facial expressions (AUs). Each person has a unique set

Algorithm 3: OAM optimization algorithm.Data: input texture image, current parameters vector ut

calculated by ICPResult: adjusted parameters vector utwhile not converged do

Warp texture;Update Transformation;

endUpdate the Jacobian matrix;

TABLE I. COMPARISON OF MEDIAN TRACKING ERROR (IN PIXELS)

Angry Disgust Fear Happy Sad Surprise AllICP 5.82 6.87 6.11 7.20 6.02 6.95 6.48Hybrid 5.78 6.83 6.04 6.89 5.91 6.42 6.41

A. Synthetic Dataset

The BU-4DFE dataset consists of 101 subjects, and 6sequences corresponding to 6 expressions of each subject wererecorded, making total of 606 sequences. 446 sequences inwhich subjects have neutral expression in the first frame wereused for testing. In each sequence, the face is first graduallymoved to its left, then it is moved back to the right and repeat.The head is also rotated to the left side up to 300, then it startsto rotate to the right side until the angle reaches 300 again. Weuse this rendering setup to test the robustness and stability ofour tracker under head pose variation. The average head sizeis about 200x250 pixels.

Test results are shown in Table.I. We tested our hybridtracker in comparison with a pure ICP tracker, which onlyimplements the ICP tracking algorithm in Sec.IV-A. We tookthe median separately for each expression category, and alsothe median from all test sequences. On average, both the ICPtracker and hybrid tracker give similar medians and averages oferror, this means the ICP-based tracker aided by optical flowperforms fairly well on BU-4DFE. However these numbersdo not reflect the performance of facial deformations, sincethe vertices may be close to the ground truth landmarks butat wrong places. Figure.4 demonstrates tracking performanceon BU-4DFE dataset. We can see that the hybrid trackerperforms marginally better than the ICP tracker regardingfacial deformations.

B. Real Data

Our current implementation can run at 16fps and that isnot fast enough to perform on live streams, thus we capturedsequences for off-line tests. With the Kinect, both color anddepth frames were recorded at 640x480 resolution at 1mdistance, thus data has moderate quality; depth frame is alignedto color frame using OpenNI2. The Senz3D camera howevercan only capture depth frame at 320x240, while the colorframe is at 640x480 and up. Due to the assumption of ourtracker that both aligned frames must have the same resolution,we generated a synthetic 320x240 texture frame by mappingthe original color frame to the depth frame, as shown inFigure.6(a). This synthetic texture does not retain fidelity of theoriginal. The point cloud retrieved from Senz3D (Figure.6b) isalso sparser compared to the point cloud that Kinect captured(Figure.5b).

Figure.5(c) demonstrates fine tracking performance on aKinect sequence. Figure.6(c) shows the result on the Senz3Dsequence; considering that both the point cloud and texture arenoisy and slightly misaligned, the performance of the hybridtracker is acceptable.

VI. CONCLUSION

In this paper we presented a hybrid 3D tracking frameworkthat can be used to simultaneously track head pose and facialactions from RGBD video sequences. The tracker can startautomatically from arbitrary face pose given a few availablefeature points. Our framework combined Iterative ClosestPoint and On-line Appearance Model which complement eachother and make the tracker more stable and accurate. In futurework, we plan to work on a 3D head pose estimator to initializethe tracker from any arbitrary pose, extend our framework witheyelids and irises tracking, and improve speed, robustness andperformance of our tracker.

REFERENCES

[1] S. Gokturk, J. Bouguet, and R. Grzeszczuk, “A data-driven model formonocular face tracking,” in ICCV, vol. 2, 2001, pp. 701–708.

[2] E.-J. Ong and R. Bowden, “Robust facial feature tracking using shape-constrained multi-resolution selected linear predictors,” IEEE Trans.Pattern Anal. Mach. Intell., no. 33, pp. 1844–1859, 2011.

[3] T. Cootes, C. Taylor, D. Cooper, and J. Graham, “Active shape models -their tranining and applications,” Comput. Vis. Image Underst., no. 61,pp. 39–59, 1995.

[4] T. Cootes, G. Edwards, and C. Taylor, “Active appearance models,”IEEE Trans. Pattern Anal. Mach. Intell., no. 23, pp. 681–684, 2001.

[5] I. Matthews and S. Baker, “Active appearance models revisited,” Int. J.Comput. Vis., no. 60, pp. 135–164, 2004.

[6] J. Ahlberg, “Face and facial feature tracking using the active appearancealgorithm,” in 2nd European Workshop on Advanced Video-BasedSurveillance Systems (AVBS), London, UK, 2001, pp. 89–93.

[7] F. Dornaika and J. Ahlberg, “Fast and reliable active appearance modelsearch for 3d face tracking,” IEEE Trans. Syst., Man, Cybern., vol. 34,no. 4, pp. 1838–1853, 2004.

[8] F. Dornaika and J. Orozco, “Real-time 3d face and facial featuretracking,” J. Real-time Image Proc., pp. 35–44, 2007.

[9] J. G. J. Orozco, O. Rudovic and M. Pantic, “Hierarchical on-lineappearance-based tracking for 3d head pose, eyebrows, lips, eyelidsand irises,” Image and Vis. Comput., vol. 31, no. 4, pp. 322–340, 2013.

[10] Q. Cai, D. Gallup, C. Zhang, and Z. Zhang, “3d deformable facetracking with a commodity depth camera,” in ECCV, 2010.

[11] T. Weise, S. Bouaziz, H. Li, and M. Pauly, “Realtime performance-based facial animation,” in SIGGRAPH, 2011.

[12] M. Breidt, H. Bulthoff, and C. Curio, “Robust semantic analysis bysynthesis of 3d facial motion,” in IEEE FG’11, 2001, pp. 713–719.

[13] Z. Zhang, “Iterative point matching for registration of free-form curvesand surfaces,” Int. J. Comput. Vis., no. 13:2, pp. 119–152, 1994.

[14] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3dfaces,” in SIGGRAPH, 1999.

[15] J. Ahlberg, “An updated parameterized face,” Image Coding Group,Dept. of Electrical Engineering, Linkoping University, Tech. Rep.

[16] K. S. Arun, T. S. Huang, and S. D. Blostein, “Least-squares fitting oftwo 3d point sets,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 9,no. 5, pp. 698–700, 1987.

[17] P. Viola and M. Jones, “Rapid object detection using a boosted cascadeof simple features,” in CVPR, 2001.

[18] J. Saragih, S. Lucey, and J. Cohn, “Deformable model fitting byregularized landmark mean-shift,” Int. J. Comput. Vis., 2011.

Page 6: Hybrid On-line 3D Face and Facial Actions Tracking in RGBD ... · Candide model are designed to cover face biometry (SUs) and facial expressions (AUs). Each person has a unique set

Fig. 4. Tracking results on 4 sequences of 4 different expressions from 4 subjects. Odd and even columns are results of the hybrid tracker and the ICP trackerrespectively. In sequence (a), eyebrows are pulled out of place (frame 40), and mouth deforms incorrectly (frame 85) with the ICP tracker; the hybrid trackerperforms well. In sequence (b) which is relatively difficult, the hybrid tracker tracks the facial actions correctly. Frame 20 of sequence (c) shows really gooddeformation of the mouth area. In (d), the model returned by the ICP tracker drifts upward whereas the hybrid tracker performs reliably.

Fig. 5. Tracking results on a Kinect sequence. a) The model is drawn upon the first color frame. b) The model fits to the point cloud in the first frame. c)First and second rows show performance of the hybrid tracker and ICP tracker respectively. ICP tracker does not track the mouth deformation correctly.

Fig. 6. Tracking results on the Senz3D sequence. The fitting is skewed toward the right side of the face due to color/depth misalignment, but the tracker canstill manage to model facial actions. a) and b) The fitting result in the first frame c) Performance of two trackers. ICP tracker fails to model the mouth correctly.

[19] A. W. Fitzgibbon, “Robust registration of 2d and 3d point sets,” Imageand Vis. Comput., no. 21(13-14), pp. 1145–1153, 2003.

[20] D. Marquardt, “An algorithm for least-squares estimation of nonlinearparameters,” SIAM J. Appl. Math., no. 11 (2), pp. 431–441, 1963.

[21] C. Kanzow, N. Yamashita, and M. Fukushima, “Levenberg-marquardtmethods for monstrained monlinear mquations with mtrong mocalconvergence properties,” J. Comp. App. Math., vol. 172, pp. 375–397,2004.