3d lip shapes from video: a combined physical–statistical

3D lip shapes from video: A combined physical±statisticalmodel 1

Sumit Basu *, Nuria Oliver, Alex Pentland

Perceptual Computing Section, The MIT Media Laboratory, E15-383, 20 Ames St., Cambridge, MA 02139, USA

Received 30 January 1998; received in revised form 10 June 1998; accepted 16 July 1998

Abstract

Tracking human lips in video is an important but notoriously di�cult task. To accurately recover their motions in

3D from any head pose is an even more challenging task, though still necessary for natural interactions. Our approach

is to build and train 3D models of lip motion to make up for the information we cannot always observe when tracking.

We use physical models as a prior and combine them with statistical models, showing how the two can be smoothly and

naturally integrated into a synthesis method and a MAP estimation framework for tracking. We have found that this

approach allows us to accurately and robustly track and synthesize the 3D shape of the lips from arbitrary head poses in

a 2D video stream. We demonstrate this with numerical results on reconstruction accuracy, examples of static ®ts, and

audio-visual sequences. Ó 1998 Elsevier Science B.V. All rights reserved.

Keywords: Lip models; Deformable/non-rigid models; Finite element models; Analysis-synthesis models; Training models from video;

Model-based tracking

1. Introduction

It is well-known that lips play a signi®cant rolein spoken communication. Summer®eld's classic1979 study (Summer®eld, 1979) showed how thepresence of the lips alone (without tongue orteeth) raised word intelligibility in noisy condi-tions from 22.7% to 54% on average and up to amaximum of 71%. Not only can the lip shape beused to reduce noise and enhance intelligibilty forhuman/machine speech understanding, but it isalso useful as a signi®cant feature for the under-standing of expression. However, to realize any of

these applications, it is necessary to robustly andaccurately track the lips in 3D. Why is 3D socritical? Consider the canonical task of speechrecognition at a distance. By moving the micro-phone from the headset to the desktop and usingthe lip shapes to account for the new channelnoise, we can free the user from cumbersomeheadgear. However, if he must be facing straightinto the camera in order for the system to work,we have achieved only minimal freedom. In nat-ural conversation and expression, we move ourheads constantly, both in translation and rotation.If we cannot contend with this simple fact, we willnever reach the unconstrained interfaces we desire.Computer vision techniques have been developedthat accurately track the head's 3D rigid motions(as in our previous work (Basu et al., 1996; Jebaraand Pentland, 1997)), leaving the formidable task

Speech Communication 26 (1998) 131±148

* Corresponding author. E-mail: [email protected] Speech ®les available. See http://www.elsevier.nl/locate/spec-

om.

0167-6393/98/$ ± see front matter Ó 1998 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 6 3 9 3 ( 9 8 ) 0 0 0 5 5 - 7

of tracking the remaining 3D non-rigid deforma-tions.

In this paper, we develop a method for suc-cessfully facing this di�cult problem. One of themost vexing issues surrounding the lip trackingproblem has been the poor quality of the availabledata ± contours, color, ¯ow, etc., are all obscuredat some point or other by lighting, the speed ofmotion, and so on. Our approach is thus to buildand rely on strong models of the lip shape tocorrect for anomalies in the data. In essence, ourmodel learns the permissible space of lip motions.The incoming data from the video stream is thenregularized by this model ± we ®nd the permissiblelip shape that can best account for the data. In thisway, we remain robust to the unavoidable noise inthe raw features. To build and train this model, westart by giving a lip-shaped mesh generic physicalcharacteristics using the Finite Element Method(FEM). This acts as a physically based ``prior''(i.e., locally elastic behavior) on how things move.We then train this model with 3D data of real lipmotions and blend the physical prior with thestatistical characteristics of this data. Finally, weuse this physical±statistical model in a MAP esti-mation framework to ®nd the locally most prob-able lip shape that accounts for the incoming data.Along the way, we have also developed a full-¯edged synthesis model ± by moving the modelthrough the permissible lip space, we can generateimages of the 3D lips in motion.

Through this method, we have been able torobustly and accurately track lip shapes in 3Dfrom arbitrary head poses in a video stream. Wewill demonstrate our results with an illustration ofthe learned lip subspace, numerical ®gures on re-construction accuracy, examples of static ®ts of themodel, and audio-visual sequences demonstratingthe tracking and synthesis in action.

1.1. Background

In looking at the prior work on lip modelingand tracking, there are two major groups ofmodels. The ®rst of these contains the models de-veloped for analysis, usually intended for inputinto a combined audio±visual speech recognitionsystem. The underlying assumption behind most

of these models is that the head will be viewedfrom only one known pose. As a result, thesemodels are only two-dimensional. Many are baseddirectly on image data (Coianiz et al., 1995; Kasset al., 1988); others use such low level features toform a parametrized description of the lip shape(Adjoudani and BenoõÃt, 1995).

Some of the most interesting work done in thisarea has been in using a statistically trained modelof lip variations. Bregler and Omohundro (1995)and Luettin et al. (1996), for example, model thesubspace of lip bitmaps and contours respectively.However, since these are 2D models, the changesin the apparent lip shape due to rigid rotationshave to be modeled as complex changes in the lippose. In our work, we begin by extending thisphilosophy to 3D. By modeling the true three-dimensional nature of the lips, variations thatlook complex and highly nonlinear from a 2Dperspective become far simpler. With a 3D model,we can rotate the model to match the observedpose, modeling only the actual variations in lipshape.

The other category of lip models includes thosedesigned for synthesis and facial animation. Theselip models are usually part of a larger facial ani-mation system, and the lips themselves often havea limited repertoire of motions (Lee et al., 1995).To their credit, these models are mostly in 3D. Formany of the models, though, the control parame-ters are de®ned by hand. A few are based on theactual physics of the lips: they attempt to modelthe physical material and musculature in themouth region (Essa, 1995; Waters and Frisbie,1995). Unfortunately, the musculature of themouth is extremely complicated and has proved tobe very di�cult to model accurately. Even if themodeling were accurate, this approach would stillresult in a di�cult control problem. Humans donot have independent control of all of these facialmuscles: normal motions are a small subspace ofthe possible muscle states. Some models have triedto approximate this subspace by modeling key lippositions (visemes) and then interpolating betweenthem (for example, (Waters and Frisbie, 1995)).However, this limits the accuracy of the resultinglip shapes, since only the key positions are learnedfrom data.

132 S. Basu et al. / Speech Communication 26 (1998) 131±148

We hope to ®ll the gap in these approaches withour 3D model, which can be used for both analysisand synthesis. We start with a 3D shape model andgeneric physics, but then deform this initial modelwith real 3D data to learn the correct modes ofvariation, i.e., all of the deformation modes thatoccur in the observations. In this way, we not onlyaddress the problem of parametrizing the model'smotions, but also that of control. Because we learnonly the modes that are observed, we end up withdegrees of freedom that correspond only to plau-sible motions. This yields powerful advantages forboth tracking and synthesizing lip shapes. Fortracking, it means we only need to search along thelearned degrees of freedom (a 10-dimensionalspace, for example, instead of a 612-dimensionalone for the unconstrained mesh), and also that weremain robust to anomalies in the data. For syn-thesis, it means we have a small number of ``con-trol knobs'' to produce any lip shape we may need.Furthermore, as we will show, we also have amodel of the probability density of the lip shapewithin this parametric space. We can thus trade o�the likelihood of the model with the strength of theobservations to ®nd an optimally probable lipshape given the data.

2. The model

In the following section, we give a brief de-scription of the choice of the model shape and thephysics used. A more detailed account of the FEMand particulars of our implementation are pro-vided in Appendix A.

The underlying representation of our initialmodel is a mesh in the shape of the lips. At theinitial stage, before any training has occurred, wehave no learned notion of the lip shape. We thussimply extract the region surrounding the mouthin a Viewpoint Data Labs model of the humanhead and made a few minor changes to aid thephysical modeling steps ahead. The ®nal modelhas 336 faces and 204 nodes, resulting in 612 de-grees of freedom (three per node). The initial shapeis shown in Fig. 1. Similarly, we have no real ideawhat the inherent degrees of freedom of the lipsare. However, we do know something about howthe lip material behaves, namely that it acts in alocally elastic way. When one portion of the lips ispulled on, the surrounding region stretches with it.We express this notion mathematically in ourmodel by using the FEM. We use this method togive this initial mesh the properties of a genericelastic material ± i.e., we treat the mesh as if it wereformed from a rubber sheet. The resulting ®rst-stage model is a ``physical prior'' for our trainingstages to come. It clearly does not have the overallcorrect shape modes of the lips, but when we re-ceive 3D point locations from our training data, ittells us how to move the points in between.

2.1. More on the FEM

The FEM is a numerical technique for ap-proximating the physics of an arbitrarily complexbody by breaking it into many small pieces (ele-ments) whose individual physics are very simple.The individual stress±strain matrices of theelements can then be assembled into a single,

Fig. 1. Initial shape of the model: front, partial pro®le and full pro®le views.

S. Basu et al. / Speech Communication 26 (1998) 131±148 133

overall matrix expressing the static equilibriumequation

Ku � f ; �1�where the displacements u and forces f are in aglobal coordinate system. The FEM thus linearizesthe physics of the body around a given point. It isimportant to choose this nominal point carefully,as the linearization is only valid in a limitedneighborhood of the point. Initially, the physicsfor the model were derived using a thin-shellmodel with the shape in Fig. 1 as a nominal point.However, because this shape is not very realistic,we used the training data and the deformationstrategy described in the sections below to deformthis model to a data frame that had the lips in atypical rest (closed) position. The physics werethen relinearized about this point. The resultingmodel was the one used in all of training methodsdescribed below, and is shown in Fig. 2.

Appendix A and Basu (1997) give further de-tails on the physical modeling techniques that wereused and how they were applied to the lip model.

2.2. Understanding the meaning of the model

It is important here to understand the di�erencebetween a physically-based and a physiologicalmodel. We are not attempting to construct aphysiological model, and thus we do not claim thatour model has any simple relation to the actualsti�nesses of the skin, muscle, and other tissue thatmake up the mouth region. Our model is a thinshell structure, while the actual lips are clearlyvolumetric in nature. What we do claim is that our

model (after training) can accurately account forthe visible observations of the mouth. The ``learnedphysics'' that we discuss here corresponds tolearning the modes and distributions of deforma-tions that account for these observations. Theframework of the physical model is simply a meansof modeling these observations that allows us toconveniently describe the interrelations betweendi�erent parts of the structure.

3. The observations

To train this model to have the correct 3Dvariations of the lips, it was necessary to haveaccurate 3D data. Also, in order to observe naturalmotions, it was not acceptable to a�x re¯ectivemarkers or other cumbersome objects to the lips.To satisfy these criteria, seventeen points weremarked on the face with ink: sixteen on the lips

Fig. 2. Final linearization point for the model: front, partial pro®le and full pro®le views. The FEM matrix K used for all training

computations was derived from this shape.

Fig. 3. Locations of marked points on the face.


and one on the nose. The placement of these pointsis shown in Fig. 3. The points were chosen toobtain a maximally informative sampling of the3D motions of the lips.

Once the points were marked, two views of thepoints were taken by using a camera-mirror setupto ensure perfect synchronization between the twoviews. The points were tracked over 150 frames ata 30 Hz frame rate using supervised normalizedcorrelation. The two views were then used to re-construct the 3D location of the points. Finally,the points were transformed into a head-alignedcoordinate system to prevent the rigid motion ofthe head from aliasing with the non-rigid motionsof the lips. See (Basu, 1997) for further details onthese methods.

It was attempted to have as great a variety of lipmotions within this brief period as possible. Tothis end, several utterances using all of the Englishvowels and the major fricative positions werespoken during the tracking period. Clearly, 150frames from one subject is still not enough to coverall possible lip motions, but it is enough to providethe model with the initial training necessary tocover a signi®cant subset of motions. Methods forcontinuing the training using other forms of inputdata will be discussed in a later section.

4. Training the model

In order to relate the training data to the model,the correspondence between data points andmodel nodes had to be de®ned. This was a simpleprocess of examining a video frame containing themarked points and ®nding the nodes on the lipmodel that best matched them in a structuralsense. The di�erence between the goal locations ofthese points (i.e., the observed 3D point locations)and their current location in the model is then thedisplacement goal, ug.

4.1. Reaching the displacement goals

The issue was then how to reach these dis-placement goals. The recorded data points con-strained 48 degrees of freedom (16 points on thelips with three degrees of freedom each). However,

the other 564 degrees of freedom were left open.There are an in®nite number of ways to reach thedisplacement goals ± as long as the constrainedpoints reach the goals, the other points are free todo anything. However, we wanted the physicallycorrect solution: to pin down the constrainedpoints and let the other points go to their equi-librium locations.

Mathematically, this idea translates to theconstraint of minimum strain. Given the set ofconstrained point displacements, our solutionmust minimize the strain felt throughout thestructure. This solution is thus a physically-basedsmoothing operation: the physics of the model areused to smooth out the regions where we have noobservation data by minimizing the strain in themodel.

Fortunately, in the ®nite element framework,this solution can be found analytically and withlittle computation. If we denote the Kÿ1 matrixwith only the rows pertaining to the constraineddegrees of freedom as P, the desired solution can beput in the form of the standard underconstrainedleast-squares problem. We wish to minimize

f Tf ; �2�with the constraint

Pf � ug; �3�which results in following solution (Gelb, 1974):

f � PT�PPT�ÿ1ug: �4�If we apply this f to the mesh, we will have thedesired minimum-strain displacement.

4.2. Modeling the observations

Once we have all the displacements for all of theframes, the observed deformations can be relatedto a subset of the ``correct'' physics of the model.We began with the default physics (i.e., fairlyuniform sti�ness, only adjacent nodes connected)and have now seen how the model can be de-formed with point observations. The results ofthese deformations can now be used to form anew, ``learned'' K matrix. Martin et al. (1998)described the connection between the strain matrixand the covariance of the displacements Ru: if we


consider the components of the force to be IIDwith unit variance, we have

Ru � Kÿ2: �5�We can now take this mapping in the oppositedirection. Given the sample covariance matrix Ru,we can ®nd Kÿ1 by taking its positive de®nitesquare root, i.e., diagonalizing the matrix intoSKST (where each column of S is an eigenvectorand K is the diagonal matrix of eigenvalues) andthen reforming it with the square roots of the ei-genvalues. We can then use the resulting ``sampleKÿ1'' to represent the learned physics from theobservations. Forces can now be applied to thismatrix to calculate the most likely displacementgiven the observations.

However, because we only have a small numberof training observations (150) and a large numberof degrees of freedom (612), we could at best ob-serve 150 independent degrees of freedom. Fur-thermore, noise in the observations makes itunreasonable to estimate even this many modes.We thus take only the 10 linear modes that ac-

count for the greatest amount of variance in theinput data (i.e., those with the largest eigenvaluesof the covariance matrix). These modes are foundby performing principal components analysis(PCA) on the sample covariance matrix. Themodal covariance and Kÿ1 matrices can then bereconstructed using these modes. We thus have aparametric description of the subspace of lipshapes (the modes) and a probability measure forthe subspace (the modal covariance matrix).

Frontal, partial pro®le, and full pro®le views ofthe the mean displacement (�u) and some of the ®rstfew modes are shown in Fig. 4. Though we areonly using the ®rst ten modes, it was found thatthese account for 99.2% of the variance in thedata. We should thus be able to reconstruct mostshape variations from these modes alone.

5. Tracking the lips in raw video

At this point, we have a parametric model ofthe permissible lip shapes and a probability model

Fig. 4. Front, partial pro®le and full pro®le views of the mean displacement and some characteristic modes.


for the resulting subspace. The remaining task is to®t this model to the raw video stream in the ab-sence of special markings on the lips or face. Asmentioned earlier, our approach will be to ®nd thelip shape within the learned subspace that bestaccounts for the incoming data. In statisticalterms, this means ®nding the parameters with thehighest a posteriori probability given the obser-vations and our prior model. Intuitively, though, itsimply means balancing the potentially noisy datafrom our observations with our learned notion ofwhat shapes are permissible.

Any of a number of features (or a combinationthereof) could be used as observations in ourframework ± color classi®cation, optical ¯ow,contours, tracked points, etc. For this implemen-tation, we have chosen to use only the color con-tent of the various regions, as it is a robust andeasily computable candidate. Of course, this fea-ture will not directly give us any kind of shapeinformation ± it will only give us the probabilitiesof each pixel belonging to the color classes,fmodel � f �colorjmodel�. From our statistical per-spective, though, it is clear how this data should beused. We wish to ®nd the set of parameters p� forour model that maximizes its posterior probabilitygiven the observations

p� � arg maxp

f �pjO�

� arg maxp

f �Ojp� f �p�f �O� ; �6�

we can neglect the denominator in the last ex-pression, since it will be the same for all p, leavingus with

p� � arg maxp

f �Ojp�f �p�: �7�

We can further simplify this by taking the loga-rithm. Because the logarithm is a monotonicfunction, maximizing the log of the expression inEq. (7) is equivalent to maximizing the originalexpression

p� � arg maxp

log f �Ojp�f �p�� arg max

plog f �Ojp� � log f �p�� : �8�

Another piece of information we have is the colorclass of each point on our model. As shown in the

®gures above, the model contains the lips andsome surrounding skin, and we know a prioriwhich triangular faces belong to which class. If wenow project the model in state p into the cameraview, we can compute the term f �O�x; y�jp� foreach point in the visible surface of the model. Thisvalue is simply the probability of the observedcolor value at �x; y� belonging to the same class asthe point in the model that is projected onto it. To®nd the overall probability of the model in thisstate, we simply take the product of the observa-tions (making an assumption of independence,which we will modify later on) over the modelregion and postmultiply by the prior value of themodel being in state p,

f �pjO� � aY

n

f �On�x; y�jp�f �p�; �9�

where a represents a constant factor to account forf �O�. In the log domain, this product becomes asum,

log f �pjO� �X

n

log f �On�x; y�jp�

� log f �p� � log a: �10�This gives us a measure of the posterior proba-bility of the model being in a given state. We willshow in a later section how this quantity can bedecomposed for e�cient computation. This stillleaves the problem of ®nding the optimal statewithout searching the entire subspace. We ap-proach this through gradient ascent: at each step,we compute the direction in the parameter spacethat will most increase the ®t of our model to thedata. We then take a step in this direction.

In order to apply these ideas to our trackingproblem, we ®rst train models of the color classesfor the skin and lips. Next, we compute theprobability maps for the image (i.e., 2D mapswhose entries are the probability values of thegiven class). The model is then initially positionedbased on the rigid pose and geometry of the head.From this initial ®t, we ®nd the gradient of theoptimization function in Eq. (10) in the parameterspace and climb to a local maximum of the pos-terior probability. For the next frame, we thenbegin the ascent at these parameter values. We willdescribe this process in detail in the following


sections. Note that this technique is much stablerand faster than the local-contribution method de-scribed in our earlier work (Basu et al., 1998).

5.1. Training the color classes

The statistical models of the lip and skin colorsare derived by ®rst collecting a hundred or so RGBsample points from each class for a given user. Forthis paper, the samples were picked by hand, but itis a simple matter to acquire them automatically,for example using a system such as (Oliver et al.,1997). The distributions are modeled as mixturesof Gaussians and are estimated using the Expec-tation±Maximization algorithm, using three com-ponents for the lips and one for the skin. In thepast, we have modeled these classes in an intensity-normalized color space (Basu et. al, 1998), buthave found that this produces less accurate fea-tures. This is especially true in dark areas wheredividing by the intensity has the dangerous e�ectof amplifying the camera noise. Furthermore,there are subjects for whom the lip color is quitesimilar to the skin color, in which case it is theintensity alone which di�erentiates the classes. Forthese reasons, we now model the color classes inthe full RGB space.

Once we have these models, we can apply themto the relevant regions of the image to produceprobability maps for the lip and the skin classes.Fig. 5 shows the probability maps for the lip�flips�x; y�� and skin �fskin�x; y�� classes for a typicalinput image.

These probability maps tend to be somewhatnoisy due to camera noise, as can be seen in Fig. 5.To compensate for this, we convolve them with a7 ´ 7 normalized Hamming kernel, resulting in

smoothed probability maps. Note also that thelighting conditions can often obscure the colorinformation available. For example, the right sideof the face in Fig. 5 is too dark to provide anysalient color cues. However, because we havelearned the subspace of permissible lip shapes, thelip shape can be accurately estimated using onlythe available information (i.e., the left side of theface). The quality of the tracking results under thisparticular lighting condition can be viewed in the®rst audio-visual sequence.

5.2. Projecting the model into the camera view

The 2D projection of the model into the cameraview is found using a pinhole camera model with acalibrated focal length. The rigid pose of the modelis related to the camera view by six rigid parame-ters: three for rotation and three for translation. Inaddition, there are three scaling parameters (in x, yand z) that ®t the lip shape to a particular user ±these parameters are of course constant for a givenuser. For the results shown in this paper, theseparameters were ®t by hand in the ®rst frame andthe head was kept rigid throughout the sequence.We are currently integrating the head-trackingsystem in (Jebara and Pentland, 1997) to auto-matically determine the rigid position of the head.Given this initial rigid ®t, we iteratively deform themodel along the learned non-rigid modes to max-imize the probability of the model state given theobservations as described in the sections below.

5.3. Measuring the model probability

In order to measure the probability of the cur-rent model state given the observations, we need to

Fig. 5. Original image, lip probability map and skin probability map.


now compute the expression in Eq. (10), whichsums the log probability of the observations. Wecan break this expression up into a sum of sumsover the pixels of the faces (the triangular facets)of the model


i

Xfacei

log f �O�x; y�jp�

� log f �p� � log a: �11�Furthermore, we can reduce the computation byonly sampling each face at its centroid and multi-plying by the visible area of the face, which can beeasily computed as the 3D area multiplied by thedot product of the view vector and the face normal(faces that are not visible do not contribute to thissum). This area Ai is thus multiplied by the value atthe center of the patch, log f �O��xi; �yi�jp�. Notethat this scheme approximates the function asbeing constant over the face. This is reasonable forour situation because of the small size of the faces(and thus the small variation in the function sur-face over them ± see Fig. 8). At this point, we alsointroduce a scaling factor c that multiplies the re-sulting value for each face. By multiplying eachcentroid probability by the visible surface area (ofAi pixels2), we are making the assumption that wehave Ai statistically independent samples of thevalue at the centroid, which is clearly not the case.If we do not adjust for the mutual dependence ofthese samples, their contribution will overwhelmthe prior ( log f �p�) and we would trust the dataentirely. The small number of samples we are ac-tually taking and the lack of true independenceamong these values does not warrant this level oftrust. We thus scale all of the areas by a parameterc which compensates for the dependence amongthe observations. This makes the e�ect of ourobservations commensurate with that of our prior,still giving each face a contribution proportionalto its visible area.

The resulting scaled total sum is


n

cAn log f �O��xn; �yn�jp�

� log f �p� � log a: �12�We can now simplify f �O�x; y�jp� to flips�x; y� orfskin�x; y�, depending on whether the given face in

the model is a lip face or a skin face. This breaksthe sum into two pieces:


lipfaces

cAn log flips��xn; �yn�

�X

skinfaces

cAn log fskin��xn; �yn�

� log f �p� � log a: �13�Lastly, to evaluate the prior term f �p�, we use aGaussian model with the learned covarianceRu;modal, which takes the form

f �p� � 1

�2p�5 � Ruj j0:5exp ÿ 0:5pTRÿ1

u pÿ �

: �14�

The log of this quantity is then simply

log f �p� � C ÿ 0:5pTRÿ1u p; �15�

where C is the constant log of the scaling for theGaussian. This can further be simpli®ed to the sumof eleven terms, since Ru is diagonal,

log f �p� � C ÿ 0:5X10

i�1

p2i

ki: �16�

The overall posterior probability of the model canthus be written as a set of simple sums, where thenew constant D is C � a,


lipfaces

cAm log flips��xm; �ym�

�X

skinfaces

cAn log fskin��xn; �yn� �17�

ÿ 0:5X10

i�1

p2i

ki�18�

� D: �19�

5.4. Iterating to a solution

Now that the posterior probability of the modelshape can be computed, we need a means of im-proving it. We do this by iteratively moving in thedirection that will most improve this posterior. Ingeneral, ®nding the explicit gradient can be quiteexpensive, and approximations to the gradient areused to save computation (as in our previous work(Basu et al., 1998)). However, in this case, because


we have found a low-dimensional parametricspace that our model is allowed to move in, weonly need to compute the gradient along thesedimensions. Furthermore, because our modes arelinear, we do not need to relinearize our model atevery step. As a result, we have found that we cancompute the explicit gradients for our model verye�ciently. Because this correponds to the optimaldirection to move the model, it converges muchmore quickly and smoothly to a probabilisticmaximum than our previous method.

We thus wish to ®nd the direction of optimalascent in our parametric space. Mathematically,we seek the quantity

d log f �pjO�dp

: �20�

Since f �O� is constant for a given frame, we canapply Bayes' rule and then break this up as fol-lows:

d log f �pjO�dp

� d log f �Ojp�dp

� d log f �p�dp

: �21�

We will deal with these terms separately to sim-plify the development. The ®rst term can berewritten as a chain of partial derivatives,

d log f �Ojp�dp

� o log f �Ojp�o�xn

o�xn

op; �22�

which can be written more explicitly as

d log f �Ojp�dp

� c A1o log f �O��x1;�y1�jp�

ox1. . . AN

o log f �O��xn;�yn�jp�oyn

h i

�

o�x1

op1

o�x1

op2. . . . . .

o�y1

op1

o�y1

op2. . . . . .

..

. ... . .

.

..

. ... o�yN

opM

266666664

377777775 �23�

where N is the number of faces and M is thenumber of modes being used. The ®rst term in thisproduct is composed of the x and y derivatives ofthe log probability map, scaled by the visible faceareas An and c. The columns of the second term aresimply the in-plane components of the modescomputed at the centroids of each face. To ®nd the

latter, the 3D modes are rotated and scaled by therigid transform/scaling discussed earlier. Becausethe modes are linear, though, for a given headpose, this second term is constant. Even when thehead moves, it can be updated at minimal com-putation cost (the modes are simply transformedby 3� 3 rotation/scaling matrix).

We now go on to deal with the second term inEq. (21). The gradient of this term is

d log f �p�dp

� ÿ Rÿ1u p

ÿ �T: �24�

We now use these results to take a step in the di-rection of the overall gradient,

p � p � bd log f �pjO�

dp; �25�

where b is su�ciently small to account for the non-linearities in the posterior surface in most cases. Toensure that we continue moving upwards in prob-ability, though, the log probability is computedafter each step. If it has decreased, we go back anduse a smaller (half) value of b. This ascent process iscontinued until we have converged to a local max-imum, which typically occurs in less than twentyiterations. The resulting estimated shape is thenused as the initial shape for the next input frame.

6. Results

In this section, the reconstruction and trackingcapabilities of our method are demonstrated. We®rst provide numerical results that show the ca-pability of our model to accurately reconstruct 3Dlip shapes from 2D data. Examples are then pro-vided of using the tracking method describedabove to capture the lip shape from a 2D videostream and reconstruct the 3D shape. This isshown both with example ®ts in static frames andwith audio-visual sequences. The advantages ofthe modal form of our model are also discussed.

6.1. Reconstruction capabilities

As noted above, one of the major argumentsbehind the 3D representation was that a smallnumber of observations from any viewpoint couldbe used to ®nd a good estimate of the model shape.


This is because the subspace of permissible lipshapes has been learned. Without the model, the2D observations would leave far too many degreesof freedom unconstrained. With the model, as wewill show, all degrees of freedom can accurately bereconstructed. We demonstrate this by recon-structing the 3D shape using only xÿ y (frontalview) data and only y ÿ z (side view) data.

The mean-squared reconstruction errors perdegree of freedom were found for two cases of 2Dobservation scenarios and are shown in Table 1.The results are given in the coordinate system ofthe model, in which the model is 2.35 units across,2.83 units wide and 0.83 units deep. The recon-struction error shown is from using only the ®rstten modes. Note that these results were obtainedusing a cross-validation method, so the errors re-ported are on frames outside the training set.

The rows of the table correspond to whatmeasurements were used to reconstruct the full 3Dshape. In the ®rst row, only the data availablefrom a frontal view was used for the estimation. Inother words, the x and y coordinates of all pointswere observed, but the z coordinates were not.Fig. 6 shows the dimensions of the data pointsused for the estimation with a `+' marker andthose not used with an `o' marker. Front and sideviews of the reconstruction for the point locationsare also shown.

Table 1

Reconstruction error per DOF (in normalized coordinates)

Data used 3D reconstruction error

xy (frontal) 6.70 ´ 10ÿ3

yz (pro®le) 7.13 ´ 10ÿ4

Fig. 6. Data used and a sample reconstruction for frontal view reconstruction experiment. Data point dimensions used (x and y) are

labeled with a `+'; dimensions not used are labeled with an `o'.


The second row shows the results of using onlythe data available from a pro®le (side) view. Here,the y and z components of the data were observed,but the x locations were not. Fig. 7 shows thecoordinates used and a sample reconstruction forthis experiment.

Note that in both cases (see Table 1), the re-construction is quite accurate in terms of mean-squared error. This shows that the ten learnedmodes are a su�ciently strong characterization toaccurately reconstruct the 3D lip shape from 2Ddata. It is interesting to note that the y ÿ z dataprovides much better performance than the xÿ ycase. This is understandable in that there was asigni®cant amount of depth variation in the testframes. Because many xÿ y variations can occurat a variety of z positions, the depth variation isnot observable from the frontal plane. As a result,the y ÿ z data provides more information aboutthe lip shape in these cases. Since our model is a

full 3D representation, it can take advantage ofthis disparity (or any other advantageous 3D pose)when these observations are available.

6.2. Tracking and reconstruction results

In this section, several examples are providedfor using our algorithm to estimate the 3D lipshape. We begin with a detailed example (Fig. 8).The ®rst frame shows the mouth image we aretrying to ®t. The next frame shows the initialplacement of the model on the image. The thirdframe shows the gradient direction resulting fromthe observations. In the last frame we see the ®nal,converged result after 20 iterations.

Figs. 9±12 show some other frames with theinitial image, the ®nal converged ®t, and the pro®leview of the estimated model. The audio-visualsequences these frames are taken from, alongwith the tracking and reconstruction views, are

Fig. 7. Data used and a sample reconstruction for pro®le (side) view reconstruction experiment. Data point dimensions used (y and z)

are labeled with a `+'; dimensions not used are labeled with an `o'.


available on the Elsevier website 2. We recommendviewing these sequences both at normal speed, inorder to see the natural quality of the motions, butalso at reduced speed, in order to see the quality ofthe ®t from frame to frame.

6.3. Advantages of the modal representation

It is worth noting here the ¯exibility that we gainfrom having a modal representation. As we havealready described, the ®rst few modes account forthe greatest variation in lip shape, whereas the lastfew contribute the least. The more modes we use,the more accurately we can ®t the shape (thoughthe gain in accuracy drops with every mode). Thefewer modes we use, the more robustly we can re-

ject noise, since we only move along the directionsof the greatest variation. Increasing or decreasingthe number of modes we use for tracking is thuslike moving a slider between accuracy (manymodes) and robustness (few modes).

One of the main reasons we have used a smallnumber of modes (10) thusfar is to be robust tonoisy data. When clean data is available, though,we expect to be able to ®t many more modes for ahigher accuracy of ®t. This points towards a muchmore convenient method of continuing to train themodel when clean data is available. We can pro-duce such clean data by carefully controlling thelighting or by improving colorspace separation byusing colored lipstick. When such ``clean'' colordata is available, we can ®t many more than tenmodes with high accuracy. Because point markingand tracking is then no longer necessary, we caneasily train on large volumes of data in this way.We can then use data collected with this technique

2 Speech ®les available. See http://www.elsevier.nl/locate/spec-

om.

Fig. 8. From initial image to ®nal ®t.


Fig. 10. Initial image, ®nal ®t and 3D reconstruction.





to ®nd an improved set of generic modes (model-ing the lip variations of many users) or a special-ized set of modes (i.e., for high-detail tracking ofan individual user).

On the other hand, in a video coding applica-tion, we may be receiving very noisy data (from alow quality camera) and we want to keep thenumber of transmitted bytes very low. For such anapplication, we want to use an even smaller num-ber of modes (perhaps only two or three) tomaximize robustness and data reduction.

The modal representation thus gives us thepowerful capability of moving smoothly betweenhigh accuracy and high robustness, allowing us toadapt to the quality of the available data and thebandwidth of the communication channel.

7. Conclusions and future directions

We have presented a method for estimating andreconstructing the 3D shape of human lips fromraw video data. This method began with a physicalmodel with generic physical properties ± a rubbersheet in the shape of the lips. 3D observations werethen used to train this intial model with the truevariations of human lip shapes, smoothly goingfrom a physical to a hybrid physical±statisticalmodel. This model ®t naturally into a MAP esti-mation framework, which was then used fortracking and resynthesis of 3D lip shapes. We haveshown through static and video examples how theobservations in raw data can be accuratelytracked, and have also demonstrated the ability ofour model to accurately reconstruct 3D lip shapesfrom sparse 2D data. With the method presentedhere, we can now accurately and robustly track 3Dlip shapes from 2D video data taken from an ar-bitrary pose.

This capability allows us to look towards anumber of new directions in the future. The ®rststep will be to integrate a 3D head pose tracker sothat the model does not need to be initialized byhand in the ®rst frame and so it can deal with largehead motions within a sequence. The next step willbe to incorporate this model as a feature of anaudio-visual speech recognition system, allowingusers to have truly unconstrained head motion

while speaking. Other directions we wish to pursueinclude applying this method to other non-rigidareas of the face, such as the eyes and eyebrows, sothat we can track and resynthesize the entire facewithout marking it with points. The methods de-scribed here can then be used for analysis/synthesisof all facial motion, which leads to many videocoding and computer graphics applications.

Acknowledgements

This paper is dedicated to the memory of ourfriend and colleague Christian Benoõt. His enthu-siasm played a key role in our continued e�orts inthis research. In addition, the ®rst author wouldlike to thank the National Science Foundation fortheir support in the form of a Graduate ResearchFellowship. The second author would like tothank La Caixa Foundation. Thanks also toViewpoint Data Labs, who provided the modelfrom which the initial model of the lips in Fig. 1was extracted.

Appendix A. Using the ®nite element method

In this appendix, we give an overview of the®nite element method (for the static equilibriumcase), the speci®cs of our model, and the stepsinvolved in applying our model to the lips.

A.1. The ®nite element method

The FEM is a numerical method for approxi-mating the physics of an arbitrarily complex body.The central idea is to break the body into manysmall pieces whose individual physics are simpleand then assembling the individual physics into acomplete model. In contrast to the ®nite di�erencemethod, the ®nite element method also models thephysics of the material between the nodes. This ispossible because the method gives us interpolationfunctions that let us describe the entire modeledbody in a piecewise analytic manner. Given thispiecewise representation, the constitutive relationsare integrated over the entire body to ®nd theoverall stress±strain relationship. These interpol-ation functions are written in vector form as


u�x� � H�x�u; �26�where u represents the values of the function to beinterpolated at the nodes, H�x� is the interpolationmatrix, and u�x� is the analytic representation ofthe function in the local coordinates of the ele-ment. In our case, the function we are interested inis the strain � (internal force) resulting from a givendeformation. We can ®nd this using the relation

��x� � B�x�u; �27�where u now represents the displacements at eachnode and B is a combination of H�x� above andthe stress±strain relationship of the material. It canbe obtained by appropriately di�erentiating andrecombining the rows of H given the stress modesspeci®ed by �. To ®nd the sti�ness matrix K for anentire element, we integrate this relationship overthe volume of the element,

ke �Z

BTCB dV ; �28�where C describes the stress±strain relationshipbetween the di�erent degrees of freedom in theelement. For each such matrix, we have the rela-tionship

keu � f ; �29�where u represents the nodal displacements and fthe strain resulting from these displacements. Notethat the stresses and the strains are both expressedin the local coordinate system at this point. Eachof these element matrices can be transformed by amatrix k, which transforms the global coordinatesystem to the local one,

k � õ ! | ! k !

264375; �30�

where õ, | and k are unit vectors in the local x, yand z directions, respectively. Because these vec-tors are orthonormal, kÿ1 is simply the transposeof the above matrix. kT thus transforms from thelocal coordinate system to the global one.

Note that the matrix above transforms onlythree degrees of freedom: to apply it to the strainmatrix for an entire element (which is nine-by-nine), we must repeat the same k in the followingblock-diagonal form:

T �k

k

k

264375: �31�

The matrix T can then be applied to the elementstrain matrix to produce the strain matrix in theglobal coordinate system

k0e � TTkeT: �32�In the expanded form on the right-hand side, wecan see how in a vector post-multiplication (by aglobal displacement) this k0e ®rst transforms thevector to the local coordinate system (with T),applies the stress±strain relation (with ke), andtransforms the resulting force back into the globalcoordinate system (with TT ).

The resulting transformed strain matrices nowhave aligned degrees of freedom and can be as-sembled into a single, overall matrix such that

Ku � f ; �33�where the displacements and forces are now in theglobal coordinate system.

Further details of this method are described inmany references on ®nite elements including(Bathe, 1982; Zienkiewicz and Cheung, 1967).Note that at the current time, we are not consid-ering the higher order e�ects of dynamics (themass and damping matrices) and thus do notdescribe them here.

A.2. Model speci®cs

For this application, a thin-shell model waschosen. We constructed the model by beginningwith a 2D plane-stress isotropic material formu-lation (Zienkiewicz and Cheung, 1967) and addinga strain relationship for the out-of-plane compo-nents. For each triangular element, then, the six in-plane degrees of freedom are related with a six-by-six matrix kxy, while the out-of-plane degrees offreedom are related by the three-by-three kz. Inorder to preserve the linearity of our model whilemaintaining the use of ¯at elements, we treat thesetwo modes as being decoupled. They are thus as-sembled into the total ke as shown in block-matrixform below:


ke �kxy

kz

� �: �34�

We built the 2D kxy using the formulation asdescribed by Zienkiewicz and Cheung (1967) andBathe (1982). This formulation has the followingstress modes:

� ��x

�y

cxy

264375 � ou=ox

ov=oy

ou=oy � ov=ox

264375; �35�

where u and v correspond to displacements alongthe x and y dimensions of the local coordinatesystem. Using the relation ��x� � B�x�u and theoverall displacement vector

ue � u1 v1 u2 v2 u3 v3� �T; �36�we can solve for B. In addition, the material ma-trix C is

C � E�1ÿ v��1� v��1ÿ 2v�

1 v1ÿv 0

v1ÿv 1 0

0 0 1ÿ2v2�1ÿv�

264375; �37�

where E is the elastic modulus and v is Poisson'sratio. For the lip model, Poisson's ratio was cho-sen to be 0.01. Since the elastic modulus E is aconstant multiplier of the entire material matrix, itcan be used to vary the sti�ness of the element as awhole. As a result, a default value of 1.0 was usedfor this parameter. Elements that were to be moreor less sti� than the default material were thenassigned larger and smaller values, respectively.

The next step is to relate the out-of-plane de-grees of freedom. It is important at this stage toconsider the desired behavior of the material. If itwere important for nodes to be able to move in-dependently out of the plane without causingstrain in the adjoined nodes, the kz of Eq. (34)should be diagonal. In this case, however, it isdesired that ``pulling'' on a given node has thee�ect of ``pulling'' its neighbors along with it. As aresult, we construct the following kz:

kz � E�1ÿ v��1� v��1ÿ 2v�

1 ÿ0:5 ÿ0:5

ÿ0:5 1 ÿ0:5

ÿ0:5 ÿ0:5 1

264375: �38�

Consider this matrix in terms of the basic relationkeue � fe. A positive displacement (out of theplane) in only one of these degrees of freedomproduces negative forces (into the plane) in theother two. This means that stretching one node outof the plane without moving the other two wouldrequire forces pushing the other two down. Whenequal displacements are applied to all three nodes,there is zero strain on the element, resulting in arigid motion mode out of the plane. Though theout-of-plane degrees of freedom are decoupledfrom the in-plane components, this mechanismacts to relate the strain energy to the deformationof the element due to out-of-plane displacements.The greater the disparity in out-of-plane dis-placements (i.e., the greater the stretching of theelement due to such motions), the greater thestrain energy is produced by this kz.

Once the degrees of freedom are properlyaligned as described above, the resulting materialhas the approximate physics of many small plane-strain elements hooked together. The in-planeforces of one element can pull on both the in-planeand out-of-plane components of its neighbors, andthe vice versa.

Once the complete K matrix has been assem-bled, we have a linear approximation to the rela-tionship between the displacement and theresulting strain. We can now invert this relation-ship to ®nd the displacements produced by anapplied force (external strain). However, the ma-trix cannot be inverted as is: it is necessarily sin-gular, as there are several displacement vectorsthat produce no stress in the body (i.e., they existin the nullspace of K). These are the modes of rigidmotion. Consider, for example, a uniform dis-placement applied to all of the nodes. This wouldclearly produce no strain on the body. As a result,a minimum set of nodes must be ``grounded'' (i.e.,held ®xed) to prevent these singularities. For a 3Dbody, two nodes (6 DOF) must be grounded. Thisamounts to removing the rows and columns cor-responding to the degrees of freedom for thesenodes. The remaining Ks has full rank and can beinverted to provide the desired strain±stress rela-tion:

Kÿ1s fs � us; �39�


while K is easy to compute and is band diagonal(due to the limited interconnections betweennodes), ®nding its inverse is an expensive calcula-tion. We thus want to take this inverse only once at apoint where it is appropriate to linearize the physics.

A.3. Applying the method to the lips

The method described above can be directlyapplied to the mesh in Fig. 1, resulting in a phys-ical model in that shape made up of a uniformelastic material. However, in order to accentuatecertain properties of the lips, some additional in-formation was added to the model. First, in orderto maintain smoothness in the inner contours ofthe lips, the faces along the inside ridges of the lipswere made twice as sti� as the default material. Inaddition, to allow relatively free deformation ofthe lips while still maintaining the necessary rigidconstraints, a thin strip of low-sti�ness elementswas added to the top of the lips stretching backinto the oral cavity. The nodes at the far end ofthis strip were then ®xed in 3D. Lastly, since theFEM linearizes the physics of a body around agiven point, the initial K matrix (Fig. 1) was usedto deform the model to a more natural state of thelips (see Fig. 2), as described in the main text. TheK and Kÿ1

s matrices used for the training wereformed at this point to allow a more e�ective rangefor the linearized physics.

References

Adjoudani, A., Benoõt, C., 1995. On the integration of auditory

and visual parameters in an HMM-based ASR. In: Stork,

D., Hennecke, M. (Eds.), Speechreading by Man and

Machine, Springer, Berlin, pp. 461±472.

Basu, S. 1997. A three-dimensional model of human lip motion.

Master's Thesis, MIT Department of Electrical Engineer-

ing and Computer Science, Massachusetts Institute of

Technology, Cambridge.

Basu, S., Essa, I., Pentland, A., 1996. Motion regularization for

model-based head tracking. In: Proc. 13th Internat. Conf. on

Pattern Recognition, Vienna, 25±29 August 1996. IEEE

Computer Society Press, Los Alamitos, Vol. C, pp. 611±616.

Basu, S., Oliver, N., Pentland, A., 1998. 3D modeling and

tracking of human lips. In: Proc. Internat. Conf. on

Computer Vision, Bombay, 4±7 January 1998. IEEE

Computer Society Press, Los Alamitos, pp. 337±343.

Bathe, K., 1982. Finite Element Procedures in Engineering

Analysis, Prentice-Hall, Englewood Cli�s.

Bregler, C., Omohundro, S., 1995. Nonlinear image interpol-

ation using manifold learning. In: Tesauro, G., Touretzky,

D., Leen, T. (Eds.), Advances in Neural Information

Processing Systems, Vol. 7. MIT Press, Cambridge,

pp. 401±408.

Coianiz, T., Torresani, L., Caprile, B., 1995. 2D deformable

models for visual speech analysis. In: Stork, D., Hennecke,

M. (Eds.), Speechreading by Man and Machine. Springer,

Berlin, pp. 391±398.

Essa, I., 1995. Analysis and interpretation and synthesis of facial

expressions. Ph.D. Thesis, MIT Department of Media Arts

and Sciences, Massachusetts Institute of Technology,

Cambridge.

Gelb, A. (Ed.), 1974. Applied Optimal Estimation, MIT Press,

Cambridge.

Jebara, T., Pentland, A., 1997. Parametrized structure from

motion for 3D adaptive feedback tracking of faces, In:

Proc. IEEE Conf. on Computer Vision and Pattern

Recognition '97, San Juan, 17±19 June 1997. IEEE

Computer Society Press, Los Alamitos, pp. 144±150.

Kass, M., Witkin, A., Terzopoulos, D., 1988. Snakes: Active

contour models. Internat. J. Computer Vision 1 (4),

321±331.

Lee, Y., Terzopoulos, D., Waters, K., 1995. Realistic model-

ing for facial animation. In: Proc. SIGGRAPH'95, Orlan-

do, 6±11 August 1995. Addison-Wesley, Reading, MA,

pp. 55±62.

Luettin, J., Thacker, N., Beet, S., 1996. Visual speech recog-

nition using active shape models and hidden Markov

models. In: Proc. Internat. Conf. on Acoustics Speech and

Signal Processing 1996, Atlanta. IEEE Press, New York,

pp. 817±820.

Martin, J., Pentland, A., Sclaro�, S., Kikinis, R., 1998.

Characterization of neuropathological shape deformations.

IEEE Transactions on Pattern Analysis and Machine

Intelligence 30 (2), 97±112.

Oliver, N., Pentland, A., B�erard, F., 1997. LAFTER: lips and

face real time tracker. In: Proc. IEEE Conf. on Computer

Vision and Pattern Recognition '97, San Juan, 17±19 June

1997. IEEE Computer Society Press, Los Alamitos,

pp. 123±129.

Summer®eld, Q., 1979. Use of visual information for phonetic

perception. Phonetica 36, 314±331.

Waters, K., Frisbie, J., 1995. A coordinated muscle model for

speech animation. In: Davis, W., Prusinkiewicz, P. (Eds.),

Graphics Interface '95. Canadian Human±Computer Com-

munications Society, Ontario, pp. 163±170.

Zienkiewicz, O.C., Cheung Y.K., 1967. The Finite Element

Method in Structural and Continuum Mechanics. Mc-

Graw-Hill, London.


3d lip shapes from video: a combined physical–statistical

Documents