estimatingcoloured3dfacemodelsfromsingle images...

15
Estimating coloured 3D face models from single images: An example based approach Thomas Vetter and Volker Blanz Max-Planck-Institut f¨ ur biologische Kybernetik Spemannstr. 38 72076 T¨ ubingen – Germany {thomas.vetter,volker.blanz}@tuebingen.mpg.de Abstract. In this paper we present a method to derive 3D shape and surface texture of a human face from a single image. The method draws on a general flexible 3D face model which is “learned” from examples of individual 3D-face data (Cyberware-scans). In an analysis-by-synthesis loop, the flexible model is matched to the novel face image. From the coloured 3D model obtained by this procedure, we can generate new images of the face across changes in viewpoint and illumination. Moreover, nonrigid transformations which are represented within the flexible model can be applied, for example changes in facial expression. The key problem for generating a flexible face model is the computation of dense correspondence between all given 3D example faces. A new cor- respondence algorithm is described which is a generalization of common algorithms for optic flow computation to 3D-face data. 1 Introduction Almost an infinite number of different images can be generated from the human face. Any system that tries to analyse images of faces is confronted with the problem of separating different sources of image variation. For example, in order to identify a face of a person, the system must be able to ignore changes in illumination, orientation and facial expression, but it must be highly sensitive to attributes that are characteristic of identity. One approach to solving this prob- lem is to transform the input image to a feature vector which is then compared to stored representations. In these ‘bottom-up’ strategies, the crucial problem is to choose appropriate features and criteria for comparison. An alternative approach is to build an explicit model of the variations of images. An image analysis is performed by matching an image model to a novel image, thereby coding the novel image in terms of a known model. This paper focuses on the analysis and synthesis of images of a specific object class – that is on images of human faces. A model of an entire object class, such as faces or cars, with all objects sharing a common similarity, can be learned from a set of prototypical examples. Developed for the analysis and synthesis of images of a specific class of objects, it must solve two problems simultaneously: H. Burkhardt, B. Neumann (Eds.): Computer Vision – ECCV ’98,Vol II, LNCS 1407, pp. 499–513, 1998. c Springer-Verlag Berlin Heidelberg 1998

Upload: ngodang

Post on 11-Aug-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

Estimating coloured 3D face models from single

images: An example based approach

Thomas Vetter and Volker Blanz

Max-Planck-Institut fur biologische KybernetikSpemannstr. 38

72076 Tubingen – Germany{thomas.vetter,volker.blanz}@tuebingen.mpg.de

Abstract. In this paper we present a method to derive 3D shape andsurface texture of a human face from a single image. The method drawson a general flexible 3D face model which is “learned” from examples ofindividual 3D-face data (Cyberware-scans). In an analysis-by-synthesisloop, the flexible model is matched to the novel face image.

From the coloured 3D model obtained by this procedure, we can generatenew images of the face across changes in viewpoint and illumination.Moreover, nonrigid transformations which are represented within theflexible model can be applied, for example changes in facial expression.

The key problem for generating a flexible face model is the computationof dense correspondence between all given 3D example faces. A new cor-respondence algorithm is described which is a generalization of commonalgorithms for optic flow computation to 3D-face data.

1 Introduction

Almost an infinite number of different images can be generated from the humanface. Any system that tries to analyse images of faces is confronted with theproblem of separating different sources of image variation. For example, in orderto identify a face of a person, the system must be able to ignore changes inillumination, orientation and facial expression, but it must be highly sensitive toattributes that are characteristic of identity. One approach to solving this prob-lem is to transform the input image to a feature vector which is then compared tostored representations. In these ‘bottom-up’ strategies, the crucial problem is tochoose appropriate features and criteria for comparison. An alternative approachis to build an explicit model of the variations of images. An image analysis isperformed by matching an image model to a novel image, thereby coding thenovel image in terms of a known model.

This paper focuses on the analysis and synthesis of images of a specific objectclass – that is on images of human faces. A model of an entire object class, suchas faces or cars, with all objects sharing a common similarity, can be learnedfrom a set of prototypical examples. Developed for the analysis and synthesis ofimages of a specific class of objects, it must solve two problems simultaneously:

H. Burkhardt, B. Neumann (Eds.): Computer Vision – ECCV ’98,Vol II, LNCS 1407, pp. 499–513,1998.c© Springer-Verlag Berlin Heidelberg 1998

500 Thomas Vetter and Volker Blanz

– The model must be able to synthesize images that cover the whole range ofpossible images of the class.

– Matching the model to a novel image must be possible, avoiding local min-ima.

Recently, two-dimensional image-based face models have been constructedand applied for the synthesis of rigid and nonrigid face transformations [1,2,3].These models exploit prior knowledge from example images of prototypical facesand work by building flexible image-based representations (active shape models)of known objects by a linear combination of labeled examples. These representa-tions are used for image search and recognition or synthesis [3]. The underlyingcoding of an image of a new object or face is based on linear combinations ofprototypical images in terms of both two-dimensional shape (warping fields),and color values at corresponding locations (texture).For the problem of synthesizing novel views to a single example image of a

face, we have developed over the last years the concept of linear object classes[4]. This image-based method allows us to compute novel views of a face froma single image. On the one hand, the method draws on a general flexible imagemodel which can be learned automatically from examples images, and on theother hand, on an algorithm that allows this flexible model to be matched toa novel face image. The novel image can now be described or coded by meansof the internal model parameters which are necessary to reconstruct the image.The design of the model also allows new views of the face to be synthesized.In this paper we replace the two-dimensional image model by a three-dimen-

sional flexible face model. A flexible three-dimensional face model will lead onthe one hand to a more efficient data representation, and on the other hand toa better generalization to new illumination conditions.

In all these techniques, it is crucial to establish the correspondence betweeneach example face and a single reference face, either by matching image points inthe two-dimensional approach, or surface points in the three-dimensional case.Correspondence is a key step posing a difficult problem. However, for images ofobjects which share many common features, such as faces all seen from a singlespecific viewpoint, automated techniques seem feasible. Techniques applied inthe past can be separated in two groups, one which establishes the correspon-dence for a small number of feature points only, and techniques computing thecorrespondence for every pixel in an image. For the first approach models ofparticular features such as the eye corners or the whole chin line are developedoff line and then matched to a new image [3]. The second technique computesthe correspondence for each pixel in an image by comparing this image to areference image using methods derived from optical flow computation[2,5].In this paper, the method of dense correspondence, which we have already

applied successfully to face images [4], will be extended to the three-dimensionalface data. Firstly, we describe the flexible three-dimensional face model andcompare it to the two-dimensional image models we used earlier. Secondly wedescribe an algorithm to compute dense correspondence between individual 3Dmodels of human faces. Thirdly we describe an algorithm that allows us to match

Estimating coloured 3D face models from single images 501

the flexible face model to a novel image. Finally we show examples for synthesiz-ing new images from a single image of a face and describe future improvements.

2 Flexible 3D face models

In this section we will give a formulation of a flexible three-dimensional facemodel which captures prior knowledge about faces exploiting the general simi-larity among faces. The model is a straightforward extension of the linear objectclass approach as described earlier[4]. Exploiting the general similarity amongfaces, prototypical examples are linked to a general class model that capturesregularities specific for this object class.Three-dimensional modelsIn computer graphics, at present the most realistic three-dimensional face rep-resentations consist of a 3D mesh describing the geometry, and a texture mapcapturing the color data of a face. These representations of individual facesare obtained either by three-dimensional scanning devices or by means of pho-togrammetric techniques from several two-dimensional images of a specific face[6,7]. Synthesis of new faces by interpolation between such face representationwas already demonstrated in the pioneering work of Parke (1974). Recently theidea of forming linear combinations of faces has been used and extended to ageneral three-dimensional flexible face model for the analysis and synthesis oftwo-dimensional facial images [1,4].

Shape model: The three-dimensional geometry of a face is represented by ashape-vector S = (X1, Y1, Z1, X2, ....., Yn, Zn)

T ∈ <3n, that contains the X, Y, Z-coordinates of its n vertices. The central assumption for the formation of aflexible face model is that a set ofM example faces Si is available. Additionally,it is assumed that all these example faces Si consist of the same number ofn consistently labeled vertices, in other words all example faces are in fullcorrespondence (see next section on correspondence). Usually this labeling isdefined on an average face shape, which is obtained iteratively, and which isoften denoted as reference face Sref . Additionally, all faces are assumed to bealigned in an optimal way by rotating and translating them in three-dimensionalspace. Under this assumptions a new face geometry Smodel can be generated asa linear combination of M example shape-vectors Si each weighted by ci

Smodel =M∑i=1

ci Si . (1)

The linear shape model allows for approximating any given shape S as alinear combination with coefficients that are computed by projecting S onto theexample shapes Si. The coefficients ci of the projection then define a coding ofthe original shape vector in this vector space which is spanned by all examples.Texture model: The second component of a flexible three-dimensional face orhead model is texture information, which is usually stored in a texture map. A

502 Thomas Vetter and Volker Blanz

Cylindrical Coordinates

3D Data

Texture Map

Radii Data

Fig. 1. Three-dimensional head data represented in cylindrical coordinates resultin a data format which consists of two 2D-images. One image is the texture map(top right), and in the other image the geometry is coded (bottom right).

texture map is simply a two-dimensional color pattern storing the brightness orcolor values (ideally only the albedo of the surface). This pattern can be recordedin a scanning process or generated synthetically.A u, v coordinate system associates the texture map with the modeled sur-

face. The texture map is defined in the two-dimensional u, v coordinate system.For polygonal surfaces as defined by a shape vector S, each vertex has an assignedu, v texture coordinate. Between vertices, u, v coordinates are interpolated.The linear texture model, described in [1] starts from a set ofM example face

textures Ti. Equivalent to the shape model described earlier, it is assumed thatall M textures Ti consist of the same number of n consistently labeled texturevalues, that is all textures are in full correspondence. For texture synthesis linearmodels are used again. A new texture Tmodel is generated as the weighted sumof M given example textures Ti as follows

Tmodel =

M∑i=1

biTi, (2)

Equivalent to the linear expansion of shape vectors (equation 1), the linearexpansion of textures can be understood and used as an efficient coding schemafor textures. A new texture can be coded by its M projection coefficients bi inthe ‘texture vector space” spanned by M basis textures.

3 3D Correspondence with Optical Flow

In order to construct a flexible 3D face model, it is crucial to establish correspon-dence between a reference face and each individual face example. For all vertices

Estimating coloured 3D face models from single images 503

of the reference face, we have to find the corresponding vertex location on eachface in the dataset. If, for example, vertex j in the reference face is locatedon the tip of the nose, with a 3D position described by the vector componentsXj, Yj, Zj in Sref , then we have to store the position of the tip of the nose of facei in the vector components Xj , Yj, Zj of Si. In general, this is a difficult problem,and it is difficult to formally specify what correct correspondence is supposedto be. However, assuming that there are no categorical differences such as somehaving a beard and others not, an automatic method is feasible for computingthe correspondence.

The key idea of the work described in this paragraph is to modify an existingoptical flow algorithm to match points on the surfaces of three-dimensionalobjects instead of points on 2D images.

Optical Flow AlgorithmIn video sequences, in order to estimate the velocities of scene elements withrespect to the camera, it is necessary to compute the vector field of optical flow,which defines the displacements (δx, δy) = (x2 − x1, y2 − y1) between pointsp1 = (x1, y1) in the first image and corresponding points p2 = (x2, y2) in thefollowing image.

A variety of different optical flow algorithms have been designed to solvethis problem (for a review see [8]). Unlike temporal sequences taken from onescene, a comparison of images of completely different scenes or faces may violatea number of important assumptions made in optical flow estimation. However,some optical flow algorithms can still cope with this more difficult matchingproblem.

In previous studies [4], we built flexible image models of faces based oncorrespondence between images, using a coarse-to-fine gradient-based method[9] and following an implementation described in [10]:

For every point x, y in an image I(x, y), the algorithm attempts to minimizethe error term E(x, y) =

∑R(x,y)(Ixδx + Iyδy − δI)

2 for δx, δy, with Ix, Iybeing the spatial image derivatives and δI the difference of grey-levels of the twocompared images. The region R is a 5x5 pixel neighbourhood of (x, y). Solvingthis optimization problem is achieved in a single iteration.

Since this is only a crude approximation to the overall matching problem,an iterative coarse-to-fine strategy is required. The algorithm starts with anestimation of correspondence on low-resolution versions of the two input images.The resulting flow field is used as an initial value to the computation on the nexthigher level of resolution. Iteratively, the algorithm proceeds to full resolution.

In our applications, results were dramatically improved if images on each levelof resolution were computed not only by downsampling the original (GaussianPyramid), but also by band-pass filtering (Laplacian Pyramid). The LaplacianPyramid was computed from the Gaussian pyramid adopting the algorithmproposed by [11].

Three-dimensional face representations.The adaptation and extension of this optical flow algorithm to face surfaces in3D is straightforward due to the fact that these two-dimensional manifolds can

504 Thomas Vetter and Volker Blanz

be parameterized in terms of two variables: In a cylindrical representation (seefigure 1), faces are described by radius and color values at each angle φ and heighth. Images, on the other hand, consist of grey-level values in image coordinatesx,y. Thus, in both cases correspondence can be expressed by a mapping C :<2 → <2 in parameter space.In order to compute correspondence between different heads, both texture

and geometry were considered simultaneously. The optical flow algorithm de-scribed above had to be modified in the following way. Instead of comparingscalar grey-level functions I(x, y), our modification of the algorithm attempts tofind the best fit for the vector function

F (h, φ) =

radius(h, φ)red(h, φ)green(h, φ)blue(h, φ)

in a norm

‖(radius, red, green, blue)T‖2 = w1 ·radius2+w2 ·red

2+w3 ·green2+w4 · blue

2.

The coefficients w1...w4 correct for the different contrasts in range and colorvalues, assigning approximately the same weights to variations in radius as tovariations in all color channels taken together.Radius values can be replaced by other surface properties such as Gaussian

curvature or surface normals in order to represent the shapes of faces moreappropriately.The displacement between corresponding surface points is expressed by a

correspondence function

C(h, φ) = (δh(h, φ), δφ(h, φ)). (3)

Interpolation in low-contrast areas.It is well known that in areas with no contrast or with strongly oriented intensitygradients, the problem of optical flow computation cannot be uniquely solvedbased on local image properties only (aperture problem). In our extension ofthe algorithm to surfaces of human faces, there is no structure to define correctcorrespondence on the cheeks, along the eyebrows and in many other areas, andindeed the method described so far yields spurious results here. While theseproblems might remain undetected in a simple morph between two faces, theystill have significant impact on the quality of the flexible face model.The ambiguities of correspondence caused by the aperture problem can be

resolved if the flow field is required to be smooth. In a number of optical flowalgorithms, smoothness constraints have been implemented as a part of an iter-ative optimization of correspondence [12,13,14].In our algorithm, smoothing is performed as a separate process after the

estimation of flow on each level of resolution. For the smoothed flow field(δh′(h, φ), δφ′(h, φ)), an energy function is minimized using conjugate gradient

Estimating coloured 3D face models from single images 505

descent such that on the one hand, flow vectors are kept as close to constant aspossible over the whole domain, and on the other hand as close as possible tothe flow field (δh(h, φ), δφ(h, φ)) obtained in the computation described above.The first condition is enforced by quadratic potentials that increase with thesquare distances between each individual flow vector and its four neighbours.These interconnections have equal strength over the whole domain. The secondcondition is enforced by quadratic potentials that depend on the square distancebetween (δh′(h, φ), δφ′(h, φ)) and (δh(h, φ), δφ(h, φ)) in every position (h, φ).These potentials vary over the (h, φ) domain: If the gradient of colour and radiusvalues, weighted in the way described above, is above a given threshold, thecoupling factor is set to a fixed, high value in the direction along the gradient,and zero in the orthogonal direction. This allows the flow vector to move alongan edge during the relaxation process. In areas with gradients below threshold,the potential is vanishing, so the flow vector depends on its neighbours only.With our choice of the threshold, only 5% of all flow-vectors were set free in thelow-resolution step, but 85% in the final full-resolution computation.

4 Matching the flexible 3D model to a 2D image

Based on an example set of faces which are already in correspondence, new 3Dshape vectors Smodel and texture maps Tmodel can be generated by varying thecoefficients ci and bi in equations (1) and (2). Combining model shape and modeltexture results in a complete 3D face representation which can now be renderedto a new model image Imodel . This model image is not fully specified by themodel parameters ci and bi, but it also depends on some projection parameterspj and on the parameters rj of surface reflectance properties and illuminationused for rendering. For the general problem of matching the model to a novelimage Inovel we define the following error function

E(c, b,p, r) =1

2

∑x,y

[Inovel(x, y)− Imodel(x, y)

]2(4)

where the sum is over all pixels (x, y) in the images, Inovel is the novel imagebeing matched and Imodel is the current guess for the model image for a specificparameter setting (c, b,p, r). Minimizing the error yields the model image whichbest fits the novel image with respect to the L2 norm.However, the optimization of the error function in equation (4) is extremely

difficult for several reasons. First, the function is not linear in most of theparameters, second, the number of parameters is large (> 100) and additionally,the whole computation is extremely expensive since it requires the rendering ofthe three-dimensional face model to an image for each evaluation of the errorfunction.In this paper we will simplify the problem by assuming the illumination

parameters r and also the projection parameters p, such as viewpoint, are known.This assumption allows us to reduce the amount of rendering and also to use

506 Thomas Vetter and Volker Blanz

image modeling techniques developed earlier [15,16]. By rendering images fromall example faces under fixed illumination and projection parameters, the flexible3D model is transformed into a flexible 2D face model. This allows us to generatenew model images depicting faces in the requested spatial orientation and underthe known illumination. After matching this flexible 2D model to the novel image(see below), the optimal model parameters are used within the flexible 3D modelto generate a three-dimensional face representation which best matches the noveltarget image.

4.1 Linear image model

To build the flexible 2D model, first we render all 3D example faces under thegiven projection and illumination parameters to images I0, I1, . . . , IM . Let I0be the reference image, and let positions within I0 be parameterized by (u, v).Pixelwise correspondences between I0 and each example image are mappingssj : R

2 → R2 which map the points of I0 onto Ij , i.e. sj(u, v) = (x, y), where(x, y) is the point in Ij which corresponds to (u, v) in I0. We refer to sj as acorrespondence field and interchangeably as 2D shape vector for the vectorizedIj.

The 2D correspondence sj between each pixel in the rendered reference imageI0 and its corresponding location in each rendered example image Ii, can bedirectly computed from the projection P of the differences of 3D shapes betweenall 3D faces and the reference face, PSj − PS0.Warping image Ij onto the reference image I0, we obtain tj as:

tj(u, v) = Ij ◦ sj(u, v)⇔ Ij(x, y) = tj ◦ s−1j (x, y).

So, {tj} is the set of shape-normalized prototype images, referred to as texturevectors. They are normalized in the sense that their shape is the same as theshape of the chosen reference image.

The flexible image model is the set of images Imodel , parameterized byc = [c0, c1, . . . , cM ], b = [b0, b1, . . . , bM ] such that

Imodel ◦ (M∑i=0

cisi) =

M∑j=0

bjtj . (5)

The summation∑Mi=0 cisi constrains the 2D shape of every model image to

be a linear combination of the example 2D shapes. Similarly, the summation∑Mj=0 bjtj constrains the texture of every model image to be a linear combination

of the example textures.

For any values for ci and bi, a model image can be rendered by computing(x, y) =

∑Mi=0 cisi(u, v) and g =

∑Mj=0 bjtj(u, v) for each (u, v) in the reference

image. Then the (x, y) pixel is rendered by assigning Imodel(x, y) = g, that is bywarping the texture into the model shape.

Estimating coloured 3D face models from single images 507

4.2 Matching a 2D face model to an image

For matching the flexible image model to a novel image we used the methoddescribed in [15,16]. In 2D, the error function as defined in equation (4) is reducedto a function of the model parameters c and b.

E(c, b) =1

2

∑x,y

[Inovel(x, y) − Imodel(x, y)

]2

In order to compute Imodel (see equation (5)) the shape transformation (∑cisi)

has to be inverted or one has to work in the coordinate system (u, v) of thereference image, which is computationally more efficient. Therefore, the shapetransformation (given some estimated values for c and b) is applied to bothInovel and Imodel . From equation (5) we obtain

E =1

2

∑u,v

[Inovel ◦ (M∑i=0

cisi(u, v)) −M∑j=0

bjtj(u, v)]2.

Minimizing the error yields the model image which best fits the novel imagewith respect to the L2 norm. The optimal model parameters c and b are foundby a stochastic gradient descent algorithm [17], a method that is fast and has alow tendency to be caught in local minima.The robustness of the algorithm is further improved using a coarse-to-fine

approach [11]. In addition to the textural pyramids, separate resolution pyramidsare computed for displacement fields s in x and y.Separate matching of facial subregions.In the framework described above, the flexible face model has M degrees offreedom for texture and M for shape, if M is the number of example faces. Thenumber of degrees of freedom can be increased by dividing faces into independentsubregions which are optimized independently [18], for example into eyes, nose,mouth and a surrounding region. Once correspondence is established, it is suffi-cient to define these regions on the reference face. In the linear object class, thissegmentation is equivalent to splitting down the vector space of faces into inde-pendent subspaces. The process of fitting the flexible face model to given imagesis modified in the following ways: First, model parameters c and b are estimatedas described above, based on the whole image. Starting from these initial values,each segment is then optimized independently, with its own parameters c andb. The final 3D model is generated by computing linear combinations for eachsegment separately and blending them at the borders according to an algorithmproposed for images by [11,19] .

5 Novel view synthesis

After matching the 2D imagemodel to the novel image, the 2D model parametersc and b can be used in the three-dimensional flexible face model as defined in

508 Thomas Vetter and Volker Blanz

ORIGINALS

SYNTHESIS

EXPRESSION 1 EXPRESSION 2

ORIGINAL + 1 SMILE + 1.4 SMILE

SMILE = EXPRESSION2 − EXPRESSION1

Fig. 2. Correspondence between faces allows us to map expression changes fromone face to the other. The difference between the two expressions in the top rowis mapped on the left face in the lower row multiplied by a factor of 1 (center)or by 1.4 (lower right)

equations (1) and (2). The justification of this parameter transfer is discussedin detail under the aspect of linear object classes in [4]. The output of the 3Dflexible face model is an estimate of the three-dimensional shape from the two-dimensional image. Since this result is a complete 3D face model, new imagescan be rendered from any viewpoint or under any illumination condition.The correspondence between all faces within this flexible model allows for

mapping non-rigid face transitions ‘learned’ from one face onto all other facesin the model. In figure 2, the transformation for a smile is extracted from oneperson and then mapped onto the face of another person. Computing the cor-respondence between two examples of one person’s face, one example showingthe face smiling and the other showing the face in a neutral expression, resultsin a correspondence field or deformation field which captures the spatial dis-placement for each vertex in the model according to the smile. This expressionspecific correspondence field is formally identical to the correspondence fieldsbetween different persons described earlier. Such a ‘smile-vector’ can now beadded or subtracted from each face which is in correspondence to one of theoriginals, making a neutral looking face more smily or giving a smiling face amore emotionless expression.

6 Data set

We used a 3D data set obtained from 200 laser scanned (CyberwareTM ) headsof young adults (100 male and 100 female).The laser scans provide head structure data in a cylindrical representa-

tion, with radii of surface points sampled at 512 equally-spaced angles, and at512 equally spaced vertical distances. Additionally, the RGB-color values were

Estimating coloured 3D face models from single images 509{ORIGINAL3D SCAN

TEST IMAGE

RECONSTRUCTEDIMAGE

RECONSTRUCTED3D HEAD

MODIFICATIONS

MATCHING PROCESSFLEXIBLE FACE MODEL

Fig. 3. Three-dimensional reconstruction of a face from a single two-dimensional image of known orientation and illumination. The test image isgenerated from an original 3D scan (top) which is not part of the training setof 100 faces. Using the flexible face model derived from the training set, the testimage can be reconstructed by optimizing parameters in a 2D matching process.These model parameters also describe the estimated 3D structure of the recon-structed head. The bottom row illustrates manipulations of surface properties,illumination and facial expression (left to right).

510 Thomas Vetter and Volker Blanz

recorded in the same spatial resolution and were stored in a texture map with8 bit per channel. All faces were without makeup, accessories, and facial hair.After the head hair was removed digitally (but with manual editing), individualheads were represented by approximately 70000 vertices and the same numberof color values.We split our data set of 200 faces randomly into a training and a test set, eachconsisting of 100 faces. The training set was used to ‘learn’ a flexible model. Fromthe test set, images were rendered showing the faces 30◦ from frontal, and usingmainly ambient light. The image size used in the experiments was 256-by-256pixel and 8 bit per color channel.

ORIGINAL NO SEGMENTATION SEGMENTATION

Fig. 4. In order to recover more of the characteristics of a test image (left),after optimizing model parameters for the whole face (center), separate sets ofparameters are optimized for different facial subregions independently (right). Assubregions, we used eyes, nose, mouth and the surrounding area. Improvementscan be seen in the reconstructed shape of lips and eyes.

7 Results

Correspondence between all 3D example face models and a reference face modelwas computed automatically. The results were correct (visual inspection) foralmost all 200 faces, in only 7 cases obvious correspondence errors occurred.Figure 3 shows an example of a three-dimensional face reconstruction from

a single image of known orientation and illumination. After matching the modelto the test image, the model parameters were used to generate a complete three-dimensional face reconstruction. Figure 3 illustrates the range of images that canbe obtained when a full 3D head is reconstructed, including both manipulationsof viewpoint, surface material or illumination, and changes of internal propertiessuch as facial expression.At present, evaluation of the three-dimensional face reconstructions from

single images is only based on visual inspection. Out of 100 reconstructions, 72faces were highly similar and often hard to distinguish from the original in a 30◦

view. In 27 cases, persons were still easy to identify, but images displayed somedifferences, for example in the shape of cheeks or jaw. Only one reconstructionshowed a face clearly unlike the original, yet a face that could very well exist.

Estimating coloured 3D face models from single images 511

We rated the example shown in figure 3 as highly similar, but within thiscategory it is average. While the reconstructed head appears very similar to theoriginal image in a 30◦ view, it is not surprising to notice that front and profileviews reveal a number of differences.

Since texture vectors in the flexible 2D face model only code points thatare part of the face, no particular background colour is specified. Performanceshould be roughly the same for any background that produces clear contours,and indeed results are almost identical for test images with black and light brownbackgrounds.

The flexible model approach can also be applied to faces that are partiallyoccluded, for example wearing sunglasses, or for persons with unusual featuressuch as beards or moles. In each iteration of the matching process, contributionsare ignored for those pixels which have the largest disagreement in color valueswith respect to the original image [15]. Along with a relatively stable recon-struction of all visible areas, the system yields an estimate of the appearance ofoccluded areas, based on the information available in the image and the internalstructure of the face model (figure 5). Moreover, the approach allows detection ofconspicious image areas. Conspiciousness of an individual pixel can be measuredin different ways, for example by plain disagreement of reconstructed colourvalues with the original image, by the required change in model parameters tofit the original (used in figure 5), or by the loss in overall matching quality forfitting this specific pixel within the flexible face model. The second and thirdmeasures take into account that in some regions small changes of the modelparameters lead to considerable changes in color value, while other regions showonly small variations.

ORIGINAL RECONSTRUCTION CONSPICIOUS PIXELS

Fig. 5. Reconstruction of a partly occluded face. In the matching process, pixelswith large disagreement in colour values were ignored. No segmentation wasapplied. The approach allows detection of conspicious image areas (right, seetext).

512 Thomas Vetter and Volker Blanz

8 Conclusions

We presented a method for approximating the three-dimensional shape of a facefrom just a single image. In an analysis-by-synthesis loop, a flexible 3D-facemodel is matched to a novel image. The novel image can now be describedor coded by means of the model parameters reconstructing the image. Priorknowledge of the three-dimensional appearance of faces, derived from an exampleset of faces, allows new images of a face to be predicted. The results presented inthis paper are preliminary. We plan to apply a more sophisticated evaluation ofreconstruction quality based on ratings by naive human subjects and automatedsimilarity measures.Clearly, the present implementation with its intermediate step of generating

a complete 2D face model can not be the final solution. Next, we plan for eachiteration step to form linear combinations in our 3D-representation, render animage from this model and then perform a comparison with the target image.This requires several changes in our matching procedure to keep the computa-tional costs tolerable.Conditions in our current matching experiments were simplified in two ways.

Firstly, all images were rendered from our test set of 3D-face scans. Secondly,projection parameters and illumination conditions were known. The extensionof the method to face images taken under arbitrary conditions, in particular toany photopraph, will require several improvements. Along with a larger numberof free parameters in the matching procedure, model representations need tobe more sophisticated, especially in terms of the statistical dependence of theparameters. Reliable results on a wider variety of faces, such as different agegroups or races, can only be obtained with an extended data set.While the construction of 3D-models from a single image is very difficult

and often an ill-posed problem in a bottom-up approach, our example-basedtechnique allows us to obtain satisfying results by means of a maximum likeli-hood estimate. The ambiguity of the problem is reduced when several imagesof an object are available, a fact that is exploited in stereo or motion-basedtechniques. In our framework, we can make use of this additional information bysimultaneously optimizing the model parameters c and b for all images, while thecamera and lighting parameters p and r are adjusted for each image separately.The method presented in this paper appears to be complementary to non

model-based techniques such as stereo. While our approach is limited to resultswithin a fixed model space, these techniques are often not reliable in areaswith little structure. For the future, we plan to combine the benefits of bothtechniques.

References

1. C. Choi, T. Okazaki, H. Harashima, and T. Takebe, “A system of analyzing andsynthesizing facial images,” in Proc. IEEE Int. Symposium of Circuit and Syatems(ISCAS91), pp. 2665–2668, 1991. 500, 501, 502

Estimating coloured 3D face models from single images 513

2. D. Beymer, A. Shashua, and T. Poggio, “Example-based image analysis and syn-thesis,” A.I. Memo No. 1431, Artificial Intelligence Laboratory, MassachusettsInstitute of Technology, 1993. 500, 500

3. A. Lanitis, C. Taylor, T. Cootes, and T. Ahmad, “Automatic interpretation ofhuman faces and hand gestures using flexible models,” in Proc. InternationalWorkshop on Face and Gesture Recognition (M.Bichsel, ed.), (Zurich, Switzerland),pp. 98–103, 1995. 500, 500, 500

4. T. Vetter and T. Poggio, “Linear objectclasses and image synthesis from a singleexample image,” IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 19, no. 7, pp. 733–742, 1997. 500, 500, 501, 501, 503, 508

5. D. Beymer and T. Poggio, “Image representation for visual learning,” Science,vol. 272, pp. 1905–1909, 1996. 500

6. F. Parke, “A parametric model of human faces,” doctoral thesis, University ofUtah, Salt Lake City, 1974. 501

7. T. Akimoto, Y. Suenaga, and R. Wallace, “Automatic creation of 3D facial mod-els,” IEEE Computer Graphics and Applications, vol. 13, no. 3, pp. 16–22, 1993.501

8. J. Barron, D. Fleet, and S. Beauchemin, “Performance of optical flow techniques,”Int. Journal of Computer Vision, pp. 43–77, 1994. 503

9. J. Bergen, P. Anandan, K. Hanna, and R. Hingorani, “Hierarchical model-basedmotion estimation,” in Proceedings of the European Conference on Computer Vi-sion, (Santa Margherita Ligure, Italy), pp. 237–252, 1992. 503

10. J. Bergen and R. Hingorani, “Hierarchical motion-based frame rate conversion,”technical report, David Sarnoff Research Center Princeton NJ 08540, 1990. 503

11. P. Burt and E. Adelson, “The Laplacian pyramide as a compact image code,”IEEE Transactions on Communications, no. 31, pp. 532–540, 1983. 503, 507, 507

12. B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intelli-gence, vol. 17, pp. 185–203, 1981. 504

13. E. Hildreth, The Measurement of Visual Motion. Cambridge: MIT Press, 1983.504

14. H. H. Nagel, “Displacement vectors derived from second order intensity variationsin image sequences.,” Computer Vision, Graphics and Image Processing, vol. 21,pp. 85–117, 1983. 504

15. M. Jones and T. Poggio, “Model-based matching by linear combination of proto-types,” a.i. memo no., Artificial Intelligence Laboratory, Massachusetts Instituteof Technology, 1996. 506, 507, 511

16. T. Vetter, M. J. Jones, and T. Poggio, “A bootstrapping algorithm for learninglinear models of object classes,” in IEEE Conference on Computer Vision andPattern Recognition – CVPR’97, (Puerto Rico, USA), IEEE Computer SocietyPress, 1997. 506, 507

17. P. Viola, “Alignment by maximization of mutual information,” A.I. Memo No.1548, Artificial Intelligence Laboratory, Massachusetts Institute of Technology,1995. 507

18. T. Vetter, “Synthestis of novel views from a single face image,” InternationalJournal of Computer Vision, no. in press. 507

19. P. Burt and E. Adelson, “Merging images through pattern decomposition,” inApplications of Digital Image Processing VIII, no. 575, pp. 173–181, SPIE TheInternational Society for Optical Engeneering, 1985. 507