fitting an active appearance model generated from...

78
Fitting an Active Appearance Model Generated from a 3D Morphable Model Master Thesis Frank Preiswerk [email protected] Version 1.1

Upload: others

Post on 12-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

Fitting an Active Appearance Model Generatedfrom a 3D Morphable Model

Master Thesis

Frank [email protected]

Version 1.1

Page 2: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

A thesis submitted in partial fulfillment of therequirements for the degree of

Master of Science in Computer Science

Department of Computer ScienceComputer Graphics and Vision Research Group

University of BaselSwitzerland

http://www.cs.unibas.ch

Basel, July 2008 Frank Preiswerk

Page 3: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

Acknowledgments

I would like to express my gratitude to the members of the Graphics and Vi-sion Research Group for the pleasant atmosphere during the six months prepa-ration time for this work. In am particularly grateful to my supervisor ReinhardKnothe for all his efforts and Brian Amberg for many helpful discussions. I wouldalso like to acknowledge Pascal Paysan who provided me with a large 3D Mor-phable Model database. Furthermore I want to thank Prof. Dr. T. Vetter for his“open-door policy” and for making this master thesis possible.

In addition I would like to acknowledge the work of my fellow student DanielJaeger who proof-read parts of this thesis.

My deepest and sincere thanks go to my wonderful girlfriend Franziska forher love and patience. She was always there when I needed her.

Above all I want to thank my parents who gave me the invaluable privilegein life to purse the education of my choice. During that time they have supportedmy in any possible area and have always been there for me. It is to them that Idedicate this work.

Page 4: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

“Any sufficiently advanced technology is indistinguishable from magic.”

Arthur C. Clarke, 1917 - 2008.

Page 5: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Mathematical Notations . . . . . . . . . . . . . . . . . . . . . . . 41.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Comparison of 2D and 3D Models . . . . . . . . . . . . . . . . . 5

1.4.1 Active Appearance Models (AAMs) . . . . . . . . . . . . 61.4.2 3D Morphable Models (3DMMs) . . . . . . . . . . . . . 71.4.3 Comparison of Representational Power . . . . . . . . . . 8

1.5 Model-Based Vision . . . . . . . . . . . . . . . . . . . . . . . . 91.5.1 A Simple Example for Model Fitting . . . . . . . . . . . 10

2 Prior Work 132.0.2 Shape-Free Models . . . . . . . . . . . . . . . . . . . . . 132.0.3 Shape Models . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Inverse Compositional Image Alignment 253.1 The Lucas-Kanade Algorithm . . . . . . . . . . . . . . . . . . . 26

3.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . 263.1.2 One-Dimensional Case . . . . . . . . . . . . . . . . . . . 273.1.3 Generalization to Higher Dimensions . . . . . . . . . . . 29

3.2 Compositional Image Alignment . . . . . . . . . . . . . . . . . . 313.2.1 Forwards Compositional Image Alignment . . . . . . . . 313.2.2 Inverse Compositional Image Alignment . . . . . . . . . 323.2.3 Applying Inverse Compositional Image Alignment to AAMs 33

4 Automatic AAM Generation 434.0.4 Landmark Extraction . . . . . . . . . . . . . . . . . . . . 44

5 Results and Discussion 495.1 Fitting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Page 6: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

vi CONTENTS

5.2.1 Thoughts about ICIA . . . . . . . . . . . . . . . . . . . . 565.2.2 Thoughts about AAMs . . . . . . . . . . . . . . . . . . . 585.2.3 Model Building . . . . . . . . . . . . . . . . . . . . . . . 60

6 Conclusion 616.0.4 Propositions for Future Work . . . . . . . . . . . . . . . . 61

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Appendices

A Principal Component Analysis 65

B Gauss-Newton Algorithm 69

Page 7: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

Chapter 1

Introduction

During the past decades, face recognition has evolved to an active field of researchin computer vision. Face recognition can be described as the process of extractinginformation about a face in an image. This is a fundamental problem because itconstitutes an essential building block to the human-computer interface. In thefuture, computers will not only be able to recognize the identity of a face, butalso expressions like joy or anger in order to react in a certain way. This extendsthe range of possible applications from well known examples like surveillance oraccess control to more complex feedback systems.

There is certainly a long way to go until we reach this level of artificial intelli-gence. However, computer vision and the problem of face recognition in particularis at the very core of such ambitions. To address this problem, three-dimensionalmodels such as 3D Morphable Models (3DMMs) [1] as well as two-dimensionalapproaches, for instance Active Appearance Models (AAMs) [2], have been de-veloped in the area of model-based computer vision. Opposed to pure face de-tection and identification such models can be used for 3D face reconstruction andanimation.For both AAMs and 3DMMs, algorithms have been developed to fit the models toinput images, yielding a set of parameters that give a reconstruction of the input interms of the model. In the case of a 3DMM this is a complete 3D reconstructionof the input face. However, the good result comes at a higher computational costcompared to 2D approaches. There are several reasons for this. A 3D model is ahigh resolution structure of several tens of thousands of vertices whereas 2D mod-els typically consist of less than one hundred vertices only. The high resolutionand the additional degrees of freedom in 3D, combined with the steepest descentalgorithm that is normally used for matching a 3DMM to images results in a highnumber of derivatives to be computed in every iteration.At the same time, the other side of the coin for faster fitting speeds of 2D models

Page 8: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

2 Introduction

is the lack of information about depth and illumination. Another drawback lies inmodel construction because the process involves labeling a set of example images.So far, the user has to manually select a set of landmark points (e.g. eyes, nose,mouth, contour) on a set of examples. With the 3D Morphable Model, on theother hand, a three-dimensional scan of a face is sufficient and requires only mi-nor human input for building a model because it can take advantage of an existingpowerful registration algorithm [1].

1.1 Motivation

The arguments in the introduction show that two-dimensional and three-dimensionalapproaches to face recognition both have advantages and disadvantages. Ideally,we wish to combine the strengths of two-dimensional models (speed throughfewer degrees of freedom and lower resolution) and three-dimensional models(higher fitting accuracy, explicit control of 3D pose, semi-automatic model con-struction) in order to obtain a better fitting system.

This thesis makes a first step towards combining the strengths of both ap-proaches: The speed of AMMs and the power of 3DMMs. It proposes a methodfor automatic construction of a 2D Active Appearance Model based on a 3D Mor-phable Model. Additionally the model is fitted to images using the Inverse Com-positional Image Alignment algorithm [3], a recent algorithmic addition that in-creases the fitting speed of AAMs models significantly.

The final goal is to accelerate fitting of a 3DMM by first fitting an AAM tothe input and then using the result to make an initial guess on the 3D parametersin order to obtain a good initialization for a 3DMM Model. Currently, this initial-ization has to be done manually.

Figure 1.1 shows an overview of the proposed solution for this task. First,a 3DMM is used to automatically generate a set of sample images to learn theAAM. One or more 3D heads are selected that shall be included in the Active Ap-pearance Model (1). The parameter vector p3D here defines the shape, texture andpose of a face in 3D space. On the surface of the shape we can define a numberof landmarks for eyes, mouth and more that will be used to construct a 2D shapemodel later. For every parameter p3D a 2D image can be generated by renderingthe 3D object to a image (2). Besides the rendered image we can as well obtainthe position of the landmarks in the image plane. These steps can be repeated forany number of 3D parameters p3D, thereby introducing variability in identity andpose to the set of projected samples. Every projected image together with the cor-

Page 9: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

1.1 Motivation 3

Training / Model building

Fitting

(3)

(4)

(5)

(6)

(1)(2)

Novel 2D image input

3D

Error E =P

i D(vrec,i − vinput,i)

(p′3D)

2D images (p3D)

Error E =P

x,y [Irec(x, y)− Iinput(x, y)]2

Error E =P

x,y [Imodel(x, y)− Iinput(x, y)]2

2D projection

2D projection

Model DBMorphable

2D reconstruction

Active Appearance Model

2D image→ p2D(p′3D)

3D head (p3D)

Reconstructed 3D head

3D2D

2D model fitting:

ImageLandmarks→ AAM textureLandmarks→ AAM shape

Iinput(x, y)

Irec(x, y)

Figure 1.1: Combined 2D/3D model fitting system

responding landmarks constitutes an input sample for the generation of an ActiveAppearance Model (3).Once an AAM has been generated it can be used on a novel image Iinput(x, y) byapplying an optimization algorithm on the image errorE =

∑x,y[(Imodel(x, y)− Iinput(x, y)]2, thereby finding a set of AAM parameters

that best describe the input (4).

Every resulting image in step (2) is a function of p3D since it is defined by the3D parameter (together with the projection). As mentioned above, the projectedimages during model building are a function of the 3D parameters p3D. There isa (non-linear) relation between p3D and p2D. By exploring this relation using thetraining samples, e.g. using a Support Vector Machine (SVM), it is possible tocreate a reconstruction from 2D back to 3D (5).Finally, the learned function can be applied to the parameters of an AAM fit inorder to make a 3D reconstruction from a novel image (6).

A number of sub problems can be identified in figure 1.1:

• Model generation: An Active Appearance Model must be generatedfrom a 3D Morphable Model.

Page 10: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

4 Introduction

• Model fitting: The model must be fitted to an input image.

• Interpretation of parameters: The final parameters of a 2D model fitmust be mapped to 3D parameters.

This paper addresses the first two steps.

1.2 Mathematical Notations

Throughout this document we make use of the following mathematical notations,unless explicitly specified otherwise.

Variables written in normal font represent scalar values, bold lower-case vari-ables represent vectors and bold uppercase variables represent matrices:

x and X are scalar values (x,X ∈ R)

x = (x1, . . . , xn)T is a vector (x ∈ Rn)

X =

x11 · · · x1n... . . . ...

xm1 · · · xmn

is a matrix (∈ Rm×n)

(1.1)

The scalar product of two vectors x,y is denoted as: x · y =∑

i xi · yi.

Scalar valued functions, i.e. functions that map to R are always mostly inlower case, although sometimes upper case letters are used:

f(x) is a scalar valued function(R→ R)

f(x) is a scalar valued function(Rn → R)

F (x) is a scalar valued function(R→ R)

(1.2)

Vector valued functions are written in upper-case bold letters:

F(x) is a vector-valued function(Rn → Rm). (1.3)

In particular, we often use I(x) to refer to the scalar valued image intensity func-tion I : R2 → R that maps pixel positions x = (x, y) to gray levels I(x). Anotherexample is the vector valued warp function W(x), W : R2 → R2 which mapspixel positions to pixel positions.

Page 11: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

1.3 Overview 5

The derivative f(x) of a scalar valued function F (x), F : R→ R is denoted

dF (x)

dx, (1.4)

or simply F ′(x).The partial derivative of the i’th component of a vector-valued functionf(x), f : (Rn → Rm) after xj is written

∂fi(x)

∂xj

. (1.5)

The gradient of f(x) is written

∂f(x)

∂x= ∇f(x), (1.6)

where ∂∂x

is the gradient operator as a column vector:

∂x=[

∂∂x1

∂∂x2

. . . ∂∂xn

]T. (1.7)

1.3 OverviewThis thesis is structured as follows.

• Prior Work presents relevant approaches on 2D model building and fittingthat have been developed so far.

• Inverse Compositional Image Alignment gives an insight into the al-gorithm that I adapt in my work.

• Automatic AAM Generation explains how an Active Appearance Modelcan be automatically generated from a 3D Morphable Model.

• Results and Discussion evaluates model fitting in a number of test casesand discusses the results.

• Conclusion summarized this work and gives an outlook on possible im-provements and extensions.

1.4 Comparison of 2D and 3D ModelsIn this section the concepts of Active Appearance Models (AAMs) and 3D Mor-phable Models (3DMMs) are introduced. Sections 1.4.1, 1.4.2 and 1.4.3, in par-ticular the simplifications in notation in order to highlight the similarities anddifferences between 2D and 3D models, is based on the work of Xiao et al. [4].

Page 12: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

6 Introduction

1.4.1 Active Appearance Models (AAMs)An Active Appearance Model consists of a linear 2D shape model and a linearappearance model.

Linear 2D shape model The shape of an AAM is defined by 2D vertexlocations and a triangulation based on the vertices. The vertices are characteristiclandmarks on training images that have either been labeled manually or extractedusing an algorithm. The shape of an AAM can be written as the matrix of n 2Dcoordinates that make up the mesh:

s =

(u1 u2 . . . un

v1 v2 . . . vn

). (1.8)

In a linear model, this matrix is usually expressed as a base shape s0 plus a linearcombination of m shape matrices:

s = s0 +m∑

i=1

pisi, (1.9)

where the parameters pi are the shape parameters of s.AAMs are computed by applying PCA to a set of training shapes (see appendixA). The base shape s0 is the mean shape and si are the orthonormal eigenvectorscorresponding to the m largest eigenvalues.

Linear appearance model The appearance of an AAM is defined withinthe base mesh s0. We follow the “abuse of terminology” in [4] and let s0 alsodenote the set of pixels u = (u, v)T that lie inside the base shape s0. Then the ap-pearanceA(u) is an image defined over the pixels u ∈ s0. An appearance instanceis a linear combination of a base appearance A0(u) plus a linear combination of lappearance images Ai(u):

A(u) = A0(u) +l∑

i=1

λiAi(u). (1.10)

where λi are the appearance parameters. Similar to the shape, the appearance pa-rameters are obtained by applying PCA to the (shape normalized) training images[2].

Image formation model AAMs use an image formation model that in-corporates a 2D similarity transform N(s; q) to produce the final image. Given

Page 13: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

1.4 Comparison of 2D and 3D Models 7

a shape parameter vector p = (p1, . . . , pm), equation 1.9 is used to create theshape s of the AAM. The shape is then mapped into the image using N(s; q) withparameters q = (q1, q2, q3, q4)T encoding rotation, scaling and translation. Sim-ilarly, the appearance A(u) is computed from the appearance parameter vectorλ = (λ1, . . . , λl)

T using equation 1.10. The final AAM instance is then createdby warping the appearance A(u) from the base mesh s0 to the model shape meshin the image, N(s; q).The pair of meshes s0 and N(s; q) define a piecewise affine warp which we denoteW(u; p; q). For each triangle in s0 there is a unique affine warp to the vertices ofthe corresponding triangle in N(s; q).We will discuss AAMs in more detail in chapter 2.0.3.

1.4.2 3D Morphable Models (3DMMs)A 3D Morphable Model consists of a linear 3D shape model and a linear appear-ance model.

Linear 3D shape model The shape of a 3DMM is defined by 3D vertexlocations and a triangulation based on the vertices. The vertices are usually ob-tained from 3D range scans. The shape of a 3DMM can be written as the matrixof n 3D coordintes that make up the mesh:

s =

x1 x2 . . . xn

y1 y2 . . . yn

z1 z2 . . . zn

. (1.11)

In a linear model, this matrix is usually expressed as a base shape s0 plus a linearcombination of m shape matrices:

s = s0 +m∑

i=1

pisi, (1.12)

where the parameters pi are the shape parameters of s.3DMMs are computed by applying PCA to a set of training shapes. The baseshape s0 is the mean shape and si are the orthonormal eigenvectors correspondingto the m largest eigenvalues.

Linear appearance model The appearance of a 3DMM is defined withina 2D triangulated mesh that has the same topology (vertex connectivity) as thebase mesh s0. Let s∗0 denote the set of pixels u = (u, v)T that lie inside this 2Dmesh. Then the appearance A(u) is an image defined over the pixels u ∈ s∗0. An

Page 14: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

8 Introduction

appearance instance is a linear combination of a base appearance A0(u) plus alinear combination of l appearance images Ai(u):

A(u) = A0(u) +l∑

i=1

λiAi(u). (1.13)

where λi are the appearance parameters. Similar to the shape, the appearanceparameters are obtained by applying PCA to the training images that are acquiredfrom the 3D range scanner and warped onto the 2D triangulated mesh s∗0.

Image formation model 3DMMs use a projection matrix as image for-mulation model. The goal is to convert the shape s into a 2D mesh to produce thefinal image. We use the weak perspective model which is defined by the matrix(

ix iy izjx jy jz

)(1.14)

and an offset of the origin (ox, oy). The weak perspective model is a good approx-imation if the face is not too close to the camera. In order to avoid distortion, werequire the projection axes i and j to be orthonormal, i.e. i · j = 0. The projectionof a 3D point x = (x, y, z)T is

u = P x =

(ix iy izjx jy jz

)x +

(ox

oy

). (1.15)

A 3DMM image instance is then generated by first computing a 3D shape usingequation 1.9. The shape vertices are then projected to the image plane. The ap-pearance is generated using equation 1.10 and finally warped onto the 2D meshdefined by the projected shape.

1.4.3 Comparison of Representational PowerXiao et al. [4] have shown that 2D models can represent any visual phenomenonthat 3D models can.The set of 2D images that can be produced from the 3D model is(

ix iy izjx jy jz

)· (s0 +

m∑i=1

pisi), (1.16)

where (ix, iy, iz), (jx, jy, jz) and all pi vary over their allowed range of values.In order to prove that this variation can also be expressed by a 2D model, the

Page 15: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

1.5 Model-Based Vision 9

projection matrix can be decomposed to a sum of 6 matrices:(ix iy izjx jy jz

)= ix

(1 0 00 0 0

)+ iy

(0 1 00 0 0

)+ . . .+ iz

(0 0 00 0 1

). (1.17)

Using this decomposition, equation 1.16 can be expressed as a linear combinationof (

1 0 00 0 0

)· si,

(0 1 00 0 0

)· si, . . . , (1.18)

for i = 0, 1, . . . , m and all 6 matrices of the decomposition in equation 1.17. Thisgives a total of m = 6 × (m + 1) combinations. Using these we can define a 2Dmodel that can describe the same image variation as the 3D model, for exampleby choosing the shape vectors

s1 =

(1 0 00 0 0

)· s0, s2 =

(1 0 00 0 0

)· s1, . . . . (1.19)

This proof by Xiao et. al [4] shows very nicely that 2D models can indeed repre-sent any visual phenomenon that 3D models can, although they can also representshapes that cannot be expressed by the 3D model, so additional constraints mustbe imposed. In any case this proof is only of limited significance in the context oftraditional AAM approaches where images are manually labeled and the 3D shapevectors si are never available to build such a fully expressive model. We will there-fore use the 3D model to improve the traditional approach on AAMs where themodel is not explicitly constructed from 3D information but rather learned from2D samples.

1.5 Model-Based VisionMachines can analyze images on different levels. Low-level techniques extractlines and edges by analyzing the spatial variance of the pixel intensities, usuallyby computing image gradients. Sophisticated algorithms like the Canny edge de-tector [5] have been developed that allow to find salient features in images, asdepicted in figure 1.2. Low level approaches are limited in the sense that thesealgorithms do not have any knowledge about the object to be recognized.Model-based approaches introduce such knowledge. They are usually tailored toa specific object class, e.g. the class of human faces. A model for faces can beanything from very simple rigid models like a single image template, to morecomplex, deformable models.Optimization algorithms usually make use of lower level image analysis methods,

Page 16: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

10 Introduction

(a) Colour photograph of a scene (b) The same image filtered with a Cannyedge detector

Figure 1.2: Canny edge detection algorithm

most notably image gradients and edge detectors, in order to find a solution. Forthe example of a simple model consisting only of a single image template men-tioned above, the parameters might describe the translation, rotation and scalingneeded to fit this template to an image. For more sophisticated models, the pa-rameters also describe texture variation as well as non-rigid change to the shape’sgeometry.Knowledge about the object in question can also be introduced on the parametersof such a model. For example, statistical knowledge can be used to constrain theparameters to a certain range. This is called prior knowledge.

1.5.1 A Simple Example for Model Fitting

A model can be written as a function F (x,p), M : Rn → Rm, where x is a vectorof observations and p is the vector of parameters that govern the model. A simpleexample of a model for the class of quadratic functions is

f(x,p) = p1 + p2x+ p3x2. (1.20)

Fitting this parabola to a set of observed data (xi, yi), i = (1, . . . ,m) requires tofind a set of parameters p that minimize some error measure, usually the squareddistance, between the model and the observed data:

F := [f(x,p)− y]2 =m∑

i=1

[(p1 + p2x+ p3x2)− yi]

2. (1.21)

Page 17: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

1.5 Model-Based Vision 11

The partial derivatives are

∂F

∂p0

= 2m∑

i=1

[(p1 + p2x+ p3x2)− yi] = 0

∂F

∂p1

= 2m∑

i=1

[(p1 + p2x+ p3x2)− yi]x = 0

∂F

∂p2

= 2m∑

i=1

[(p1 + p2x+ p3x2)− yi]x

2 = 0

(1.22)

The solution with respect to p isp1

p2

p3

=

m∑n

i=1 xi

∑mi=1 x

2i∑m

i=1 xi

∑mi=1 x

2i

∑mi=1 x

3i∑m

i=1 x2i

∑mi=1 x

3i

∑mi=1 x

4i

−1 ∑mi=1 y1∑m

i=1 xiyi∑mi=1 x

2i yi

(1.23)

Figure 1.5.1 depicts a least squared model fit of a quadratic function to a set ofdata points.In this example the solution is found in a single step. Usually, the functions tobe optimized are more complex than just quadratic functions and thus cannot besolved directly. Often a reasonable value for the parameters is given and an al-gorithm, e.g. a gradient descent method, is used to iteratively improve the modelparameters until convergence.

Page 18: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

12 Introduction

Input dataFit

Figure 1.3: Least-squares fit of a quadratic function (green) on a set of data points (blue).

Page 19: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

Chapter 2

Prior Work

In the early days of computer vision research, approaches have concentrated onthe recognition of rigid objects, i.e. objects that move under a two-dimensionalsimilarity transform such as rotation, translation or scaling [6]. This is certainly aninsufficient choice for face reconstruction, as the shape of a face possesses internaldegrees of freedom that encode identity and expression.Accurate face recognition requires deformable models. In contrast to face detec-tion where the output of an algorithm might simply be a binary label whether ornot an image contains a face and its approximate position, face recognition revealsmore from images. A perfect face recognition algorithm would return a compre-hensive description of all properties of a face.Properties are typically geometry, texture, position and pose, but could be any-thing from gender to the number of moles in the face. In practice, these propertiesare formulated in a model. There has been quite some research in model-basedface recognition during the past years. In this chapter we discuss the most impor-tant approaches.Between the different methods, a rough distinction can be made into shape-freemodels and shape models.

2.0.2 Shape-Free Models

Shape-free models represent faces not by means of a geometry but only using aholistic model of pixel intensities. The most important work in this domain iscalled Eigenfaces by Turk and Pentland [7].It proposes an information theory approach of coding and decoding face images.The principal components of the distribution of faces is computed by applyingprincipal component analysis (PCA) (see appendix A) on a set of training images.These training images are assumed to be captured under ”controlled” conditions,meaning that only little variability in pose and illumination is allowed. This is nec-

Page 20: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

14 Prior Work

essary because PCA conceptually associates pixel positions with features, whichmakes it crucial that the pose is approximately the same in all images.

Let I(x) be a gray level image of N columns and M rows. This image can bewritten as a vector of length N ×M . Denote this vector t. This vector describesa point in n × m dimensional space. The authors assume that images of facesare not randomly distributed in this huge space but rather form a relatively lowdimensional subspace called face space.The average of the complete set of training images t1, t2, . . . , tm is

s0 =1

m

m∑i=1

ti, (2.1)

where s0 = (x01, y

01 . . . , x

0n, y

0n) is the vector of averages. PCA is then applied to

the covariance matrix of mean free data

C =1

m

m∑i=1

(ti − s0)(ti − s0)T (2.2)

= AAT , (2.3)

where A = [(t1 − s0), (t2 − s0), . . . , (ti − s0)].

The output of the PCA is a matrix U of m orthonormal basis vectors ui of thesame length as the original image vectors ti, and their associated eigenvalues λi.The eigenvalues define the ”usefulness” of eigenvectors because they describe thevariance along a particular direction.Because the eigenvectors themselves are images of the same dimension as theoriginal images and because of their face-like appearance, the authors have namedthem Eigenfaces.Figure 2.1 shows examples of Eigenfaces built from a database of ten faces.Face identification can then be done by projecting a novel image t into the facespace by the operation

pk = uTi (t− s0) i = (1, . . . ,m) (2.4)

The set of parameters p = [p1, . . . , pn]T forms the parameter set for this image t.The image can be expressed using only this parameter vector and the Eigenfaces:

t ≈m∑

i=1

piui, (2.5)

Page 21: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

15

(a) Two example images fromthe data set

(b) The first 5 Eigenfaces corresponding to the largest eigenvalues (n as-cending order)

Figure 2.1: Examples of a data set (left) and of the Eigenfaces (right) of a model built from 10 images.

where “≈” can only be replaced with “=” if t is in face space. Otherwise a recon-struction error occurs.By removing the columns ui in U with the smallest corresponding eigenvaluesλi, an optimal dimensionality reduction is achieved. The loss of variability corre-sponds exactly to the sum of the dropped eigenvalues.Often only between 5 and 7 principal components are used for recognition, so thecomplexity of the model and the computational cost can be reduced.

It is also possible to find (not just encode) faces in images by extracting a smallerimage patch from the image at every pixel location and projecting this patch intothe Eigenface model. The assumption is that face-like images keep their appear-ance when reconstructed using 2.5 while for non face-like images, t− t will differheavily.Although the method can find the position of faces in images and do identifica-tion on them at the same time, it is not easily possible to extract information likepose from the face. Extensions like a pre-registration of all faces by warping themonto a reference face (defined by a set of control points) before computing themodel does not improve the system. For face identification this tends to makethe situation worse because this shape normalization removes specific features infaces.

2.0.3 Shape Models

”Hand Crafted” Models

Various researchers have built models based on geometric shapes as sub compo-nents that together define the object of interest. Yuille et al. [8] proposed a methodfor eye fitting where the eye is described by a parametrized template. Basicallythe model consists of a circle for the pupil and two parabolic curves for the con-

Page 22: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

16 Prior Work

tour. The optimization method is very specific to eyes; the number of dark pixelsin the circle and the number of bright pixels between circle and parabolic curvesare both maximized. A hierarchical optimization scheme can be used, meaningthat first the whole constellation is fitted to an image and then the individual partscan be optimized individually. Lipson et al. have applied a similar method to CTimages [9]. Such models can perform good on objects to which they are tailoredbut they completely lack generality.

Articulated Models

If the object of interest can be decomposed into a number of rigid sub parts and ifthe occlusion between different parts is small, articulated models can be a reason-able choice. These sub parts are connected by sliding or rotating joints. A typicalexample is a model of the human body with corpus, head, arms and legs beingthe sub parts. Individual components can be recognized individually and a globalconsistency check then ensures that the relative configuration of the detected partsis allowed. Beinglass and Wolfson [10] have developed such a system using ageneralized Hough transform.In the context of faces, articulated objects have not proven to be successful. Oneof the reasons is that faces can not be divided into rigid sub parts very well as theyvary non-rigidly under pose change.

Active Contour Models

Active Contour Models, or Snakes, are flexible models that can be fitted to im-age features. Mathematically, an Active Contour Model is a parametrized energyminimizing spline. The term spline comes from the flexible spline devices usedby shipbuilders and draftsmen to draw smooth shapes. A spline is a special func-tion defined piecewise by polynomials that approximates a function using a setof control points. Compared to polynomials of higher order, splines avoid heavyoscillation at the ends (known as Runge’s phenomenon).The approach uses an energy equation that maximizes smoothness in the spline(i.e. minimize curvature) while finding the maximum image gradient in an imagecontour. The authors state: ”Because of the way the contours slither while mini-mizing their energy, we call them snakes.”.

Representing the position of a snake parametrically by

v(s) = (x((s), y(s)) (2.6)

Page 23: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

17

we can write its energy functional as

Esnake =

∫ 1

0

Esnake(v(s))ds

=

∫ 1

0

Einternal(v(s)) + Eimage(v(s)) + Econstraint(v(s))ds

(2.7)

Einternal is a squared sum of weighted first and second order derivatives,

Einternal = α|v′|+ β|v′′|2, (2.8)

that controls stretching (weight α) and curvature (weight β) of the snake. Thesecond term, Eimage, is a weighted sum that attracts the snake to salient features:

Eimage = wlineEline + wedgeEedge + wtermEterm, (2.9)

where Eline = I(x, y) attracts the snake to either bright or dark lines (dependingon the sign of wline), andEedge = −|∇I(x, y)2| attracts the snake to contours withlarge image gradients. Eterm attracts the snake to lines or terminations.

Snakes have a number of drawbacks. Most importantly, no knowledge about theobject class is built into the model which has the consequence that snakes candegenerate to arbitrary shapes. Hinton et al. [11] have extended snakes to havepreferred ”home” locations to give the snake a certain default shape.A related problem is the trade-off between model specificity and variability. Bydefining the weights in equations 2.8 and 2.9, one implicitly defines some sort ofprior knowledge. This, however, does not really incorporate real class-knowledge.

Active Shape Models

Active Shape Models (ASM) address the lack of knowledge about the shape inthe previous approaches. The method consists of a set of points called landmarkpoints that explicitly describe the object of interest. An Active Shape Model isbuilt by systematically placing such landmarks on a set of training images. After-wards the complete set is aligned using Procrustes analysis to minimize the vari-ance in distance between equivalent points. Finally, princpial component analysis(PCA) (see appendix A) is applied to the training shapes in order to obtain a PointDistribution Model (PDM).

Bookstein [12] describes such landmark points for biological and medical spec-imens. He describes them in terms of their usefulness. They can be reduced tothree different types:

Page 24: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

18 Prior Work

1. Points marking parts of the object with particular application-dependent sig-nificance such as the center of an eye.

2. Points marking application-independent things, such as the highest point onan object in a particular orientation, or curvature extrema.

3. Other points which can be interpolated from points of type 1 and 2; forinstance, points marked at equal distances around a boundary between twotype 1 landmarks.

Landmarks of type 1 are easier to identify precisely but those of type 2 and 3are also necessary to define a flexible shape in sufficient detail. Points of type 3could be replaced by interpolating splines in order to represent boundaries with aminimal set of landmarks, however using enough points of type 3 is as effectiveand computationally more efficient.After a user has manually labeled a set ofm images, an iterative process is used toalign the shapes. First, all shapes are rotated, translated and scaled to align withthe first shape in the set. Then the following steps are repeated: The mean shapeis calculated and orientation, scale and origin of the current mean are normalizedto suitable defaults. Every shape is then realigned with the current mean. Thisiteration is performed until convergence. The alignment between two shapes xi

and xj is optimized based on

Ej = (xi −M(sj, θj)[xj]− tj)TW (xi −M(sj, θj)[xj]− tj), (2.10)

where

M(s, θ)

[xjk

yjk

]=

((s cos θ)xjk − (s sin θ)yjk

(s sin θ)xjk + (s cos θ)yjk

)(2.11)

and

tj = (txj, tyj, . . . , txj, tyj)T (2.12)

are rotation (θ) and scaling (s) as well as translation (tj) needed to map one shapeto the other and W is a diagonal matrix of weights. The weighting is motivatedby the observation that some points move around more than others in the trainingset. By weighting points that tend to be more stationary in the set, matching suchpoints in different shapes will be a priority. The normalization is required for thealgorithm to converge because without it, there are 4(m − 1) constraints on 4mvariables (θ, s, tx, ty for each of them shapes) and the algorithm is ill-conditioned.

Similar to the Eigenface approach, the reasonable assumption is made that, whenshapes are represented as a 2m-dimensional vector, the set of shapes forms an

Page 25: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

19

ellipsoidal cloud in this space, called Allowable Shape Domain. The center andmajor axes of this ellipsoid is modeled using PCA.The computations performed are very similar to what is done with Eigenfaces,however the statistics are not captured from pixels but from landmark positions.The mean is calculated using

µ =1

m

m∑i=1

xi. (2.13)

and the principal axes using the covariance matrix

C =1

m

m∑i=1

(xi − µ)(xi − µ)T (2.14)

(2.15)

The singular value decomposition yields

C = USVT . (2.16)

Again, the eigenvectors corresponding to small eigenvalues can be removed fromthe basis U, yielding a compressed version of the model.Any point in the Allowable Shape Domain can be approximated by

x = µ+

(m−t)∑i=1

piui, (2.17)

where t is the number of eigenvectors that are left out and pi are the coefficientsof the shape. The coefficients are obtained by projecting a shape x into the spacespanned by U: p1

...pn

= UT x. (2.18)

Since the eigenvalues λi correspond to the variance along the i’th dimension, thevalue of pi can be restricted to a certain size. A limit of ±3

√λi covers about 99.7

percent of the population of a Gauss distribution, however soft constraints, i.e.Tikhonov regularization, can lead to better results.

Fitting To fit an Active Shape Model to images, displacements for eachmodel point are computed individually based on the gradient information per-pendicular to the model boundary. This gives a vector of updates to the current

Page 26: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

20 Prior Work

Figure 2.2: A rectangular Active Shape Model (black) with model point offsets yielding ashape that is not within the Allowable Shape Domain (red). This shape is projected into thespan of the model to give a new shape (green).

positions of the model points

∆x = (dx1, dy1, . . . , dxn, dyn), (2.19)

which, in general, does not yield a new shape (x+dx) that is within the AllowableShape Domain (see figure 2.2). Therefore, x and x+dx are aligned using the samemethod as in the registration process, yielding a set of parameters (θ, s, t) that areused to remove the difference in rotation, scaling and translation between x andx + dx. The remaining difference between the two shapes can only be explainedby local deformation of one of the shapes. These residual adjustments are finallyprojected into model space

∆p = UTdx, (2.20)

to give a new set of parameter values pi+1 = pi + W∆p, where W again is adiagonal matrix of weights for each mode.

This process is repeated until no significant change in the shape results.Typical applications for Active Shape Models are location of resistors on a circuitboard or detection of hands in images. Various researchers have applied ASMs toface recognition. Lanitis et al. [13] use an ASM to locate a face in an image andthen warp the image to a normalized frame in which a model of the intensities ofthe shape-free face is used to interpret the image. This texture model is used tointerpret the face that was found using the ASM but it is not used for fitting theASM itself. Thus, a natural extension is to use a holistic model of pixel intensitiesin combination with an ASM during the fitting phase. This idea leads to ActiveAppearance Models.

Page 27: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

21

Active Appearance Models

Active Appearance Models (AAMs) combine an Active Shape Model and a modelof pixel intensities. The basic idea is to learn the image difference patterns cor-responding to changes in each shape parameter in order to use them for updatingthe model estimate.

First, a Point Distribution Model is built, fully in line with Active Shape Models.This gives in a linear model for shape

s = s0 +m∑

i=1

pisi. (2.21)

Then a triangulation algorithm is applied to s, yielding a triangulated two-dimensionalmesh. Any mesh s together with the mean (or base mesh) s0 thus define a piece-wise affine warp. If the number of triangles in the mesh is n, the piecewise warpconsists of n affine warps.Afterwards the image is sampled at the positions inside s and warped from s tos0 so that the landmarks coincide. Implementing this warp then means to iterateover all pixels in the shape s0 and to find out to which triangle this pixel belongs.It can then be warped with the affine warp of that triangle. Note that this processresults in a deformation of the shapes and thus in the textures.Once all example images have been warped into the reference frame s0, a PCA isapplied to the pixel intensities, yielding a linear appearance model

A(x) = A0 +l∑

i=1

λiAi. (2.22)

that is defined in the region within the mean shape s0. Note the similarity to theEigenface approach. The Ai correspond to Eigenfaces.Optionally, a combined PCA can be computed on the concatenated vectors [p, λ]T

to remove correlations between the shape and pixel variations. We will, however,stick to the case where texture and shape models are kept separate.

An instance of the model can then be generated for a given p and λ by first gener-ating the shape-free image according to equation 2.21 and then warping this imageto the shape defined by p according to equation 2.22. We denote the warp from themean shape s0 to an instance of the model with parameters p as W(x,p), wherex is the set of pixels within the base shape s0. The following equation summarizesthis process:

M(W(x,p)) = A(x). (2.23)

Page 28: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

22 Prior Work

Figure 2.3: A previously unseen image (left) and its reconstruction from the Active Appear-ance Model (right), reprinted from [2]

M is the image containing the model instance. For every pixel value x, W(x,p)gives its destination in the model instance and A(x) defines the appearance valueat this destination.

Following the abuse of terminology in [3], we sometimes write W(s0,p) wheres0 refers to the set of pixels that lie in the base shape.

Approximating a Previously Unseen Face Given a new image la-beled with the same landmarks that are used in the model, a reconstruction canbe made using the ASM. The shape vector s containing the landmarks can beprojected into the shape model using

p = UT (s− s0). (2.24)

The input image is then sampled in the region within this shape and the result isprojected into the texture model using

λ = AT (I − A0). (2.25)

where A = (A1, . . . , Al) is the (possibly reduced) matrix of Eigenfaces.The obtained set of parameters (p, λ) defines a reconstruction of the input usingthe ASM. Using equation 2.23 the final image of the reconstruction is computed.Figure 2.0.3 gives an example.

Fitting If the set of landmarks in a new image is not given, the model hasto find an appropriate fit. If a reasonable starting position is known, this can beperformed using the following iterative algorithm.

Page 29: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

23

Figure 2.4: Visualization of the rows of the regression matrix corresponding to the rotationand translation parameters (sx, sy, tx, ty), reprinted from [2]

The goal is to minimize the squared distance between the input image and themodel reconstruction

EI =∑x,y

[Iinput(x, y)− Imodel(x, y)]2 :=∑x,y

[Ierror(x, y)]2. (2.26)

This error image Ierror depends on the model parameters. The key assumptionthat is made at this point is that the spatial pattern of Ierror encodes informationabout how the parameters should be updated in order to obtain a better fit. Therelation between error image and the correct parameter updates is assumed to beconstant and linear. By randomly perturbing the model parameters for imageswhere the true parameters are known, a set of patterns for Ierror is collected and alinear regression on the data is performed:

∆s = RsIerror (2.27)∆t = RtIerror (2.28)

where ∆s and ∆t are the perturbations in the parameters, Ierror are the error im-ages computed in the shape-free reference frame and Rs,Rt are linear regressionmatrices.Four extra parameters are added to the regression representing a similarity trans-form using(sx, sy, tx, ty), where sx = s cos(θ), sy = s sin(θ). s is a scaling factor and ty, tyare translations. Figure 2.4 shows the rows of the regression matrix correspondingto these four parameters (the values are shifted so that dark areas represent nega-tive values, bright areas positive). They represent the weights to different areas inthe error image when updating the model parameters during the iteration. Note thesimilarity between the last two images and image gradients in x- and y-direction.We will later see how we can compute the images in figure 2.4 analytically fromthe model itself instead of learning them from perturbations of model parameters.

Page 30: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

24 Prior Work

Given an initial set of model parameters, in particular for the similarity transform,the algorithm itself then consists of the following loop:

• Evaluate the error image Ierror = Iinput − Imodel

• Evaluate the error E = |Ierror|2• Compute the parameter updates ∆s = RsIerror and ∆t = RtIerror

• Sample the image at s′ = s + k∆s and t′ = k∆s and calculate a new errorE ′.• If E ′ < E accept the current guess• Otherwise try again for smaller values of k

The procedure is repeated until no improvement E ′ < E can be achieved.

A variant of this optimization first matched the model to a lowpass-filtered versionof the image. This can increase the convergence range and give an initializationto images of higher frequency.The problem with the fitting approach by Cootes et al. is that the relationshipbetween error image and correct parameter updates is, although approximatelylinear in a close neighborhood of the optimum, not constant. The result of such awrong assumption is a so-called ad-hoc fitting algorithm that can be outperformedby a method that takes this fact into account, as we will see in the next chapter.

Page 31: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

Chapter 3

Inverse Compositional ImageAlignment

The Inverse Compositional Image Alignment (ICIA) algorithm by Matthews andBaker [14] [3] is a fitting algorithm for Active Appearance Models that takes onthe wrong assumption about a constant linear relationship between the image er-ror and the model parameter updates that is made in the approach of Cootes et al.In the context of this thesis, ICIA is prior work but because it plays a central role,a separate chapter is dedicated to it.

In figure 3.1, a simple counterexample (reprinted from [3]) is given that provesthe invalidity of the assumption of a constant relationship between Ierror and themodel parameter updates. In this figure, an simple AAM and two input imagesare shown. Both images yield the same error image when warped to the referenceframe. The algorithm of Cootes et al. in this case applies the same parameterupdates for both inputs. However, the correct parameter updates that should beapplied to the models are actually not equal.A natural way to improve the fitting algorithm could be to use a standard gradientdescent algorithm like Blanz and Vetter use for the 3D Morphable Model [1]. Adisadvantage of such algorithms is that the Jacobian and the Hessian both need tobe recomputed in each iteration.Because in gradient descent methods, the equivalents of Rs and Rt are the Ja-cobian and the Hessian, which depend on the current model parameters p and λ,they can not be fast per definition; to compute the parameter updates, both Jaco-bian and Hessian need to be recomputed in each iteration.

ICIA puts Active Appearance Model fitting into the framework of Lucas-Kanadeimage alignment [15], an analytical image alignment method, and extends it tobecome as efficient as the ad-hoc fitting algorithm by Cootes et al.

Page 32: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

26 Inverse Compositional Image Alignment

Figure 3.1: Two input images for a simple AAM that yield the same error image but requiredifferent parameter updates. (Reprinted from [3])

In order to get more insight into ICIA algorithm it is necessary to first discussthe Lucas-Kanade- and the Forwards Compositional Image Alignment algorithmsbefore we move on to see how ICIA extends current gradient descent approacheswith the advantages of efficient ad-hoc fitting algorithms.

3.1 The Lucas-Kanade Algorithm

This algorithm was introduced in 1981 Lucas and Kanade [15]. It is still a standardmethod for image alignment and optical flow calculation.

3.1.1 Problem Formulation

Given two functions F (x) and G(x), F,G : R2 → R that both give the imageintensity at position x in two images, we want to find the disparity vector h thatminimizes the difference between F (x + h) and G(x).For our purpose, the difference measure is the L2 norm∑

x

√[G(x)− F (x + h)]2. (3.1)

This approach, like all other image alignment algorithms discussed here, requiresan reasonable initial guess for h. The performance heavily depends on this choice.

Page 33: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

3.1 The Lucas-Kanade Algorithm 27

h

G(x) - F(x)

G(x)F(x)

Figure 3.2: In the one dimensional case, the Lucas-Kanade algorithm finds the horizontaldisparity h between two curves F (x) and G(x).

3.1.2 One-Dimensional Case

If x is one dimensional we want to find the (scalar) horizontal disparity betweenF (x) and G(x) = F (x+ h). Figure 3.2 gives an overview of the problem.We can write the Taylor expansion of F (x) around x

F (x+ h) ≈ F (x) + F ′(x) · h. (3.2)

Isolating h gives

h ≈ G(x)− F (x)

F ′(x). (3.3)

Two things are important here. First, this linear approximation is only adequate ifh is small enough. Second, equation 3.3 depends on x. It is therefore necessary tocombine all estimates of h. This is done using the weighted average

h =∑

x

w(x)[G(x)− F (x)]

F ′(x)/∑

x

w(x). (3.4)

In this formula,∑

x[G(x)−F (x)]/F ′(x) is the average of all estimates and w(x)a weighting factor. Because this estimate is good where F (x) is nearly linearand conversely worse where F (x) is non-linear, it is reasonable to use the inversecurvature F ′′(x) as weighting factor:

w(x) =1

F ′′(x). (3.5)

The denominator∑

xw(x) in equation 3.4 is a normalization factor.

Page 34: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

28 Inverse Compositional Image Alignment

This estimate allows for an iterative refinement of h by alternatively computing anew h using equation 3.4 and moving F (x) to F (x+ h). The recurrence relationis

h0 = 0 (3.6)

hn+1 = hn +∑

x

w(x)[G(x)− F (x+ hn)]∑xw(x)

. (3.7)

An alternative derivation of the method avoids the division by zero in 3.3 if thecurve is level. We can use equation 3.2 and search for the h that minimizes the L2

difference

E =∑

x

[G(x)− F (x+ h)]2

=∑

x

[G(x)− F (x)− F ′(x) · h]2.(3.8)

The goal is to find arg minh

∑x[G(x) − F (x + h)]2 which can be done through

the derivative

0 =d

dhE (3.9)

=d

dh

∑x

[G(x)− F (x)− F ′(x)h]2 (3.10)

=∑

x

2F (′x)[G(x)− F (x)− F ′(x)h]. (3.11)

Solving for h gives

h ≈∑

x F′(x)[G(x)− F (x)]∑

x F′(x)2

. (3.12)

This new approximation of h replaces equation 3.3. Now the algorithm general-izes to two dimensions, as we will see shortly.Note that in this linear approximation of h, all estimates are already combinedbecause the error term in 3.8 is a sum over all values of x, although we still haveto weight the result with w(x). This gives the final recurrence relation

h0 = 0 (3.13)

hn+1 = hn +∑

x

w(x)[G(x)− F (x+ hn)]∑xw(x)

. (3.14)

Page 35: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

3.1 The Lucas-Kanade Algorithm 29

where w(x) is the same weighting function as in equation 3.5.

Like with the fitting algorithm proposed by Cootes et al., the authors suggest acourse-to-fine strategy. Using a lowpass-filtered image, an approximate fit can beobtained which in turn can be improved by iteratively incorporating higher fre-quencies. A very intuitive reason why this improves the convergence range isgiven in [15] for F (x) = sin (x) and G(x) = F (x + h) = sin (x+ h), where itturns out that convergence is given for misregistrations as large as one-half wave-length.

3.1.3 Generalization to Higher Dimensions

We will now generalize the Lucas-Kanade algorithm to higher dimensions, thatis, for vectors x,h ∈ Rn.The error function remains unchanged,

E =∑x

[G(x)− F (x + h)]2,

The generalized Taylor approximation is

F (x + h) ≈ F (x) +∂F (x)

∂xh, (3.15)

and the derivative with respect to h becomes

0 =∂E

∂h(3.16)

≈ ∂

∂h

∑x

[G(x)− F (x)− ∂F

∂x(x)h]2 (3.17)

=∑x

2∂F

∂x(x)[G(x)− F (x)− ∂F

∂xh]. (3.18)

This gives the equivalent of ∆h for higher dimensions,

happrox[∑

x

(∂F

∂x)T (

∂F

∂x)]−1[∑

x

(∂

∂x)T [G(x)− F (x)]

](3.19)

= H−1∑x

[∇F (x)]T [G(x)− F (x)]. (3.20)

Page 36: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

30 Inverse Compositional Image Alignment

H is the Gauss-Newton approximation to the Hessian matrix see appendix B).The fact that we minimize a least squares expression and that the Hessian matrixis replaced by this approximation makes the Lucas-Kande algorithm belong to theclass of Gauss-Newton methods (see appendix B).

Until now, we have limited the discussion of the disparity between the imagesto translations only. A good image registration technique should be able to copewith arbitrary translation, rotation, scaling and shearing. This can be achieved byminimizing

E =∑x

[G(x)− F (Ax + h)]2. (3.21)

A is an affine transform matrix. The linear approximation in equation 3.15 be-comes

F ((A + ∆A)x + (h + ∆h)) ≈ F (Ax + h) + (∆Ax + ∆h)∂

∂xF (x) (3.22)

Minimizing this expression instead of 3.8 can be done analogous to the steps de-scribed here.

Application to AAMs

Applied to Active Appearance Models, the Lucas-Kanade energy term (equation3.8) takes the form ∑

x

[A0(x)− I(W(x; p + ∆p))]2. (3.23)

To apply the Taylor expansion we need to know the derivative of I(W(x; p)) withrespect to the model parameters p. Applying the chain rule gives

∂I(W(x; p))

∂p=∂I

∂x(W(x; p))

∂W

∂p

= ∇I ∂W

∂p,

(3.24)

where ∇I is the gradient of the AAM instance evaluated at W(x; p) and ∂W∂p

isthe Jacobian of the warp evaluated at (x; p).The Taylor expansion is then:

∑x

[A0(x)− I(W(x; p + p))−∇I ∂W

∂p∆p]2. (3.25)

Page 37: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

3.2 Compositional Image Alignment 31

The counterpart of equation 3.19 is then

∆p = H−1∑x

[∇I ∂W

∂p

]T[A0(x)− I(W(x; p))]. (3.26)

Matthews and Baker refer to the Lucas-Kanade method as the forwards-additivealgorithm because in each iteration, ∆p is added to the current estimate of p. Theforwards part denotes the direction of the warp W from the reference frame intothe image coordinate frame.

The Lucas-Kanade algorithm is slow because the gradient ∇I and the Jacobian∂W∂p

depend on p so they have to be recomputed in each iteration together with the(approximation to the) Hessian matrix.

3.2 Compositional Image AlignmentInstead of an increment ∆p to be added to the current p, compositional algo-rithms compute an incremental warp W(x; ∆p) to be composed with the currentwarp W(x; p).Compositional algorithms can be further distinguished between the Forwards Com-positional- and the Inverse Compositional algorithm.

3.2.1 Forwards Compositional Image AlignmentThe Forwards Compositional algorithm computes an incremental warp W(x; ∆p)to be composed with the current warp W(x; p).

The expression to be minimized is then∑x

[A0(x)− I(W(W(x; ∆p); p))]2. (3.27)

A solution for ∆p has to be found in order to compute the incremental warpW(x; ∆p). To update the current warp with this incremental warp, these twohave to be composed, resulting in the recurrence relation

W(x; p)←W(x; p) ◦W(x; ∆p). (3.28)

The incremental warp W(x; ∆p) is said to be in the “image” direction becauseW(x; p) projects from the reference frame into the image frame.The Taylor expansion becomes∑

x

[A0(x)− I(W(x; 0); p))−∇I(W(x; p))

∂W

∂p∆p]2

(3.29)

Page 38: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

32 Inverse Compositional Image Alignment

where p = 0 is defined to be the identity warp W(x; 0) = x.Compared to equation there are two important differences. First, the gradient iscomputed on I(W(x; p)) instead of I(x). Second, and most importantly, theJacobian is constant because it is evaluated at W(x; 0). This results in a perfor-mance gain since the Jacobian can now be precomputed. This is the key point ofthe Forwards Compositional algorithm.

3.2.2 Inverse Compositional Image AlignmentChanging the roles of the template and the input image in equation 3.27 yields an-other improvement. Doing so results in the following expression to be minimized:∑

x

[I(W(x; p))− A0(W(x; ∆p))]2, (3.30)

and the following update rule

W(x; p)←W(x; p) ◦W(x; ∆p)−1. (3.31)

Instead of computing the incremental warp with respect to I(W(x; p)), it is com-puted with respect to the template A0(x). The incremental warp now points in theopposite direction; from the current position in the input image I(W(x; p), theincremental warp points back into the reference frame of A0(x).

Consequently, the roles of I and A0 are also switched in the Taylor expansion,∑x

[I(W(x; p)− A0(W(x; 0))−∇A0∂W

∂p∆p]2, (3.32)

and in the solution for ∆p (equation 3.26)

∆p = H−1∑x

[∇I ∂W

∂p

]T[A0(x)− I(W(x; p))]. (3.33)

Hence the Hessian is now constant too, since none of its components depends onp:

H =∑x

[∇A0

∂W

∂p

]T[∇A0

∂W

∂p

]. (3.34)

The result is that significant parts of equation 3.33 do not vary during the iterationand can thus be precomputed.Additionally, the authors use a method called Project Out [16] in order to makethe optimization independent of appearance variation. This reduces the numberof parameters to optimize and makes the algorithm faster and more robust.

Page 39: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

3.2 Compositional Image Alignment 33

3.2.3 Applying Inverse Compositional Image Alignmentto AAMs

We will now discuss how the different components of equation 3.33 can be com-puted in order to apply the Inverse Compositional Image Alignment algorithm toAAMs.

Piecewise Affine Warping I(W(x; p)) is computed by backwards warp-ing the input image with W(x; p). For every pixel in the base mesh s0 the cor-responding pixel in the input image I(W(x; p)) is computed and stored in a newimage defined over the base mesh s0; the input image I is said to be sampled atthe positions I(W(x; p).Every pixel in the base mesh s0 lies in a triangle. To perform the above sampling,each triangle has to be warped from its location in the base mesh to its location inthe mesh defined by the current shape parameter p. The vertices in the base meshs0 define m piecewise affine warps. Choose an arbitrary triangle from the basemesh and denote its vertices by

(x0i , y

0i )T , (x0

j , y0j )T , (x0

k, y0k)T (3.35)

Given a shape parameter vector p = (p1, . . . , pn) the vertices of the correspondingshape can be computed using equation 2.17:

s = s0 +n∑

i=1

pisi. (3.36)

Choose the same triangle in s and denote it

(xi, yi)T , (xj, yj)

T , (xk, yk)T (3.37)

These two triangles define an affine warp from the s0 to s. The method to computethis warp uses barycentric coordinates. Any pixel in the triangle in s0 can bewritten in barycentric coordinates as

x = (x, y)T = (x0i , y

0i )T + α[(x0

j , y0j )T − (x0

i , y0i )T ] + β[(x0

k, y0k)T − (x0

i , y0i )T ]

(3.38)

with

α =(x− x0)(y0

k − y0i )− (y − y0

i )(x0k − x0

i )

(x0j − x0

i )(y0k − y0

i )− (y0j − y0

i )(x0k − x0

i ),

β =(y − y0

i )(x0j − x0

i )− (x− x0i )(y0

j − y0i )

(x0j − x0

i )(y0k − y0

i )− (y0j − y0

i )(x0k − x0

i ).

(3.39)

Page 40: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

34 Inverse Compositional Image Alignment

The warp for the pixels in triangle (x0i , y

0i )T , (x0

j , y0j )T , (x0

k, y0k)T of the base mesh

s0 is then

W(x; p) = (xi, yi)T + α[(xj, yj)

T − (xi, yi)T ] + β[(xk, yk)T − (xi, yi)

T ](3.40)

where (xi, yi)T , (xj, yj)

T , (xk, yk)T are the vertices of the triangle in s.

Warp Jacobian The warp Jacobian can be computed by applying thechain rule to W(x; p):

∂W

∂p=

v∑i=1

[∂W

∂xi

∂xi

∂p+

W

∂yi

∂yi

∂p

](3.41)

since the shape parameters define the position of the mesh vertices x = (xi, yi)T

and thus the position of all pixels under the warp W(x; p). In other words, thedestination of a pixel under W(x; p) depends on the shape parameters p throughthe vertices of the mesh s.The first components are the Jacobians of the warp with respect to the modelvertices. Applying the partial derivative to equation 3.40 gives

∂W

∂xi

= (1− α− β, 0)T

∂W

∂yi

= (0, 1− α− β)T .

(3.42)

The Jacobian describes for each shape vertex (xi, yi) how much the pixels withinthe base mesh s0 change their destination when this model vertex is moved underW(s; p). Figure 3.3 depicts the Jacobian for a selected mesh vertex. A similarfigure can be found in [3].To calculate the second component of the Jacobian ∂W

∂p, the derivative of the mesh

vertices with respect to the shape parameters ∂xi

∂p, we take a look at equation 3.36.

A particular shape has the form

s =(x1, y1, x2, y2, . . . , xm, ym

)T. (3.43)

Rewriting equation 3.36 for a particular tuple xk = (xk, yk) gives

xk = sk0 +

n∑i=1

piski (3.44)

Page 41: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

3.2 Compositional Image Alignment 35

∂W∂ x i

x

y

∂W∂ yi

Figure 3.3: The Jacobians ∂W∂xi

and ∂W∂yi

for a selected vertex of the mesh. The top rowrepresents the x-component, the bottom row the y-component.

where sk denotes the k-th tuple in s. The derivative of this vertex with respect topi is simply

∂xk

∂pi

= ski , (3.45)

or, with xk = (xk, yk) and indices i and j

∂xk

∂pj

= sxk ,

∂yk

∂pj

= syk ,

(3.46)

where sxk and syk denote the x and y component of the k-th tuple (vertex) in x.

For the complete set of parameters the result is

∂xi

∂p= (sxi

1 , sxi2 , . . . , s

xin ),

∂yi

∂p= (syi

1 , syi2 , . . . , s

yin ).

(3.47)

Page 42: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

36 Inverse Compositional Image Alignment

Figure 3.4: Jacobian images (x and y component) of the first shape parameter from a simplemodel consisting mainly of movement in x direction.

We have all the parts together to compute the final warp Jacobian

∂W

∂p=

v∑i=1

[∂W

∂xi

∂xi

∂p+∂W

∂yi

∂yi

∂p

]. (3.48)

Conceptually this Jacobian is a tensor because it has 2 rows, p columns and a thirddimension for the pixel values. In practice, it is a set of 2× n images.Figure 3.4 depicts the Jacobian images (x and y component) of the first shapeparameter from a simple model consisting mainly of rotational movement in x-direction. Bright pixels represent areas with larger movement.

Warp Inversion The update to the model parameters ∆p is computed withrespect to the template A0 and not with respect to I(W(x; p)). Hence the warpW(x; ∆p) must be inverted. Applying a Taylor expansion to the warp functiongives

W(x; ∆p) = W(x; 0) +∂W

∂p∆p (3.49)

= x +∂W

∂p∆p +O(∆p2) (3.50)

and thus

W(x; ∆p) ◦W(x;−∆p) = x− ∂W

∂p∆p +

∂W

∂p∆p

= x +O(∆p2).

(3.51)

Consequently, the warp can be inverted, to first order in ∆p, by writing

W(x; ∆p)−1 = W(x;−∆p) (3.52)

Page 43: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

3.2 Compositional Image Alignment 37

Warp Composition The composition of two warps is not strictly defined.Matthews and Baker propose the following composition in [3].From the warp inversion in equation 3.52 it can be seen that the parameters ofW(x; ∆p)−1 are −∆p:

∆s0 = −n∑

i=1

∆pisi. (3.53)

This warp moves the base mesh s0 but eventually the current mesh s must beadjusted. Therefore these offsets must to be transformed from the base mesh s0 tothe mesh of our current guess s. Not until then we can find the destination of themesh under the composed warp W(x; p) ◦W(x; ∆p)−1.To compute this transformation, first note that for every vertex of (x0

i , y0i )T there is

an offset (∆x0i ,∆y

0i )T . This offset either points into one of the triangles adjacent

to si0 (or maybe to the border between two triangles) or it points outside the model

border. The question at this point is which triangle should be used to warp theoffset from s0 to s? We stick to the same definition as Matthews and Baker [14]and warp the offsets (∆x0

i ,∆y0i )T separately with all affine warps of the triangles

adjacent to the shape vertex (x0i , y

0i )T and average the result.

Like the example in figure 2.2, this shape is generally not in the span of the model.It can be projected into the model by

p′i = si · (s + ∆s− s0), (3.54)

given that the shape vectors si are orthonormal.

Appearance Variation The attentive reader might be suspicious aboutthe fact that no parameters for appearance variation are included in the minimiza-tion term: ∑

x

[I(W(x; p)− A0(W(x; ∆p))]2. (3.55)

Indeed the ICIA algorithm is invariant to appearance variation. This is achievedby applying a technique called Project Out originally proposed by Hager and Bel-humeur [17].Minimizing both shape and appearance of the model means to find the minimumof ∑

x∈s0

[A0(x) +

m∑i=1

λiAi(x)− I(W(x; p))]2 (3.56)

Page 44: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

38 Inverse Compositional Image Alignment

with respect to the shape parameters p and the appearance parameters λ. This sumcan be split into the linear subspace span(Ai) spanned by the base appearances Ai

and its orthogonal complement span(Ai)⊥∥∥∥A0(x) +

m∑i=1

λiAi(x)− I(W(x; p))∥∥∥2

span(Ai)⊥

+∥∥∥A0(x) +

m∑i=1

λiAi(x)− I(W(x; p))∥∥∥2

span(Ai),

(3.57)

where ‖·‖L denotes the L2 norm of the vector projected into the linear subspaceL. This expression simplifies to∥∥∥A0(x)− I(W(x; p))

∥∥∥2

span(Ai)⊥

+∥∥∥A0(x) +

m∑i=1

λiAi(x)− I(W(x; p))∥∥∥2

span(Ai),

(3.58)

because in the first term only the components in span(Ai)⊥ are considered and so

any component in span(Ai) can be dropped. The point now is that the first termdoes not depend on λ. This expression can be minimized first and the resultingvalue for p can then be used to minimize the second expression. The minimumof the second term is simply the projection of the error image into the space oforthogonal base appearances:

λi =∑x∈s0

Ai(x) · [I(W(x; p))− A0(x)]

= AT [I(W(x; p))− A0(x)],

(3.59)

where A = [A1, . . . , Am] is the matrix of base appearance vectors.The second term in equation 3.58 can be minimized with only minor change tothe update step in equation 3.33. The only difference is that we have to work inspan(Ai)

⊥ instead of span(Ai). In equation 3.58 the error image is projected intospan(Ai)

⊥ in the first term but in practice this projection does not even have tobe computed. The most efficient way to work in the linear subspace span(Ai)

⊥ isto project ∇A0

∂W∂p

into span(Ai)⊥ because the result is the same and this can be

executed in a precomputation step. The reason why this can be done is becausethe error image is multiplied with ∇A0

∂W∂p

in equation 3.33 and thus the result isin span(Ai)

⊥ if any of the two vectors is projected into span(Ai)⊥.

The projection of a vector x into span(A)⊥ is (I −AAT )x. Similarly the projec-tion above is computed,

∇A0∂W

∂pj

−m∑

i=1

[∑x∈s0

Ai(x) · ∇A0∂W

∂pj

]Ai(x) := SDj, (3.60)

Page 45: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

3.2 Compositional Image Alignment 39

yielding the so-called steepest descent images.

Global Transform It is mentioned in previous chapters that the trainingshapes are usually normalized using a Procrustes analysis that removes scaling, ro-tational and translational components among the training set. To allow the modelto move in such ways, this variation needs to be modeled explicitly.A 2D similarity transform

N(x; q) =

((1 + a) −b

b (1 + a)

)(xy

)+

(txty

)(3.61)

expressed in the same notation as the warp W(x; p) is integrated into the im-age formation process. In this formula, a = k cos(φ) and b = k sin(φ) definescaling and rotation, and tx, ty describe a translation. A straight forward way toincorporate this warp into a linear model is

N(x; q) = s0 +4∑

i=1

qis∗i (3.62)

where the shape vectors are s∗1 = (x01, y

01, . . . , x

0v, y

0v), s∗2 = (−y0

1, x01, . . . , y

0v , x

0v),

s∗3 = (1, 0, . . . , 1, 0) and s∗4 = (0, 1, . . . , 0, 1) and the parameter vector q is q1 =a, q2 = b, q3 = tx, q4 = ty.The AAM image is then computed by combining the warps N ◦W:

M(N(W(x0; p),q)) = A(x) = A0(x) +m∑

i=1

λiAi(x) (3.63)

Finally the set of vectors s∗ needs to be orthonormalized. We will later see whythis is necessary.

The computation of the Jacobian of N ◦W is exactly as described for ∂W∂p

(equa-tion 3.48). The same holds true for the warp inversion. We end up with an addi-tional set of 4 steepest descent images which can be included in equation 3.33 inorder to solve for ∆q.

Warp Composition Revisited Now that two separate warps N ◦W areused, the composition is more tricky and somewhat vaguely defined.Once a new pair (∆p,∆q) is computed, the same techniques as before are used toinvert the warp, combine N ◦W(s0;−∆q,−∆p) and N ◦W(s0; q,p) to com-pute the destination of the base mesh. Again this leads to a shape that has to be

Page 46: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

40 Inverse Compositional Image Alignment

projected into the span of the model:

N ◦W(s0; q,p) = s†. (3.64)

Therefore this equation must be solved for p and q. To understand how this isdone, the warp is written

N ◦W(s0; q,p) = N(s0 +

n∑i=1

pisi; q). (3.65)

Using equation 3.61, this can be rewritten:

N(s0; q) +

((1 + a) −b

b (1 + a)

) n∑i=1

pisi (3.66)

which is again an abuse of terminology; the second term means that every pairof coordinates (xi, yi) of the sum is multiplied by the matrix prior to the additionwith the first term. A further extension leads to

N(s0; q) +

[(1 + a)

(1 00 1

) n∑i=1

pisi

]+

[b

(0 −11 0

) n∑i=1

pisi

](3.67)

The second term is orthogonal to s∗i because it is a linear combination of thevectors si that are orthogonal to s∗i . The third term is orthogonal to s∗i because ofsimilar reasons [3].Since N(s0; q) = s0 +

∑4i=1 and because of the orthogonality between this term

containing s∗i and both terms containing si, this expression can be solved for q byprojecting onto s∗i :

qi = s∗i · (s† − s0). (3.68)

Given the new q, the normalizing warp can be inverted and the result can beprojected onto p:

pi = s∗i · (N(s†,q)−1 − s0). (3.69)

To summarize, the Inverse Compositional Image Alignment algorithm with ap-pearance variation and a normalizing transform consists of the following steps:

Pre-Computation:

Page 47: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

3.2 Compositional Image Alignment 41

• Compute gradient∇A0 of template A0(x)

• Compute the Jacobians ∂Wp

, ∂Nq

of the mean shape• Compute the steepest descent images SDj , j = 1, . . . , n+ 4

• Compute the Hessian H

Iteration:

1. Warp input to compute I(N(W(x; p); q))

2. Multiply all SDj with the error image:∑x∈s0 SDj(x · [I(N(W(x; p); q))− A0(x)] for i = 1, . . . , n+ 4

3. Compute the parameter updates (∆p,∆q) by multiplying the above vec-tors by the inverse Hessian

4. Update the warp (N◦W(x; q,p)← (N◦W(x; q,p)◦(N◦W(x; ∆q,∆p)−1

The iteration is performed until no significant change to the parameters occurs.Updating the appearance parameters according to equation 3.59 is an optionalstep because the optimization steps above are independent of appearance varia-tion.

The time complexity of the iteration steps is as follows. Denote the number ofpixels in s0 as N and the number of parameters of the warp as n. Then, step 1is O(nN), step 2 is O(N + nN), step 3 is O(n3) and step 4 is O(n2) (becausethe composition of W(x; p) and W(x; ∆p) can be written as a bilinear combi-nation). Together, the time complexity of the iteration steps is O(nN + n3) com-pared to O(n2N + n3) for the Lucas-Kanade algorithm. For more details aboutthe complexity of the algorithm, an in-dept study on Inverse Compositional Im-age alignment in general and a comparison to Lucas-Kanade alignment and otheralgorithms, consult: Lucas-Kanade 20 years on: A unifying framework [14].

Page 48: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

42 Inverse Compositional Image Alignment

Page 49: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

Chapter 4

Automatic AAM Generation

A big drawback of Active Appearance Models is the task of labeling the trainingset with a number of landmarks. Not only this requires human labour, it is alsoerror-prone. Although there has been research towards automatic AAM genera-tion [18], no method has been proposed to date that accurately selects the land-marks in the training data in an automated fashion.The method we propose uses a 3D Morphable Model in order to fully automatethis process. Not only this method saves a considerable amount of time, it is alsomore precise and less error-prone.

The 3D Morphable Model

The 3D Morphable Model (MM) is a three-dimensional statistical model of hu-man faces introduced by Blanz and Vetter [1]. A set of 3D scans of human facesare registrated using an optical flow algorithm and a gradient descent algorithm isused to fit the model to an input image. The model has parameters for both shapeand texture and so the concept is similar to AAMs.

A simplified representation of 3D Morhphable Models is given in chapter 1.4.2.We will stick to the same simplifications in this chapter, with only one exception,namely that sometimes we refer to 3DMM shapes as vectors of length 3 ·n insteadof matrices of size 3× n as in chapter 1.4.2.

A 3D Morphable Model is more flexible than 2D approaches for a number of rea-sons. One the one hand this is because illumination conditions and 3D pose canboth be controlled explicitly. Another reason is the much higher resolution of themesh. A 3DMM typically has several thousands of vertices compared to typicallyless than 200 vertices in an AAM. However, fitting a Morphable Model is alsocomputationally far more expensive.

Page 50: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

44 Automatic AAM Generation

Figure 4.1: Example of a subject scanned using a structured-light scanner. The left image isthe texture captured by color cameras. The right image is the surface information acquired bythe scanner.

We can make use of a 3DMM in order to learn an AAM. Normally, the trainingdata for an AAM consists of a set of images of persons in different pose. Suchimages can be generated by a 3DMM very precisely due to correspondence andcontrol of rotation and translation parameters. We replace the manual preprocess-ing that is necessary with traditional AAMs by a fully automated process.The model that was used to produce the training data for the AAM is obtainedfrom scans of human faces. The scans are captured by a structured-light scannerthat produces a high resolution mesh (97577 vertices) from a person. Figure 4depicts an example scan and the corresponding appearance.All scans are later processed by an algorithm that computes correspondence usingoptical flow [1]. The result is a database of human faces in dense point-to-pointcorrespondence, i.e. for every vertex (xi, yi, zi) in s we can access the same indexin any other shape s′ in the database to get the (anatomically) corresponding loca-tion (x′i, y

′i, z′i) on the surface s′.

4.0.4 Landmark Extraction

Correspondence makes it possible to define the indices of vertices on the 3D meshthat we want to use for the AAM. We have extended the set of vertices that wasused by Matthews and Baker [3]. The extension consists of two additional pointsaround the nose that give more stability to our method. The triangulation is com-

Page 51: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

45

Figure 4.2: Eye region of a 3D mesh with selected landmarks.

puted using a Delaunay triangulation algorithm.

Figure 4.0.4 shows a closeup of the mesh structure with some vertices highlightedthat we use for building the AAM.Selecting these landmarks in 3D has do be done manually. This is equivalent tothe landmark selection step in previous approaches but with a significant differ-ence: Thanks to correspondence within the 3DMM it needs to be done only once.The set of landmarks can then be applied to any face in the 3DMM under any pose.

The process of acquiring samples from the 3DMM is then as follows: For everyface to be included in the AAM we generate a set of renderings of predefinedposes. From the renderer we further extract the position of all landmark points inthe image plane and store them in a list. The result is a set of annotated imagesready to be used for building an AAM.

The landmarks that we extract can be divided into two disjoint groups:

• Fixed Landmarks correspond to distinct features like eyes, nose andmouth (sometimes referred to as anatomic landmarks). They have a fixedposition on the 3D mesh.

• Dynamic Landmarks do not correspond to features and move along the

Page 52: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

46 Automatic AAM Generation

Figure 4.3: Inner (cheek) and outer (nose) default locations for a selected trajectory on theright cheek. The contour landmark point is searched along this trajectory.

surface depending on the pose parameters. Points along the contour of theface are typical examples of dynamic landmarks.

Dynamic landmarks impose a problem to our approach. Even for humans it issometimes not easy to define the contour in the image of a face. Figure 4.0.4depicts two renderings of the mean face in different pose. In the right image thecontour can be detected using the surface normal. In the left image, the positionof the contour on the left cheek of the face is somehow undefined. We use thefollowing technique to extract reasonable vertices in that area: For all dynamiclandmarks we have defined an outer and an inner landmark. On a trajectory be-tween these landmarks we search for a vertex that is perpendicular to the cameradirection. If such a vertex can be found, we select that vertex. If the trajectorydoes not contain any such vertex we select the outer landmark.Figure 4.0.4 displays the inner and outer landmarks of a trajectory on the cheek.Using this technique we have built a variety of different models ranging fromperson-specific models to models of multiple persons. For all of them we havesampled a 3DMM at a number of values for rotations around the x- and y-axisand extracted the landmarks as described above in order to obtain a set of anno-tated training images.After the data has been generated, the procedure is just like with a conventionalAAM. PCA is applied to the shape vertices and the textures are then warped intothe mean of the mesh defined by the mean shape. On the shape normalized tex-tures, PCA is applied to obtain the appearance model.

We have originally warped the textures of all samples to the mean shape. How-

Page 53: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

47

Figure 4.4: Two renderings of the mean in different pose. In the left image the contour is welldefined on both sides of the face. On the right image, a contour has do be defined somewhereon the left cheek.

Figure 4.5: Overview of fixed (black) and dynamic (red) landmarks.

Page 54: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

48 Automatic AAM Generation

(a) Projected face withlandmarks

(b) Extracted landmarks (c) Triangulated land-marks

Figure 4.6: Comparison of a sample from a 3D Morphable Model and the extracted landmarks.On the right, the triangulation is applied to the landmarks.

ever, experiments have shown that this is not really necessary. We can alwaysrender every sample from a frontal position which gives a good appearance witha lower overall distortion than samples that are first rotated and then warped backto the frontal mean. The frontal texture can then be used to reconstruct rotatedviews. Especially in the case of ICIA fitting where the input is always warpedonto the mean shape (that usually corresponds to a frontal view), this is probablya reasonable thing to do.

Page 55: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

Chapter 5

Results and Discussion

5.1 Fitting Results

We have carried out a number of experiments using a person-specific model toevaluate our approach.

The model was build from a person that was recorded in frontal pose as well as30 rotated to both sides. Figure 5.1 depicts these samples as well as the resulting2D model. The model has three components, however most of the variabilityis encoded in the first component. Using only this component, rotations in theinterval [−30,+30] can be reconstructed realistically.The resolution of the input images for both training and fitting is 512 × 512 andthe dimension of the model is 181× 199 pixels. The number of pixels within thebase shape is approximately 28′000.The data used for all experiments is semi-synthetic. This means that although wefit to images that are not in the span of the AAM, the data can still be consideredbeing synthetic because all images are derived from a 3D Morhpable Model andnot from real photographs. In particular, all test images have a constant blackbackground.

Convergence Range

The goal of the first experiment was to find out how much initial translationaloffset the model can compensate. To assess this convergence range we have sys-tematically shifted the model in both x- and y-direction from the position of theface in the input image. The range of translations was [−50,+50] × [−50,+50]in steps of 10 pixels, resulting in a total of 121 fits.The experiment was carried out on the same face that was used to build the model.

Page 56: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

50 Results and Discussion

Figure 5.1: Person-specific model. The top row depicts the training images from the 3Dmodel, the second and third row depicts the mean shape (middle) and reconstructed views(left, right) from variation in the first shape component.

Page 57: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

5.1 Fitting Results 51

(a) Frontal view (b) Rotated view (15◦) (c) Rotated view (30◦)

Figure 5.2: The images that were used for the first experiment.

First we presented the model a frontal image subject. The experiment was thenrepeated on 15◦ and 30◦ rotated versions of the same individual. Note that the 15◦

rotated face was not part of the training samples for the AAM and is thus not in itsspan. Figure 5.3 depicts these images. The model was initialized using the meanshape in all experiments.

Tables 5.1, 5.1 and 5.1 display the results. The numbers indicate how many itera-tions were needed until the squared distance between the vertices of the model andthe vertices of the ground truth was below a threshold. Dashes (-) indicate that themodel either diverged, that the model converged to a local minimum somewhereaway from the ground truth or that the iteration stopped because no solution wasfound after 70 iterations.

We can see in table 5.1 that, for the frontal image, the model converges for offsetsin x-directions of approximately 30−40 pixels. For negative x-offsets, the conver-gence range is slightly better because neither the model not the input is perfectlysymmetric. It stands out that the model is more sensitive to y-translations. Es-pecially for negative y-offsets the number iterations increases significantly. Thereason for this can be found in the input image. Forehead and hair attract themodel to wrong contours. The same phenomenon, possibly much stronger, is tobe expected for all directions when the model is fitted to images with arbitrarystructures in the background. Overall, the algorithm converged at 30 of total 121positions.

Table 5.1 displays the result for the experiment on the 15◦ rotated version. Theconvergence range is now larger for positive x-offsets. This can be explained bythe fact that a vertex on the nose is used as the reference for initialization betweeninput and model. When the face is rotated, the distance between nose and right

Page 58: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

52 Results and Discussion

border of the model increases while it decreases at the other side, yielding moresupport for the model on the right side than on the left side. Overall, the algorithmconverged at 36 of total 121 positions.

Table 5.1 displays the result for the experiment on the 30◦ rotated version. Theconvergence range for x-offsets is now visibly smaller. The cluster in this table isshifted even more to the right because of the same reason as above. Overall, thealgorithm converged at 19 of total 121 positions.

Performance on the 15◦ image is almost twice as good as on the 30◦ image eventhough the 15◦ image was not used to build the model. On the other hand, thedistance from initialization to optimal parameters is larger for the latter since themean shape is used as an initialization for all experiments, so this result is not verysurprising.

The observation of different sensitivities to offsets in x- and y-direction is subjectto an additional experiment that was carried out. We have repeated the same ex-periment as in table 5.1 but without shape variation in order to assess whether thecompensation for larger offsets in x-direction stems from the correlation betweenthe first shape component and changes to the x-coordinates of the model. Theresult in table 5.1 proves that this is not the case. In fact, the experiment revealedthat the opposite effect occurs. The algorithm converged in 37 of the positions.This is because the lack of shape variability results in a lower probability to getstuck in a local minimum.

Relation Between Pose and Model Parameters

In order to explore the relation between the shape parameters and the rotation ofthe input, we have carried out the following experiment. The model was fittedto a total of six images: The frontal view and rotations in x-direction of ±15◦

and ±30◦. The initialisation was again performed based on the nose vertex andrandom perturbations were introduced. Figure 5.1 shows a plot of the first andsecond shape components for all samples that converged. The plot shows nicelythat an approximately quadatic relation exists between these two shape parametersand that different pose angles can be separated accodring to these values.

Page 59: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

5.1 Fitting Results 53

x-offset−50 −40 −30 −20 −10 0 +10 +20 +30 +40 +50

y-o

ffse

t

−50 - - - - - - - - - - -−40 - - - - - - - - - - -−30 - - - - - - - - - - -−20 - - - - - - - - - - -−10 - - 55 45 33 33 36 44 55 - -

0 - 57 38 27 14 12 15 27 38 - -+10 - 53 34 23 5 1 8 24 34 - -+20 - - 38 25 7 3 9 25 36 - -+30 - - - - - - - - - - -+40 - - - - - - - - - - -+50 - - - - - - - - - - -

Table 5.1: Convergence range for the person-specific AAM with shape variation and globaltransform, applied to a frontal view of a face (figure 5.2(a)).

x-offset−50 −40 −30 −20 −10 0 +10 +20 +30 +40 +50

y-o

ffse

t

−50 - - - - - - - - - - -−40 - - - - - - - - - - -−30 - - - - - - - - - - -−20 - - - - - - - - - - -−10 - - - - 53 40 38 43 47 56 -

0 - - - 40 26 18 15 19 27 37 50+10 - - - 36 23 9 3 9 20 32 51+20 - - 59 36 25 10 5 10 22 33 -+30 - - - 39 29 24 35 39 - - -+40 - - - 55 52 - - - - - -+50 - - - - - - - - - - -

Table 5.2: Convergence range for the person-specific AAM with shape variation and globaltransform, applied to a 15◦ rotated face in x-direction.

Page 60: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

54 Results and Discussion

x-offset−50 −40 −30 −20 −10 0 +10 +20 +30 +40 +50

y-o

ffse

t

−50 - - - - - - - - - - -−40 - - - - - - - - - - -−30 - - - - - - - - - - -−20 - - - - - - - - - - -−10 - - - - - - - - - - -

0 - - - - - - 64 48 55 65 -+10 - - - - - - 34 12 17 24 37+20 - - - - - - 33 17 17 25 35+30 - - - - - - 55 21 25 64 -+40 - - - - - - - 46 - - -+50 - - - - - - - - - - -

Table 5.3: Convergence range for the person-specific AAM with shape variation and globaltransform, applied to a 30◦ rotated face in x-direction.

x-offset−50 −40 −30 −20 −10 0 +10 +20 +30 +40 +50

y-o

ffse

t

−50 - - - - - - - - - - -−40 - - - - - - - - - - -−30 - - - - - - - - - - -−20 - - - - - - - - - - -−10 - - - 49 39 38 44 66 - - -

0 - 69 33 23 17 15 19 26 42 - -+10 - 40 24 14 6 1 6 14 24 54 -+20 - 50 28 17 8 4 10 18 28 44 -+30 - 0 59 33 21 21 21 43 - - -+40 - - - - - - - - - - -+50 - - - - - - - - - - -

Table 5.4: Convergence range for the person-specific AAM without shape variation, appliedto a frontal view of a face.

Page 61: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

5.1 Fitting Results 55

(a) Input and model init (b) After 3 iterations (c) After 30 iterations

(d) After 35 iterations (e) After 40 iterations (f) After 45 iterations

Figure 5.3: Sequence of a fit of the 15◦ rotated sample with x-offset and y-offset equal to+20.

Page 62: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

56 Results and Discussion

Figure 5.4: Firs (horizontal axis) and second (vertical axis) shape coefficients for samplesthat converged on inputs of different rotations in x-direction.

5.2 Discussion

5.2.1 Thoughts about ICIA

A couple of vaguenesses are inherent to the approach of Matthews and Baker, orat least there are things that remain unclear.

Computation of Changes to the Current Shape

Matthews and Baker don’t give a proof that the method used to transform thechanges to the base mesh vertices to changes in the current mesh vertices al-ways make sense. In particular, several other methods are possible to estimatethe changes to the vertices in s such as warping the vertices in ∆s0 with only onetriangle instead of using all adjacent triangles. It would be an interesting task tostudy how good this method for warp composition actually performs and how itdepends on the topology of the mesh.

Furthermore, the authors assume that the sets of shape vectors s and s∗ are orthog-onal to each other. They also conclude that, ”due to various sources of noise” theyare never really quite so. Whether it is still valid to proceed without orthonormal-

Page 63: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

5.2 Discussion 57

izing the complete set of vectors s and how it all affects the performance remainsan open question. In our implementation the four normalizing transform vectorss∗i are orthonormalized separately to the shape vectors.

Appearance Invariance

Another point of concern is appearance invariance. What is actually coined asan advantage of the approach might as well be a drawback in some situations.The fact that the gradient is computed on the mean appearance and not on theinput might lower the specificity of the algorithm if the number of individualsincreases. A comparison with the Lucas-Kanade algorithm could confirm or denythis suspicion.

Failure of ICIA

There are cases where the ICIA algorithm does exactly the opposite of what onewould expect an image alignment algorithm to do. Consider the very simplemodel in figure 5.2.1 and the corresponding input. In the first iteration, the im-age will be warped to the base mesh through I(W(N(x; p); q)) (left column).Consider the pixel that is highlighted in blue. The gradient in the back warpedinput at this pixel points towards the center of the circle of the input image (mid-dle column, top). However, the gradient at this position of the warped input isestimated using the gradient of the model (middle column, bottom). This gradientapproximately points into the opposite direction and so the template is moved inthe wrong direction. This bevahiour is repeated until there is no overlap betweenmodel and input (right column).Although this phenomenon appears only if little overlap between model and inputis given, it still shows a weakness of ICIA compared to forwards additive and for-wards compositional methods.

Page 64: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

58 Results and Discussion

Figure 5.5: A simple model with a single radial gradient appearance component. The ICIAalgorithm is applied to fit the model to an input that is in the span of the model but shiftedfrom the initial model position (left). In the first iteration, the input is warped to the base shapethrough I(W(N(x;p);q)) (top center). Comparing the gradient of the warped input and thesame pixel location in the base appearance shows that the gradients point in the oppositedirection. The model will thus move away from the input until the overlap is eliminated.

5.2.2 Thoughts about AAMs

Non-linearity

The data that we generate using our automatic 2D model generation method isnon-linear, just like when input images are hand-labeled. The perspective pro-jection that is used either when taking photographs or when rendering from a 3Dmodel is a non-linear function because of the perspective division. In computergraphics we have the possibility to use an orthographic projection in order to getrid of this non-linearity. Nevertheless, the variability in shape is still nonlinearbecause the position of a vertex on the head is non-linearly dependent of the poseangle. In figure 5.2.2 explains this in more detail. In this simplified case a linearapproximation of the dependency between the coordinate of a projected point onthe image-plane and the pose angle of the head is good if a point is near the cen-ter of the image-plane. The further we drift away to the sides, the worse a linearapproximation becomes.The fact that many landmarks in an AAM are positioned on features that are nearthe center of the face (eyes, nose, mouth) explains why AAMs can approximaterotations to some degree.The result of the experiment in table 5.1 confirms that a linear approximation tothis non-linear head movement is good around the frontal pose (perpendicular to

Page 65: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

5.2 Discussion 59

(a) Equidistant movements in the cameraplane result in unequal changes in both an-gle and distance on the surface.

(b) Equidistant changes to the angle resultresult in unequal movements in the cameraplane.

Figure 5.6: A schematic view of a head from top and an image plane on front of the face.For simplification, orthographic projection is used. This figure illustrates that the position ofa vertex on the image plane is non-linearly dependent on the pose angle.

the image plane).

Computation of Gradient The computation of the gradient is problem-atic at the model borders. Convolving A0(x) with a conventional Sobel filter

Gx =

+1 0 −1+2 0 −2+1 0 −1

Gy =

+1 +2 +10 0 0−1 −2 −1

(5.1)

produces artificial gradients at the model border because pixels from the back-ground are taken into account. This is something that should be avoided in orderto increase generality. The problem is solved by computing a simplified versionof the gradient according to

Gx =

0 0 0+1 0 −10 0 0

Gy =

0 +1 00 0 00 1 0

. (5.2)

Symmetric border values are used to compute one-sided gradients at the modelborder: we use dx = [f(x+ h)− f(x)] instead of dx = [f(x+ h)− f(x− h)/h](and similarly for the remaining 3 borders and for y).A better method could be to apply the Sobel filters where possible and only applythe filters in equation 5.2 at the border.

Page 66: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

60 Results and Discussion

Figure 5.7: Comparison of the Sobel filter (top row) and the gradient method used here(bottom row).

5.2.3 Model BuildingA more sophisticated method could be used to extract the contour vertices. Untilnow, the trajectories have to be defined manually. They could as well be searchedfor on the mesh surface using a clever algorithm. The projection of the line fromthe outer to inner default vertices onto the mesh surface could be computed andused for automatic trajectory generation or a variant of the a greedy search al-gorithm could be applied. This might be a challenge because e.g. the Dijkstraalgorithm and similar methods are not trivial to extend to 3D.We have observed that the mean shape becomes visibly asymmetric if larger vari-ations of x-axis rotations are incorporated in a model. Although we don’t expectthe mean to be perfectly symmetric, the amount of asymmetry observed is suspi-ciously large. This is probably due to a certain instability in the dynamic contourextraction algorithm, e.g. countour landmarks on one side of the face are not al-ways computed in the same way as landmarks computed on the other side for theopposite rotation. Due to lack of time, we could not yet investigate further intothis problem.

Page 67: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

Chapter 6

Conclusion

During the six month preparation time for this thesis, an method for automatictransformation of a 3D Morphable Model (3DMM) into an Active AppearanceModel (AAM) was developed. To evaluate the approach, the model was matchedto images using Inverse Compositional Image Alignment (ICIA) in a number ofsynthetic test cases. The results are encouraging and show that AAMs are capableto describe 3D pose variation of faces. but demand further investigation.

6.0.4 Propositions for Future WorkThere is a number of improvements that can be done based on this work. Theseinclude the following.

Speed Optimization

First of all, code improvements are necessary in order to produce fitting speedsthat are similar to those in [3]. During development of the fitting software in thisthesis the focus was on automated model construction and analysis and imple-mentation of ICIA but less on speed.

Dependency Between Shape Parameter and Pose

Based on the preliminary experiments in this thesis an in-dept study about therelation between the pose parameter and the shape coefficients of the model are ofhigh priority.

Regularization

The probability distribution of the model parameters along individual directionsin the orthogonal shape and texture space is known since the data is transformed

Page 68: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

62 Conclusion

using PCA (see appendix A). This information can be used to restrict the ICIA al-gorithm to only produce shapes and appearances that are “likely to be faces”. Re-stricting the values of the parameters to 3σ, i.e. three standard deviations aroundthe mean, covers around 99.7 of the area under a Gauss curve. It can be seenin some examples that the shape currently degenerates if the initialization is bad.Regularization avoid this phenomenon to some degree and improves overall ro-bustness. Regularization is a very natural thing to do and one of the benefits ofstatistical models. Only due to the lack of time this must be postponed to the listof future work.

Hierarchical Processing

The convergence range can be improved by fitting the model to low-pass filteredimages first, i.e. images of lower pixel resolution, and then subsequently to imagesof higher resolution. A multiresolution strategy reduces the likelihood of fallinginto a local minimum and extend the convergence radius.

Combination with Face Detection System

A powerful extension would be to combine the system with a face detector. Wehave experimented with an open-source library for face detection and the resultsare encouraging. Information about the location and size of faces in images canbe obtained and used as initial guess for an AAM.

Page 69: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

Bibliography

[1] Volker Blanz and Thomas Vetter. A morphable model for the synthesis of3d faces. In SIGGRAPH ’99: Proceedings of the 26th annual conference onComputer graphics and interactive techniques, pages 187–194, New York,NY, USA, 1999. ACM Press/Addison-Wesley Publishing Co.

[2] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active Appearance Models.1998.

[3] Iain Matthews and Simon Baker. Active appearance models revisited. Int. J.Comput. Vision, 60(2):135–164, November 2004.

[4] Jing Xiao, S. Baker, I. Matthews, and T. Kanade. Real-time combined 2d+3dactive appearance models. volume 2, pages II–535–II–542 Vol.2, 2004.

[5] J. Canny. A computational approach to edge detection. IEEE Trans. PatternAnal. Mach. Intell., 8(6):679–698, November 1986.

[6] Roland T. Chin and Charles R. Dyer. Model-based recognition in robot vi-sion. ACM Comput. Surv., 18(1):67–108, 1986.

[7] M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. pages586–591, 1991.

[8] A. L. Yuille, D. S. Cohen, and P. W. Hallinan. Feature extraction fromfaces using deformable templates. In Computer Vision and Pattern Recogni-tion, 1989. Proceedings CVPR ’89., IEEE Computer Society Conference on,pages 104–109, 1989.

[9] P. Lipson, Alan L. Yuille, D. O’Keeffe, J. Cavanaugh, J. Taaffe, andD. Rosenthal. Deformable templates for feature extraction from medicalimages. In ECCV ’90: Proceedings of the First European Conference onComputer Vision, pages 413–417, London, UK, 1990. Springer-Verlag.

Page 70: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

64 Appendix

[10] A. Beinglass and H. J. Wolfson. Articulated object recognition, or: how togeneralize the generalized hough transform. In Computer Vision and Pat-tern Recognition, 1991. Proceedings CVPR ’91., IEEE Computer SocietyConference on, pages 461–466, 1991.

[11] Geoffrey E. Hinton, Christophe K. I. Williams, and Michael D. Revow.Adaptive elastic models for hand-printed character recognition. In John E.Moody, Steve J. Hanson, and Richard P. Lippmann, editors, Advances inNeural Information Processing Systems, volume 4, pages 512–519. MorganKaufmann Publishers, Inc., 1992.

[12] Fred L. Bookstein. Morphometric Tools for Landmark Data: Geometry andBiology. Cambridge University Press, June 1997.

[13] Andreas Lanitis, Chris J. Taylor, and Timothy F. Cootes. Automatic interpre-tation and coding of face images using flexible models. IEEE Trans. PatternAnal. Mach. Intell., 19(7):743–756, July 1997.

[14] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unify-ing framework. International Journal of Computer Vision, 56(3):221–255,February 2004.

[15] B. D. Lucas and T. Kanade. An iterative image registration technique withan application to stereo vision. In IJCAI81, pages 674–679, 1981.

[16] G. D. Hager and P. N. Belhumeur. Efficient region tracking with parametricmodels of geometry and illumination. Pattern Analysis and Machine Intelli-gence, IEEE Transactions on, 20(10):1025–1039, 1998.

[17] Gregory D. Hager and Peter N. Belhumeur. Efficient region tracking withparametric models of geometry and illumination. IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 20(10):1025–1039, 1998.

[18] S. Baker, I. Matthews, and J. Schneider. Automatic construction of activeappearance models as an image coding problem. Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on, 26(10):1380–1384, 2004.

Page 71: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

Appendix A

Principal Component Analysis

Principal Component Analysis (PCA) is a method to decorrelate the covariancematrix of data in order extract dimensions of high variability and to remove di-mensions that only encode small or no information.For a set of vectors

vi ∈ Rm i = 1, . . . , n (A.1)

we can write the mean

µv

n∑i=1

vi (A.2)

and subtract the mean from all vectors

xi = vi − v (A.3)

to obtain a new mean-free set of vectors. Written in matrix form, we get a matrixof mean-free data

X = [x1, . . . ,xn] ∈ Rn×m. (A.4)

The step of subtracting the mean can be understood as putting the data centeredaround the origin of the coordinate system. The covariance matrix of X is

CX =1

mXXT =

n∑i=1

(xi)(xi)T ∈ Rm×m. (A.5)

The covariance matrix can be diagonalized using singular value decomposition(SVD). For an n-by-m matrix A with real or complex entries there exists a factor-ization of the form

Page 72: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

66 Appendix A

A = USVT (A.6)

where U and V are orthonormal matrices and S is a diganonal matrix containingthe singular values. A = USVT is called the singular value decomposition of A.In the case that A is a symmetric, semi-definite matrix the singular values equalthe eigenvalues of A:

Aui = λiui. (A.7)

Since covariance matrices are always positive semidefinite, CX can be decom-posed according to

CX = USVT (A.8)

in order to obtain the matrix of eigenvalues S and a new basis U. By projectingthe data into the space spanned by the principle components of the covariancematrix we effectively de-correlate the data:

ci = UTxi ci =

ci1...cin

cij = 〈uj,xi〉. (A.9)

We can write this as the matrix operation

CX = UTA (A.10)

The vectors ui in U can be sorted according to the size of the correspondingeigenvalue λi. Vectors corresponding to eigenvalues equal to zero can be omittedwithout any loss of information. Omitting vectors that correspond to the smallesteigenvalues results in an optimal lossy compression in the least squares sense.

The computation of SVD on the covariance matrix can be very costly because thesize of CX is m-by-m. In many cases m � n, i.e. the number of dimensions ismuch larger than the number of samples. It is, however, possible to decomposeCX without actually computing this large matrix. The following equation holds:

CX =1

mXXT (A.11)

=1

mUWVTVWUT (A.12)

=1

mUW2UT, (A.13)

Page 73: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

67

where W is the matrix of singular values of X. So first computing the SVD on Xand then computing the singular values according to

λi =1

mω2

i (A.14)

gives the same result as computing the SVD directly on CX.

Page 74: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

68 Appendix A

Page 75: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

Appendix B

Gauss-Newton Algorithm

The Gauss-Newton algorithm is a variant of the Newton method for function ap-proximation and minimization. To make things clear, we will first discuss theNewton Method and then explain the difference to the Gauss-Newton Method.

The Newton Method

The Newton Method is an iterative algorithm for finding the zero (called root) ofa function by linearizing it around a point of interest xn and solving for the in-tercept between this tangent line and the x-axis. The intercept gives a new valuexn+1 which is used for the next iteration.

The linearization of f(x) around x is performed using an a Taylor expansion.The Taylor expansion of a real- or complex-valued function f(x) that is infinitelydifferentiable in a neighborhood of a is

f(x) =∞∑

n=0

fn(a)

n!(x− a)n

= f(a) +f ′(a)

1!(x− a) +

f ′′(a)

2!(x− a)2 +

f (3)(a)

3!(x− a)3 + ... .

(B.1)

Often f(x) is equal to its Taylor expansion in a sufficiently close neighborhood ofx. This is the reason why the Taylor series is important for function approximationand minimization.Taking the first two terms of the Taylor expansion gives

f(x+ 1) = f(x) + f ′(x) · h (B.2)

and solving for h gives the new value for h,

h = − f(x)

f ′(x). (B.3)

Page 76: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

70 Appendix B

This leads to the recurrence relation

hn+1 = hn −f(xn)

f ′(xn), (B.4)

which is iterated until convergence.

Besides finding roots of functions, the Newton method can be used to find localminima and maxima of functions. If x is a stationary point (local minimum, lo-cal maximum or inflexion) of a function f(x), then x is a root of the first orderderivative f(x)′. By applying the Newton method to f(x)′, the stationary pointcan be found.We use the Taylor expansion of f(x+ h)

f(x+ h) ≈ f(x) + f ′(x)h+f ′′(x)

2h2, (B.5)

where we have included the second order derivative from the series.The derivative of this Taylor expansion with respect to h is

df(x+ h)

dh= f(x)′ + f(x)′′h = 0. (B.6)

By solving this linear equation we find the extremum of the current approxima-tion. If the initial guess x0 is chosen sufficiently close to the actual extremum off(x), we can find this extremum using the recurrence relation

xn+1 = xn −f ′(xn)

f ′′(xn). (B.7)

Depending on the sign of f ′′(x) this is either a minimum (f ′′(x) > 0) or a maxi-mum (f ′′(x) < 0).

Intuitively, we approximate f(x) using a quadratic function around x and solvefor the extremum of this parabola. Like this we take steps toward the extremumof f(x). If f(x) happens to be a quadratic function, the solution will be found ina single step.

This scheme can be generalized to higher dimensions. The first order derivativef(x)′ is replaced by the gradient∇f(x) and the second order derivative is replacedby the Hessian matrix of second order derivatives, H.The update rule becomes

xn+1 = xn −H−1∇f(x) (B.8)

Page 77: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

71

The fact that the second order derivative has to be computed in every iteration isa drawback of the Newton method. In particular, computing the inverse of theHessian is a very expensive operation. Additionally, if the Hessian is close to anon-invertible matrix, numerical instabilities occur.If f(x) has a certain structure, it is possible to overcome these problems, as wewill see in the next chapter.

The Gauss-Newton Algorithm

The Gauss-Newton algorithm is a variant of the Newton Method that can only beapplied to least squares problems. The difference to the Newton method is thatthe Hessian H is approximated instead of being computed exactly. Computingsecond order derivatives becomes expensive if the dimensionality of the data ishigh, since the number of elements in the Hessian is quadratic in the number ofdimensions.In the previous chapter we have derived the update rule of the Newton method forhigher dimensions

xn+1 = xn −H−1∇f(x). (B.9)

We will now compute both gradient and Hessian analytically in order to examinethe special structure of least squares problems of the form

F (x,p) =n∑

i=1

(f(xi,p)− yi)2 :=

n∑i=1

r2i . (B.10)

where (x1, y1), . . . , (xn, yn) are a set of data points and p = (p1, . . . , pm) is avector of model parameters that governs the model function f(x). If n > m themodel will, in general, not be able to fit the data perfectly and so the goal is tominimize the resulting error.We can apply the chain rule (u(v(x))′ = u′(v(x))v′(x)) with u(x) =

∑ni=1 x

2 andv(x) = r(x) to compute the elements of the gradient

∂F (x,p)

∂pj

= 2n∑

i=1

ri∂ri

∂pj

. (B.11)

The gradient is

2n∑

i=1

ri∂ri

∂p1

...ri

∂ri

∂pm

(B.12)

Page 78: Fitting an Active Appearance Model Generated from …scholar.harvard.edu/files/frank/files/masterthesis_fp.pdfAAM. One or more 3D heads are selected that shall be included in the Active

72 Appendix B

The Hessian is computed by differentiating the elements of the gradient with re-spect to the entries of the gradient:

Hjk =( ∂ri

∂pj

∂ri

∂pk

+ ri∂2ri

∂pjpk

)(B.13)

By ignoring the second order terms in the above equation, an approximation of His obtained:

Hjk ≈ 2m∑

i=1

JijJik. (B.14)

Jik are the entries of the Jacobian Jr of r.In terms of Jr, gradient and Hessian can be written in matrix notation as

∇F = 2JTr r = 2

∂r1

∂p1. . . ∂rn

∂p1

...∂r1

∂pm. . . ∂rn

∂pm

r1

...rn

,H ≈ 2JT

r Jr = 2

∂r1

∂p1. . . ∂rn

∂p1

...∂r1

∂pm. . . ∂rn

∂pm

∂r1

∂p1. . . ∂r1

∂pm

...∂rn

∂p1. . . ∂rn

∂pm

.(B.15)

If we write the recurrence relation from the newton method (equation B.9) usingthese quantities we get

xn+1 = xn − (JTr Jr)

−1JTr r (B.16)

The approximation of the Hessian is good if the quadratic term is orders of mag-nitude smaller than the linear term. This is the case of the function values ri aresmall in magnitude, at least around the minimum, or the function is not highlynon-linear so that the second derivative term stays relatively small.