dynamic structure from motion using uncalibrated cameras and …wolf/papers/thesis.pdf ·...

Dynamic Structure from Motion

using Uncalibrated Cameras and

Unsegmented Scenes

Thesis for the degree of

DOCTOR of PHILOSOPHY

by

Lior Wolf

SUBMITTED TO THE SENATE OF

THE HEBREW UNIVERSITY OFJERUSALEM

Aug 2003

This work has been carried out at the School of Computer Science and Engineering,

The Hebrew University of Jerusalem, Jerusalem, Israel, under the supervision of

Prof. Amnon Shashua

ACKNOWLEDGMENTS

First and foremost, I give special thanks to my advisor, Amnon Shashua. I was fortunate to work

under his supervision and to have the opportunity to learn not just from his wide knowledge in many

areas, but also from his choices and actions. I greatly enjoyed every single one of our meetings, and

consider myself very lucky and most honored to have been one of Amnon’s students.

I would like to thank Shmuel Peleg for inspiring me to work in the area of computer vision, and for

bringing me to the lab. I would like to thank the other professors at the lab: Daphna Weinshall, Mike

Werman and Yair Weiss for valuable advice.

I would like to thank Michal Irani, Andrew Zisserman, and Richard Hartley for supporting me and

appreciating my work. I would like to thank Peter Sturm and Yoram Singer for inspiring discussions.

I would like to thank Yoni Wexler, Anat Levin and Assaf Zomet for cooperating with me on several

projects. I would also like to thank Shay Avidan, Moshe Ben-Ezra, Yaron Caspi, Adiel Ben-Shalom and

Jeremy Kaminsky for sharing their experience and knowledge with me. I would like to thank many other

lab members for their help and friendship throughout the years.

Thanks to my parents for baring with me during the busy stressful times.

Finally, my gratitude to my wife Aya, for her endless love and support, and my sons Guy and Tom for

the happiness they gives me.

iii

AbstractMuch work has been done in the last decade by the Computer Vision community in understanding the

geometry of images of a rigid scene taken by a moving camera. The case of a scene containing motion

has been largely ignored. Since most video footage aim at capturing events, hence motion, the need for

handling dynamic scenes became apparent.

Our work deals with discovering the geometrical models and the mathematical tools that we can use

to analyze views of such scenes. In particular, we focus on the extraction of information about multiple

independently moving objects. Unlike previous work which at best tried to ignore such moving objects,

we show that valuable information can be extracted from such motions.

The main mathematical tools that we have used are projective algebra and multi-linear tensors. Pro-

jective algebra has long been used to model the process of imaging of rigid scenes. From this model

multiple views’ invariants can be derived to describe for example stereo vision. It was shown that these

invariants can be described most generally by using multi-linear tensors. In our work, we further use

multi-linear tensors to model dynamic scenes.

In order to handle dynamic scenes we have often lifted the model of the scene to higher projective

spaces. In these higher spaces, we were able to derive novel multi-linear invariants. We then found ways

to decompose these invariants to unravel underlying information such as scene structure, motion in the

scene and the motion of the camera.

The research presented in this thesis appears in the following papers [46, 57, 33, 56, 58, 61]. These

papers are reprinted as chapters of this thesis:

Chapter 2 describes the recovery of structure and motion of a scene containing points moving along

coplanar lines, appeared inECCV2000 - European Conference on Computer Vision[46].

Chapter 3 describes the recovery of structure and motion of a scene containing points in 3D moving

in constant velocity, and other scenarios, appeared inInternational Journal on Computer Vision (IJCV)

48(1), 2002. [57].

Chapter 4 describes the derivation multi linear constraints used to recognize a moving object from a

single view appeared inCVPR 2001 - IEEE Conf. on Computer Vision and Pattern Recognition[33]

v

Chapter 5 describes the analysis of a scene containing multiple moving planes using the double

algebraICCV99 - International Conference on Computer Vision[56].

Chapter 6 describes the analysis of a scene containing two independently moving objects. Appeared

in CVPR 2001 - IEEE Conf. on Computer Vision and Pattern Recognition[58].

Chapter 7 describes the analysis of a dynamic scene viewed from two unknown cameras which are

held fix relative to one another. Appeared inPost ECCV 2002 Workshop on Vision and Modeling of

Dynamic Scenes[61].

Chapter 8 describes the analysis of a scene viewed by 3D imaging devices. Some of it follows

our publication which appeared inICPR 2001 - International Conference on Pattern Recognition[59]

offering more compact solutions, but most of it is about completely new scenarios.

Chapter 9 describes the use of the tools of the representation theory to derive a general solution for

many counting problems. These counting problems arise in the geometric analysis of both static and

dynamic scenes. This chapter, which was not published elsewhere, also contains a short introduction to

the representation theory.

vi

Contents

1 Introduction 1

1.1 Classical Structure from Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Projective spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Tensorial notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.3 Extensors and the Join Operation . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Early approaches to dynamic structure from motion . . . . . . . . . . . . . . . . . . . . 5

1.3 Methods for dealing with independently moving points . . . . . . . . . . . . . . . . . . 7

1.4 Methods for dealing with multiple moving objects . . . . . . . . . . . . . . . . . . . . . 10

1.5 Unpublished chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Homography Tensors 17

3 Projection Matrices from Pk to P2 33

4 Action indexing using dynamic shape tensors 49

5 A common transversal solution for independently translating planes 59

6 The segmentation matrix 69

7 Synchronization and reconstruction from fixed cameras viewing a dynamic scene 79

8 “3D to 3D” alignment 99

8.1 Derivation of Jtensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

vii

8.2 The Minimal Jtensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

8.3 The 3D Constant Velocity Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

8.3.1 decomposing the constant velocity tensor . . . . . . . . . . . . . . . . . . . . . 106

8.4 The Translating Lines Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

9 Counting Problems for Multilinear Constraints 111

9.1 A Representation Theory Digest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

9.2 The 8-point Shape Tensor Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

9.3 DynamicPn → Pn Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

9.4 The Structure ofV (n, m, k) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

9.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

10 Conclusions 125

10.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

10.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

viii

Chapter 1

Introduction

This introduction includes an overview of the evolvement of the structure from motion (SFM) analysis

of dynamic scenes. The field of SFM has matured greatly over the last decade, but most of the work was

confined to static scenes, or to known camera motion. Motion in the scene was mostly treated as noise,

and was usually ignored. The work done in the field of dynamic structure from motion does more than

just providing the tools for analyzing dynamic scenes - it also exploits the motion in the scene in order to

extract further information about the world. Thus, not only that the motion is not treated as a disturbing

element, it is also taken as an advantage.

1.1 Classical Structure from Motion

The aim of this research is to extend the field of the classical structure from motion (SFM) to dynamic

scenes. Classical SFM [25, 14] deals with static scenes viewed from moving cameras and its goal is to

recover the scene’s structure (reconstruction) and the camera ego-motion (inverse reconstruction). Our

goal is to deal with cases where both the cameras and the objects in the scene move and answer similar

questions.

The remarkable thing is that these new questions are solved using the same tools that have been

traditionally used in SFM, namely:projective spaces, tensorial notation andthe double algebra.

1

Projective spaces have been very successful in representing the geometry of the imaging process. In

the imaging process a 3D point in the world(X, Y, Z)> is mapped to a 2D point on the image plane

(x, y)>. Working with the pin-hole camera model, and assuming that the camera axis are aligned with

the world coordinate system, and that the image plane has been transformed to a “standard” coordinate

system the imaging mapping is simply:x = XZ

, y = YZ

.

The ratio in the above formulas causes the problem to be non-linear in nature and hence difficult to

handle. The use of projective algebra provides ways not only to make the imaging mapping linear, but

also to represent large families of transformations of the world coordinate system or of the image plane

as linear mappings as well.

Our main tool in the research of these projective spaces are tensors. These tensors are a generalization

of matrices: every entry in a matrix has two indices, whereas in a tensor it can have any number of

indices. We use the strength of tensorial notation to separate the measurements from the ego-motion

in order to build constraints. We then use these constraints to recover the ego-motion and then achieve

reconstruction.

Another tool which is used is the double algebra. The double algebra is used to describe linear

subspaces as single objects. The term extensor is used to describe a linear spaces spanned by several

points. A point will be an extensor of step 1, a line - an extensor of step 2, an extensor of step 3 will be

referred to as a plane. Hyper-planes are extensors of stepk in P k.

1.1.1 Projective spaces

We will be working with projective spaces,Pk. A point in Pk is defined byk + 1 numbers, not all

zero, that form a coordinate vector defined up to a scale factor. Thedual projective spaceis defined as

the space of hyper-planes which and is also represented byk+1 numbers. A pointp in a projective space

is said to coincide with a hyper-planes if and only if p>s = 0, i.e., their scalar (dot) product vanishes. In

other words, the set of hyper-planes coincident with the pointp are represented by the coordinate vectors

s that satisfyp>s = 0, and vice versa: a point represented by the coordinate vectorp can be thought of

as the set of hyper-planes through it.

In the projective spaceP k, anyk + 2 points in general position can be uniquely mapped to any other

2

k + 2 points in the same projective space. Such a mapping is calledcollineationand is represented by

a (k + 1) × (k + 1) invertible matrix, defined up to scale. A collineation is defined byk + 2 pairs of

matching points, each pair providesk linear constraints on the entries of the collineation matrix.

A linear mapping from one projective spaceP k to another projective spaceP l is given by a(l + 1)×

(k+1) projection matrix. For example - a projection matrix fromP 3 to P 2 is given by a3×4 matrix, this

specific projection matrix is also known ascamera matrixand is used to model the process of imaging.

Any matching between a point inP k and a point inP l providesl constraints of the projection matrix

between these spaces.

In this work, the termcenter of a projection matrixwould refer to the null space of the projection

matrix. If the projection matrix is fromP k to P l then the center of the projection matrix would be the

rankk − l linear subspace (or using another terminology: theextensorof stepk − l, see below) inP k

which is mapped by this projection matrix to zero.

1.1.2 Tensorial notations

It is often more convenient to use tensor notations to represent linear operations. In these notations

coordinates of a point are specified with superscripts, i.e., inP 2 pi = (p1, p2, p3). These are called

contravariant vectors. A hyper-plane inPk is called acovariant vectorand is represented by subscripts,

i.e., inP 2 sj = (s1, s2, s3). Indices repeated in covariant and contravariant forms are summed over, i.e.,

pisi = p1s1 + p2s2 + p3s3. This is known as acontraction. For example, ifp is a point incident to a line

s in P2, thenpisi = 0.

Vectors are also termed1-valence tensors. 2-valence tensors(matrices) have two indices and the

transformation they represent depends on the covariant-contravariant positioning of the indices. For

example,aji is a mapping from points to points (a collineation, for example), and from hyper-planes

(lines inP2) to hyper-planes, sinceajip

i = qj andajisj = ri (in matrix form: Ap = q andA>s = r);

aij maps points to hyper-planes; andaij maps hyper-planes to points. When viewed as a matrix the row

and column positions are determined accordingly: inaji andaji the indexi runs over the columns and

j runs over the rows, thusbkj a

ji = ck

i is BA = C in matrix form. An outer-product of two 1-valence

tensors (vectors),aibj, is a 2-valence tensorcj

i whosei, j entries areaibj — note that in matrix form

3

C = ba>. A 3-valence tensor has three indices, sayHjki . The positioning of the indices reveals the

geometric nature of the mapping: for example,pisjHjki must be a point because the i,j indices drop out

in the contraction process and we are left with a contravariant vector (the index k is a superscript). Thus,

Hjki maps a point in the first coordinate frame and a line in the second coordinate frame into a point

in the third coordinate frame. The trifocal tensor of multiple-view geometry is an example of such a

tensor. A single contraction, saypiHjki , of a 3-valence tensor leaves us with a matrix. Note that whenp

is (1, 0, 0) or (0, 1, 0), or (0, 0, 1) the result is a “slice” of the tensor.

1.1.3 Extensors and the Join Operation

The mathematical component of our work deals with intersecting and joining subspaces for the pur-

poses of finding common transversals in the 8-th dimensional projective spaceP8. A convenient way

to do so is to treat a k-dimensional subspace as a single object (instead of as a collection ofk basis

vectors) which is done using Grassmann coordinates also known as anextensor of step k. Generally, the

algebra of extensors with the operations of intersection (“meet”) and union (“join”) are also known as

double algebra or Grassmann-Cayley algebra. These were first introduced in the context of multiple-

view geometry by [7, 18, 17] and also in the context of projection matricesPk → P2 [57]. A concise

introduction to extensors and the operations of meet and join can be found in [51, 3].

An extensor of stepk describes a subspace of dimensionk of some n-dimensional vector spaceV .

All extensors of stepk lie in the linear space∧k(V ) which is of dimension

(nk

). The join operator (∨) is

a multilinear antisymmetric operator that takes two extensors of stepsj andk and produces an extensor

of stepj + k. The joint extensor is associated with the direct sum of the linear spaces associated with

the two extensors. This join extensor vanishes if the two generating extensors intersect. Ife1, e2..., en is

a basis ofV then the basis for∧k(V ) is given by

(nk

)basis elements:

{ej1 ∨ ej2 ∨ ... ∨ ejk|1 ≤ j1 < ... < jk ≤ n}

Let A = span{a1, ..., ak} be a k-dimensional subspace ofV wherea1, ..., ak is some choice of basis.

The stepk extensorA = a1 ∨ · · · ∨ ak also denoted byA = a1a2 · · · ak is an element of the vector space

4

∧k(V ):

A =∑

1≤j1<...<jk≤n

Aj1,...,jkej1 ∨ · · · ∨ ejk

where the scalarsAj1,...,jkarek × k minors:

Aj1,...,jk=

∣∣∣∣∣∣∣∣∣∣∣∣∣

a1j1 a1j2 ... a1jk

a2j1 a2j2 ... a2jk

......

......

akj1 akj2 ... akjk

∣∣∣∣∣∣∣∣∣∣∣∣∣Thus the extensorA has

(nk

)coefficients (choices ofk × k minors from thek × n matrix whose rows

consist ofa1, ..., ak). The extensorA represents the subspaceA as we note that

A = {u ∈ V ‖A ∨ u = 0}

(all (k+1)×(k+1) minors vanish, therforeu ∈ span{a1, ..., ak}) while on the other hand the determinant

expansions are invariant to a change of basis ofA.

Let A = a1 · · · ak andB = b1 · · · bj be extensors of stepk, j representing subspacesA, B andk + j ≤

n. ThenA ∨ B = a1 · · · akb1 · · · bj is non-zero (at least one coefficient does not vanish) iff the set

a1, · · · , ak, b1, · · · bj is linearly independent (i.e.,A ∧ B = ∅). In this case,

A + B = A ∨B = span{a1, · · · , ak, b1, · · · , bj}

Thus, the algebraic join of extensors corresponds to the geometric join of linear subspaces. Con-

versely, in casek + j > n the subspacesA, B always have a non-vanishing intersection into ak + j − n

dimensional linear space. Thus, it is possible to define a “meet” operationA∧B which would be a linear

combination of extensors of stepk + j − n.

1.2 Early approaches to dynamic structure from motion

Almost every system which computes structure from motion has to deal with motion in the scene.

However, the existing SFM techniques were designed for static scenes. A dominant approach to deal

with scene motion (e.g. [30]) is to try to separate the static background from the moving objects. The

5

camera motion (ego-motion) is recovered using the background, the images are registered and the mov-

ing objects can be segmented out. However, this object/background segmentation is a difficult task by

itself, and solutions are based usually on the dominance of the background. The tools we develop in this

work are aimed to treat moving points and static points alike i.e work with unsegmented scenes. One of

the advantages of such an approach is the ability to work with scenes containing no dominant regions or

even no static or rigid regions.

A more systematic alternative to the approach described in this thesis is the work done by Torr [52].

In this work the case of multiple moving objects is considered. Using a sampling algorithm (such as

RANSAC) and model fitting techniques, several models are being fit to explain the motion of feature

points. This approach has the advantage that in principle the number of moving objects and their com-

plexity (they can be either degenerate ,e.g. planar, or not) is not specified in advance. In this work we

present analytical solutions, rather than statistical solutions, to deal with dynamic scenes. This prefer-

ence enables us to treat , for example, the case of points moving independently, while not being clustered

to rigid objects.

Some analytical solutions were suggested in the past for the case of a dynamic scene containing sev-

eral rigid objects moving independently. The factorization-based motion segmentation of Costeira and

Kanade [12] is applicable to the affine (parallel projection) camera model. There it was shown that the

measurement matrix of all points across a sequence of images lies in a linear subspace whose dimension

is determined by the number of independent bodies. The motion of each body lies in a separate subspace

— the rearrangement of the data can be done in order to separate between the subspaces (see also [31]).

In contrast to the factorization based approaches, in our work we assume a full projective model (but

also address the affine model). Moreover, factorization based approaches need more than the minimal

number of views, requiring point tracks to be maintained over many frames. For example in the case

of two independently moving rigid objects we require 2 views, even for the projective camera case.

The factorization approach would have required at least four views, and usually much more views are

required in order to get stable results.

Parallel to our work, Fitzgibbon and Zisserman [19] also demonstrated that some benefits arise from

considering a dynamic scene. They addressed the situation of several segmented independently moving

6

objects, and showed how to combine constraints on the cameras internal parameters arising from several

objects. The solutions given in that work were nonlinear minimizations. In our work we usually ignore

the problem of self calibration. We usually preform reconstructions only up to a projective reconstruc-

tion. In some cases, such as the constant velocity case, we preform an Affine reconstruction, from which

an Euclidean reconstruction is readily given [25]. This enables us to propose solutions which are “lin-

ear” in nature (i.e do not require solutions to non linear systems of equation). Non linear minimization

is only preformed to handle noise.

1.3 Methods for dealing with independently moving points

The first work to deal with the case of points moving independently in space, and not just of points

spread among several rigid bodies, was done by Avidan and Shashua [2]. This work and those who

followed [43, 40] considered the case they call ”trajectory triangulation”. In SFM, the term triangulation

refers to the recovery of the locations of 3D points from the image measurements in the case where

the camera parameters are known (”the calibrated case”). In trajectory triangulation, the point in 3D is

allowed to move along some parametric path, for example a line or a conic section. Note that in every

single image there is information from one instant in time. Without adding constraints on the type of

motion, the problem of trajectory triangulation is inherently ill-posed.

In [2], the trajectory of each point is linear. Each trajectory line is represented by its by Plucker

coordinates. Given the original3 × 4 camera projection matricesM (i) it is possible to build3 × 6

projection matricesM (i) which project each 3D lineL to a 2D image linel, according tol ∼= ML

[17]. The three rows ofM (i) are the result of the “meet” [3] operation of pairs of rows of the original

3× 4 camera projection matrix, i.e., each row ofM represents the line of intersection of the two planes

represented by the corresponding rows ofM .

We can try to extend the results of the trajectory triangulation scheme to the uncalibrated case in a

straight-forward way. LetP be the moving point along the straight lineL such that in thej’th view we

observe the projection ofpj of P . The pointpj has to be on the image of the trajectory lines. Thus,

p>j M (j)L = 0 for all views ofP . The determinant of the6 × 6 matrix whose rows arep>j M (j) must

7

vanish. This determinant is a multilinear expression in the measurementspj, and can be expressed as a

tensor. The resulting tensor is of36 elements and thus would require 728 matching points across 6 views

in order to obtain a linear solution. Naturally, this situation is unwieldy application-wise. Considering

even more complex types of trajectories, like conic trajectories gives rise to even less tractable solutions.

We deal with this situation by adding more constraints to the nature of the motion. Sometimes we

constrain the scene to be planar, sometimes we constrain the trajectories to be parallel, and sometimes

we constrain the motion to be of constant velocity.

A solution for linear trajectories in a planar scene with unknown homographies is given in [46] and

is described in chapter 2. Similarly to other work presented in this thesis, the major effort is put not on

the derivation of the tensorial constraint, but on its analysis. The major question is how to recover the

camera motion from this tensor. Other questions are how general should the points and their motion be

in order to recover the tensor without ambiguity.

Interestingly, the study of the tensor associated with linear planar trajectories is dual to the study of the

tensor associated with slices of the quadrifocal tensor [47]. The latter tensor is the result of contracting

the quadrifocal tensor with a single line, and represents the situation where three lines intersect in one

point. The planar trajectory triangulation tensor represents the situation where three points lie on one

line. These dual situations are projective situations, i.e they are invariant to a projective transformation

of the coordinate system [37].

In chapter 3 we describe work which incorporated Affine constraints to the motion of the moving

points. One type of constraint is the constant velocity constraint, another is the pure translation type of

constraint. Combined with other projective constraints and with 2D and 3D cases, a wealth of tensors

are described.

In order to handle dynamic scenes we have often lifted the model of the scene to higher projective

spaces. In these higher spaces, we were able to derive novel multi-linear constraints. The tensor defini-

tion and recovery is just the first stage. We then analyze the structure of each tensor to get a projective

reconstruction in the higher projective space. We then further analyze the tensors to unravel the under-

lying information such as scene structure, motion in the scene and the motion of the camera. Since we

start with Affine invariants such as constant velocity, we end up with an Affine structure, and not merely

8

a projective structure.

After modelling the problem as a projection from one projective space to another, the derivation of

the underlying tensors is automatic. However, the next stages of decomposing the tensors are the most

demanding stages. The first stage of obtaining the projective structure is independent of the underlying

problem, and depends only on the projective spaces at hand (except for some degenerate cases which

are a result of a specific modelling). Nevertheless, no method exists for automatically achieving this

decomposition. We have developed some tools for handling these decompositions, such as the “joint

epipoles” which we believe are applicative to a wide range of such tasks, but the general method is left

for further research.

In classic SFM, beside the multiple view invariants (the tensors), there are what is sometimes con-

sidered to be their dual - multi point tensors, called the”shape tensors”[54]. In chapter 4 we show

how to find shape tensors for any kind of projection matrix, and use those to index single images ac-

cording to the action being photographed. The idea is to represent an action (e.g ”sitting”,”walking”)

as a combination of trajectories of different parts of the body. Given an image, whether check to see if

the configuration of body parts in it match one of the models for which we have built its dynamic shape

tensors.

Parallel to our work, Han and Kanade proposed solutions for the constant velocity case using factor-

ization. They first proposed a solution for the Affine projection model [22], and then to the projective

camera model [23]. Both solutions are based on factorization, which has the limitations mentioned

above. Also the factorization for the projective case is not guaranteed to converge to the correct solution.

These methods require move views than the minimal number of views, but have the advantage that they

provide a way to incorporate information from many views at once.

Following our work, Wexler and Shashua [55] used the homography tensors presented in chapter 2

to synthesize a new view of a dynamic scene. Their work assume constant velocity and could take

advantage of the results presented in chapter 3. Levin and Shashua [32] considered the infinitesimal

motion model as the camera model to derive similar results for linear trajectories.

In [26], the results of chapter 3 are rederived using a method which can be considered as a descendent

of the relative affine framework [45]. In our view their framework is much less intuitive as a general-

9

ization of the classical SFM techniques. For example, in their proposed framework the center of the

projection matrix is always a point, where in our formalization it is the null space of the projection

matrix. The authors claim to generalize and complete our results, but in fact just show several new

examples of tensors, ignoring both the un-tractability of large tensors, and the problem of decomposing

those tensors.

1.4 Methods for dealing with multiple moving objects

As stated above, the majority of previous work which handled dynamic scenes has focused on multiple

moving objects rather than on independently moving points. These previous approaches were either

sampling based or limited to the Affine projection model.

In chapter 3 we described, using the framework of lifting the problem to a higher projective space,

some solutions to handle multiple moving objects. These solutions were confined for the case where the

relative motion between the objects was a pure translation. In chapter 5 we continue to study the pure

translation case.

Consider two views of a scene containing multiple moving bodies. Each moving body is associated

with its own fundamental matrix. We show that if the bodies move relative to one another by pure trans-

lation, all these fundamental matrices reside in a 3-dimensional subspace ofR9. We seek to generalize

the classic result that two homographies associated with two planar parts of the scene are sufficient to re-

cover the static fundamental matrix. We show that five homographies associated with five planar bodies

are sufficient for the recovery of the 3-dimensional subspace mentioned above. Once it is recovered we

are able to recover the fundamental matrix associated with each of the views as well as the homography

at infinity (hence we achieve an Affine reconstruction of the dynamic scene).

We solve this problem by associating with each homography the subspace of the fundamental matrices

which confirm with this homography. Each homography provides 6 linear constraints on the elements

of the fundamental matrix, hence the dimension of this linear subspace is also 3. All the subspaces

associated with the homographies arising from one dynamic pure translation scene have to intersect the

subspace of all possible fundamental matrices of the scene (this last subspace is in fact the subspace of

10

fundamental matrices which confirm with the homography at infinity).

Conceptually, this problem of finding a linear subspace which intersects several linear subspaces is

a general form of the trajectory triangulation described above. We solve this problem using the double

algebra. This work is quite unique in the sense that until now most of the work in computer vision which

used the double algebra, used it to derive results for which other proofs existed. In this work the double

algebra is not just an elegant tool, but also an inherent part of the solution.

The analytic approach we use to handle dynamic scenes is not limited to scenes with pure translation.

In chapter 6 we describe a solution to the two views multibody problem, where the motion between the

two views can be general.

As mentioned above, the multibody problem was the first problem to be considered in the field of

dynamic SFM. Solutions were either sampling based, or constrained to the Affine projection model. In

chapter 6 we propose a different approach. Each body is associated with a different invariant (funda-

mental matrix). Given image measurements of a point in one image and of a corresponding point in a

second image, we know that one of the two invariants must vanish. Since we do not know which one of

these vanishes for each point (this is the segmentation problem) we just multiply the two invariants. The

product has to vanish for points on both bodies.

This simple scheme bares a difficulty: each original invariant was linear in each one of the point mea-

surements. The product invariant is bilinear in each one of these points. We handle this by representing

the invariant as being linear in the second order monomials of the point measurements. We then show

how to decompose this representation to the fundamental matrices of each body.

Although the new invariant has many desirable properties such, as using point measurements from

both bodies at once, and insensitivity to degenerate bodies, its dependence on the second order mono-

mials makes it less stable to compute than single fundamental matrices. To overcome this problem, we

suggest a non linear minimization technique for its computation.

All of the work described in this thesis was done using the projective camera model, except for the

work described in chapter 7. In this work we use the Affine camera model to derive our results, and just

briefly describe how to generalize this to the projective camera model.

A 3D reconstruction of a dynamic non-rigid scene from features in two cameras usually requires

11

synchronization and correspondences between the cameras. These may be hard to achieve due to occlu-

sions, wide base-line, different zoom scales, etc. In chapter 7 we present an algorithm for reconstructing

a non-rigid scene from sequences acquired by two uncalibrated non-synchronized fixed Affine cameras.

This algorithm assumes that (possibly) different points are tracked in the two sequences. The only

constraint used to relate the two cameras is that every 3D point tracked in one sequence can be described

as a linear combination of some of the 3D points tracked in the other sequence. This constraint lies

somewhere in between the independently moving points problems presented in the previous section and

the multiple rigid bodies problem presented in this section.

We present algorithms for synchronizing the two sequences and reconstructing the 3D points tracked

in both views. Outlier points are automatically detected and discarded. The algorithm can also handle

both 3D objects and planar objects in a unified framework without the need for model selection.

Following the work of Irani [27], we were able to make a ”direct method” version of our synchro-

nization algorithm. This version does not use point tracks as its input. Instead it uses only gray level

measurement from the image. By avoiding the stage of tracking points, we were able to suggest a very

simple technique for action indexing.

Parallel to the work presented in chapter 5, Manning and Dyer [36] presented a solution for a sim-

plified sceneraio. In their work it is assumed that two fundamental matrices of the dynamic scene are

already recovered, and what remains is to solve for the homography at infinity.

Following the work described in chapter 6 Vidalet al. [53] described solutions for more than two

rigid bodies view by two cameras. Although this generalization is possible, the resulting invariants are

untractable.

Bartoli [4] used the idea presented in chapter 6 to combine two types of constraints arising from

dynamic or static parts of a combined scene. The dynamic part consists of points moving along planar

lines all intersecting at a point. This scenario can be modelled as a projection fromP3 toP2, hence the

resulting invariant is similar to the fundamental matrix.

Caspi and Irani [11] proposed a different solution for the problem of synchronizing two fixed cameras

viewing a dynamic scene. Their solution assumes the existence of a point which is being tracked in

both sequences. Starting with a small number of point tracks, they search over all pairs of possible

12

matchings to find the best matching pair of tracks, and use its measurements across time to compute the

fundamental matrix.

Zelnik-Manor and Irani [62] repeated the settings proposed in chapter 7 as part of a comprehensive

study on rank conditions on measurement matrices. Using a more restrictive assumption that the same

points are tracked by both cameras, they were able to get very accurate synchronization results using

a similar algorithm. Using their assumption, the whole reconstruction problem becomes a matching

problem, and using a brute force search they were able to show that this matching is stable. Using the

reprojection method we describe in chapter 7, this search can be avoided.

1.5 Unpublished chapters

In chapters 2-7 we focus on scenes captured by 2D images. In chapter 8 we consider scenes which

have been captured by 3D imaging devices (e.g structured light systems, stereo systems). We consider

scenarios where points move along straight lines, similarly to chapter 2, points which move with constant

velocity, similarly to chapter 6, and 3D lines which more in pure translation. The results obtained follow

the same use of tools throughout the rest of the thesis, and this unpublished chapter is rather technical in

nature.

Chapter 9 which follows is theoretical. In this chapter, which was developed with the help of Prof.

Roy Meshulam (The Technion) and Prof. Gil Kalai (The Hebrew University) we use the representation

theory as a tool for solving ”counting questions”. These questions appear whenever we resort to the

use of multi-linear algebra to derive invariants. Examples of such questions are ”how many points in

general position are needed to solve linearly for the fundamental matrix?”, ”how many constraints on

the elements of the trifocal tensor can we obtain using images of points lying on one plane in 3D?”,

”What is the rank of the estimation matrix whose rows are the outer products of three images of a point

in 3D?”.

In the past, these questions were solved one at a time. As we began to present more and more tensors,

each one with its own counting problems, the need for a general tool for solving those became apparent.

In chapter 9 we approach one such abstract question. The general solution to this question is shown to

15

have applications for the analysis of constraints arising from both static scenes and dynamic scenes.

16

Chapter 2

Homography Tensors

Homography Tensors: On Algebraic Entities ThatRepresent Three Views of Static or Moving Planar points

Amnon Shashua, Lior Wolf

Published inProc. of the European Conference on Computer Vision (ECCV)

June 2000.

17

Chapter 3

Projection Matrices from Pk to P2

On Projection MatricesPk −→ P2, k = 3, 4, 5, 6 , and theirApplications in Computer Vision.

Lior Wolf, Amnon Shashua

Published inInternational Journal on Computer Vision (IJCV)

48(1) 2002.

33

Chapter 4

Action indexing using dynamic shape tensors

Time-varying Shape Tensors for Scenes with MultiplyMoving Points

Anat Levin, Lior Wolf, Amnon Shashua

Published inIEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

Dec. 2001.

49

Chapter 5

A common transversal solution for

independently translating planes

Affine 3-D Reconstruction from Two Projective Images ofIndependently Translating Planes


Published inThe Eighth IEEE International Conference on Computer Vision (ICCV)

June 2001.

59

Chapter 6

The segmentation matrix

Two-body Segmentation from Two Perspective Views


Published inIEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

Dec. 2001.

69

Chapter 7

Synchronization and reconstruction from fixed

cameras viewing a dynamic scene

Correspondence-free Synchronization and Reconstructionin a Non-rigid Scene

Lior Wolf, Assaf Zomet

Published inPost ECCV 2002 Workshop on Vision and Modeling of Dynamic Scenes

May 2002.

79

Chapter 8

“3D to 3D” alignment

Consider the classic problem of “3D to 3D” alignment of point sets. One is given a set of 3D points

P1, ..., Pn measured by some device such as one based on structured-light [48] or on triangulation from

a stereo-rig of cameras. The measuring device has changed its position in space (while the set of 3D

points has remained static in space) and the corresponding 3D positions areP ′1, ..., P

′n, i.e., the measured

points have undergone a coordinate transform. In the case of a projective setting, five of these matching

pairs in general position are sufficient to recover the4×4 collineationA, such thatAPi∼= P ′

i , i = 1, ..., n.

In a rigid motion setting, the coordinate transform consists of translation and rotation which could be

recovered using 4 matching points and elegant techniques using SVD have been developed for this

purpose [21].

In this section we introduce “dynamic” versions of the 3D-to-3D alignment problem. The first dy-

namic version would allow any number of points to move along straight-line paths during the motion

of the measuring device. Points that remain in place are calledstatic and points that move are called

dynamic. There could be any number of dynamic points — including the possibility thatall points are

dynamic — and the system need not know in advance which points are static and which are dynamic

(unsegmented configuration). Under these conditions we wish to find the projective coordinate changes

acrosstwomotions of the measuring device.

A previous work [59] derived a4 × 4 × 4 family of tensors, referred to asjoin tensorsor Jtensors in

99

short, that capture the dynamic 3D-to-3D alignment problem. A matching tripletP, P ′, P ′′ of a point

measured at three time instances contributes a linear measurement for the Jtensor - regardless of whether

the physical point in space is dynamic or static while the measuring device has changed positions. The

linear constraints add up to a 4-dimensional null space of Jtensors, that is, there exist 4 distinct Jtensors

which are linearly recovered from matching points. These Jtensors however are not minimal. We will

derive a minimal tensor which would require fewer measurements and that will give us one tensor per

estimation matrix.

We will also consider the constant velocity case. For points that are either static or move in constant

speed we introduce a smaller tensor that introduces some advantages. First, the tensor is smaller thus

fewer measurements are required for its recovery. Second, since constant velocity in an Affine invariant,

we will recover the change in coordinate system up to an Affine (not projective) ambiguity. Our approach

here would be closest to the one presented in [57], where a full projective camera was used. Having 3D

information would allow us to use fewer views and measurements than the solution given there.

The third case that we are going to consider is the case of translating lines. In 3D measurements, such

as range data, points are not always well defined. We consider a case where instead of tracking points

we track lines. The motion of the lines is restricted in such a way that every point on each line move in

the same direction as other points on this line. In this case only two views are needed in order to recover

the relative position of the coordinate systems.

In a separate work, Sturm [50] derived multiple view tensors of this family to deal with the case

where points in 3D move along linear paths which are constrained to intersect some line. A full analysis

is done including the analysis of ambiguities, and the possibilities of preforming Euclidean calibration

using these tensors.

8.1 Derivation of Jtensors

Let X be some point in 3D space with a coordinate vectorP . Let P ′ be the coordinate representation

of the pointX at some other time instance (say, the measuring device has changed its viewing position)

and letP ′′ be the coordinate representation ofX at a third time instance. LetA, B be the collineations

100

mapping the second and third coordinate representations back to the first representation, i.e.,P ∼= AP ′

andP ′′ ∼= BP ′′.

If the pointX happens tomovealong some straight-line path during the change of coordinate systems,

thenP, AP ′, BP ′′ do not coincide but they form a rank-2 matrix:

rank

| | |

P AP ′ BP ′′

| | |

= 2

And for every column vectorV we have

det

| | | |

P AP ′ BP ′′ V

| | | |

= 0 (8.1)

Note that becauseV is spanned by a basis of size four, we can obtain at most four linearly independent

constraints on some object consisting ofA, B from a triplet of matching pointsP, P ′, P ′′. Note also that

the null vector of a4 × 3 matrix can be represented by3 × 3 determinant expansions. For example, let

X, Y, Z be three column vectors in a4× 3 matrix, then the vectorW representing the plane defined by

the pointsX,Y, Z is:

w1 = det

x2 y2 z2

x3 y3 z3

x4 y4 z4

w2 = −det

x1 y1 z1

x3 y3 z3

x4 y4 z4

w3 = det

x1 y1 z1

x2 y2 z2

x4 y4 z4

w4 = −det

x1 y1 z1

x2 y2 z2

x3 y3 z3

We can write the relationship betweenW andX, Y, Z as a tensor operation as follows:

wi = εijklxjykzl

101

where the entries ofε consist of+1,−1, 0 in the appropriate places. We will refer toε as the “cross-

product” tensor. Note that the determinant of a4× 4 matrix whose columns consist of[X, Y, Z, T ] can

be compactly written as:

tixjykzlεijkl.

Using the cross product tensor we can write the constraint eqn. 8.1 as follows:

0 = det

| | | |

P AP ′ BP ′′ V

| | | |

= P i(εilmu(A

ljP

′j)(Bmk P ′′k)V u)

= P iP ′jP ′′k(εilmuAljB

mk V u)

Note that the tensor form allows us to separate the measurementsP, P ′, P ′′ from the unknownsA, B,

and we denote the expression in parenthesis:

Jijk = εilmuA′lj B

′mk V u (8.2)

as the “join”1 tensor, or Jtensor in short. Note that for every choice of the vectorV we get a Jtensor. As

previously mentioned, sinceV is spanned by a basis of dimension four there are 4 such tensors – each

tensor is defined by the constraints:

P iP ′jP ′′kJijk = 0.

These are linear constraints in the 64 elements of the Join tensors. Because there are four Jtensors the

linear system of equations for solving forJijk from the matching tripletsP, P ′, P ′′ has a 4-dimensional

null space. The vectors of the null space are spanned by the Jtensors. In practical terms, givenN ≥ 60

matching tripletsP, P ′, P ′′, each triplet contributes one linear equationsP iP ′jP ′′kJijk = 0 for the 64

1The join operator is the exterior product of the Grassmann-Cayley algebra. A join of three 3D points is a plane which

contains the three points.

102

entries ofJijk. The eigenvectors associated with the four smallest eigenvalues of the estimation matrix

arethe Jtensors of the dynamic 3D-to-3D alignment problem.

We see that at least 60 point measurements are needed for a solution to the Jtensors. In case all of the

measurements arise from dynamic points, then these points should be distributed along at least 10 lines,

5 of which can hold up to 8 dynamic points, and the remaining 5 can hold up to 4 dynamic points. A

tool to show these kind of arguments is the representation theory. Using this tool along the guidelines

of chapter 9, one can observe that the size of the subspace of constraints spanned by static points is 20,

and that from each line one can extract four “static” constraints, and four constraints outside that static

subspace, which explains the above result.

More information about the Jtensor is given in [59]. The other main results shown there are: (i)

Tensor slices and the extraction of the constituent collineationsA, B from the four Jtensors, (ii) The use

of Jtensors for direct mapping between coordinate systems (without extractingA, B along the way), (iii)

The use of Jtensors to distinguish between dynamic and static points, and (iv) the relationship between

the number of static and dynamic points for estimating the Jtensors in the unsegmented and segmented

configurations.

8.2 The Minimal Jtensor

By building the estimation matrix from the outer product of the 3D points we get four constraints,

which is the number of constraint that exist when a point is on a line. In this case the line can be seen

as the join ofP andAP ′ and the point asBP ′′. The fact that we got four constraints suggests that the

Jtensors we have found were not minimal. A minimal Jtensor would use a smaller estimation matrix and

therefore would need less measurements.

We can limit the constraints to be the one constraint for intersection of two lines. This will be done

by taking instead of the pointBP ′′, the ray which connects some camera centerC with this point. This

camera is totally arbitrary and may be chosen at will.

Assuming that we choose a cameraM , as stated above, we takeC ∼= null(M). There exist4 × 3

matrices which transform a point on the imageMP ′′ to a point in 3D, which is the intersection of the

103

ray associated with this point (BP ′′ ∨BC) and some plane. We choose one of these matrices and call it

O. We know that the ray above and the lineP ∨ AP ′ must intersect. So:

det ( P AP ′ BOMP ′′ BC ) = 0

This expression is multi-linear inP ,P ′ andMP ′′, and gives us the minimal JtensorNijk. The indexes

i andj are from 1 to 4, and the last indexk is from 1 to 3. The Minimal Jtensor’s equation is:

Nijk = εilmuAlj(BO)m

k (BC)u (8.3)

Alternatively, we can give the minimal Jtensor another equation, this time by noticing that after all

the 3D points are transformed to the third coordinate system the points and therefore their images by the

cameraM are collinear:

Nijk = εlmk(MB−1)li(MB−1A)m

j (8.4)

Where the three indexes epsilonεlmk is just the anti-symmetric (cross product) tensor ofR3.

The estimation matrix for this tensor is made out of the outer product ofP ,P ′ andMP ′′ for any

prechosen rank three camera matrixM . Thus, in order to solve for the minimal Jtensor, one would need

3 × 42 − 1 = 47 measurements, which is a significant reduction of the60 measurements needed for

the Jtensor above. Having the same 3D measurements one can choose another projection matrixM and

compute several such minimal Jtensors.

We can recover the collineationsA andB by noticing that, for example,P i(AP )jNijk = 03. There-

fore, for any slice of the minimal JtensorSδ = δkNijk we have:AT Sδ + STδ A = 0 which gives us

10 equations onA per slice of the minimal Jtensor. The recovery of the second collineationB is only

slightly more complicated.

8.3 The 3D Constant Velocity Tensor

We now consider the constant velocity case. Let the collineations between the world coordinate

system and the sensors coordinate systems beAi, i = 0, 1..3. This time we cannot assumeA0 to be the

104

identity, since we allow our tensors to have projective coordinate systems, and constant velocity is an

affine invariant. A point in 3D( X Y Z )T is moving at a constant velocity( dX dY dZ )T . Its

location in the tensors (in projective coordinates)Pi is given by:

Pi∼= Ai

X

Y

Z

1

+ i

dX

dY

dZ

0

∼= Ai

X

Y

Z

1

dX

dY

dZ

WhereAi is composed of the columns ofAi - A1

i , A2i , A

3i , A

4i :

Ai∼= [ A1

i A2i A3

i A4i iA1

i iA2i iA3

i ]

We can now take three hyper-planes through the point inP 6 for each measurement. By taking three

such hyper-planes from each of the first two points, and one from the third point we derive the constraint:

det

( P 40 0 0 −P 1

0 ) A0

( 0 P 40 0 −P 2

0 ) A0

( 0 0 P 40 −P 3

0 ) A0

( P 41 0 0 −P 1

1 ) A1

( 0 P 41 0 −P 2

1 ) A1

( 0 0 P 41 −P 3

1 ) A1

LT2 A2

= 0

This constraint is multi-linear in the measurementsP0, P1, L2 and has the formP i0P

j1 L2kA

kij = 0.

The size of the resulting tensor is43. Since we can take any plane through the point in the third

coordinate system, 21 point matches across 3 views are sufficient to solve for this tensor linearly.

Note that it is also possible to use an arbitrary camera in the third view here. Instead of thePPp

(point in P 3 - point in P 3 - point in P 2) tensor we got for the non-constant velocity case, Here we will

105

get aPPl, wherel is any line through the pointp. This would give us a tensor of size3 × 42 which

needs 24 point matches in order to solve.

Extracting information from this tensor such as the collineationsAi and the structure of the scene, is

done along the lines described in [57]. A first reconstruction is done inP7, then a second reconstruction

is done inP3 for the 3D structure and collineations. Since constant velocity is an affine invariant, the

last reconstruction is an affine reconstruction. Note that in the case of constant velocity, there is an

ambiguity in defining static points. This is because preforming two constant velocity motions one after

the other gives a new constant velocity motion. Therefore, we cannot distinguish between translation of

the coordinate frames and adding the same translation to all the points. This ambiguity can be resolved

using one static point.

We will first describe some properties of the constant velocity tensor that will enable us to achieve the

projective reconstruction, then we will show how to achieve the reconstruction itself.

8.3.1 decomposing the constant velocity tensor

We have formulated our problem as a projection problem fromP 6 to P 3. This gives us a4 × 7

projection matrix (“camera”). The analogous to an image plane in this type of “camera” is an “image

space” (an extentor of step 4). This projection matrix has a center which is just its null space - an extentor

of degree7 − 4 = 3. The image of this “camera center” on some other view is some plane subspace of

the other view’s image space. This is analogous to the well known epipole of epipolar geometry.

Consider the contractions:Ok ∼= P i0P

j1 Ak

ij. This type of contraction is called point transfer. It can

be easily shown (by multiplying both sides with a plane through the third pointPm2 - Lm) that these

contraction generates the point in the last viewP2, or 0 for degenerate cases.

Lets move our attension to slices of the tensor.P iAkij is clearly a4 × 4 matrix. It is the collineation

between views 2 and 3 of the space (extentor of step 4) which connects the first projection matrix center

(a plane - an extentor of step 3) and the point in the first “image space”.

When trying to solve for projection matrices from the multi-linear constraints in the classic case, the

epipoles play a great rule. Here each epipole is a plane. For example, the epipole in view 2 associated

106

with view 1, is the plane containing the pointsee01:

ee01∼= A1null(A0)

, whereAi being the projection matrices.

In multi-view geometry the epipole in the first image is transformed to the epipole in the second

image using any valid homography between the views. This is because the line in 3D which connects

the camera centers intersects the image planes at constant points. Here the extentor of rank 6 which

connects the two projection centers intersects each image space (extentor of rank 4) in an extentor of

rank 3(6 + 4 − 7 = 3) which is a plane. The epipoles are transformed from one view to another using

any valid collineation between them.

Assume that H and J are two collineations between views 1 and 2 (In order to find these we need the

tensors between views three one and two, and not the tensor between views one two and three). Using

dual collineations (we transform planes, not points):

e10∼= H−T e01

∼= J−T e01

Thereforee01 is a generalized eigenvector ofH−T andJ−T . We can use this property to find these

epipoles using some collineations. The epipole would be a generalized eigenvector of any pair of valid

dual collineation.

Up to projective transformation the projection matrices fromP 6 to P 2 can be chosen as:

w0∼= [ I4×4 04×3 ]

w1∼= [ H01 e1

10e210e

310 ]

whereH01 is any collineation between views one and two andei10i = 1..3 are three points on the plane

which is the epipole in view two associated with camera one.

This choice of projection matrices is actually the choice of the the eight points of the standard basis

of P 6, which we are free to choose. The first four points are taken from the space associated with the

collineationH01; the fifth,sixth and seventh points are taken from the center of the first projection matrix;

the scales between the epipoles/homography determines the missing point of the basis inP 6.

107

Havingw0 andw1 we can use the tensor in order to findw2. Note that the tensor elements are multi-

linear expression in the projection matrices. Sincew0 and w1 are known, we are left with a linear

expression inw2.

Using the tensorial constraints alone, we cannot do better in the larger space than a projective recon-

struction (of the larger space). This is due to the gauge invariance. However, other infomation arises

from the nature of the underlying scenario.

For the case of the constant velocity tensors, we know that the matricesA have a special structure -

They have columns which are repeated multiplied by some scalar. This gives us linear constraints on a

transformation that will change allwi to be of this structure. The rest follows very similarly to what is

done in [57]

8.4 The Translating Lines Matrix

Consider a moving sensor capturing a scene composed oflines moving in 3D. The motion of the lines

is constrained in such a way that each point on the line moves in the same direction. For example, the

lines are on rigid objects that undergo translation.

The constraint is derived from the fact that if we compensate for the motion of the sensor, the line

before and the line after the motion both reside on the same plane. In other words, these two lines

intersect (maybe at infinity).

Every two lines represented in plucker coordinates intersect if the dot product of the first line with

some known permutation of the second line vanishes. The permutation depends on the order of elements

in the plucker coordinates. This can be written aslT2 Jl1 = 0, where the matrixJ is a known permutation

matrix.

Changing the coordinate systems for points by some collineationA, changes the coordinate system of

the plucker lines derived from these points by some collineationA. The rows of this collineation for the

plucker lines are the plucker coordinates of the lines made out of every couple of the rows of the points

collineation. Combining with the previous equation we get:lT2 AT Jl1 = 0, wherel1 is the line before the

motion in the first sensor, andl2 is the line after the motion in the second sensor.

108

This constraint is multi-linear in the lines, and we can solve forAT J having 35 line matches, between

two views. If points are known to be on a translating object, that we can take lines by choosing every two

such points as a line. Experiments show that two objects are sufficient to solve for the tensor linearly.

Since we knowJ , which is a permutation matrix and therefore full rank, we can solve for the

collineationA from the tensor. The point collineationA can then be solved by noticing that every

row of A is the intersection of three rows ofA representing three lines.

Note that in the common case, where all the points on each line move in the same direction, and

in the same velocity per line, we have an ambiguous situation. The ambiguity has two components:

First, we can recover the change in coordinate system only up to some translation of all points. Sec-

ond, we can recover the change in coordinate system only up to some unknown scale. i.e every point

( Xi Yi Zi 1 )T can be transformed into( λXi λYi λZi 1 )T for any fixedλ without changing

the property that every line before the motion intersects the corresponding line after the motion.

Both ambiguities can be shown to arrive from the following relation: Let( Xi Yi Zi 1 )T i = 1, 2

be two points, which both move at a constant velocity( dX12 dY12 dZ12 0 )T . Then for every

common arbitrary translation( dA dB dC 0 )T , and for any scale factorλ:

det

X1 X2 λX1 + dX12 + dA λX2 + dX12 + dA

Y1 Y2 λY1 + dY12 + dB λY2 + dY12 + dB

Z1 Z2 λZ1 + dZ12 + dC λZ2 + dZ12 + dC

1 1 1 1

= 0

(This can be noticed by subtracting the first two rows of the matrix and comparing with the subtraction

of the last two).

This ambiguity, which reduces the rank of the estimation matrix to be of 31 instead of 35 can be

overcome by using two static points, for example. From each static point we derive additional constraints

on the estimation matrix by choosing any line through one of the points in the first frame, and any line

through the same point in the second frame.

109

Chapter 9

Counting Problems for Multilinear

Constraints

Multilinear constraints in computer vision applications are of growing interest in Structure for Motion

(SFM), Indexing and Graphics. Many of the applications where multiple measurements are involved

— like multiple-view geometry of static and dynamic scenes, indexing functions into 3D data-sets,

separation of various attribute/modalities such as “content” and “style” — have a multilinear form. As a

result, a growing amount of work has been published on the various aspects of those algebraic functions

and their applications — see Hartley & Zisserman, 2000 and Faugeras & Luong, 2001 for the recent

summary of various multi-linear maps and their associated tensors.

In this paper we raise a general question and demonstrate its relevance to the current research in

multilinearity in computer vision. The questions takes the following form: LetV be a complexn-

dimensional space and form ≥ k consider theGL(V )-moduleV (n,m, k) ⊂ V ⊗m defined by

V (n, m, k) = { v1 ⊗ · · · ⊗ vm ∈ V ⊗m :

dim Span{v1, . . . , vm} ≤ k } .

We would like to determinedim V (n,m, k) for any choice ofn, m ≥ k. We will show that this question

appears in a one disguised form or another in a number of vision problems and, for example, focus on

111

two of those problems: (i) analysis of constraints in single view indexing functions (the 8-point shape

tensor), and (ii) the analysis of the constraints in dynamicPn → Pn mappings, i.e., where the point sets

are allowed to move within ak-dimensional subspace while then-dimensional space is being multiply

projected (multiple views) onto copies of them-dimensional space.

We then derive the solution to the general problem using tools from representation theory. We will

describe the general notations in the next section (and provide a brief primer on representation theory

in the appendix), followed by the detailed description of the two problems mentioned above and the

way the are mapped to the question ofdim V (n,m, k), and followed by the derivation of the structure

and dimension of theGL(V ) moduleV (n,m, k) by counting irreducibles followed with examples of its

application to some instances of dynamicPn → Pn mappings.

9.1 A Representation Theory Digest

In this section we briefly recall some relevant facts concerning the representation theory of the general

linear group. For a thorough introduction see Foulton & Harris 1991.

Let V be a finite n-dimensional vector space over the complex numbers. The collection of invertible

n×n matrices is denoted byGL(n) which is the group of automorphisms ofV denoted byGL(V ). The

vector spaceV ⊗m (m-fold tensor product) is spanned by decomposable tensors of the formv1⊗· · ·⊗vm,

where the vectorsvi are inV . Hence the dimension ofV ⊗m is nm. The vector spaceV ⊕m is the m-fold

direct sum ofV , thus is of dimensionnm.

The exterior powers∧mV of V , n ≥ m, is the vector space spanned by them × m minors of the

n × m matrix [v1, ..., vm] where the vectorsvi are inV . Hence the dimension of∧mV is(

nm

). The

exterior powers are the images of the mapV ×m → V ⊗m given by

(v1, · · · , vm) →∑

σ∈Sm

sgn(σ)vσ(1)⊗, · · · , vσ(m)

whereSm denotes the symmetric group (ofpermutationsof m letters).

Thesymmetric powersSymmV are the images of the mapV ×m → V ⊗m given by

(v1, · · · , vm) →∑

σ∈Sm

vσ(1)⊗, · · · , vσ(m)

112

Hence the vector spaceSymmV is of dimension(

n+m−1m

). Note that,

V ⊗ V = Sym2V ⊕ ∧2V

with the appropriate dimension:n2 =(

n+12

)+(

n2

). This decomposition into irreducibles (see later)

is not true forV ⊗m, m > 2. The remainder of this section is devoted to the necessary notation for

representingV ⊗m as a decomposition of irreducibles.

A representationof a groupG on a complex finite dimensional spaceU is a homomorphismG to

GL(U) - the group of linear automorphisms ofU . The action ofg ∈ G onu ∈ U is denoted byg ·u. The

G−moduleU is irreducible if it contains no non-trivialG−invariant subspaces. Any finite dimensional

representation of a compact groupG can be decomposed as a direct sum of irreducible representations.

This basic property calledcomplete reducibilityalso holds for all holomorphic representations of the

general linear groupGL(V ).

The main focus of this paper is the space

V (n,m, k) = Span{v1 ⊗ · · · ⊗ vm ∈ V ⊗m :

dim Span{v1, . . . , vm} ≤ k } .

SinceV (n, m, k) is invariant under theGL(V ) action given byg · v1 ⊗ · · · ⊗ vm = g(v1)⊗ · · · ⊗ g(vm)

it is natural to study its structure by decomposing it into irreducibleGL(V )- modules.

The description of the finite dimensional irreducible representations (irreps) ofGL(V ) depends on the

Combinatorics of partitions and Young diagrams which we now describe:

A partition of m is an ordered setλ = (λ1, ..., λk) such thatλ1 ≥ ... ≥ λk ≥ 1 and∑

λi = m. A

partition is represented by itsYoung diagram(also calledshape) which consists ofk left aligned rows

of boxes withλi boxes in rowi. Theconjugate partitionµ = (µ1, ..., µr) to a partitionλ is defined by

interchanging rows and columns in the Young diagram — or without reference to the diagram,µi is the

number of terms inλ that are greater than or equal toi.

An assignment of the numbers{1, ...,m} to each of the boxes of the diagram ofλ, one number to

each box, is called atableau. A tableau in which all the rows and columns of the diagram are increasing

is called astandard tableau. We denote byfλ the number of standard tableaux onλ, i.e., the number of

113

ways to fill the young diagram ofλ with the numbers from 1 tom, such that all rows and columns are

increasing. Let(i, j) denote the coordinates of the boxes of the diagram wherei = 1, .., k denotes the

row number andj denotes the column number, i.e.,j = 1, ..., λi in the i’th row. Thehook lengthhij of

a box at position(i, j) in the diagram is the number of boxes directly below plus the number of boxes to

the right plus 1 (without reference to the diagram,hij = λi + µj − i− j + 1). Then,

fλ =m!∏

(i,j) hij

where the product of the hook-lengths is over all boxes of the diagram. We denote bydλ(n) the number

of semi-standard tableauxwhich is the number of ways to fill the diagram with the numbers from 1 to

n, such that all rows are non-decreasing and all columns are increasing. We have:

dλ(n) =∏(i,j)

n− i + j

hij

.

Let Sm denote the symmetric group on{1, . . . ,m}. Thegroup algebraCSm is the algebra spanned

by the elements ofSm

CG = {∑

σ∈Sm

ασσ | ασ ∈ C}

where addition and multiplication are defined as follows:

α(∑

σ∈Sm

ασσ) + β(∑

σ∈Sm

βσσ) =∑

σ∈Sm

(αασ + ββσ)σ

and

(∑

σ∈Sm

ασσ)(∑

τ∈Sm

βττ) =∑

g∈Sm

(∑

g=στ

ασβτ )g

for α, β, ασ, βσ ∈ C.

Let t be a tableau onλ (a numbering of the boxes of the diagram) and letP (t) denote the group of

all permutationsσ ∈ Sm which permute only the rows oft. Similarly, let Q(t) denote the group of

permutations that preserve the columns oft. Letat, bt be two elements in the group algebraCSm defined

as:

at =∑

g∈P (t)

g , bt =∑

g∈Q(t)

sgn(g)g.

114

The group algebraCSm acts onV ⊗m on the right by permuting factors, i.e.,(v1 ⊗ · · · ⊗ vm) · σ =

vσ(1)⊗ · · ·⊗ vσ(m). For a general shapeλ and a tableaut onλ the image ofat, V ⊗m · at, is the subspace:

V ⊗m · at = Symλ1V ⊗ · · · ⊗ SymλkV ⊂ V ⊗m

and the image ofbt is

V ⊗m · bt = ∧µ1V ⊗ · · · ⊗ ∧µrV ⊂ V ⊗m

whereµ is the conjugate partition toλ. TheYoung symmetrizeris defined byct = at · bt ∈ CSm. The

image of the Young symmetrizer

St(V ) = V ⊗m · ct

is theSchur Moduleassociated tot and is an irreducibleGL(V )- module. The isomorphism type of

St(V ) depends only on the shapeλ so we may writeSt(V ) = Sλ(V ). It turns out that all the polynomial

irreps ofGL(V ) are of the formSλ(V ) for somem and a partitionλ ` m.

Let Tλ denote the set of standard tableaux onλ then the direct sum decomposition ofV ⊗m into irre-

ducibleGL(V )-modules is given by

V ⊗m =⊕λ`m

⊕t∈Tλ

St(V ) ∼=

⊕λ`m

Sλ(V )⊕fλ .

Sincedλ(n) = dimSλ(V ) it follows that

dim V ⊗m = nm =∑λ`m

dλ(n)fλ.

For example, considern = m = 3, i.e.,V ⊗V ⊗V wheredim V = 3. There are three possible partitions

λ of 3 — these are(3), (1, 1, 1) and(2, 1). From the above,S(3)(V ) = Sym3V andS(1,1,1)V = ∧3V .

There are two,f(2,1) = 2, standard tableaux forλ = (2, 1) and these are123 and132 (numbering of

boxes left to right and top to bottom). There are eight,d(2,1)(3) = 8, semi-standard tableaux which are:

112, 113, 122,123, 132, 133,223 and233. We have the decomposition:

V ⊗ V ⊗ V = Sym3V ⊕ ∧3V ⊕ (S(2,1)V )⊕2

with the appropriate dimensions:27 = 10 + 1 + (8 + 8).

115

9.2 The 8-point Shape Tensor Problem

In this section we will make the connection between the question ofdim V (n, m, k) and a riddle re-

garding the internal structure of the 8-point shape tensor. Shape tensors were first introduced in Carlsson

1995, Weinshallet. al.1996, Carlsson & Weinshall 1998, with the basic idea that single-view invariants

of a 3D scene can be obtained by algebraically eliminating the viewing position (camera) parameters

given a sufficient number of points. Later, the same analysis was conducted in a reduced (but practical

in vision applications) setting where a reference plane is identified in advance (Irani & Anandan 1996,

Irani et. al. 1998, Criminisiet. al. 1998, Rother & Carlsson 2001) — which is the case we will focus

on here.

The problem setting is as follows. LetPi = (Xi, Yi, Zi, Wi)> ∈ P3, i = 1, ..., 8, denote 8 points

in 3D projective space and letM be a3 × 4 projection matrix, thuspi∼= MPi wherepi ∈ P2 be the

corresponding image points in the 2D projective plane. We wish to algebraically eliminate the camera

parameters (matrixM ) by having a sufficient number of points. This could be done succinctly if we

first make a change of basis: Let the coplanar points be denoted byP1, ..., P4 with the coordinates

(1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (1, 1, 1, 0) which is appropriate whenP1, ..., P4 are indeed coplanar.

Let the image undergo a projective change of coordinates such that the corresponding pointsp1, ..., p4

be assignede1 = (1, 0, 0), e2 = (0, 1, 0), e3 = (0, 0, 1), e4 = (1, 1, 1), respectively. Given this setup the

camera matrixM contains only 4 non-vanishing entries:

M =

δ 0 0 α

0 δ 0 β

0 0 δ γ

Let M = (α, β, γ, δ) ∈ P3 be a point (representing the camera) and letPi be the projection matrix:

P =

Wi 0 0 Xi

0 Wi 0 Yi

0 0 Wi Zi

And as in the general case we have the dualitypi

∼= MPi = PiM where the role of the motion (the

camera) and shape have been switched. Letli, l′i be two distinct lines passing through the image point

116

pi, i.e.,p>i li = 0 andp>i l′i = 0, and therefore we havel>i PiM = 0 andl′>i PiM = 0. For i = 5, ..., 8 we

have thereforeEM = 0 where:

E =

l>5 P5

·

l>8 P8

l′>5 P5

·

l′>8 P8

(9.1)

Therefore the determinant of any 4 rows ofE must vanish. The choice of the 4 rows can include 2

points, 3 points, or 4 points (on top of the 4 basis pointsP1, ..., P4) and each such choice determines a

multilinear constraint whose coefficients are arranged in a tensor. The 8-point tensor is when 4 points

are chosen: by choosing one row from each point we obtain a vanishing determinant involving 4 points

which provides 16 constraints (per view)l5i l6j l

7kl

8tQijkt = 0 for the 81 coefficients of the tensorQijkt.

The indicesi, j, k, l follow the covariant-contravariant notations (upper index represents points, lower

represent lines) and follow the summation convention (contraction)uivi = u1v1 + u2v2 + ... + unvn.

The tensor contains 81 coefficients, however, they satisfy internal “synthetic” linear constraint.Exactly

how many constraints are there is an open problem which we will show boils down to the question of

dim V (n, m, k).

SinceP1, ..., P4 are coplanar we have the constraintP>i n = 0, i = 1, ..., 4 and, due to our choice

of coordinates,n = (0, 0, 0, 1)T . Consider the family of camera matricesM = un> for all choices of

u = (u1, u2, u3)>. In other words, the 4’th column ofM consists of the arbitrary vectoru and all other

entries vanish. Thus we have thatMP either vanishes or is equal tou (up to scale) for allP . Let li, l′i be

lines throughu, therefore

l>i MjP = l>i P Mj = 0

l′>i MjP = l′>i P Mj = 0

for all pointsP , and dually for all projection matricesP . Therefore the4× 4 determinants ofE vanish

regardless ofPi. We have a single3×3×3×3 tensorQijkt responsible for the 16 quadlinear constraints

117

l5i l6j l

7kl

8tQijkt = 0 (we have a choice of 2 lines for each point, thus 16 constraints). From the discussion

above, the four lines contracted by the tensor are all coincident with the arbitrary pointu. Therefore,the

question is what is the dimension of the set of constraintsl5i l6j l

7kl

8tQijkt = 0 where the lines are arbitrary

but form a 2-dimensional subspace?

Recall the definition ofV (n,m, k) and setn = 3, m = 4, k = 2:

V (3, 4, 2) = {v1 ⊗ v2 ⊗ v3 ⊗ v4| dim Span{v1, ..., v4} ≤ 2}

wherev1, ..., v4 are vectors inR3. Our question regarding the number of synthetic constraints is equiva-

lent to the question ofwhat is the dimension ofV (3, 4, 2)?

9.3 DynamicPn → Pn Mappings

Consider a configuration of points inQi ∈ Pn−1, i = 1, ..., q undergoing a projective mapping

Qi → Q′i. Then it is well known thatQ′

i∼= AQi whereA ∈ GL(n) is some invertiblen × n matrix.

However, consider the following “complication” where each pointQi maychangeits position up to ak-

dimensional subspace (k = 1 means thatQi is fixed,k = 2 means thatQi may change its position along

some line inPn, and so forth), and we are givenm > 2 observationsQ(j)i wherej = 1, ...,m. In other

words, the observationsQ(j)i are generated by a combination of “global” (unknown) transformations

Ai ∈ GL(n) and “local” (unknown) movements within (unknown) subspaces of dimension up tok < m.

The task is to recover the global transformationsAi from the observations.

The definition above is a generalization of particular cases which were introduced in the past under the

name of “dynamic” Structure from Motion (SFM), or SFM of multiply moving points, and the relevant

literature includes Avidan & Shashua 2000, Shashua & Wolf 2000, Wolfet. al. 2000, Manning & Dyer

1999, Wexler & Shashua 2000, Han & Kanade 2000,2001, Segal & Shashua 2000, Wolf & Shashua

2002. For instance, Shashua & Wolf 2000 consider the case wheren = 3 (pointsQi belong to the 2D

projective plane),m = 3 andk = 2. In other words, a configuration of coplanar points are viewed

by a moving camera and the points move along arbitrary straight lines (k = 2) or stay fixed (“static”,

k = 1) while the camera changes positions. It was shown there that the image observations (across three

views) satisfy a3× 3× 3 tensorial constraint, where in the case where all points are moving along along

118

lines, 26 observations are sufficient for a unique solution to the tensor, when all points are static (without

being labeled as such) then those observations fill a 10 dimensional subspace (thus at least 16 points

should be dynamic for a unique solution form observations). In a later paper (Wolfet. al. 2000) the

case of “dynamic 3D to 3D” alignment was introduced, wheren = 4, m = 3, k = 2. In that case, the

observations are governed by a4 × 4 × 4 tensor, where the observations from moving points fill a 60-

dimensional space (thus there 4 tensors satisfying the constraints), and static points fill a 20-dimensional

space.

Among the various aspects of those tensors, one important aspect is the counting of necessary con-

straints for a solution. Some of those counting issues, even in the particular low dimension examples

given above, are not obvious. The matter becomes fairly subtle when dealing with the general dynamic

Pn → Pn mappings where the issue of counting constraints is an open problem.

We observe that since tensor products commute with linear transformations, the issue of dimension

counting is independent of the matricesAi ∈ GL(n). Therefore, the general problem of counting the

constraints of a dynamicPn−1 → Pn−1 mapping is isomorphic to the question ofdim V (n, m, k), where

in this casen ≥ m ≥ k.

When we compute the constraints of dynamic mappings we have other limitations which are not

described in Shashua & Wolf 2000 and Wolfet. al. 2000 and can be also described in theV (n, m, k)

framework. For example, in the case of dynamicP2 → P2 alignment the collection of measurements

arising from triplets of matching points must span the 2D plane. We may ask what is the largest number

of collinear points allowed? (which beyond that the solution becomes degenerate). In other words, the

question is how many points moving on the same striaght line path will generate linearly independent

constraints. The answer isdim V (2, 3, 2) — note thatn = 2 because the effective dimension of the

vector space is 2 even though the points are in defined in the 2D projective plane (i.e.,n = 3). Likewise,

in the case of dynamicP2 → P2 alignment the maximal number of points allowed on a single line is also

dim V (2, 3, 2) — and out of these pointsdim V (2, 3, 1) static points will give us linearly independent

constraints (in both cases).

From the examples above we have thatdim V (3, 3, 2) = 26 anddim V (4, 3, 2) = 60 (point moving

along straight line paths) anddim V (3, 3, 1) = 10 anddim V (4, 3, 1) = 20 (static points) for the 2D and

119

3D cases, respectively.

In the following section we analyze the structure ofV (n, m, k) and as a result determinedim V (n, m, k)

for any choice ofn, m ≥ k.

9.4 The Structure ofV (n,m, k)

So far we have presented two (unrelated) Vision problems which are isomorphic to thedim V (n, m, k)

question. We will provide below the statement and proof about the structure ofV (n, m, k). The state-

ment appears very similar to the classic result (see Appendix) of decomposing ofV ⊗m into irreducible

GL(V )-modules:

V ⊗m =⊕λ`m

⊕t∈Tλ

St(V ),

with the difference thatnot all diagrams are included — only those diagramsλ for whichλk+1 = 0.

Claim 1

V (n,m, k) =⊕

λk+1=0

Sλ(V )⊕fλ .

In particular

dim V (n,m, k) =∑

λk+1=0

fλsλ.

Proof: supposeλ ` m andλk+1 = 0. Let t be the tableau given byt(i, j) =∑i−1

l=1 λl + j. Noting that

V (n, r, 1) = SymrV it follows that

V ⊗m · at = Symλ1V ⊗ · · · ⊗ SymλkV

= V (n, λ1, 1)⊗ · · · ⊗ V (n, λk, 1) ⊂ V (n, m, k) .

Therefore,

St(V ) = V ⊗m · aT · bT ⊂ V (n, m, k) · bT ⊂ V (n, m, k)

hence, ⊕λk+1=0

Sλ(V )⊕fλ ⊂ V (n, m, k).

120

To show the other direction let(·, ·) be a hermitian form onV and let the induced form onV ⊗m be

given by

(u1 ⊗ · · · ⊗ um, v1 ⊗ · · · ⊗ vm) =m∏

i=1

(ui, vi) .

Note that

(u1 ∧ · · · ∧ um, v1 ⊗ · · · ⊗ vm)

=1

m!(u1 ∧ · · · ∧ um, v1 ∧ · · · ∧ vm)

=1

m!det[(ui, vj)]

mi,j=1 .

Let λ ` m with λk+1 6= 0, then the conjugate partitionµ = (µ1 ≥ µ2 ≥ . . . ≥ µt) satisfiesµ1 ≥ k + 1.

Let lj =∑j

r=1 µr and lett be the tableau given byt(i, j) = lj−1 + i. Then

St(V ) = V ⊗m · at · bt ⊂ V ⊗m · bt

= ∧µ1V ⊗ · · · ⊗ ∧µlV .

Suppose now thatv1, . . . , vm ∈ V ⊗m satisfydim Span{v1, . . . , vm} ≤ k. Thenv1 ∧ · · · ∧ vµ1 = 0

therefore for anyu1, . . . , um ∈ V

((u1 ⊗ · · · ⊗ um) · bT , v1 ⊗ · · · ⊗ vm) =l∏

r=1

1

µr!(

lr∧i=lr−1+1

ui,lr∧

i=lr−1+1

vi) = 0 .

It follows thatV (n,m, k) is orthogonal to

⊕λk+1 6=0

Sλ(V )⊕fλ

hence,

dim V (n, m, k) ≤ dim⊕

λk+1=0

Sλ(V )⊕fλ .

Claim 1 can be used to give explicit formulas fordim V (n, m, k) when eitherk or m − k are small.

In the later case we write

dim V (n,m, k) = nm −∑

λk+1 6=0

fλdλ(n)

and note that the partitions ofm with λk+1 6= 0 correspond to all partitions of all numbers up tom−k−1.

121

9.4.1 Examples

To calculatedim V (n, m,m− 1) note that onlyλ = (1m) must be excluded, thus:

f(1m) = 1 , d(1m)(n) =

(n

m

)

hence,

dim V (n, m,m− 1) = nm −(

n

m

).

To calculatedim V (n,m,m−2) we must exclude, in addition to the above, the partition(2, 1m−2), thus:

f(2,1m−2) = m− 1 , d(2,1m−2)(n) = (m− 1)

(n + 1

m

)

hence,

dim V (n, m,m− 2) = nm − [

(n

m

)+ (m− 1)2

(n + 1

m

)].

To calculatedim V (n, m,m− 3) we must exclude, in addition to the above, the partitions(3, 1m−3) and

(22, 1m−4), thus:

f(3,1m−3) =

(m− 1

2

), d(3,1m−3)(n) =

(m− 1

2

)(n + 2

m

)

f(22,1m−4) =m(m− 3)

2,

d(22,1m−4)(n) =(m− 3)n

2

(n + 1

m− 1

)

Hence,

dim V (n,m,m− 3) = nm − [

(n

m

)+ (m− 1)2

(n + 1

m

)+(

m− 1

2

)2(n + 2

m

)+

m(m− 3)2n

4

(n + 1

m− 1

)].

With these in mind, we can easily resolve the first of the open problems which is the number of

synthetic constraints of the 8-point shape tensor with 4 coplanar points. We have seen that the answer is

dim V ((3, 4, 2):

dim V ((3, 4, 2) =∑

λ3=λ4=0

fλdλ,

122

whereλ = (λ1, ..., λ4), is a partition of 4, i.e.,λ1 ≥ λ2 ≥ λ2 ≥ λ4 and∑

i λi = 4. We have therefore

only three partitions which satisfyλ3 = λ4 = 0: λ = (4), (2, 2), (3, 1) to consider. Thus,f(4) = 1, d(4) =

15, f(2,2) = 2, d(2,2) = 6, f(3,1) = 3 andd(3,1) = 15. Therefore,dim V (3, 4, 2) = 15 + 12 + 45 = 72.

We can also verify the special cases of dynamicP2 → P2 andP3 → P3 by substituting the values

of n,m, k in the formulas above. For example:dim V (3, 3, 2) = 27 − 1 = 26 anddim V (4, 3, 2) =

64 − 4 = 60 (point moving along straight line paths) anddim V (3, 3, 1) = 27 − (1 + 4 · 4) = 10 and

dim V (4, 3, 1) = 64− (4 + 4 · 10) = 20 (static points). Alsodim V (2, 3, 2) = 8− 0 = 8 points moving

along one line path out of which up todim V (2, 3, 1) = 8− [0 + 4] = 4 are static points on this line will

give us linearly independent constraints.

123

Chapter 10

Conclusions

10.1 Discussion

The list of possible scenarios suggested in this thesis is by no means final, and the tools presented in

this work are not limited to the use of images. We hope that many more applications will emerge that

will employ the results obtained here.

A guiding line we used throughout the work is to keep appearance of tractable tensors. We never

presented any tensor which has an estimation matrix larger than the one needed in order to compute

the quadrifocal tensor in a straightforward way. Nevertheless, the dynamic invariants are less stable to

compute than the static ones. There is need for a normalizatin of the image coordinate system, for the use

of sampling based outlier rejection and for the use of non linear estimation techniques where applicable.

In general we preformed our experiments using standard point tracking software such as the openCV

[38] KLT point tracker. Using standard normlization [24] and LEMDS sampling [39] we were able to

compute our invariants. The reason why we stress this point is becuase the future development of the

dynamic structure from motion field is bounded by the applicativity of the proposed solutions.

Ignoring this question and considering only theoretical questions about projection matrices from one

projective space to another, some general tools still have to be developed. In chapter 9, understanding

of the general invariants governing the relations betweenm points inPn spanning an extensor of stepk

125

was achieved. The Htensor [46] and the Jtensor [59] are examples of such invariants. However, the case

of projections from one subspace to another still has gaps.

Those gaps are not in the derivation of single invarinats - those are well understood, and automatically

achievable. The problems are regarding counting of invarinats in the general case (e.g ”how many

invarinats are forn views inP l of extensors of stepm in Pk confined to move on an extensor of step

r?”), and determining the minimal conditions on the measurements needed in order to recover these

invarinats (e.g ”up to 10 points on one hyperplane”).

Another problem which needs a general solution is the problem of reconstructing structure from ten-

sorial invariants. Currently there is not an algorithm for recovering structure inPk from measurements

after projections toP l. We speculate that using epipoles where possible, and the joint epipoles, devel-

oped here, otherwise, could solve this problem. This still deserves a rigorous proof. A step toward a

solution would be to notice that by combining two projection matrices fromPk toP l we get a projection

matrix fromPk to P(2l). Proceeding in this direction we can always build projection matrices which

have centers with lower dimensions that their image planes.

Apart from chapter 7, which uses very general assumptions, our solutions were more aimed toward

handling moving objects, and less aimed at handling deformable objects. Deformable objects are usually

treated as statistical objects rather than as geometrical objects. Still, some work has been done which

extracts geometric infromation such as camera ego motion from scenes containing deformable objects.

In [5, 6] methods were presented for modeling a deformable object, such as a human face, by using

a small number of basis shapes and their linear combinations. Using a factorization based method, it

is possible to recover the deformable shape and the camera ego motion for the Orthographic camera

model. One can imagine incorporating both shape basis and view information into some very large

projection matrices, yeilding a solution for the projective case. A more tractable solution would be to

add information such as a second camera, symmetry of the recovered shape, or limiting the camera

ego-motion.

In chapters 4 and 7 methods for action recognition were proposed. In both cases an action indexing

function was learned from examples. In chapter 4 it was assumed that the location of some feature points

in the images of the examined body were known. In chapter 7, no correspondance was required, and

126

even a direct brightness based method was proposed. Althought the correspondance free method is very

appealing, there is a price to be paid in the accuracy of the resulting method. Having no knowledge

introduced in advance on the structure of the moving body produces a lot of ambiguity in the resulting

indexing. This can be seen for example by comparing the no correspondance synchronization results

given in chapter 7 to the perfect results obtained using a similar method with correspondances shown in

[62].

We would be interested in exploring the possiblity of learning to index motion with a use of prior

knowledge, but without having this incorperated manually into the system. The resulting system will

use the input examples twice - first to learn the type of variability in the whole dataset, and then to learn

specific action indexing functions. An example of such a first stage would be to learn to identify those

body parts which bear great information about the type of the motion is most examples (i.e the system

will learn that ”hands” are important and will learn to recognize them in the images).

10.2 Summary

This thesis addresses the problems concerning the recovery of the geometry of a dynamic scene

viewed by a moving uncalibrated camera.

As there are many ways to model dynamics, many different scenarios are being considered, and

several types of solutions are being suggested. These solutions are by employment and generalization

of classical structure from motion techniques, thus the resulting body of work lies within the structure

from motion field.

Out of the contributions made in this work we would like to point out the following:

• Identifying analytical solutions for dynamic SFM problems in the uncalibrated case.

• A systematic way to model dynamic scenes and to derive the multiple views and multiple points

invariants for those scenes.

• Developing tools for the analysis of the resulting invariants and thier degenerate configuration.

This study in its most general form relied on tools from the representation theory.

127

• Developing tools for the decomposition of those invariants, in order to compute camera motion,

such as the recovery of the “joint epipoles”.

128

Bibliography

[1] S. Avidan and A. Shashua. Threading Fundamental Matrices. InProc. of the European Conference

on Computer Vision, June 1998, Frieburg, Germany.

[2] S. Avidan and A. Shashua. Trajectory triangulation: 3D reconstruction of moving points from

a monocular image sequence.IEEE Transactions on Pattern Analysis and Machine Intelligence,

22(4):348–357, 2000.

[3] M. Barnabei, A. Brini, and G.C. Rota. On the exterior calculus of invariant theory.J. of Alg.,

96:120–160, 1985.

[4] A. Bartoli, The geometry of dynamic scenes - on coplanar and convergent linear motions embedded

in 3D static scene InThe 13th British Machine Vision Conference (BMVC)Sep. 2002

[5] M. Brand. Morphable 3d models from video. InCVPR, Kauai, Hawaii, December 2000, pages

II:456–463, 2001.

[6] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3d shape from image streams.

In CVPR, Hilton Head, SC, June 13-15, 2000, pages II:690–696, 2000.

[7] S. Carlsson The Double Algebra: An effective Tool for Computing Invariants in Computer Vision

In Applications of Invariance in Computer Vision, Joseph L.Mundy, Andrew Zisserman, David

Forsyth (Eds.), Springer-Verlag Berlin Heidelberg 1994.

[8] S. Carlsson. Duality of reconstruction and positioning from projective views. InProceedings of

the workshop on Scene Representations, Cambridge, MA., June 1995.

129

[9] S. Carlsson and D. Weinshall. Dual computation of projective shape and camera positions from

multiple images.International Journal of Computer Vision, 27(3), 1998.

[10] C. Rother and S. Carlsson. Linear Multi View Reconstruction and Camera Recovery. InProceed-

ings of the International Conference on Computer Vision, Vancouver, Canada, July 2001.

[11] Y. Caspi, D. Simakov and M. Irani Feature-Based Sequence-to-Sequence Matching InVision and

Modelling of Dynamic Scenes workshop, with ECCV 2002, Copenhagen.

[12] J. P. Costeira and T. Kanade, A multibody factorization method for independently moving objects

In International Journal on Computer Vision, 29-3 (1998), 159-179.

[13] A. Criminisi, I. Reid, and A. Zisserman. Duality, rigidity and planar parallax. InProceedings of

the European Conference on Computer Vision, Frieburg, Germany, 1998. Springer, LNCS 1407.

[14] O.D. Faugeras.Three-Dimensional Computer Vision: A Geometric Viewpoint. MIT Press, 1993.

[15] O.D. Faugeras. Stratification of three-dimensional vision: projective, affine and metric representa-

tions. Journal of the Optical Society of America, 12(3):465–484, 1995.

[16] O. Faugeras and Q.T. Luong with contributions from T. Papadopoulo.The geometry of multiple

imagesMIT Press, 2001.

[17] O.D. Faugeras and B. Mourrain. On the geometry and algebra of the point and line correspon-

dences between N images. InProceedings of the International Conference on Computer Vision,

Cambridge, MA, June 1995.

[18] O.D. Faugeras and T. Papadopoulo. Grassmann-Cayley algebra for modeling systems of cameras

and the algebraic equations of the manifold of trifocal tensorsINRIA Rapport de rechercheno.3225

- july 1997

[19] A.W. Fitzgibbon and A. Zisserman. Multibody Structure and Motion: 3-D Reconstruction of

Independently Moving Object. InProceedings of the European Conference on Computer Vision

(ECCV), Dublin, Ireland, June 2000.

130

[20] W. Fulton and J. HarrisRepresentation Theory: A First Course. Springer-Verlag New York Inc.,

1991.

[21] G.H.Golub and C.F.Van LoanMatrix Computations second edition1989 Johns Hopkins University

Press p.582

[22] M. Han and T. Kanade. Reconstruction of a Scene with Multiple Linearly Moving Objects.InProc.

of Computer Vision and Pattern Recognition, June, 2000.

[23] M. Han and T. Kanade. Multiple motion scene reconstruction from uncalibrated views InProceed-

ings of the Eighth IEEE International Conference on Computer Vision(ICCV ’01), July, 2001.

[24] R.I. Hartley In Defense of the Eight-Point Algorithm. InIEEE Transactions on Pattern Analysis

and Machine Intelligence19(6): 1997.

[25] R.I. Hartley and A. Zisserman.Multiple View Geometry. Cambridge University Press, 2000.

[26] K. Huang , R. Fossum and Y. Ma. Generalized Rank Conditions in Multiple View Geometry

with Applications to Dynamical Scenes. in Proceedings of theEuropean Conference on Computer

Vision (ECCV), Copenhagen, Denmark, May 2002.

[27] M. Irani, Multi-Frame Optical Flow Estimation Using Subspace Constraints. InIEEE International

Conference on Computer Vision (ICCV), Corfu, September 1999.

[28] M. Irani and P. Anandan. Parallax geometry of pairs of points for 3D scene analysis. InProceedings

of the European Conference on Computer Vision, LNCS 1064, pages 17–30, Cambridge, UK, April

1996. Springer-Verlag.

[29] M. Irani, P. Anandan, and D. Weinshall. From reference frames to reference planes: Multiview

parallax geometry and applications. InProceedings of the European Conference on Computer

Vision, Frieburg, Germany, 1998. Springer, LNCS 1407.

[30] M. Irani, B. Rousso, and S. Peleg Computing Occluding and Transparent Motions, Int. J. Computer

Vision, Vol 12 No. 1, January 1994, pp. 5-16.

131

[31] K. Kanatani. Motion Segmentation by Subspace Separation and Model Selection. InInternational

Conference on Computer Vision (ICCV), Vancouver, Canada, July, 2001.

[32] A. Levin and A. Shashua. Reconstruction of Dynamic 3D Motion from a Monocular Sequence of

Infinitesimal Motion. Submitted to ICCV2001.

[33] A. Levin, Lior Wolf and A. Shashua. Time-varying Shape Tensors for Scenes with Multiply Mov-

ing Points. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Dec. 2001,

Hawaii.

[34] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo

vision. InProceedings IJCAI, pages 674–679, Vancouver, Canada, 1981.

[35] R.A. Manning and C.R. Dyer. Interpolating view and scene motion by dynamic view morphing. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 388–394,

Fort Collins, Co., June 1999.

[36] R.A. Manning and C.R. Dyer. Affine Calibration from Moving Objects. InThe Eighth IEEE

International Conference on Computer Vision (ICCV), June 2001.

[37] J.L. Mundy, A. Zisserman, D. Forsyth (Eds.)Applications of Invariance in Computer Vision,

Springer-Verlag Berlin Heidelberg 1994.

[38] Open source computer vision libraryhttp://www.intel.com/research/mrl/research/cvlib/

[39] P.J.Rousseeuw Least Median of Squares Regression InJournal of American Statistical Association,

vol.79 1984. pp.871-880.

[40] D. Segal and A. Shashua. 3D Reconstruction from Tangent-of-Sight Measurements of a Moving

Object Seen from a Moving Camera.Proc. of the European Conference on Computer Vision

(ECCV), June 2000, Dublin, Ireland.

[41] A. Shashua. Trilinear tensor: The fundamental construct of multiple-view geometry and its appli-

cations. In G. Sommer and J.J. Koenderink, editors,Algebraic Frames For The Perception Action

132

Cycle, number 1315 in Lecture Notes in Computer Science. Springer, 1997. Proceedings of the

workshop held in Kiel, Germany, Sep. 1997.

[42] A. Shashua and S. Avidan. The rank4 constraint in multiple view geometry. InProceedings of the

European Conference on Computer Vision, Cambridge, UK, April 1996.

[43] A. Shashua, S. Avidan and M. Werman. Trajectory Triangulation over Conic Sections.Interna-

tional Conference on Computer Vision (ICCV)Sep., 1999.

[44] A. Shashua, R. Meshulam, L. Wolf, A. Levin and G. Kalai. On Representation Theory in Computer

Vision Problems.Technical Report 2002-44, Leibniz Center for Research, School of Computer

Science and Eng., The Hebrew University of Jerusalem, July, 2002.

[45] A. Shashua and N. Navab. Relative affine structure: Canonical model for 3D from 2D geometry

and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(9):873–

883, 1996.

[46] A. Shashua and L. Wolf. Homography tensors: On algebraic entities that represent three views of

static or moving planar points. InProceedings of the European Conference on Computer Vision

(ECCV), Dublin, Ireland, June 2000.

[47] A. Shashua and L. Wolf. On The Structure and Properties of the Quadrifocal Tensor. InProceedings

of the European Conference on Computer Vision, Dublin, Ireland, June 2000.

[48] C.C.Slama, editorManual of Photogrammetry, Fourth Edition.American Society of Photogram-

metry and Remote Sensing, Falls Church, Virginia, USA, 1980.

[49] M.E.Spetsakis and Y.Aloimonos. A Multi-frame Approach to Visual Motion Perception InInter-

national Journal of Computer Vision, 1991, pages 245-255.

[50] P. Sturm Structure and Motion for Dynamic Scenes - The Case of Points Moving in Planes In

European Conference on Computer Vision)ECCV, May 2002

[51] B. Sturmfels Algorithms in Invariant Theory Springer-Verlag Wien New York, 1993.

133

[52] P. H. S. Torr. Geometric motion segmentation and mosel selection. InPhil. Trans. Roy. Soc., A-356

(1998), 1321-1340.

[53] R. Vidal, Y. Ma , S. Soatto, and S. Sastry. Two-view Multibody structure from Motion. InInterna-

tional Journal of Computer Vision, special issue on dynamic vision.

[54] D. Weinshall, M. Werman and A. Shashua. Duality Of Multi-Point And Multi-Frame Geometry:

Fundamental Shape Matrices And Tensors. ECCV, April 1996.

[55] Y. Wexler and A.Shashua. On the synthesis of dynamic scenes from reference views. InProceed-

ings of the IEEE Conference on Computer Vision and Pattern Recognition, South Carolina, June

2000.

[56] L. Wolf and A. Shashua Affine 3-D Reconstruction from Two Projective Images of Independently

Translating Planes. The Eighth IEEE International Conference on Computer Vision, July 2001.

[57] L. Wolf and A. Shashua. On Projection MatricesPk −→ P2, k = 3, 4, 5, 6 , and their Applications

in Computer Vision. InInternational Journal on Computer Vision (IJCV)48(1), 2002.

[58] L. Wolf and A. Shashua. Two-body Segmentation from Two Perspective Views. InIEEE Conf. on

Computer Vision and Pattern Recognition (CVPR), Dec. 2001, Hawaii.

[59] L. Wolf, A. Shashua and Y. Wexler. Join Tensors: on 3D-to-3D alignment of Dynamic Sets. In

Proc. of the Int. Conf. on Pattern Recog. (ICPR), Sep. 2000, Barcelona, Spain

[60] L. Wolf and A. Zomet Sequence to Sequence Self Calibration . InEuropean Conference on

Computer Vision (ECCV), May 2002, Copenhagen, Denmark.

[61] L. Wolf and A. Zomet Correspondence-free Synchronization and Reconstruction in a Non-rigid

Scene. InWorkshop on Vision and Modeling of Dynamic Scenes(with ECCV 2002).

[62] L. Zelnik-Manor and M. Irani, Degeneracies, Dependencies and their Implications in Multi-body

and Multi-Sequence Factorizations InIEEE Conference on Computer Vision and Pattern Recogni-

tion (CVPR), June 2003 .

134

dynamic structure from motion using uncalibrated cameras and …wolf/papers/thesis.pdf ·...

Documents