dynamic structure from motion using uncalibrated cameras and …wolf/papers/thesis.pdf ·...
TRANSCRIPT
Dynamic Structure from Motion
using Uncalibrated Cameras and
Unsegmented Scenes
Thesis for the degree of
DOCTOR of PHILOSOPHY
by
Lior Wolf
SUBMITTED TO THE SENATE OF
THE HEBREW UNIVERSITY OFJERUSALEM
Aug 2003
ii
This work has been carried out at the School of Computer Science and Engineering,
The Hebrew University of Jerusalem, Jerusalem, Israel, under the supervision of
Prof. Amnon Shashua
ii
ACKNOWLEDGMENTS
First and foremost, I give special thanks to my advisor, Amnon Shashua. I was fortunate to work
under his supervision and to have the opportunity to learn not just from his wide knowledge in many
areas, but also from his choices and actions. I greatly enjoyed every single one of our meetings, and
consider myself very lucky and most honored to have been one of Amnon’s students.
I would like to thank Shmuel Peleg for inspiring me to work in the area of computer vision, and for
bringing me to the lab. I would like to thank the other professors at the lab: Daphna Weinshall, Mike
Werman and Yair Weiss for valuable advice.
I would like to thank Michal Irani, Andrew Zisserman, and Richard Hartley for supporting me and
appreciating my work. I would like to thank Peter Sturm and Yoram Singer for inspiring discussions.
I would like to thank Yoni Wexler, Anat Levin and Assaf Zomet for cooperating with me on several
projects. I would also like to thank Shay Avidan, Moshe Ben-Ezra, Yaron Caspi, Adiel Ben-Shalom and
Jeremy Kaminsky for sharing their experience and knowledge with me. I would like to thank many other
lab members for their help and friendship throughout the years.
Thanks to my parents for baring with me during the busy stressful times.
Finally, my gratitude to my wife Aya, for her endless love and support, and my sons Guy and Tom for
the happiness they gives me.
iii
iv
AbstractMuch work has been done in the last decade by the Computer Vision community in understanding the
geometry of images of a rigid scene taken by a moving camera. The case of a scene containing motion
has been largely ignored. Since most video footage aim at capturing events, hence motion, the need for
handling dynamic scenes became apparent.
Our work deals with discovering the geometrical models and the mathematical tools that we can use
to analyze views of such scenes. In particular, we focus on the extraction of information about multiple
independently moving objects. Unlike previous work which at best tried to ignore such moving objects,
we show that valuable information can be extracted from such motions.
The main mathematical tools that we have used are projective algebra and multi-linear tensors. Pro-
jective algebra has long been used to model the process of imaging of rigid scenes. From this model
multiple views’ invariants can be derived to describe for example stereo vision. It was shown that these
invariants can be described most generally by using multi-linear tensors. In our work, we further use
multi-linear tensors to model dynamic scenes.
In order to handle dynamic scenes we have often lifted the model of the scene to higher projective
spaces. In these higher spaces, we were able to derive novel multi-linear invariants. We then found ways
to decompose these invariants to unravel underlying information such as scene structure, motion in the
scene and the motion of the camera.
The research presented in this thesis appears in the following papers [46, 57, 33, 56, 58, 61]. These
papers are reprinted as chapters of this thesis:
Chapter 2 describes the recovery of structure and motion of a scene containing points moving along
coplanar lines, appeared inECCV2000 - European Conference on Computer Vision[46].
Chapter 3 describes the recovery of structure and motion of a scene containing points in 3D moving
in constant velocity, and other scenarios, appeared inInternational Journal on Computer Vision (IJCV)
48(1), 2002. [57].
Chapter 4 describes the derivation multi linear constraints used to recognize a moving object from a
single view appeared inCVPR 2001 - IEEE Conf. on Computer Vision and Pattern Recognition[33]
v
Chapter 5 describes the analysis of a scene containing multiple moving planes using the double
algebraICCV99 - International Conference on Computer Vision[56].
Chapter 6 describes the analysis of a scene containing two independently moving objects. Appeared
in CVPR 2001 - IEEE Conf. on Computer Vision and Pattern Recognition[58].
Chapter 7 describes the analysis of a dynamic scene viewed from two unknown cameras which are
held fix relative to one another. Appeared inPost ECCV 2002 Workshop on Vision and Modeling of
Dynamic Scenes[61].
Chapter 8 describes the analysis of a scene viewed by 3D imaging devices. Some of it follows
our publication which appeared inICPR 2001 - International Conference on Pattern Recognition[59]
offering more compact solutions, but most of it is about completely new scenarios.
Chapter 9 describes the use of the tools of the representation theory to derive a general solution for
many counting problems. These counting problems arise in the geometric analysis of both static and
dynamic scenes. This chapter, which was not published elsewhere, also contains a short introduction to
the representation theory.
vi
Contents
1 Introduction 1
1.1 Classical Structure from Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Projective spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Tensorial notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Extensors and the Join Operation . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Early approaches to dynamic structure from motion . . . . . . . . . . . . . . . . . . . . 5
1.3 Methods for dealing with independently moving points . . . . . . . . . . . . . . . . . . 7
1.4 Methods for dealing with multiple moving objects . . . . . . . . . . . . . . . . . . . . . 10
1.5 Unpublished chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Homography Tensors 17
3 Projection Matrices from Pk to P2 33
4 Action indexing using dynamic shape tensors 49
5 A common transversal solution for independently translating planes 59
6 The segmentation matrix 69
7 Synchronization and reconstruction from fixed cameras viewing a dynamic scene 79
8 “3D to 3D” alignment 99
8.1 Derivation of Jtensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
vii
8.2 The Minimal Jtensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3 The 3D Constant Velocity Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.3.1 decomposing the constant velocity tensor . . . . . . . . . . . . . . . . . . . . . 106
8.4 The Translating Lines Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9 Counting Problems for Multilinear Constraints 111
9.1 A Representation Theory Digest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.2 The 8-point Shape Tensor Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.3 DynamicPn → Pn Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.4 The Structure ofV (n, m, k) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
10 Conclusions 125
10.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
viii
Chapter 1
Introduction
This introduction includes an overview of the evolvement of the structure from motion (SFM) analysis
of dynamic scenes. The field of SFM has matured greatly over the last decade, but most of the work was
confined to static scenes, or to known camera motion. Motion in the scene was mostly treated as noise,
and was usually ignored. The work done in the field of dynamic structure from motion does more than
just providing the tools for analyzing dynamic scenes - it also exploits the motion in the scene in order to
extract further information about the world. Thus, not only that the motion is not treated as a disturbing
element, it is also taken as an advantage.
1.1 Classical Structure from Motion
The aim of this research is to extend the field of the classical structure from motion (SFM) to dynamic
scenes. Classical SFM [25, 14] deals with static scenes viewed from moving cameras and its goal is to
recover the scene’s structure (reconstruction) and the camera ego-motion (inverse reconstruction). Our
goal is to deal with cases where both the cameras and the objects in the scene move and answer similar
questions.
The remarkable thing is that these new questions are solved using the same tools that have been
traditionally used in SFM, namely:projective spaces, tensorial notation andthe double algebra.
1
Projective spaces have been very successful in representing the geometry of the imaging process. In
the imaging process a 3D point in the world(X, Y, Z)> is mapped to a 2D point on the image plane
(x, y)>. Working with the pin-hole camera model, and assuming that the camera axis are aligned with
the world coordinate system, and that the image plane has been transformed to a “standard” coordinate
system the imaging mapping is simply:x = XZ
, y = YZ
.
The ratio in the above formulas causes the problem to be non-linear in nature and hence difficult to
handle. The use of projective algebra provides ways not only to make the imaging mapping linear, but
also to represent large families of transformations of the world coordinate system or of the image plane
as linear mappings as well.
Our main tool in the research of these projective spaces are tensors. These tensors are a generalization
of matrices: every entry in a matrix has two indices, whereas in a tensor it can have any number of
indices. We use the strength of tensorial notation to separate the measurements from the ego-motion
in order to build constraints. We then use these constraints to recover the ego-motion and then achieve
reconstruction.
Another tool which is used is the double algebra. The double algebra is used to describe linear
subspaces as single objects. The term extensor is used to describe a linear spaces spanned by several
points. A point will be an extensor of step 1, a line - an extensor of step 2, an extensor of step 3 will be
referred to as a plane. Hyper-planes are extensors of stepk in P k.
1.1.1 Projective spaces
We will be working with projective spaces,Pk. A point in Pk is defined byk + 1 numbers, not all
zero, that form a coordinate vector defined up to a scale factor. Thedual projective spaceis defined as
the space of hyper-planes which and is also represented byk+1 numbers. A pointp in a projective space
is said to coincide with a hyper-planes if and only if p>s = 0, i.e., their scalar (dot) product vanishes. In
other words, the set of hyper-planes coincident with the pointp are represented by the coordinate vectors
s that satisfyp>s = 0, and vice versa: a point represented by the coordinate vectorp can be thought of
as the set of hyper-planes through it.
In the projective spaceP k, anyk + 2 points in general position can be uniquely mapped to any other
2
k + 2 points in the same projective space. Such a mapping is calledcollineationand is represented by
a (k + 1) × (k + 1) invertible matrix, defined up to scale. A collineation is defined byk + 2 pairs of
matching points, each pair providesk linear constraints on the entries of the collineation matrix.
A linear mapping from one projective spaceP k to another projective spaceP l is given by a(l + 1)×
(k+1) projection matrix. For example - a projection matrix fromP 3 to P 2 is given by a3×4 matrix, this
specific projection matrix is also known ascamera matrixand is used to model the process of imaging.
Any matching between a point inP k and a point inP l providesl constraints of the projection matrix
between these spaces.
In this work, the termcenter of a projection matrixwould refer to the null space of the projection
matrix. If the projection matrix is fromP k to P l then the center of the projection matrix would be the
rankk − l linear subspace (or using another terminology: theextensorof stepk − l, see below) inP k
which is mapped by this projection matrix to zero.
1.1.2 Tensorial notations
It is often more convenient to use tensor notations to represent linear operations. In these notations
coordinates of a point are specified with superscripts, i.e., inP 2 pi = (p1, p2, p3). These are called
contravariant vectors. A hyper-plane inPk is called acovariant vectorand is represented by subscripts,
i.e., inP 2 sj = (s1, s2, s3). Indices repeated in covariant and contravariant forms are summed over, i.e.,
pisi = p1s1 + p2s2 + p3s3. This is known as acontraction. For example, ifp is a point incident to a line
s in P2, thenpisi = 0.
Vectors are also termed1-valence tensors. 2-valence tensors(matrices) have two indices and the
transformation they represent depends on the covariant-contravariant positioning of the indices. For
example,aji is a mapping from points to points (a collineation, for example), and from hyper-planes
(lines inP2) to hyper-planes, sinceajip
i = qj andajisj = ri (in matrix form: Ap = q andA>s = r);
aij maps points to hyper-planes; andaij maps hyper-planes to points. When viewed as a matrix the row
and column positions are determined accordingly: inaji andaji the indexi runs over the columns and
j runs over the rows, thusbkj a
ji = ck
i is BA = C in matrix form. An outer-product of two 1-valence
tensors (vectors),aibj, is a 2-valence tensorcj
i whosei, j entries areaibj — note that in matrix form
3
C = ba>. A 3-valence tensor has three indices, sayHjki . The positioning of the indices reveals the
geometric nature of the mapping: for example,pisjHjki must be a point because the i,j indices drop out
in the contraction process and we are left with a contravariant vector (the index k is a superscript). Thus,
Hjki maps a point in the first coordinate frame and a line in the second coordinate frame into a point
in the third coordinate frame. The trifocal tensor of multiple-view geometry is an example of such a
tensor. A single contraction, saypiHjki , of a 3-valence tensor leaves us with a matrix. Note that whenp
is (1, 0, 0) or (0, 1, 0), or (0, 0, 1) the result is a “slice” of the tensor.
1.1.3 Extensors and the Join Operation
The mathematical component of our work deals with intersecting and joining subspaces for the pur-
poses of finding common transversals in the 8-th dimensional projective spaceP8. A convenient way
to do so is to treat a k-dimensional subspace as a single object (instead of as a collection ofk basis
vectors) which is done using Grassmann coordinates also known as anextensor of step k. Generally, the
algebra of extensors with the operations of intersection (“meet”) and union (“join”) are also known as
double algebra or Grassmann-Cayley algebra. These were first introduced in the context of multiple-
view geometry by [7, 18, 17] and also in the context of projection matricesPk → P2 [57]. A concise
introduction to extensors and the operations of meet and join can be found in [51, 3].
An extensor of stepk describes a subspace of dimensionk of some n-dimensional vector spaceV .
All extensors of stepk lie in the linear space∧k(V ) which is of dimension
(nk
). The join operator (∨) is
a multilinear antisymmetric operator that takes two extensors of stepsj andk and produces an extensor
of stepj + k. The joint extensor is associated with the direct sum of the linear spaces associated with
the two extensors. This join extensor vanishes if the two generating extensors intersect. Ife1, e2..., en is
a basis ofV then the basis for∧k(V ) is given by
(nk
)basis elements:
{ej1 ∨ ej2 ∨ ... ∨ ejk|1 ≤ j1 < ... < jk ≤ n}
Let A = span{a1, ..., ak} be a k-dimensional subspace ofV wherea1, ..., ak is some choice of basis.
The stepk extensorA = a1 ∨ · · · ∨ ak also denoted byA = a1a2 · · · ak is an element of the vector space
4
∧k(V ):
A =∑
1≤j1<...<jk≤n
Aj1,...,jkej1 ∨ · · · ∨ ejk
where the scalarsAj1,...,jkarek × k minors:
Aj1,...,jk=
∣∣∣∣∣∣∣∣∣∣∣∣∣
a1j1 a1j2 ... a1jk
a2j1 a2j2 ... a2jk
......
......
akj1 akj2 ... akjk
∣∣∣∣∣∣∣∣∣∣∣∣∣Thus the extensorA has
(nk
)coefficients (choices ofk × k minors from thek × n matrix whose rows
consist ofa1, ..., ak). The extensorA represents the subspaceA as we note that
A = {u ∈ V ‖A ∨ u = 0}
(all (k+1)×(k+1) minors vanish, therforeu ∈ span{a1, ..., ak}) while on the other hand the determinant
expansions are invariant to a change of basis ofA.
Let A = a1 · · · ak andB = b1 · · · bj be extensors of stepk, j representing subspacesA, B andk + j ≤
n. ThenA ∨ B = a1 · · · akb1 · · · bj is non-zero (at least one coefficient does not vanish) iff the set
a1, · · · , ak, b1, · · · bj is linearly independent (i.e.,A ∧ B = ∅). In this case,
A + B = A ∨B = span{a1, · · · , ak, b1, · · · , bj}
Thus, the algebraic join of extensors corresponds to the geometric join of linear subspaces. Con-
versely, in casek + j > n the subspacesA, B always have a non-vanishing intersection into ak + j − n
dimensional linear space. Thus, it is possible to define a “meet” operationA∧B which would be a linear
combination of extensors of stepk + j − n.
1.2 Early approaches to dynamic structure from motion
Almost every system which computes structure from motion has to deal with motion in the scene.
However, the existing SFM techniques were designed for static scenes. A dominant approach to deal
with scene motion (e.g. [30]) is to try to separate the static background from the moving objects. The
5
camera motion (ego-motion) is recovered using the background, the images are registered and the mov-
ing objects can be segmented out. However, this object/background segmentation is a difficult task by
itself, and solutions are based usually on the dominance of the background. The tools we develop in this
work are aimed to treat moving points and static points alike i.e work with unsegmented scenes. One of
the advantages of such an approach is the ability to work with scenes containing no dominant regions or
even no static or rigid regions.
A more systematic alternative to the approach described in this thesis is the work done by Torr [52].
In this work the case of multiple moving objects is considered. Using a sampling algorithm (such as
RANSAC) and model fitting techniques, several models are being fit to explain the motion of feature
points. This approach has the advantage that in principle the number of moving objects and their com-
plexity (they can be either degenerate ,e.g. planar, or not) is not specified in advance. In this work we
present analytical solutions, rather than statistical solutions, to deal with dynamic scenes. This prefer-
ence enables us to treat , for example, the case of points moving independently, while not being clustered
to rigid objects.
Some analytical solutions were suggested in the past for the case of a dynamic scene containing sev-
eral rigid objects moving independently. The factorization-based motion segmentation of Costeira and
Kanade [12] is applicable to the affine (parallel projection) camera model. There it was shown that the
measurement matrix of all points across a sequence of images lies in a linear subspace whose dimension
is determined by the number of independent bodies. The motion of each body lies in a separate subspace
— the rearrangement of the data can be done in order to separate between the subspaces (see also [31]).
In contrast to the factorization based approaches, in our work we assume a full projective model (but
also address the affine model). Moreover, factorization based approaches need more than the minimal
number of views, requiring point tracks to be maintained over many frames. For example in the case
of two independently moving rigid objects we require 2 views, even for the projective camera case.
The factorization approach would have required at least four views, and usually much more views are
required in order to get stable results.
Parallel to our work, Fitzgibbon and Zisserman [19] also demonstrated that some benefits arise from
considering a dynamic scene. They addressed the situation of several segmented independently moving
6
objects, and showed how to combine constraints on the cameras internal parameters arising from several
objects. The solutions given in that work were nonlinear minimizations. In our work we usually ignore
the problem of self calibration. We usually preform reconstructions only up to a projective reconstruc-
tion. In some cases, such as the constant velocity case, we preform an Affine reconstruction, from which
an Euclidean reconstruction is readily given [25]. This enables us to propose solutions which are “lin-
ear” in nature (i.e do not require solutions to non linear systems of equation). Non linear minimization
is only preformed to handle noise.
1.3 Methods for dealing with independently moving points
The first work to deal with the case of points moving independently in space, and not just of points
spread among several rigid bodies, was done by Avidan and Shashua [2]. This work and those who
followed [43, 40] considered the case they call ”trajectory triangulation”. In SFM, the term triangulation
refers to the recovery of the locations of 3D points from the image measurements in the case where
the camera parameters are known (”the calibrated case”). In trajectory triangulation, the point in 3D is
allowed to move along some parametric path, for example a line or a conic section. Note that in every
single image there is information from one instant in time. Without adding constraints on the type of
motion, the problem of trajectory triangulation is inherently ill-posed.
In [2], the trajectory of each point is linear. Each trajectory line is represented by its by Plucker
coordinates. Given the original3 × 4 camera projection matricesM (i) it is possible to build3 × 6
projection matricesM (i) which project each 3D lineL to a 2D image linel, according tol ∼= ML
[17]. The three rows ofM (i) are the result of the “meet” [3] operation of pairs of rows of the original
3× 4 camera projection matrix, i.e., each row ofM represents the line of intersection of the two planes
represented by the corresponding rows ofM .
We can try to extend the results of the trajectory triangulation scheme to the uncalibrated case in a
straight-forward way. LetP be the moving point along the straight lineL such that in thej’th view we
observe the projection ofpj of P . The pointpj has to be on the image of the trajectory lines. Thus,
p>j M (j)L = 0 for all views ofP . The determinant of the6 × 6 matrix whose rows arep>j M (j) must
7
vanish. This determinant is a multilinear expression in the measurementspj, and can be expressed as a
tensor. The resulting tensor is of36 elements and thus would require 728 matching points across 6 views
in order to obtain a linear solution. Naturally, this situation is unwieldy application-wise. Considering
even more complex types of trajectories, like conic trajectories gives rise to even less tractable solutions.
We deal with this situation by adding more constraints to the nature of the motion. Sometimes we
constrain the scene to be planar, sometimes we constrain the trajectories to be parallel, and sometimes
we constrain the motion to be of constant velocity.
A solution for linear trajectories in a planar scene with unknown homographies is given in [46] and
is described in chapter 2. Similarly to other work presented in this thesis, the major effort is put not on
the derivation of the tensorial constraint, but on its analysis. The major question is how to recover the
camera motion from this tensor. Other questions are how general should the points and their motion be
in order to recover the tensor without ambiguity.
Interestingly, the study of the tensor associated with linear planar trajectories is dual to the study of the
tensor associated with slices of the quadrifocal tensor [47]. The latter tensor is the result of contracting
the quadrifocal tensor with a single line, and represents the situation where three lines intersect in one
point. The planar trajectory triangulation tensor represents the situation where three points lie on one
line. These dual situations are projective situations, i.e they are invariant to a projective transformation
of the coordinate system [37].
In chapter 3 we describe work which incorporated Affine constraints to the motion of the moving
points. One type of constraint is the constant velocity constraint, another is the pure translation type of
constraint. Combined with other projective constraints and with 2D and 3D cases, a wealth of tensors
are described.
In order to handle dynamic scenes we have often lifted the model of the scene to higher projective
spaces. In these higher spaces, we were able to derive novel multi-linear constraints. The tensor defini-
tion and recovery is just the first stage. We then analyze the structure of each tensor to get a projective
reconstruction in the higher projective space. We then further analyze the tensors to unravel the under-
lying information such as scene structure, motion in the scene and the motion of the camera. Since we
start with Affine invariants such as constant velocity, we end up with an Affine structure, and not merely
8
a projective structure.
After modelling the problem as a projection from one projective space to another, the derivation of
the underlying tensors is automatic. However, the next stages of decomposing the tensors are the most
demanding stages. The first stage of obtaining the projective structure is independent of the underlying
problem, and depends only on the projective spaces at hand (except for some degenerate cases which
are a result of a specific modelling). Nevertheless, no method exists for automatically achieving this
decomposition. We have developed some tools for handling these decompositions, such as the “joint
epipoles” which we believe are applicative to a wide range of such tasks, but the general method is left
for further research.
In classic SFM, beside the multiple view invariants (the tensors), there are what is sometimes con-
sidered to be their dual - multi point tensors, called the”shape tensors”[54]. In chapter 4 we show
how to find shape tensors for any kind of projection matrix, and use those to index single images ac-
cording to the action being photographed. The idea is to represent an action (e.g ”sitting”,”walking”)
as a combination of trajectories of different parts of the body. Given an image, whether check to see if
the configuration of body parts in it match one of the models for which we have built its dynamic shape
tensors.
Parallel to our work, Han and Kanade proposed solutions for the constant velocity case using factor-
ization. They first proposed a solution for the Affine projection model [22], and then to the projective
camera model [23]. Both solutions are based on factorization, which has the limitations mentioned
above. Also the factorization for the projective case is not guaranteed to converge to the correct solution.
These methods require move views than the minimal number of views, but have the advantage that they
provide a way to incorporate information from many views at once.
Following our work, Wexler and Shashua [55] used the homography tensors presented in chapter 2
to synthesize a new view of a dynamic scene. Their work assume constant velocity and could take
advantage of the results presented in chapter 3. Levin and Shashua [32] considered the infinitesimal
motion model as the camera model to derive similar results for linear trajectories.
In [26], the results of chapter 3 are rederived using a method which can be considered as a descendent
of the relative affine framework [45]. In our view their framework is much less intuitive as a general-
9
ization of the classical SFM techniques. For example, in their proposed framework the center of the
projection matrix is always a point, where in our formalization it is the null space of the projection
matrix. The authors claim to generalize and complete our results, but in fact just show several new
examples of tensors, ignoring both the un-tractability of large tensors, and the problem of decomposing
those tensors.
1.4 Methods for dealing with multiple moving objects
As stated above, the majority of previous work which handled dynamic scenes has focused on multiple
moving objects rather than on independently moving points. These previous approaches were either
sampling based or limited to the Affine projection model.
In chapter 3 we described, using the framework of lifting the problem to a higher projective space,
some solutions to handle multiple moving objects. These solutions were confined for the case where the
relative motion between the objects was a pure translation. In chapter 5 we continue to study the pure
translation case.
Consider two views of a scene containing multiple moving bodies. Each moving body is associated
with its own fundamental matrix. We show that if the bodies move relative to one another by pure trans-
lation, all these fundamental matrices reside in a 3-dimensional subspace ofR9. We seek to generalize
the classic result that two homographies associated with two planar parts of the scene are sufficient to re-
cover the static fundamental matrix. We show that five homographies associated with five planar bodies
are sufficient for the recovery of the 3-dimensional subspace mentioned above. Once it is recovered we
are able to recover the fundamental matrix associated with each of the views as well as the homography
at infinity (hence we achieve an Affine reconstruction of the dynamic scene).
We solve this problem by associating with each homography the subspace of the fundamental matrices
which confirm with this homography. Each homography provides 6 linear constraints on the elements
of the fundamental matrix, hence the dimension of this linear subspace is also 3. All the subspaces
associated with the homographies arising from one dynamic pure translation scene have to intersect the
subspace of all possible fundamental matrices of the scene (this last subspace is in fact the subspace of
10
fundamental matrices which confirm with the homography at infinity).
Conceptually, this problem of finding a linear subspace which intersects several linear subspaces is
a general form of the trajectory triangulation described above. We solve this problem using the double
algebra. This work is quite unique in the sense that until now most of the work in computer vision which
used the double algebra, used it to derive results for which other proofs existed. In this work the double
algebra is not just an elegant tool, but also an inherent part of the solution.
The analytic approach we use to handle dynamic scenes is not limited to scenes with pure translation.
In chapter 6 we describe a solution to the two views multibody problem, where the motion between the
two views can be general.
As mentioned above, the multibody problem was the first problem to be considered in the field of
dynamic SFM. Solutions were either sampling based, or constrained to the Affine projection model. In
chapter 6 we propose a different approach. Each body is associated with a different invariant (funda-
mental matrix). Given image measurements of a point in one image and of a corresponding point in a
second image, we know that one of the two invariants must vanish. Since we do not know which one of
these vanishes for each point (this is the segmentation problem) we just multiply the two invariants. The
product has to vanish for points on both bodies.
This simple scheme bares a difficulty: each original invariant was linear in each one of the point mea-
surements. The product invariant is bilinear in each one of these points. We handle this by representing
the invariant as being linear in the second order monomials of the point measurements. We then show
how to decompose this representation to the fundamental matrices of each body.
Although the new invariant has many desirable properties such, as using point measurements from
both bodies at once, and insensitivity to degenerate bodies, its dependence on the second order mono-
mials makes it less stable to compute than single fundamental matrices. To overcome this problem, we
suggest a non linear minimization technique for its computation.
All of the work described in this thesis was done using the projective camera model, except for the
work described in chapter 7. In this work we use the Affine camera model to derive our results, and just
briefly describe how to generalize this to the projective camera model.
A 3D reconstruction of a dynamic non-rigid scene from features in two cameras usually requires
11
synchronization and correspondences between the cameras. These may be hard to achieve due to occlu-
sions, wide base-line, different zoom scales, etc. In chapter 7 we present an algorithm for reconstructing
a non-rigid scene from sequences acquired by two uncalibrated non-synchronized fixed Affine cameras.
This algorithm assumes that (possibly) different points are tracked in the two sequences. The only
constraint used to relate the two cameras is that every 3D point tracked in one sequence can be described
as a linear combination of some of the 3D points tracked in the other sequence. This constraint lies
somewhere in between the independently moving points problems presented in the previous section and
the multiple rigid bodies problem presented in this section.
We present algorithms for synchronizing the two sequences and reconstructing the 3D points tracked
in both views. Outlier points are automatically detected and discarded. The algorithm can also handle
both 3D objects and planar objects in a unified framework without the need for model selection.
Following the work of Irani [27], we were able to make a ”direct method” version of our synchro-
nization algorithm. This version does not use point tracks as its input. Instead it uses only gray level
measurement from the image. By avoiding the stage of tracking points, we were able to suggest a very
simple technique for action indexing.
Parallel to the work presented in chapter 5, Manning and Dyer [36] presented a solution for a sim-
plified sceneraio. In their work it is assumed that two fundamental matrices of the dynamic scene are
already recovered, and what remains is to solve for the homography at infinity.
Following the work described in chapter 6 Vidalet al. [53] described solutions for more than two
rigid bodies view by two cameras. Although this generalization is possible, the resulting invariants are
untractable.
Bartoli [4] used the idea presented in chapter 6 to combine two types of constraints arising from
dynamic or static parts of a combined scene. The dynamic part consists of points moving along planar
lines all intersecting at a point. This scenario can be modelled as a projection fromP3 toP2, hence the
resulting invariant is similar to the fundamental matrix.
Caspi and Irani [11] proposed a different solution for the problem of synchronizing two fixed cameras
viewing a dynamic scene. Their solution assumes the existence of a point which is being tracked in
both sequences. Starting with a small number of point tracks, they search over all pairs of possible
12
matchings to find the best matching pair of tracks, and use its measurements across time to compute the
fundamental matrix.
Zelnik-Manor and Irani [62] repeated the settings proposed in chapter 7 as part of a comprehensive
study on rank conditions on measurement matrices. Using a more restrictive assumption that the same
points are tracked by both cameras, they were able to get very accurate synchronization results using
a similar algorithm. Using their assumption, the whole reconstruction problem becomes a matching
problem, and using a brute force search they were able to show that this matching is stable. Using the
reprojection method we describe in chapter 7, this search can be avoided.
1.5 Unpublished chapters
In chapters 2-7 we focus on scenes captured by 2D images. In chapter 8 we consider scenes which
have been captured by 3D imaging devices (e.g structured light systems, stereo systems). We consider
scenarios where points move along straight lines, similarly to chapter 2, points which move with constant
velocity, similarly to chapter 6, and 3D lines which more in pure translation. The results obtained follow
the same use of tools throughout the rest of the thesis, and this unpublished chapter is rather technical in
nature.
Chapter 9 which follows is theoretical. In this chapter, which was developed with the help of Prof.
Roy Meshulam (The Technion) and Prof. Gil Kalai (The Hebrew University) we use the representation
theory as a tool for solving ”counting questions”. These questions appear whenever we resort to the
use of multi-linear algebra to derive invariants. Examples of such questions are ”how many points in
general position are needed to solve linearly for the fundamental matrix?”, ”how many constraints on
the elements of the trifocal tensor can we obtain using images of points lying on one plane in 3D?”,
”What is the rank of the estimation matrix whose rows are the outer products of three images of a point
in 3D?”.
In the past, these questions were solved one at a time. As we began to present more and more tensors,
each one with its own counting problems, the need for a general tool for solving those became apparent.
In chapter 9 we approach one such abstract question. The general solution to this question is shown to
15
have applications for the analysis of constraints arising from both static scenes and dynamic scenes.
16
Chapter 2
Homography Tensors
Homography Tensors: On Algebraic Entities ThatRepresent Three Views of Static or Moving Planar points
Amnon Shashua, Lior Wolf
Published inProc. of the European Conference on Computer Vision (ECCV)
June 2000.
17
18
Chapter 3
Projection Matrices from Pk to P2
On Projection MatricesPk −→ P2, k = 3, 4, 5, 6 , and theirApplications in Computer Vision.
Lior Wolf, Amnon Shashua
Published inInternational Journal on Computer Vision (IJCV)
48(1) 2002.
33
34
Chapter 4
Action indexing using dynamic shape tensors
Time-varying Shape Tensors for Scenes with MultiplyMoving Points
Anat Levin, Lior Wolf, Amnon Shashua
Published inIEEE Conf. on Computer Vision and Pattern Recognition (CVPR)
Dec. 2001.
49
50
Chapter 5
A common transversal solution for
independently translating planes
Affine 3-D Reconstruction from Two Projective Images ofIndependently Translating Planes
Lior Wolf, Amnon Shashua
Published inThe Eighth IEEE International Conference on Computer Vision (ICCV)
June 2001.
59
60
Chapter 6
The segmentation matrix
Two-body Segmentation from Two Perspective Views
Lior Wolf, Amnon Shashua
Published inIEEE Conf. on Computer Vision and Pattern Recognition (CVPR)
Dec. 2001.
69
70
Chapter 7
Synchronization and reconstruction from fixed
cameras viewing a dynamic scene
Correspondence-free Synchronization and Reconstructionin a Non-rigid Scene
Lior Wolf, Assaf Zomet
Published inPost ECCV 2002 Workshop on Vision and Modeling of Dynamic Scenes
May 2002.
79
80
Chapter 8
“3D to 3D” alignment
Consider the classic problem of “3D to 3D” alignment of point sets. One is given a set of 3D points
P1, ..., Pn measured by some device such as one based on structured-light [48] or on triangulation from
a stereo-rig of cameras. The measuring device has changed its position in space (while the set of 3D
points has remained static in space) and the corresponding 3D positions areP ′1, ..., P
′n, i.e., the measured
points have undergone a coordinate transform. In the case of a projective setting, five of these matching
pairs in general position are sufficient to recover the4×4 collineationA, such thatAPi∼= P ′
i , i = 1, ..., n.
In a rigid motion setting, the coordinate transform consists of translation and rotation which could be
recovered using 4 matching points and elegant techniques using SVD have been developed for this
purpose [21].
In this section we introduce “dynamic” versions of the 3D-to-3D alignment problem. The first dy-
namic version would allow any number of points to move along straight-line paths during the motion
of the measuring device. Points that remain in place are calledstatic and points that move are called
dynamic. There could be any number of dynamic points — including the possibility thatall points are
dynamic — and the system need not know in advance which points are static and which are dynamic
(unsegmented configuration). Under these conditions we wish to find the projective coordinate changes
acrosstwomotions of the measuring device.
A previous work [59] derived a4 × 4 × 4 family of tensors, referred to asjoin tensorsor Jtensors in
99
short, that capture the dynamic 3D-to-3D alignment problem. A matching tripletP, P ′, P ′′ of a point
measured at three time instances contributes a linear measurement for the Jtensor - regardless of whether
the physical point in space is dynamic or static while the measuring device has changed positions. The
linear constraints add up to a 4-dimensional null space of Jtensors, that is, there exist 4 distinct Jtensors
which are linearly recovered from matching points. These Jtensors however are not minimal. We will
derive a minimal tensor which would require fewer measurements and that will give us one tensor per
estimation matrix.
We will also consider the constant velocity case. For points that are either static or move in constant
speed we introduce a smaller tensor that introduces some advantages. First, the tensor is smaller thus
fewer measurements are required for its recovery. Second, since constant velocity in an Affine invariant,
we will recover the change in coordinate system up to an Affine (not projective) ambiguity. Our approach
here would be closest to the one presented in [57], where a full projective camera was used. Having 3D
information would allow us to use fewer views and measurements than the solution given there.
The third case that we are going to consider is the case of translating lines. In 3D measurements, such
as range data, points are not always well defined. We consider a case where instead of tracking points
we track lines. The motion of the lines is restricted in such a way that every point on each line move in
the same direction as other points on this line. In this case only two views are needed in order to recover
the relative position of the coordinate systems.
In a separate work, Sturm [50] derived multiple view tensors of this family to deal with the case
where points in 3D move along linear paths which are constrained to intersect some line. A full analysis
is done including the analysis of ambiguities, and the possibilities of preforming Euclidean calibration
using these tensors.
8.1 Derivation of Jtensors
Let X be some point in 3D space with a coordinate vectorP . Let P ′ be the coordinate representation
of the pointX at some other time instance (say, the measuring device has changed its viewing position)
and letP ′′ be the coordinate representation ofX at a third time instance. LetA, B be the collineations
100
mapping the second and third coordinate representations back to the first representation, i.e.,P ∼= AP ′
andP ′′ ∼= BP ′′.
If the pointX happens tomovealong some straight-line path during the change of coordinate systems,
thenP, AP ′, BP ′′ do not coincide but they form a rank-2 matrix:
rank
| | |
P AP ′ BP ′′
| | |
= 2
And for every column vectorV we have
det
| | | |
P AP ′ BP ′′ V
| | | |
= 0 (8.1)
Note that becauseV is spanned by a basis of size four, we can obtain at most four linearly independent
constraints on some object consisting ofA, B from a triplet of matching pointsP, P ′, P ′′. Note also that
the null vector of a4 × 3 matrix can be represented by3 × 3 determinant expansions. For example, let
X, Y, Z be three column vectors in a4× 3 matrix, then the vectorW representing the plane defined by
the pointsX,Y, Z is:
w1 = det
x2 y2 z2
x3 y3 z3
x4 y4 z4
w2 = −det
x1 y1 z1
x3 y3 z3
x4 y4 z4
w3 = det
x1 y1 z1
x2 y2 z2
x4 y4 z4
w4 = −det
x1 y1 z1
x2 y2 z2
x3 y3 z3
We can write the relationship betweenW andX, Y, Z as a tensor operation as follows:
wi = εijklxjykzl
101
where the entries ofε consist of+1,−1, 0 in the appropriate places. We will refer toε as the “cross-
product” tensor. Note that the determinant of a4× 4 matrix whose columns consist of[X, Y, Z, T ] can
be compactly written as:
tixjykzlεijkl.
Using the cross product tensor we can write the constraint eqn. 8.1 as follows:
0 = det
| | | |
P AP ′ BP ′′ V
| | | |
= P i(εilmu(A
ljP
′j)(Bmk P ′′k)V u)
= P iP ′jP ′′k(εilmuAljB
mk V u)
Note that the tensor form allows us to separate the measurementsP, P ′, P ′′ from the unknownsA, B,
and we denote the expression in parenthesis:
Jijk = εilmuA′lj B
′mk V u (8.2)
as the “join”1 tensor, or Jtensor in short. Note that for every choice of the vectorV we get a Jtensor. As
previously mentioned, sinceV is spanned by a basis of dimension four there are 4 such tensors – each
tensor is defined by the constraints:
P iP ′jP ′′kJijk = 0.
These are linear constraints in the 64 elements of the Join tensors. Because there are four Jtensors the
linear system of equations for solving forJijk from the matching tripletsP, P ′, P ′′ has a 4-dimensional
null space. The vectors of the null space are spanned by the Jtensors. In practical terms, givenN ≥ 60
matching tripletsP, P ′, P ′′, each triplet contributes one linear equationsP iP ′jP ′′kJijk = 0 for the 64
1The join operator is the exterior product of the Grassmann-Cayley algebra. A join of three 3D points is a plane which
contains the three points.
102
entries ofJijk. The eigenvectors associated with the four smallest eigenvalues of the estimation matrix
arethe Jtensors of the dynamic 3D-to-3D alignment problem.
We see that at least 60 point measurements are needed for a solution to the Jtensors. In case all of the
measurements arise from dynamic points, then these points should be distributed along at least 10 lines,
5 of which can hold up to 8 dynamic points, and the remaining 5 can hold up to 4 dynamic points. A
tool to show these kind of arguments is the representation theory. Using this tool along the guidelines
of chapter 9, one can observe that the size of the subspace of constraints spanned by static points is 20,
and that from each line one can extract four “static” constraints, and four constraints outside that static
subspace, which explains the above result.
More information about the Jtensor is given in [59]. The other main results shown there are: (i)
Tensor slices and the extraction of the constituent collineationsA, B from the four Jtensors, (ii) The use
of Jtensors for direct mapping between coordinate systems (without extractingA, B along the way), (iii)
The use of Jtensors to distinguish between dynamic and static points, and (iv) the relationship between
the number of static and dynamic points for estimating the Jtensors in the unsegmented and segmented
configurations.
8.2 The Minimal Jtensor
By building the estimation matrix from the outer product of the 3D points we get four constraints,
which is the number of constraint that exist when a point is on a line. In this case the line can be seen
as the join ofP andAP ′ and the point asBP ′′. The fact that we got four constraints suggests that the
Jtensors we have found were not minimal. A minimal Jtensor would use a smaller estimation matrix and
therefore would need less measurements.
We can limit the constraints to be the one constraint for intersection of two lines. This will be done
by taking instead of the pointBP ′′, the ray which connects some camera centerC with this point. This
camera is totally arbitrary and may be chosen at will.
Assuming that we choose a cameraM , as stated above, we takeC ∼= null(M). There exist4 × 3
matrices which transform a point on the imageMP ′′ to a point in 3D, which is the intersection of the
103
ray associated with this point (BP ′′ ∨BC) and some plane. We choose one of these matrices and call it
O. We know that the ray above and the lineP ∨ AP ′ must intersect. So:
det ( P AP ′ BOMP ′′ BC ) = 0
This expression is multi-linear inP ,P ′ andMP ′′, and gives us the minimal JtensorNijk. The indexes
i andj are from 1 to 4, and the last indexk is from 1 to 3. The Minimal Jtensor’s equation is:
Nijk = εilmuAlj(BO)m
k (BC)u (8.3)
Alternatively, we can give the minimal Jtensor another equation, this time by noticing that after all
the 3D points are transformed to the third coordinate system the points and therefore their images by the
cameraM are collinear:
Nijk = εlmk(MB−1)li(MB−1A)m
j (8.4)
Where the three indexes epsilonεlmk is just the anti-symmetric (cross product) tensor ofR3.
The estimation matrix for this tensor is made out of the outer product ofP ,P ′ andMP ′′ for any
prechosen rank three camera matrixM . Thus, in order to solve for the minimal Jtensor, one would need
3 × 42 − 1 = 47 measurements, which is a significant reduction of the60 measurements needed for
the Jtensor above. Having the same 3D measurements one can choose another projection matrixM and
compute several such minimal Jtensors.
We can recover the collineationsA andB by noticing that, for example,P i(AP )jNijk = 03. There-
fore, for any slice of the minimal JtensorSδ = δkNijk we have:AT Sδ + STδ A = 0 which gives us
10 equations onA per slice of the minimal Jtensor. The recovery of the second collineationB is only
slightly more complicated.
8.3 The 3D Constant Velocity Tensor
We now consider the constant velocity case. Let the collineations between the world coordinate
system and the sensors coordinate systems beAi, i = 0, 1..3. This time we cannot assumeA0 to be the
104
identity, since we allow our tensors to have projective coordinate systems, and constant velocity is an
affine invariant. A point in 3D( X Y Z )T is moving at a constant velocity( dX dY dZ )T . Its
location in the tensors (in projective coordinates)Pi is given by:
Pi∼= Ai
X
Y
Z
1
+ i
dX
dY
dZ
0
∼= Ai
X
Y
Z
1
dX
dY
dZ
WhereAi is composed of the columns ofAi - A1
i , A2i , A
3i , A
4i :
Ai∼= [ A1
i A2i A3
i A4i iA1
i iA2i iA3
i ]
We can now take three hyper-planes through the point inP 6 for each measurement. By taking three
such hyper-planes from each of the first two points, and one from the third point we derive the constraint:
det
( P 40 0 0 −P 1
0 ) A0
( 0 P 40 0 −P 2
0 ) A0
( 0 0 P 40 −P 3
0 ) A0
( P 41 0 0 −P 1
1 ) A1
( 0 P 41 0 −P 2
1 ) A1
( 0 0 P 41 −P 3
1 ) A1
LT2 A2
= 0
This constraint is multi-linear in the measurementsP0, P1, L2 and has the formP i0P
j1 L2kA
kij = 0.
The size of the resulting tensor is43. Since we can take any plane through the point in the third
coordinate system, 21 point matches across 3 views are sufficient to solve for this tensor linearly.
Note that it is also possible to use an arbitrary camera in the third view here. Instead of thePPp
(point in P 3 - point in P 3 - point in P 2) tensor we got for the non-constant velocity case, Here we will
105
get aPPl, wherel is any line through the pointp. This would give us a tensor of size3 × 42 which
needs 24 point matches in order to solve.
Extracting information from this tensor such as the collineationsAi and the structure of the scene, is
done along the lines described in [57]. A first reconstruction is done inP7, then a second reconstruction
is done inP3 for the 3D structure and collineations. Since constant velocity is an affine invariant, the
last reconstruction is an affine reconstruction. Note that in the case of constant velocity, there is an
ambiguity in defining static points. This is because preforming two constant velocity motions one after
the other gives a new constant velocity motion. Therefore, we cannot distinguish between translation of
the coordinate frames and adding the same translation to all the points. This ambiguity can be resolved
using one static point.
We will first describe some properties of the constant velocity tensor that will enable us to achieve the
projective reconstruction, then we will show how to achieve the reconstruction itself.
8.3.1 decomposing the constant velocity tensor
We have formulated our problem as a projection problem fromP 6 to P 3. This gives us a4 × 7
projection matrix (“camera”). The analogous to an image plane in this type of “camera” is an “image
space” (an extentor of step 4). This projection matrix has a center which is just its null space - an extentor
of degree7 − 4 = 3. The image of this “camera center” on some other view is some plane subspace of
the other view’s image space. This is analogous to the well known epipole of epipolar geometry.
Consider the contractions:Ok ∼= P i0P
j1 Ak
ij. This type of contraction is called point transfer. It can
be easily shown (by multiplying both sides with a plane through the third pointPm2 - Lm) that these
contraction generates the point in the last viewP2, or 0 for degenerate cases.
Lets move our attension to slices of the tensor.P iAkij is clearly a4 × 4 matrix. It is the collineation
between views 2 and 3 of the space (extentor of step 4) which connects the first projection matrix center
(a plane - an extentor of step 3) and the point in the first “image space”.
When trying to solve for projection matrices from the multi-linear constraints in the classic case, the
epipoles play a great rule. Here each epipole is a plane. For example, the epipole in view 2 associated
106
with view 1, is the plane containing the pointsee01:
ee01∼= A1null(A0)
, whereAi being the projection matrices.
In multi-view geometry the epipole in the first image is transformed to the epipole in the second
image using any valid homography between the views. This is because the line in 3D which connects
the camera centers intersects the image planes at constant points. Here the extentor of rank 6 which
connects the two projection centers intersects each image space (extentor of rank 4) in an extentor of
rank 3(6 + 4 − 7 = 3) which is a plane. The epipoles are transformed from one view to another using
any valid collineation between them.
Assume that H and J are two collineations between views 1 and 2 (In order to find these we need the
tensors between views three one and two, and not the tensor between views one two and three). Using
dual collineations (we transform planes, not points):
e10∼= H−T e01
∼= J−T e01
Thereforee01 is a generalized eigenvector ofH−T andJ−T . We can use this property to find these
epipoles using some collineations. The epipole would be a generalized eigenvector of any pair of valid
dual collineation.
Up to projective transformation the projection matrices fromP 6 to P 2 can be chosen as:
w0∼= [ I4×4 04×3 ]
w1∼= [ H01 e1
10e210e
310 ]
whereH01 is any collineation between views one and two andei10i = 1..3 are three points on the plane
which is the epipole in view two associated with camera one.
This choice of projection matrices is actually the choice of the the eight points of the standard basis
of P 6, which we are free to choose. The first four points are taken from the space associated with the
collineationH01; the fifth,sixth and seventh points are taken from the center of the first projection matrix;
the scales between the epipoles/homography determines the missing point of the basis inP 6.
107
Havingw0 andw1 we can use the tensor in order to findw2. Note that the tensor elements are multi-
linear expression in the projection matrices. Sincew0 and w1 are known, we are left with a linear
expression inw2.
Using the tensorial constraints alone, we cannot do better in the larger space than a projective recon-
struction (of the larger space). This is due to the gauge invariance. However, other infomation arises
from the nature of the underlying scenario.
For the case of the constant velocity tensors, we know that the matricesA have a special structure -
They have columns which are repeated multiplied by some scalar. This gives us linear constraints on a
transformation that will change allwi to be of this structure. The rest follows very similarly to what is
done in [57]
8.4 The Translating Lines Matrix
Consider a moving sensor capturing a scene composed oflines moving in 3D. The motion of the lines
is constrained in such a way that each point on the line moves in the same direction. For example, the
lines are on rigid objects that undergo translation.
The constraint is derived from the fact that if we compensate for the motion of the sensor, the line
before and the line after the motion both reside on the same plane. In other words, these two lines
intersect (maybe at infinity).
Every two lines represented in plucker coordinates intersect if the dot product of the first line with
some known permutation of the second line vanishes. The permutation depends on the order of elements
in the plucker coordinates. This can be written aslT2 Jl1 = 0, where the matrixJ is a known permutation
matrix.
Changing the coordinate systems for points by some collineationA, changes the coordinate system of
the plucker lines derived from these points by some collineationA. The rows of this collineation for the
plucker lines are the plucker coordinates of the lines made out of every couple of the rows of the points
collineation. Combining with the previous equation we get:lT2 AT Jl1 = 0, wherel1 is the line before the
motion in the first sensor, andl2 is the line after the motion in the second sensor.
108
This constraint is multi-linear in the lines, and we can solve forAT J having 35 line matches, between
two views. If points are known to be on a translating object, that we can take lines by choosing every two
such points as a line. Experiments show that two objects are sufficient to solve for the tensor linearly.
Since we knowJ , which is a permutation matrix and therefore full rank, we can solve for the
collineationA from the tensor. The point collineationA can then be solved by noticing that every
row of A is the intersection of three rows ofA representing three lines.
Note that in the common case, where all the points on each line move in the same direction, and
in the same velocity per line, we have an ambiguous situation. The ambiguity has two components:
First, we can recover the change in coordinate system only up to some translation of all points. Sec-
ond, we can recover the change in coordinate system only up to some unknown scale. i.e every point
( Xi Yi Zi 1 )T can be transformed into( λXi λYi λZi 1 )T for any fixedλ without changing
the property that every line before the motion intersects the corresponding line after the motion.
Both ambiguities can be shown to arrive from the following relation: Let( Xi Yi Zi 1 )T i = 1, 2
be two points, which both move at a constant velocity( dX12 dY12 dZ12 0 )T . Then for every
common arbitrary translation( dA dB dC 0 )T , and for any scale factorλ:
det
X1 X2 λX1 + dX12 + dA λX2 + dX12 + dA
Y1 Y2 λY1 + dY12 + dB λY2 + dY12 + dB
Z1 Z2 λZ1 + dZ12 + dC λZ2 + dZ12 + dC
1 1 1 1
= 0
(This can be noticed by subtracting the first two rows of the matrix and comparing with the subtraction
of the last two).
This ambiguity, which reduces the rank of the estimation matrix to be of 31 instead of 35 can be
overcome by using two static points, for example. From each static point we derive additional constraints
on the estimation matrix by choosing any line through one of the points in the first frame, and any line
through the same point in the second frame.
109
110
Chapter 9
Counting Problems for Multilinear
Constraints
Multilinear constraints in computer vision applications are of growing interest in Structure for Motion
(SFM), Indexing and Graphics. Many of the applications where multiple measurements are involved
— like multiple-view geometry of static and dynamic scenes, indexing functions into 3D data-sets,
separation of various attribute/modalities such as “content” and “style” — have a multilinear form. As a
result, a growing amount of work has been published on the various aspects of those algebraic functions
and their applications — see Hartley & Zisserman, 2000 and Faugeras & Luong, 2001 for the recent
summary of various multi-linear maps and their associated tensors.
In this paper we raise a general question and demonstrate its relevance to the current research in
multilinearity in computer vision. The questions takes the following form: LetV be a complexn-
dimensional space and form ≥ k consider theGL(V )-moduleV (n,m, k) ⊂ V ⊗m defined by
V (n, m, k) = { v1 ⊗ · · · ⊗ vm ∈ V ⊗m :
dim Span{v1, . . . , vm} ≤ k } .
We would like to determinedim V (n,m, k) for any choice ofn, m ≥ k. We will show that this question
appears in a one disguised form or another in a number of vision problems and, for example, focus on
111
two of those problems: (i) analysis of constraints in single view indexing functions (the 8-point shape
tensor), and (ii) the analysis of the constraints in dynamicPn → Pn mappings, i.e., where the point sets
are allowed to move within ak-dimensional subspace while then-dimensional space is being multiply
projected (multiple views) onto copies of them-dimensional space.
We then derive the solution to the general problem using tools from representation theory. We will
describe the general notations in the next section (and provide a brief primer on representation theory
in the appendix), followed by the detailed description of the two problems mentioned above and the
way the are mapped to the question ofdim V (n,m, k), and followed by the derivation of the structure
and dimension of theGL(V ) moduleV (n,m, k) by counting irreducibles followed with examples of its
application to some instances of dynamicPn → Pn mappings.
9.1 A Representation Theory Digest
In this section we briefly recall some relevant facts concerning the representation theory of the general
linear group. For a thorough introduction see Foulton & Harris 1991.
Let V be a finite n-dimensional vector space over the complex numbers. The collection of invertible
n×n matrices is denoted byGL(n) which is the group of automorphisms ofV denoted byGL(V ). The
vector spaceV ⊗m (m-fold tensor product) is spanned by decomposable tensors of the formv1⊗· · ·⊗vm,
where the vectorsvi are inV . Hence the dimension ofV ⊗m is nm. The vector spaceV ⊕m is the m-fold
direct sum ofV , thus is of dimensionnm.
The exterior powers∧mV of V , n ≥ m, is the vector space spanned by them × m minors of the
n × m matrix [v1, ..., vm] where the vectorsvi are inV . Hence the dimension of∧mV is(
nm
). The
exterior powers are the images of the mapV ×m → V ⊗m given by
(v1, · · · , vm) →∑
σ∈Sm
sgn(σ)vσ(1)⊗, · · · , vσ(m)
whereSm denotes the symmetric group (ofpermutationsof m letters).
Thesymmetric powersSymmV are the images of the mapV ×m → V ⊗m given by
(v1, · · · , vm) →∑
σ∈Sm
vσ(1)⊗, · · · , vσ(m)
112
Hence the vector spaceSymmV is of dimension(
n+m−1m
). Note that,
V ⊗ V = Sym2V ⊕ ∧2V
with the appropriate dimension:n2 =(
n+12
)+(
n2
). This decomposition into irreducibles (see later)
is not true forV ⊗m, m > 2. The remainder of this section is devoted to the necessary notation for
representingV ⊗m as a decomposition of irreducibles.
A representationof a groupG on a complex finite dimensional spaceU is a homomorphismG to
GL(U) - the group of linear automorphisms ofU . The action ofg ∈ G onu ∈ U is denoted byg ·u. The
G−moduleU is irreducible if it contains no non-trivialG−invariant subspaces. Any finite dimensional
representation of a compact groupG can be decomposed as a direct sum of irreducible representations.
This basic property calledcomplete reducibilityalso holds for all holomorphic representations of the
general linear groupGL(V ).
The main focus of this paper is the space
V (n,m, k) = Span{v1 ⊗ · · · ⊗ vm ∈ V ⊗m :
dim Span{v1, . . . , vm} ≤ k } .
SinceV (n, m, k) is invariant under theGL(V ) action given byg · v1 ⊗ · · · ⊗ vm = g(v1)⊗ · · · ⊗ g(vm)
it is natural to study its structure by decomposing it into irreducibleGL(V )- modules.
The description of the finite dimensional irreducible representations (irreps) ofGL(V ) depends on the
Combinatorics of partitions and Young diagrams which we now describe:
A partition of m is an ordered setλ = (λ1, ..., λk) such thatλ1 ≥ ... ≥ λk ≥ 1 and∑
λi = m. A
partition is represented by itsYoung diagram(also calledshape) which consists ofk left aligned rows
of boxes withλi boxes in rowi. Theconjugate partitionµ = (µ1, ..., µr) to a partitionλ is defined by
interchanging rows and columns in the Young diagram — or without reference to the diagram,µi is the
number of terms inλ that are greater than or equal toi.
An assignment of the numbers{1, ...,m} to each of the boxes of the diagram ofλ, one number to
each box, is called atableau. A tableau in which all the rows and columns of the diagram are increasing
is called astandard tableau. We denote byfλ the number of standard tableaux onλ, i.e., the number of
113
ways to fill the young diagram ofλ with the numbers from 1 tom, such that all rows and columns are
increasing. Let(i, j) denote the coordinates of the boxes of the diagram wherei = 1, .., k denotes the
row number andj denotes the column number, i.e.,j = 1, ..., λi in the i’th row. Thehook lengthhij of
a box at position(i, j) in the diagram is the number of boxes directly below plus the number of boxes to
the right plus 1 (without reference to the diagram,hij = λi + µj − i− j + 1). Then,
fλ =m!∏
(i,j) hij
where the product of the hook-lengths is over all boxes of the diagram. We denote bydλ(n) the number
of semi-standard tableauxwhich is the number of ways to fill the diagram with the numbers from 1 to
n, such that all rows are non-decreasing and all columns are increasing. We have:
dλ(n) =∏(i,j)
n− i + j
hij
.
Let Sm denote the symmetric group on{1, . . . ,m}. Thegroup algebraCSm is the algebra spanned
by the elements ofSm
CG = {∑
σ∈Sm
ασσ | ασ ∈ C}
where addition and multiplication are defined as follows:
α(∑
σ∈Sm
ασσ) + β(∑
σ∈Sm
βσσ) =∑
σ∈Sm
(αασ + ββσ)σ
and
(∑
σ∈Sm
ασσ)(∑
τ∈Sm
βττ) =∑
g∈Sm
(∑
g=στ
ασβτ )g
for α, β, ασ, βσ ∈ C.
Let t be a tableau onλ (a numbering of the boxes of the diagram) and letP (t) denote the group of
all permutationsσ ∈ Sm which permute only the rows oft. Similarly, let Q(t) denote the group of
permutations that preserve the columns oft. Letat, bt be two elements in the group algebraCSm defined
as:
at =∑
g∈P (t)
g , bt =∑
g∈Q(t)
sgn(g)g.
114
The group algebraCSm acts onV ⊗m on the right by permuting factors, i.e.,(v1 ⊗ · · · ⊗ vm) · σ =
vσ(1)⊗ · · ·⊗ vσ(m). For a general shapeλ and a tableaut onλ the image ofat, V ⊗m · at, is the subspace:
V ⊗m · at = Symλ1V ⊗ · · · ⊗ SymλkV ⊂ V ⊗m
and the image ofbt is
V ⊗m · bt = ∧µ1V ⊗ · · · ⊗ ∧µrV ⊂ V ⊗m
whereµ is the conjugate partition toλ. TheYoung symmetrizeris defined byct = at · bt ∈ CSm. The
image of the Young symmetrizer
St(V ) = V ⊗m · ct
is theSchur Moduleassociated tot and is an irreducibleGL(V )- module. The isomorphism type of
St(V ) depends only on the shapeλ so we may writeSt(V ) = Sλ(V ). It turns out that all the polynomial
irreps ofGL(V ) are of the formSλ(V ) for somem and a partitionλ ` m.
Let Tλ denote the set of standard tableaux onλ then the direct sum decomposition ofV ⊗m into irre-
ducibleGL(V )-modules is given by
V ⊗m =⊕λ`m
⊕t∈Tλ
St(V ) ∼=
⊕λ`m
Sλ(V )⊕fλ .
Sincedλ(n) = dimSλ(V ) it follows that
dim V ⊗m = nm =∑λ`m
dλ(n)fλ.
For example, considern = m = 3, i.e.,V ⊗V ⊗V wheredim V = 3. There are three possible partitions
λ of 3 — these are(3), (1, 1, 1) and(2, 1). From the above,S(3)(V ) = Sym3V andS(1,1,1)V = ∧3V .
There are two,f(2,1) = 2, standard tableaux forλ = (2, 1) and these are123 and132 (numbering of
boxes left to right and top to bottom). There are eight,d(2,1)(3) = 8, semi-standard tableaux which are:
112, 113, 122,123, 132, 133,223 and233. We have the decomposition:
V ⊗ V ⊗ V = Sym3V ⊕ ∧3V ⊕ (S(2,1)V )⊕2
with the appropriate dimensions:27 = 10 + 1 + (8 + 8).
115
9.2 The 8-point Shape Tensor Problem
In this section we will make the connection between the question ofdim V (n, m, k) and a riddle re-
garding the internal structure of the 8-point shape tensor. Shape tensors were first introduced in Carlsson
1995, Weinshallet. al.1996, Carlsson & Weinshall 1998, with the basic idea that single-view invariants
of a 3D scene can be obtained by algebraically eliminating the viewing position (camera) parameters
given a sufficient number of points. Later, the same analysis was conducted in a reduced (but practical
in vision applications) setting where a reference plane is identified in advance (Irani & Anandan 1996,
Irani et. al. 1998, Criminisiet. al. 1998, Rother & Carlsson 2001) — which is the case we will focus
on here.
The problem setting is as follows. LetPi = (Xi, Yi, Zi, Wi)> ∈ P3, i = 1, ..., 8, denote 8 points
in 3D projective space and letM be a3 × 4 projection matrix, thuspi∼= MPi wherepi ∈ P2 be the
corresponding image points in the 2D projective plane. We wish to algebraically eliminate the camera
parameters (matrixM ) by having a sufficient number of points. This could be done succinctly if we
first make a change of basis: Let the coplanar points be denoted byP1, ..., P4 with the coordinates
(1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (1, 1, 1, 0) which is appropriate whenP1, ..., P4 are indeed coplanar.
Let the image undergo a projective change of coordinates such that the corresponding pointsp1, ..., p4
be assignede1 = (1, 0, 0), e2 = (0, 1, 0), e3 = (0, 0, 1), e4 = (1, 1, 1), respectively. Given this setup the
camera matrixM contains only 4 non-vanishing entries:
M =
δ 0 0 α
0 δ 0 β
0 0 δ γ
Let M = (α, β, γ, δ) ∈ P3 be a point (representing the camera) and letPi be the projection matrix:
P =
Wi 0 0 Xi
0 Wi 0 Yi
0 0 Wi Zi
And as in the general case we have the dualitypi
∼= MPi = PiM where the role of the motion (the
camera) and shape have been switched. Letli, l′i be two distinct lines passing through the image point
116
pi, i.e.,p>i li = 0 andp>i l′i = 0, and therefore we havel>i PiM = 0 andl′>i PiM = 0. For i = 5, ..., 8 we
have thereforeEM = 0 where:
E =
l>5 P5
·
l>8 P8
l′>5 P5
·
l′>8 P8
(9.1)
Therefore the determinant of any 4 rows ofE must vanish. The choice of the 4 rows can include 2
points, 3 points, or 4 points (on top of the 4 basis pointsP1, ..., P4) and each such choice determines a
multilinear constraint whose coefficients are arranged in a tensor. The 8-point tensor is when 4 points
are chosen: by choosing one row from each point we obtain a vanishing determinant involving 4 points
which provides 16 constraints (per view)l5i l6j l
7kl
8tQijkt = 0 for the 81 coefficients of the tensorQijkt.
The indicesi, j, k, l follow the covariant-contravariant notations (upper index represents points, lower
represent lines) and follow the summation convention (contraction)uivi = u1v1 + u2v2 + ... + unvn.
The tensor contains 81 coefficients, however, they satisfy internal “synthetic” linear constraint.Exactly
how many constraints are there is an open problem which we will show boils down to the question of
dim V (n, m, k).
SinceP1, ..., P4 are coplanar we have the constraintP>i n = 0, i = 1, ..., 4 and, due to our choice
of coordinates,n = (0, 0, 0, 1)T . Consider the family of camera matricesM = un> for all choices of
u = (u1, u2, u3)>. In other words, the 4’th column ofM consists of the arbitrary vectoru and all other
entries vanish. Thus we have thatMP either vanishes or is equal tou (up to scale) for allP . Let li, l′i be
lines throughu, therefore
l>i MjP = l>i P Mj = 0
l′>i MjP = l′>i P Mj = 0
for all pointsP , and dually for all projection matricesP . Therefore the4× 4 determinants ofE vanish
regardless ofPi. We have a single3×3×3×3 tensorQijkt responsible for the 16 quadlinear constraints
117
l5i l6j l
7kl
8tQijkt = 0 (we have a choice of 2 lines for each point, thus 16 constraints). From the discussion
above, the four lines contracted by the tensor are all coincident with the arbitrary pointu. Therefore,the
question is what is the dimension of the set of constraintsl5i l6j l
7kl
8tQijkt = 0 where the lines are arbitrary
but form a 2-dimensional subspace?
Recall the definition ofV (n,m, k) and setn = 3, m = 4, k = 2:
V (3, 4, 2) = {v1 ⊗ v2 ⊗ v3 ⊗ v4| dim Span{v1, ..., v4} ≤ 2}
wherev1, ..., v4 are vectors inR3. Our question regarding the number of synthetic constraints is equiva-
lent to the question ofwhat is the dimension ofV (3, 4, 2)?
9.3 DynamicPn → Pn Mappings
Consider a configuration of points inQi ∈ Pn−1, i = 1, ..., q undergoing a projective mapping
Qi → Q′i. Then it is well known thatQ′
i∼= AQi whereA ∈ GL(n) is some invertiblen × n matrix.
However, consider the following “complication” where each pointQi maychangeits position up to ak-
dimensional subspace (k = 1 means thatQi is fixed,k = 2 means thatQi may change its position along
some line inPn, and so forth), and we are givenm > 2 observationsQ(j)i wherej = 1, ...,m. In other
words, the observationsQ(j)i are generated by a combination of “global” (unknown) transformations
Ai ∈ GL(n) and “local” (unknown) movements within (unknown) subspaces of dimension up tok < m.
The task is to recover the global transformationsAi from the observations.
The definition above is a generalization of particular cases which were introduced in the past under the
name of “dynamic” Structure from Motion (SFM), or SFM of multiply moving points, and the relevant
literature includes Avidan & Shashua 2000, Shashua & Wolf 2000, Wolfet. al. 2000, Manning & Dyer
1999, Wexler & Shashua 2000, Han & Kanade 2000,2001, Segal & Shashua 2000, Wolf & Shashua
2002. For instance, Shashua & Wolf 2000 consider the case wheren = 3 (pointsQi belong to the 2D
projective plane),m = 3 andk = 2. In other words, a configuration of coplanar points are viewed
by a moving camera and the points move along arbitrary straight lines (k = 2) or stay fixed (“static”,
k = 1) while the camera changes positions. It was shown there that the image observations (across three
views) satisfy a3× 3× 3 tensorial constraint, where in the case where all points are moving along along
118
lines, 26 observations are sufficient for a unique solution to the tensor, when all points are static (without
being labeled as such) then those observations fill a 10 dimensional subspace (thus at least 16 points
should be dynamic for a unique solution form observations). In a later paper (Wolfet. al. 2000) the
case of “dynamic 3D to 3D” alignment was introduced, wheren = 4, m = 3, k = 2. In that case, the
observations are governed by a4 × 4 × 4 tensor, where the observations from moving points fill a 60-
dimensional space (thus there 4 tensors satisfying the constraints), and static points fill a 20-dimensional
space.
Among the various aspects of those tensors, one important aspect is the counting of necessary con-
straints for a solution. Some of those counting issues, even in the particular low dimension examples
given above, are not obvious. The matter becomes fairly subtle when dealing with the general dynamic
Pn → Pn mappings where the issue of counting constraints is an open problem.
We observe that since tensor products commute with linear transformations, the issue of dimension
counting is independent of the matricesAi ∈ GL(n). Therefore, the general problem of counting the
constraints of a dynamicPn−1 → Pn−1 mapping is isomorphic to the question ofdim V (n, m, k), where
in this casen ≥ m ≥ k.
When we compute the constraints of dynamic mappings we have other limitations which are not
described in Shashua & Wolf 2000 and Wolfet. al. 2000 and can be also described in theV (n, m, k)
framework. For example, in the case of dynamicP2 → P2 alignment the collection of measurements
arising from triplets of matching points must span the 2D plane. We may ask what is the largest number
of collinear points allowed? (which beyond that the solution becomes degenerate). In other words, the
question is how many points moving on the same striaght line path will generate linearly independent
constraints. The answer isdim V (2, 3, 2) — note thatn = 2 because the effective dimension of the
vector space is 2 even though the points are in defined in the 2D projective plane (i.e.,n = 3). Likewise,
in the case of dynamicP2 → P2 alignment the maximal number of points allowed on a single line is also
dim V (2, 3, 2) — and out of these pointsdim V (2, 3, 1) static points will give us linearly independent
constraints (in both cases).
From the examples above we have thatdim V (3, 3, 2) = 26 anddim V (4, 3, 2) = 60 (point moving
along straight line paths) anddim V (3, 3, 1) = 10 anddim V (4, 3, 1) = 20 (static points) for the 2D and
119
3D cases, respectively.
In the following section we analyze the structure ofV (n, m, k) and as a result determinedim V (n, m, k)
for any choice ofn, m ≥ k.
9.4 The Structure ofV (n,m, k)
So far we have presented two (unrelated) Vision problems which are isomorphic to thedim V (n, m, k)
question. We will provide below the statement and proof about the structure ofV (n, m, k). The state-
ment appears very similar to the classic result (see Appendix) of decomposing ofV ⊗m into irreducible
GL(V )-modules:
V ⊗m =⊕λ`m
⊕t∈Tλ
St(V ),
with the difference thatnot all diagrams are included — only those diagramsλ for whichλk+1 = 0.
Claim 1
V (n,m, k) =⊕
λk+1=0
Sλ(V )⊕fλ .
In particular
dim V (n,m, k) =∑
λk+1=0
fλsλ.
Proof: supposeλ ` m andλk+1 = 0. Let t be the tableau given byt(i, j) =∑i−1
l=1 λl + j. Noting that
V (n, r, 1) = SymrV it follows that
V ⊗m · at = Symλ1V ⊗ · · · ⊗ SymλkV
= V (n, λ1, 1)⊗ · · · ⊗ V (n, λk, 1) ⊂ V (n, m, k) .
Therefore,
St(V ) = V ⊗m · aT · bT ⊂ V (n, m, k) · bT ⊂ V (n, m, k)
hence, ⊕λk+1=0
Sλ(V )⊕fλ ⊂ V (n, m, k).
120
To show the other direction let(·, ·) be a hermitian form onV and let the induced form onV ⊗m be
given by
(u1 ⊗ · · · ⊗ um, v1 ⊗ · · · ⊗ vm) =m∏
i=1
(ui, vi) .
Note that
(u1 ∧ · · · ∧ um, v1 ⊗ · · · ⊗ vm)
=1
m!(u1 ∧ · · · ∧ um, v1 ∧ · · · ∧ vm)
=1
m!det[(ui, vj)]
mi,j=1 .
Let λ ` m with λk+1 6= 0, then the conjugate partitionµ = (µ1 ≥ µ2 ≥ . . . ≥ µt) satisfiesµ1 ≥ k + 1.
Let lj =∑j
r=1 µr and lett be the tableau given byt(i, j) = lj−1 + i. Then
St(V ) = V ⊗m · at · bt ⊂ V ⊗m · bt
= ∧µ1V ⊗ · · · ⊗ ∧µlV .
Suppose now thatv1, . . . , vm ∈ V ⊗m satisfydim Span{v1, . . . , vm} ≤ k. Thenv1 ∧ · · · ∧ vµ1 = 0
therefore for anyu1, . . . , um ∈ V
((u1 ⊗ · · · ⊗ um) · bT , v1 ⊗ · · · ⊗ vm) =l∏
r=1
1
µr!(
lr∧i=lr−1+1
ui,lr∧
i=lr−1+1
vi) = 0 .
It follows thatV (n,m, k) is orthogonal to
⊕λk+1 6=0
Sλ(V )⊕fλ
hence,
dim V (n, m, k) ≤ dim⊕
λk+1=0
Sλ(V )⊕fλ .
Claim 1 can be used to give explicit formulas fordim V (n, m, k) when eitherk or m − k are small.
In the later case we write
dim V (n,m, k) = nm −∑
λk+1 6=0
fλdλ(n)
and note that the partitions ofm with λk+1 6= 0 correspond to all partitions of all numbers up tom−k−1.
121
9.4.1 Examples
To calculatedim V (n, m,m− 1) note that onlyλ = (1m) must be excluded, thus:
f(1m) = 1 , d(1m)(n) =
(n
m
)
hence,
dim V (n, m,m− 1) = nm −(
n
m
).
To calculatedim V (n,m,m−2) we must exclude, in addition to the above, the partition(2, 1m−2), thus:
f(2,1m−2) = m− 1 , d(2,1m−2)(n) = (m− 1)
(n + 1
m
)
hence,
dim V (n, m,m− 2) = nm − [
(n
m
)+ (m− 1)2
(n + 1
m
)].
To calculatedim V (n, m,m− 3) we must exclude, in addition to the above, the partitions(3, 1m−3) and
(22, 1m−4), thus:
f(3,1m−3) =
(m− 1
2
), d(3,1m−3)(n) =
(m− 1
2
)(n + 2
m
)
f(22,1m−4) =m(m− 3)
2,
d(22,1m−4)(n) =(m− 3)n
2
(n + 1
m− 1
)
Hence,
dim V (n,m,m− 3) = nm − [
(n
m
)+ (m− 1)2
(n + 1
m
)+(
m− 1
2
)2(n + 2
m
)+
m(m− 3)2n
4
(n + 1
m− 1
)].
With these in mind, we can easily resolve the first of the open problems which is the number of
synthetic constraints of the 8-point shape tensor with 4 coplanar points. We have seen that the answer is
dim V ((3, 4, 2):
dim V ((3, 4, 2) =∑
λ3=λ4=0
fλdλ,
122
whereλ = (λ1, ..., λ4), is a partition of 4, i.e.,λ1 ≥ λ2 ≥ λ2 ≥ λ4 and∑
i λi = 4. We have therefore
only three partitions which satisfyλ3 = λ4 = 0: λ = (4), (2, 2), (3, 1) to consider. Thus,f(4) = 1, d(4) =
15, f(2,2) = 2, d(2,2) = 6, f(3,1) = 3 andd(3,1) = 15. Therefore,dim V (3, 4, 2) = 15 + 12 + 45 = 72.
We can also verify the special cases of dynamicP2 → P2 andP3 → P3 by substituting the values
of n,m, k in the formulas above. For example:dim V (3, 3, 2) = 27 − 1 = 26 anddim V (4, 3, 2) =
64 − 4 = 60 (point moving along straight line paths) anddim V (3, 3, 1) = 27 − (1 + 4 · 4) = 10 and
dim V (4, 3, 1) = 64− (4 + 4 · 10) = 20 (static points). Alsodim V (2, 3, 2) = 8− 0 = 8 points moving
along one line path out of which up todim V (2, 3, 1) = 8− [0 + 4] = 4 are static points on this line will
give us linearly independent constraints.
123
124
Chapter 10
Conclusions
10.1 Discussion
The list of possible scenarios suggested in this thesis is by no means final, and the tools presented in
this work are not limited to the use of images. We hope that many more applications will emerge that
will employ the results obtained here.
A guiding line we used throughout the work is to keep appearance of tractable tensors. We never
presented any tensor which has an estimation matrix larger than the one needed in order to compute
the quadrifocal tensor in a straightforward way. Nevertheless, the dynamic invariants are less stable to
compute than the static ones. There is need for a normalizatin of the image coordinate system, for the use
of sampling based outlier rejection and for the use of non linear estimation techniques where applicable.
In general we preformed our experiments using standard point tracking software such as the openCV
[38] KLT point tracker. Using standard normlization [24] and LEMDS sampling [39] we were able to
compute our invariants. The reason why we stress this point is becuase the future development of the
dynamic structure from motion field is bounded by the applicativity of the proposed solutions.
Ignoring this question and considering only theoretical questions about projection matrices from one
projective space to another, some general tools still have to be developed. In chapter 9, understanding
of the general invariants governing the relations betweenm points inPn spanning an extensor of stepk
125
was achieved. The Htensor [46] and the Jtensor [59] are examples of such invariants. However, the case
of projections from one subspace to another still has gaps.
Those gaps are not in the derivation of single invarinats - those are well understood, and automatically
achievable. The problems are regarding counting of invarinats in the general case (e.g ”how many
invarinats are forn views inP l of extensors of stepm in Pk confined to move on an extensor of step
r?”), and determining the minimal conditions on the measurements needed in order to recover these
invarinats (e.g ”up to 10 points on one hyperplane”).
Another problem which needs a general solution is the problem of reconstructing structure from ten-
sorial invariants. Currently there is not an algorithm for recovering structure inPk from measurements
after projections toP l. We speculate that using epipoles where possible, and the joint epipoles, devel-
oped here, otherwise, could solve this problem. This still deserves a rigorous proof. A step toward a
solution would be to notice that by combining two projection matrices fromPk toP l we get a projection
matrix fromPk to P(2l). Proceeding in this direction we can always build projection matrices which
have centers with lower dimensions that their image planes.
Apart from chapter 7, which uses very general assumptions, our solutions were more aimed toward
handling moving objects, and less aimed at handling deformable objects. Deformable objects are usually
treated as statistical objects rather than as geometrical objects. Still, some work has been done which
extracts geometric infromation such as camera ego motion from scenes containing deformable objects.
In [5, 6] methods were presented for modeling a deformable object, such as a human face, by using
a small number of basis shapes and their linear combinations. Using a factorization based method, it
is possible to recover the deformable shape and the camera ego motion for the Orthographic camera
model. One can imagine incorporating both shape basis and view information into some very large
projection matrices, yeilding a solution for the projective case. A more tractable solution would be to
add information such as a second camera, symmetry of the recovered shape, or limiting the camera
ego-motion.
In chapters 4 and 7 methods for action recognition were proposed. In both cases an action indexing
function was learned from examples. In chapter 4 it was assumed that the location of some feature points
in the images of the examined body were known. In chapter 7, no correspondance was required, and
126
even a direct brightness based method was proposed. Althought the correspondance free method is very
appealing, there is a price to be paid in the accuracy of the resulting method. Having no knowledge
introduced in advance on the structure of the moving body produces a lot of ambiguity in the resulting
indexing. This can be seen for example by comparing the no correspondance synchronization results
given in chapter 7 to the perfect results obtained using a similar method with correspondances shown in
[62].
We would be interested in exploring the possiblity of learning to index motion with a use of prior
knowledge, but without having this incorperated manually into the system. The resulting system will
use the input examples twice - first to learn the type of variability in the whole dataset, and then to learn
specific action indexing functions. An example of such a first stage would be to learn to identify those
body parts which bear great information about the type of the motion is most examples (i.e the system
will learn that ”hands” are important and will learn to recognize them in the images).
10.2 Summary
This thesis addresses the problems concerning the recovery of the geometry of a dynamic scene
viewed by a moving uncalibrated camera.
As there are many ways to model dynamics, many different scenarios are being considered, and
several types of solutions are being suggested. These solutions are by employment and generalization
of classical structure from motion techniques, thus the resulting body of work lies within the structure
from motion field.
Out of the contributions made in this work we would like to point out the following:
• Identifying analytical solutions for dynamic SFM problems in the uncalibrated case.
• A systematic way to model dynamic scenes and to derive the multiple views and multiple points
invariants for those scenes.
• Developing tools for the analysis of the resulting invariants and thier degenerate configuration.
This study in its most general form relied on tools from the representation theory.
127
• Developing tools for the decomposition of those invariants, in order to compute camera motion,
such as the recovery of the “joint epipoles”.
128
Bibliography
[1] S. Avidan and A. Shashua. Threading Fundamental Matrices. InProc. of the European Conference
on Computer Vision, June 1998, Frieburg, Germany.
[2] S. Avidan and A. Shashua. Trajectory triangulation: 3D reconstruction of moving points from
a monocular image sequence.IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(4):348–357, 2000.
[3] M. Barnabei, A. Brini, and G.C. Rota. On the exterior calculus of invariant theory.J. of Alg.,
96:120–160, 1985.
[4] A. Bartoli, The geometry of dynamic scenes - on coplanar and convergent linear motions embedded
in 3D static scene InThe 13th British Machine Vision Conference (BMVC)Sep. 2002
[5] M. Brand. Morphable 3d models from video. InCVPR, Kauai, Hawaii, December 2000, pages
II:456–463, 2001.
[6] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3d shape from image streams.
In CVPR, Hilton Head, SC, June 13-15, 2000, pages II:690–696, 2000.
[7] S. Carlsson The Double Algebra: An effective Tool for Computing Invariants in Computer Vision
In Applications of Invariance in Computer Vision, Joseph L.Mundy, Andrew Zisserman, David
Forsyth (Eds.), Springer-Verlag Berlin Heidelberg 1994.
[8] S. Carlsson. Duality of reconstruction and positioning from projective views. InProceedings of
the workshop on Scene Representations, Cambridge, MA., June 1995.
129
[9] S. Carlsson and D. Weinshall. Dual computation of projective shape and camera positions from
multiple images.International Journal of Computer Vision, 27(3), 1998.
[10] C. Rother and S. Carlsson. Linear Multi View Reconstruction and Camera Recovery. InProceed-
ings of the International Conference on Computer Vision, Vancouver, Canada, July 2001.
[11] Y. Caspi, D. Simakov and M. Irani Feature-Based Sequence-to-Sequence Matching InVision and
Modelling of Dynamic Scenes workshop, with ECCV 2002, Copenhagen.
[12] J. P. Costeira and T. Kanade, A multibody factorization method for independently moving objects
In International Journal on Computer Vision, 29-3 (1998), 159-179.
[13] A. Criminisi, I. Reid, and A. Zisserman. Duality, rigidity and planar parallax. InProceedings of
the European Conference on Computer Vision, Frieburg, Germany, 1998. Springer, LNCS 1407.
[14] O.D. Faugeras.Three-Dimensional Computer Vision: A Geometric Viewpoint. MIT Press, 1993.
[15] O.D. Faugeras. Stratification of three-dimensional vision: projective, affine and metric representa-
tions. Journal of the Optical Society of America, 12(3):465–484, 1995.
[16] O. Faugeras and Q.T. Luong with contributions from T. Papadopoulo.The geometry of multiple
imagesMIT Press, 2001.
[17] O.D. Faugeras and B. Mourrain. On the geometry and algebra of the point and line correspon-
dences between N images. InProceedings of the International Conference on Computer Vision,
Cambridge, MA, June 1995.
[18] O.D. Faugeras and T. Papadopoulo. Grassmann-Cayley algebra for modeling systems of cameras
and the algebraic equations of the manifold of trifocal tensorsINRIA Rapport de rechercheno.3225
- july 1997
[19] A.W. Fitzgibbon and A. Zisserman. Multibody Structure and Motion: 3-D Reconstruction of
Independently Moving Object. InProceedings of the European Conference on Computer Vision
(ECCV), Dublin, Ireland, June 2000.
130
[20] W. Fulton and J. HarrisRepresentation Theory: A First Course. Springer-Verlag New York Inc.,
1991.
[21] G.H.Golub and C.F.Van LoanMatrix Computations second edition1989 Johns Hopkins University
Press p.582
[22] M. Han and T. Kanade. Reconstruction of a Scene with Multiple Linearly Moving Objects.InProc.
of Computer Vision and Pattern Recognition, June, 2000.
[23] M. Han and T. Kanade. Multiple motion scene reconstruction from uncalibrated views InProceed-
ings of the Eighth IEEE International Conference on Computer Vision(ICCV ’01), July, 2001.
[24] R.I. Hartley In Defense of the Eight-Point Algorithm. InIEEE Transactions on Pattern Analysis
and Machine Intelligence19(6): 1997.
[25] R.I. Hartley and A. Zisserman.Multiple View Geometry. Cambridge University Press, 2000.
[26] K. Huang , R. Fossum and Y. Ma. Generalized Rank Conditions in Multiple View Geometry
with Applications to Dynamical Scenes. in Proceedings of theEuropean Conference on Computer
Vision (ECCV), Copenhagen, Denmark, May 2002.
[27] M. Irani, Multi-Frame Optical Flow Estimation Using Subspace Constraints. InIEEE International
Conference on Computer Vision (ICCV), Corfu, September 1999.
[28] M. Irani and P. Anandan. Parallax geometry of pairs of points for 3D scene analysis. InProceedings
of the European Conference on Computer Vision, LNCS 1064, pages 17–30, Cambridge, UK, April
1996. Springer-Verlag.
[29] M. Irani, P. Anandan, and D. Weinshall. From reference frames to reference planes: Multiview
parallax geometry and applications. InProceedings of the European Conference on Computer
Vision, Frieburg, Germany, 1998. Springer, LNCS 1407.
[30] M. Irani, B. Rousso, and S. Peleg Computing Occluding and Transparent Motions, Int. J. Computer
Vision, Vol 12 No. 1, January 1994, pp. 5-16.
131
[31] K. Kanatani. Motion Segmentation by Subspace Separation and Model Selection. InInternational
Conference on Computer Vision (ICCV), Vancouver, Canada, July, 2001.
[32] A. Levin and A. Shashua. Reconstruction of Dynamic 3D Motion from a Monocular Sequence of
Infinitesimal Motion. Submitted to ICCV2001.
[33] A. Levin, Lior Wolf and A. Shashua. Time-varying Shape Tensors for Scenes with Multiply Mov-
ing Points. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Dec. 2001,
Hawaii.
[34] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo
vision. InProceedings IJCAI, pages 674–679, Vancouver, Canada, 1981.
[35] R.A. Manning and C.R. Dyer. Interpolating view and scene motion by dynamic view morphing. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 388–394,
Fort Collins, Co., June 1999.
[36] R.A. Manning and C.R. Dyer. Affine Calibration from Moving Objects. InThe Eighth IEEE
International Conference on Computer Vision (ICCV), June 2001.
[37] J.L. Mundy, A. Zisserman, D. Forsyth (Eds.)Applications of Invariance in Computer Vision,
Springer-Verlag Berlin Heidelberg 1994.
[38] Open source computer vision libraryhttp://www.intel.com/research/mrl/research/cvlib/
[39] P.J.Rousseeuw Least Median of Squares Regression InJournal of American Statistical Association,
vol.79 1984. pp.871-880.
[40] D. Segal and A. Shashua. 3D Reconstruction from Tangent-of-Sight Measurements of a Moving
Object Seen from a Moving Camera.Proc. of the European Conference on Computer Vision
(ECCV), June 2000, Dublin, Ireland.
[41] A. Shashua. Trilinear tensor: The fundamental construct of multiple-view geometry and its appli-
cations. In G. Sommer and J.J. Koenderink, editors,Algebraic Frames For The Perception Action
132
Cycle, number 1315 in Lecture Notes in Computer Science. Springer, 1997. Proceedings of the
workshop held in Kiel, Germany, Sep. 1997.
[42] A. Shashua and S. Avidan. The rank4 constraint in multiple view geometry. InProceedings of the
European Conference on Computer Vision, Cambridge, UK, April 1996.
[43] A. Shashua, S. Avidan and M. Werman. Trajectory Triangulation over Conic Sections.Interna-
tional Conference on Computer Vision (ICCV)Sep., 1999.
[44] A. Shashua, R. Meshulam, L. Wolf, A. Levin and G. Kalai. On Representation Theory in Computer
Vision Problems.Technical Report 2002-44, Leibniz Center for Research, School of Computer
Science and Eng., The Hebrew University of Jerusalem, July, 2002.
[45] A. Shashua and N. Navab. Relative affine structure: Canonical model for 3D from 2D geometry
and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(9):873–
883, 1996.
[46] A. Shashua and L. Wolf. Homography tensors: On algebraic entities that represent three views of
static or moving planar points. InProceedings of the European Conference on Computer Vision
(ECCV), Dublin, Ireland, June 2000.
[47] A. Shashua and L. Wolf. On The Structure and Properties of the Quadrifocal Tensor. InProceedings
of the European Conference on Computer Vision, Dublin, Ireland, June 2000.
[48] C.C.Slama, editorManual of Photogrammetry, Fourth Edition.American Society of Photogram-
metry and Remote Sensing, Falls Church, Virginia, USA, 1980.
[49] M.E.Spetsakis and Y.Aloimonos. A Multi-frame Approach to Visual Motion Perception InInter-
national Journal of Computer Vision, 1991, pages 245-255.
[50] P. Sturm Structure and Motion for Dynamic Scenes - The Case of Points Moving in Planes In
European Conference on Computer Vision)ECCV, May 2002
[51] B. Sturmfels Algorithms in Invariant Theory Springer-Verlag Wien New York, 1993.
133
[52] P. H. S. Torr. Geometric motion segmentation and mosel selection. InPhil. Trans. Roy. Soc., A-356
(1998), 1321-1340.
[53] R. Vidal, Y. Ma , S. Soatto, and S. Sastry. Two-view Multibody structure from Motion. InInterna-
tional Journal of Computer Vision, special issue on dynamic vision.
[54] D. Weinshall, M. Werman and A. Shashua. Duality Of Multi-Point And Multi-Frame Geometry:
Fundamental Shape Matrices And Tensors. ECCV, April 1996.
[55] Y. Wexler and A.Shashua. On the synthesis of dynamic scenes from reference views. InProceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition, South Carolina, June
2000.
[56] L. Wolf and A. Shashua Affine 3-D Reconstruction from Two Projective Images of Independently
Translating Planes. The Eighth IEEE International Conference on Computer Vision, July 2001.
[57] L. Wolf and A. Shashua. On Projection MatricesPk −→ P2, k = 3, 4, 5, 6 , and their Applications
in Computer Vision. InInternational Journal on Computer Vision (IJCV)48(1), 2002.
[58] L. Wolf and A. Shashua. Two-body Segmentation from Two Perspective Views. InIEEE Conf. on
Computer Vision and Pattern Recognition (CVPR), Dec. 2001, Hawaii.
[59] L. Wolf, A. Shashua and Y. Wexler. Join Tensors: on 3D-to-3D alignment of Dynamic Sets. In
Proc. of the Int. Conf. on Pattern Recog. (ICPR), Sep. 2000, Barcelona, Spain
[60] L. Wolf and A. Zomet Sequence to Sequence Self Calibration . InEuropean Conference on
Computer Vision (ECCV), May 2002, Copenhagen, Denmark.
[61] L. Wolf and A. Zomet Correspondence-free Synchronization and Reconstruction in a Non-rigid
Scene. InWorkshop on Vision and Modeling of Dynamic Scenes(with ECCV 2002).
[62] L. Zelnik-Manor and M. Irani, Degeneracies, Dependencies and their Implications in Multi-body
and Multi-Sequence Factorizations InIEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), June 2003 .
134