[ieee 2014 ieee winter conference on applications of computer vision (wacv) - steamboat springs, co,...
TRANSCRIPT
NRSfM using Local Rigidity
Ali Rehan, Aamer Zaheer, Ijaz Akhter, Arfah Saeed,
Muhammad Haris Usmani, Bilal Mahmood, and Sohaib Khan
Syed Babar Ali School of Science and Engineering
Lahore University of Management Sciences, Lahore, Pakistan.
cvlab.lums.edu.pk/LocalRigidity/
Abstract
In this paper we show that typical nonrigid structure can
often be approximated well as locally rigid sub-structures
in time and space. Specifically, we assume that: I) the
structure can be approximated as rigid in a short local time
window and 2) some point- pairs stay relatively rigid in
space, maintaining a fixed distance between them during
the sequence. First, we use the triangulation constraints
in rigid SjM over a sliding time window to get an initial
estimate of the nonrigid 3D structure. Then we automat
ically identify relatively rigid point-pairs in this structure,
and use their length-constancy simultaneously with trian
gulation constraints to refine the structure estimate. Lo
cal factorization inherently handles small camera motion,
short sequences and significant natural occlusions grace
fully, performing better than nonrigid factorization meth
ods. We show more stable and accurate results as compared
to the state-of-the art on even short sequences starting from
15 frames only, containing camera rotations as small as 2° and up to 50% contiguous missing data.
1. Introduction The motion of 2D points in a video can be used to infer
their 3D structure if the camera is moving. This 'Struc
ture from Motion' (SfM) problem is well-posed in the rigid
case, when the points do not move in 3D, due to the triangu
lation constraints arising from multiple 2D observations of
the same 3D point [15]. The problem of Non-Rigid Struc
ture From Motion (NRStM) is much harder because the mo
tion of 3D points makes triangulation constraints inapplica
ble. Since every 3D point at one time is observed exactly
once in a 2D image, the number of unknowns far exceed
the number of constraints, making the problem ill-posed.
To make the problem well-posed, the shape or trajectory of
3D points is often constrained to lie in a low-dimensional
subspace [2, ] 6, ] 8, 1, 3]. These global factorization based
(a) (b) (c)
�I !1! 1�11� It � I (d)
(e)
Figure 1. Modeling nonrigid trajectory to be locally rigid: (a)
Rigid Structure from Motion works on the principle of triangu
lation. (b) Triangulation is inapplicable because 3D point has
changed its position. (c) The local rigidity assumption approxi
mates 3D position of the point through triangulation. (d) Recon
structing 3D structure using spatiotemporal local rigidity: Every
window represents local rigidity in time and every line connecting
two skeleton points represents local rigidity in space. (e) Rela
tively rigid points are discovered automatically.
methods contain inherent instabilities, and work well only
for large camera motion and long tracked sequences. These
practical shortcomings have not received much interest in
literature. Unlike the rigid case, a practical and stable solu
tion to NRStM remains an open problem.
In this paper, we observe that typical nonrigid structure
can often be approximated well as locally rigid substruc
tures in both time and space. Specifically, we make two
assumptions: 1) that the structure is rigid over a short local
time window, and 2) that some pairs of points stay rela
tively rigid, maintaining a fixed distance between them for
the duration of the sequence. Exploiting the first assump
tion, we use the triangulation constraints in rigid StM over
a sliding time window to get an initial estimate of the non
rigid 3D structure. The second assumption is then used to
automatically discover relatively rigid point-pairs, and their
69
Figure 2. A step-by-step summary of our algorithm
length-constancy is used simultaneously with the triangula
tion constraints to refine the structure estimate.
The main insight of our approach is that stability of
NRSfM can be improved by using rigid factorization lo
cally, which essentially applies triangulation constraints to
compute an approximation of the nonrigid 3D structure, as
illustrated in Figure 1. These triangulation constraints were
considered inapplicable for NRSfM problem, but we iden
tify them to be a powerful local approximation. The nonlin
ear refinement using length constraints handles even smaller
baseline cases as well. The main steps of our algorithm are
illustrated in Figure 2.
Occlusions or missing observations occur naturally and
frequently in motion capture sequences. While several ear
lier papers deal with missing data [16, 7,4,8], they typically
assume a random uncorrelated set of points to be missing in
image observation matrix. In reality, occlusions occur in
chunks, because of parts disappearing behind the body and
remaining hidden for multiple frames. We show that han
dling this contiguous missing data is more challenging than
what is often simulated in nonrigid SfM papers and deteri
orates the accuracy of results considerably. Our approach
naturally handles missing data - unlike current methods,
we do not need to impute a full image observation matrix for
the method to work. We show results on sequences with up
to 50% data simulated to be missing in contiguous chunks
rather than at random points.
Our major contributions are: 1) a novel constraint for
NRSfM in the form of local rigidity, 2) stable results even
with small camera motions and short sequences, and 3) a
local factorization approach that inherently handles natural
occlusions gracefully. These improvements point towards
the potential for extension to practical nonrigid reconstruc
tion.
2. Related Work Nonrigid structure from motion is a well-studied area in
Computer Vision. The most prominent approach to solve
this problem is to constrain the 3D deformable structure to
lie in a low-dimensional linear subspace. Bregler et aI's
seminal work in this direction proposed shape compactness
as the constraint to make NRSfM well-posed [2]. Akhter
et al. introduced a trajectory based approach as dual to
shape compactness [1]. Gotardo and Martinez later com
bined the shape and trajectory constraints and also extended
the shape model to include nonlinear shape basis [5]. Dai
et at. improved the shape basis approach through a new op
timization framework [3]. Recently Lee et al. [6] proposed
Procrustean normal distribution to model nonrigid deforma
tions. The practical limitations common to all these meth
ods include the inability to handle small camera motions,
short input sequences and realistic occlusions.
Local rigidity in space has been used previously for re
construction of an articulated skeleton [17]. Articulated
reconstruction based on bone-length constraints was aug
mented with the trajectory approach by [8]. These methods
require the input skeleton to be provided while we auto
matically recover relatively rigid pairs of points which may
or may not lie on a rigid subpart of the structure. Further,
the articulated trajectory approach requires knowledge of
camera motions in cases of a moving camera. The same
assumption has also been exploited for deformable surface
reconstruction [11, 12, 13, 14]. Perhaps most notable is the
fact that we are able to reconstruct deformable surfaces as
well as different kinds of articulated and non-articulated de
formable structures with the same generic method.
Repetition of frames was explored by Rabaud and Be
longie [9] and Zhu et at. [19] to learn the shape space of
nonrigid structure. Though repetition of structure forms the
basis of local rigidity in time, both these methods are fun
damentally different from ours because they require repeti
tions to be far apart for numerical stability while rigidity in
a temporal window is much more widely applicable.
3. Method In this section we first discuss the proposed local rigidity
constraints in space and time. Then we describe our algo
rithm to optimize these constraints, and finally we present
the generalization of our approach to handle missing data.
3.1. Local Rigidity Constraints Nonrigid structure at time instance t can be represented
as a concatenation of 3D coordinates of P points as, St =
[Xl , ' " , Xi] ,where xt = [xt, y/, ztr denotes the
3D coordinates of the j-th point at the t-th time instance.
The overall structure of the F frames can be represented as
70
a vertical concatenation of instantaneous structure as,
Analogous to structure S, measured 2D locations are con
tained in a 2F x P measurement matrix W. The imag
ing process is modeled by an orthographic camera, where
camera matrix at a time instance is denoted by 2 x 3 ma
trix Rt. The rows in Rt have norm equal to I and are
orthogonal to each other. We denote the vertical concate
nation of instantaneous camera matrices as a matrix R, i.e. [ T T]T R2Fx3 = R1, ... ,RF .
We assume that the nonrigid structure is locally rigid in
a window of N frames, where N « F. Considering the
frames in the interval (t - N /2, t + N /2), an approximate
relation between 3D structure and the image observation
can be described as,
W(2t-N:2t+N-l) = R(2t-N:2t+N-l)St, (1)
where the notation W(i:j) denotes consecutive rows in W from index i to j. Note that in Equation 1, we have done
mean centering and got rid of the translation component on
the lines of the technique presented in [15]. The complete
nonrigid structure can be modeled by varying t from 1 to F. The estimation of the 3D structure can be done using
factorization technique proposed in [15]. The basic idea is
to estimate rank-3 factorization of W(2t-N:2t+N-l) such
that W(2t-N:2t+N-l) = RtSt and solve for an unknown
matrix Qt such that
Since 2 x 3 camera matrix Rt consists of orthogonal rows of
norm equal to 1, therefore RtR[ = 12x2. This gives rise to
the following orthonormality constraints for the estimation
of Qt.
(3)
where i = 1, 2, . . . , N. Orthonormality constraints can be
used to estimate rectification matrices Qt and consequently
St can be estimated.
We observe that many nonrigid objects also exhibit local
rigidity over space which should also be exploited. Con
sequently, certain inter-point distances should remain con
stant over time. If the connectivity graph of such points
is provided, additional constraints, called bone length con
straints, can be imposed on the nonrigid structure, hence
making the estimation more stable. We consider two points
X� and X f which remain rigid with respect to each other
for all values of t. Let xi and Xf denote the j-th and k-th
I . SA S· Q-1XA j xj d Q-1XA k Xk co umn m t· mce t t = t an t t = t ' these length constraints can be written as following
for all values of t, where fLjk is the mean length between
Xi and X f, i.e. fLjk = J;; L[=1 1IQt1 (Xi - Xf) 112 and
11.112 denotes the Euclidean norm. Hence, Equation 4 en
forces the constraint that the length between points X i and
X f should remain constant through out the sequence. Equa
tions 3 and 4 provide constraints for the estimation of Qt and consequently St can be estimated.
3.2. Proposed Optimization In order to optimize these constraints, we make a squared
error cost function using equations 3 and 4 and minimize us
ing Quasi-Newton optimization. To get an initialization of
the optimization, we use orthonormality constraints given
in Equation 3. Substituting Gt = QtQ[, these constraints
become linear in Gt. We use linear least square to estimate
Gt, and then Qt is estimated using Cholesky Factorization.
The estimated QtS are used to initialize the joint nonlinear
optimization.
The initial estimate of QtS is also used for automatic
skeleton estimation. For this purpose, we use QtS to esti
mate the instantaneous structures StS using Equation 2. We
then estimate the variances in lengths of all possible pairs
and select 2P pairs with least length variation. The selected
pairs are taken as the rigid point-pairs.
The estimated rigid point-pairs and the initial estimate
of QtS is then used to optimize the orthonormality con
straints and bone length constraints given in equations 3 and
4. We use a truncated cost function to penalize the varia
tions in bone lengths. We estimate the 80th percentile of
bone length variances and truncate the larger costs to the
cost of 80th percentile. Please note that our approach only
penalizes large variations in bone lengths rather than enforc
ing them as hard constraints. Therefore, small variations
in bone lengths are allowed and precise bone connectivity
graph is not required. Hence optimizing orthonormality and
bone length constraints gives us the rectification transforms
Qt. Finally the instantaneous structures St is estimated us
ing Equation 2.
3.3. Rotation Alignment It should be noted that the instantaneous structures StS
estimated through above optimization may not be aligned
with each other. This is because the factorization given in
Equation 1 can only be recovered up to a 3 x 3 orthogonal
transform.
In fact, rotation ambiguity is inherent to the problem of
NRSfM if we look at the basic constraints equation:
71
(5)
. (6)
If we take any block diagonal rotation matrix U3Fx3F, whose 3 x 3 diagonal blocks represent arbitrary 3D rota
tions while the rest of the elements are zeros, then:
W RS, RUTUS, R'S',
(7)
(8)
(9)
where R' still represents a correct truncated rotation matrix
and R'S' = W. It implies that the same 2D observations
could have been generated by arbitrarily rotating the struc
ture such that the cameras were rotated by an inverse of
these arbitrary per-frame rotations, leading to a per-frame
rotation ambiguity instead of a desirable recovery of all the
structures up to the same rotation alignment.
To fix the alignment of StS, we estimate the camera ma
trix Rt using the following linear constraint.
W(2t-l:2t+l) = RtSt. ( 10)
The third row of the camera matrix is estimated by the cross
product of its first two rows. Then we multiply this rotation
with St and bring the structure in its canonical view. Thus
this approach is equivalent to as if a static camera is observ
ing a rotating nonrigid object.
Our approach is in contrast to the most nonrigid structure
from motion approaches which model the imaging process
as if a rotating camera is observing a nonrigid object. It
should be noted that both of these interpretations are per
fectly valid and are equivalent from the image observation
point of view.
3.4. Occlusion Handling In order to formulate a sequential approach to handle
missing data in nonrigid structure from motion, we parti
tion the image observation matrix into small overlapping
batches of length = F frames (F is typically 20-30). Each
batch consists of only the points whose tracks are com
pletely visible in all F frames. We run above algorithm
on each batch and estimate corresponding 3D structure.
By concatenating the 3D structure for each batch, we get
the overall nonrigid structure. We use overlapping recon
structed points to procrustes align one chunk on another.
After alignment, overlapping points are averaged in the con
catenated structure.
....
g � 08
� 2 0.6
� � 0.4
IX: 021�:"---'lr--......,.--r
10 20 30 40 50 Percentage of missing data in chunks
Figure 3. Effect of missing data in chunks: We generate synthetic
occlusions by deleting chunks of length = 25 at random positions
in 2D tracks. We plot the amount of missing data in percentage
versus the reconstruction error. Plots show that LRA clearly out
performs MP [7] and KSTA [5] .
This completes the description of our method based on
the key insight that local inaccuracy of model can be traded
off for stability of estimation in challenging scenarios, as
demonstrated in the following section.
4. Quantitative and Qualitative Evaluation We did extensive quantitative and qualitative evaluation
of the proposed method on both synthetic and real datasets.
For quantitative evaluation, we took Motion Capture se
quences and generated synthetic orthographic images by ro
tating the camera around z-axis with an angle 8. Synthetic
images formed the image observation matrix W. We ran
our algorithm on W and estimated the nonrigid structure.
We applied the estimated camera matrices on the 3D struc
ture and mimicked a static camera scenario. We reported er
rors as the mean Euclidean distance between the estimated
structure and the ground truth in millimeters, where the
maximum bone length in these datasets is roughly 200mm.
4.1. Handling Missing Data To study the effect of missing data, we generate syn
thetic occlusions in chunks and evaluate our sequential ap
proach. We choose stretch, yoga, pickup, and drink dataset,
with 300 frames each. We generate synthetic images with
per-frame camera motion = 5° and generate occlusions by
deleting chunks of length = 25 at random positions in 2D
tracks. We set N = 5 and batch length, F = 20. In Figure
3, we plot the reconstruction error of our approach versus
the percentage of missing data in chunks. We also com
pare our results against Kernel Shape Trajectory Approach
(KSTA) by Gotardo and Martinez [5] and Metric Projec
tions (MP) by Paladini et at. [7]. In Figure 4 we report qual
itative comparison of these results. Results clearly show
that our approach (LRA) out performs KSTA and MP.
4.2. Batch Length (F) and Camera Motion (8) We also analyze the numerical stability of our approach
against the batch length (F) and per-frame camera motion
72
Figure 4. Qualitative comparison of handling missing data in
chunks: We report results on stretch, drink, pickup, and yoga
data sets containing 10%,30%,40%, and 50% missing data re
spectively (top to bottom, 2 frames from each dataset). We set
N = 5, F = 20, and per-frame camera motion = 5°. The num
ber of basis (K) for KSTA [5] and MP [7] were selected as the
best value between 2-13 which gives minimum reconstruction er
ror. Results show that LRA out performs KSTA and MP.
and compare that against existing methods in nonrigid SlM.
In Figure 5 we analyze the effect of varying per-frame cam
era motion and the number of frames on the reconstruction
error and compare it against KSTA [5], and Block Matrix
Method (BMM) by Dai et al. [3].
We choose one MOCAP walk sequence and three syn
thetic sequences, stretch, pickup and yoga from [J]. We
divide the whole sequence into non-overlapping chunks of
length F = {15, 30, 45, 60} and generate image observa
tion matrix W by varying e from 1.5° to 6.0° . We run
our algorithm and estimate the mean reconstruction error
over all sequences and all chunks. We compare our results
against KSTA [5] and BMM [3]. For our method, we kept
window size to be fixed at N = 5, whereas vary the number
of basis (K) in Gotardo and Martinez, and Dai et at. be
tween 2-l3 and only report the best results of each method.
Figure 5 shows that the LRA gives significantly small struc
ture reconstruction error as compared to KSTA, and BMM
for small F and e, whereas for larger values of F and e, all
four methods become comparable.
20 15
F=15, N=5
3 4 5 6 Camera motion per frame e (in Degrees)
F=45, N=5 * 200 � 150 � 100
-+-LRA �KSTA -+-BMM
oL�::::�::::::� 2 3 4 5 6 Camera motion per frame e (in Degrees)
200 150
F=30, N=5
O��� 2 3 5 6 Camera motion per frame e (in Degrees)
F=60, N=5 * 200 � 150 I -+-LRA
�KSTA -+-BMM
� 100 g 5a�
UJ o 2 3 5 6 Camera motion per frame e (in Degrees)
Figure 5. Effect of per-frame camera motion () and number of
frames F on reconstruction error. The results are averaged on
non-overlapping chunks of length F on stretch, yoga, and pickup
data sets from [I] . LRA gives significantly small reconstruction
error than PTA [1] , KSTA [5] , and BMM [3] for small F and ().
4.3. Stability and Model Accuracy An important insight of our model is a tradeoff between
numerical stability and model inaccuracy. With the increase
of window size, both model error and numerical stability
increases. In Figure 6(a) we plot the mean reconstruction
error of the proposed method, LRA by varying window size
N and per-frame camera rotation e. We select a chunk of
length 100 frames from pickup sequence such that it con
tains the main action. We generate synthetic orthographic
images and test our algorithm. Plots show that the recon
struction error remains constant for a fairly long range of
N, but then starts increasing. Larger values of N provide
more constraints in the optimization, hence, increasing the
numerical stability. However, a large window also increases
the model error. The near flat area of the plot shows a com
pensation between model inaccuracy and numerical stabil
ity, whereas beyond a certain value of N, model inaccuracy
overcomes the numerical stability. LRA becomes equiva
lent to rigid structure from motion for N = F. In Figure 6(b) we analyze the effect of Gaussian noise
on reconstruction error by varying N. We select a chunk
of length 100 frames from the walk sequence and generate
synthetic orthographic images with e = 3° . Plots demon
strate the robustness of our approach against noise.
4.4. Real Results In Figure 7 we report results on three real sequences: Di
nosaur, Matrixl and Matrix2 having 50, 100 and 40 frames
respectively, where first two sequence are from [1]. We
choose N equal to 19, 17, and l3 for LRA; K equal to
3, 6 and 3 for KSTA respectively for Dinosaur, Matrixl
and Matrix2. K was selected as the value which gives
best reconstruction qualitatively. Figure 7 shows that LRA
73
LRA KSTA [4] MP [7]
F= 100
oL---�------�--� o 10 20 30 40
00 20 40 60 80 90
Window Size (N) Standard Deviation in Noise cr (in mm)
(a) Effect of N (b) Effect of Noise
Figure 6. Structure reconstruction error by varying window size
(N), per-frame camera motion (e), and observation noise (0-). Figure (a) shows that the reconstruction error decreases with the
increase in e and LRA remains quite insensitive to the selection of
N for a fairly long range. Figure (b) shows the robustness of LRA
against noise.
LRA KSTA [4] BMM [3]
�� � � � IV ���� �� �"y V'-"«-�"'"'-I?::\
Figure 7. Reconstruction results on real sequences including Di
nosaur, Matrixland Matrix2. Reconstruction results in two views
are shown for two arbitrary frames for each sequences. Compar
ison of our method with BMM [3] and KSTA [5] shows qualita
tively better results.
gives qualitatively better reconstruction results compared to
KSTA and BMM. Video results are available on the project
page [10].
5. Conclusion In this paper, we show that the novel constraint for
NRStM in the form of local rigidity gives stable results in
challenging realistic scenarios with small camera motions
and shorter sequences. Moreover, we demonstrate that this
local factorization approach inherently handles natural oc
clusions gracefully. These improvements are a first step to
wards a practical NRStM algorithm.
One limitation of the new approach, as compared to
the subspace reduction methods, is a slight degradation in
quality of reconstruction in case of large number of frames
with large camera motion. This occurs because we tradeoff
model accuracy for method stability. We believe that com
bining the power of local rigidity constraints with subspace
reduction can help achieve the best of both worlds, and is a
possible future direction.
6. Acknowledgements The research was funded in part by Higher Education
Commission of Pakistan and Lahore University of Manage
ment Sciences.
References [1] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade. Nonrigid
Structure from Motion in Trajectory Space. NIPS, 2008.
[2] C. Bregler, A. Hertzmann, and H. Biermann. Recovering
non-rigid 3D shape from image streams. CVPR, pages 690-
696,2000.
[3] Y. Dai, H. Li, and M. He. A simple prior-free method for
non-rigid structure-from-motion factorization. CVPR, 2012.
[4] P. F. U. Gotardo and A. M. Martinez. Computing smooth
time-trajectories for camera and deformable shape in struc
ture from motion with occlusion. IEEE Trans. PAMI, 2011.
[5] P. F. U. Gotardo and A. M. Martinez. Kernel non-rigid struc
ture from motion. ICCV, 2011.
[6] M. Lee, 1. Cho, C.-H. Choi, and S. Oh. Procrustean nor
mal distribution for non-rigid structure from motion. CVPR,
20l3.
[7] M. Paladini, A. D. Bue, M. Stosic, M. Dodig, J. Xavier,
and L. Agapito. Factorization for non-rigid and articulated
structure using metric projections. CVPR, pages 2898-2905,
2009.
[8] H. S. Park and Y. Sheikh. 3d reconstruction of a smooth artic
ulated trajectory from a monocular image sequence. ICCV,
2011.
[9] Y. Rabaud and S. Belongie. Linear embeddings in non-rigid
structure from motion. CVPR,2009.
[10] A. Rehan. Project page - Nonrigid Structure from Mo-
tion using Llocal Rigidity. cvlab. 1 urns. edu. pk/ local rigidity, 2014.
[11] M. Salzmann, R. Hartley, and R. Fua. Convex optimization
for deformable surface 3-D tracking. ICC V, 2007.
[12] J. Sanchez-Riera, J. Ostlund, P. Fua, and F. Moreno-Noguer.
Simultaneous pose, correspondence and non-rigid shape.
CVPR,201O.
[13] A. Shaji, A. Yarol, L. Torresani, and P. Fua. Simultaneous
point matching and 3d deformable surface reconstruction.
CVPR,201O.
[14] J. Taylor, A. D. Jepson, and K. N. Kutulakos. Non-rigid
structure from locally-rigid motion. CVPR, 2010.
[15] c. Tomasi and T. Kanade. Shape and Motion from Image
Streams Under Orthography: A Factorization Method. IJCV,
9(2), 1992.
[16] L. Torresani, A. Hertzmann, and C. Bregler. Nonrigid Struc
ture from Motion: Estimating Shape and Motion with Hier
archical Priors. IEEE Trans. on PAMI, 30(5), 2008.
[l7] X. Wei and J. Chai. Modeling 3d human poses from uncali
brated monocular images. ICCV, 2009.
[18] J. Xiao, J. Chai, and T. Kanade. A Closed Form Solution to
Non-Rigid Shape and Motion Recovery. llCV, 67(2), 2006.
[19] J. Zhu, S. C. Hoi, Z. Xu, and M. R. Lyu. An effective ap
proach to 3d deformable surface tracking. ECCV, 2008.
74