[ieee 2014 ieee winter conference on applications of computer vision (wacv) - steamboat springs, co,...

NRSfM using Local Rigidity

Ali Rehan, Aamer Zaheer, Ijaz Akhter, Arfah Saeed,

Muhammad Haris Usmani, Bilal Mahmood, and Sohaib Khan

Syed Babar Ali School of Science and Engineering

Lahore University of Management Sciences, Lahore, Pakistan.

cvlab.lums.edu.pk/LocalRigidity/

Abstract

In this paper we show that typical nonrigid structure can

often be approximated well as locally rigid sub-structures

in time and space. Specifically, we assume that: I) the

structure can be approximated as rigid in a short local time

window and 2) some point- pairs stay relatively rigid in

space, maintaining a fixed distance between them during

the sequence. First, we use the triangulation constraints

in rigid SjM over a sliding time window to get an initial

estimate of the nonrigid 3D structure. Then we automat

ically identify relatively rigid point-pairs in this structure,

and use their length-constancy simultaneously with trian

gulation constraints to refine the structure estimate. Lo

cal factorization inherently handles small camera motion,

short sequences and significant natural occlusions grace

fully, performing better than nonrigid factorization meth

ods. We show more stable and accurate results as compared

to the state-of-the art on even short sequences starting from

15 frames only, containing camera rotations as small as 2° and up to 50% contiguous missing data.

1. Introduction The motion of 2D points in a video can be used to infer

their 3D structure if the camera is moving. This 'Struc

ture from Motion' (SfM) problem is well-posed in the rigid

case, when the points do not move in 3D, due to the triangu

lation constraints arising from multiple 2D observations of

the same 3D point [15]. The problem of Non-Rigid Struc

ture From Motion (NRStM) is much harder because the mo

tion of 3D points makes triangulation constraints inapplica

ble. Since every 3D point at one time is observed exactly

once in a 2D image, the number of unknowns far exceed

the number of constraints, making the problem ill-posed.

To make the problem well-posed, the shape or trajectory of

3D points is often constrained to lie in a low-dimensional

subspace [2, ] 6, ] 8, 1, 3]. These global factorization based

(a) (b) (c)

�I !1! 1�11� It � I (d)

(e)

Figure 1. Modeling nonrigid trajectory to be locally rigid: (a)

Rigid Structure from Motion works on the principle of triangu

lation. (b) Triangulation is inapplicable because 3D point has

changed its position. (c) The local rigidity assumption approxi

mates 3D position of the point through triangulation. (d) Recon

structing 3D structure using spatiotemporal local rigidity: Every

window represents local rigidity in time and every line connecting

two skeleton points represents local rigidity in space. (e) Rela

tively rigid points are discovered automatically.

methods contain inherent instabilities, and work well only

for large camera motion and long tracked sequences. These

practical shortcomings have not received much interest in

literature. Unlike the rigid case, a practical and stable solu

tion to NRStM remains an open problem.

In this paper, we observe that typical nonrigid structure

can often be approximated well as locally rigid substruc

tures in both time and space. Specifically, we make two

assumptions: 1) that the structure is rigid over a short local

time window, and 2) that some pairs of points stay rela

tively rigid, maintaining a fixed distance between them for

the duration of the sequence. Exploiting the first assump

tion, we use the triangulation constraints in rigid StM over

a sliding time window to get an initial estimate of the non

rigid 3D structure. The second assumption is then used to

automatically discover relatively rigid point-pairs, and their

69

Figure 2. A step-by-step summary of our algorithm

length-constancy is used simultaneously with the triangula

tion constraints to refine the structure estimate.

The main insight of our approach is that stability of

NRSfM can be improved by using rigid factorization lo

cally, which essentially applies triangulation constraints to

compute an approximation of the nonrigid 3D structure, as

illustrated in Figure 1. These triangulation constraints were

considered inapplicable for NRSfM problem, but we iden

tify them to be a powerful local approximation. The nonlin

ear refinement using length constraints handles even smaller

baseline cases as well. The main steps of our algorithm are

illustrated in Figure 2.

Occlusions or missing observations occur naturally and

frequently in motion capture sequences. While several ear

lier papers deal with missing data [16, 7,4,8], they typically

assume a random uncorrelated set of points to be missing in

image observation matrix. In reality, occlusions occur in

chunks, because of parts disappearing behind the body and

remaining hidden for multiple frames. We show that han

dling this contiguous missing data is more challenging than

what is often simulated in nonrigid SfM papers and deteri

orates the accuracy of results considerably. Our approach

naturally handles missing data - unlike current methods,

we do not need to impute a full image observation matrix for

the method to work. We show results on sequences with up

to 50% data simulated to be missing in contiguous chunks

rather than at random points.

Our major contributions are: 1) a novel constraint for

NRSfM in the form of local rigidity, 2) stable results even

with small camera motions and short sequences, and 3) a

local factorization approach that inherently handles natural

occlusions gracefully. These improvements point towards

the potential for extension to practical nonrigid reconstruc

tion.

2. Related Work Nonrigid structure from motion is a well-studied area in

Computer Vision. The most prominent approach to solve

this problem is to constrain the 3D deformable structure to

lie in a low-dimensional linear subspace. Bregler et aI's

seminal work in this direction proposed shape compactness

as the constraint to make NRSfM well-posed [2]. Akhter

et al. introduced a trajectory based approach as dual to

shape compactness [1]. Gotardo and Martinez later com

bined the shape and trajectory constraints and also extended

the shape model to include nonlinear shape basis [5]. Dai

et at. improved the shape basis approach through a new op

timization framework [3]. Recently Lee et al. [6] proposed

Procrustean normal distribution to model nonrigid deforma

tions. The practical limitations common to all these meth

ods include the inability to handle small camera motions,

short input sequences and realistic occlusions.

Local rigidity in space has been used previously for re

construction of an articulated skeleton [17]. Articulated

reconstruction based on bone-length constraints was aug

mented with the trajectory approach by [8]. These methods

require the input skeleton to be provided while we auto

matically recover relatively rigid pairs of points which may

or may not lie on a rigid subpart of the structure. Further,

the articulated trajectory approach requires knowledge of

camera motions in cases of a moving camera. The same

assumption has also been exploited for deformable surface

reconstruction [11, 12, 13, 14]. Perhaps most notable is the

fact that we are able to reconstruct deformable surfaces as

well as different kinds of articulated and non-articulated de

formable structures with the same generic method.

Repetition of frames was explored by Rabaud and Be

longie [9] and Zhu et at. [19] to learn the shape space of

nonrigid structure. Though repetition of structure forms the

basis of local rigidity in time, both these methods are fun

damentally different from ours because they require repeti

tions to be far apart for numerical stability while rigidity in

a temporal window is much more widely applicable.

3. Method In this section we first discuss the proposed local rigidity

constraints in space and time. Then we describe our algo

rithm to optimize these constraints, and finally we present

the generalization of our approach to handle missing data.

3.1. Local Rigidity Constraints Nonrigid structure at time instance t can be represented

as a concatenation of 3D coordinates of P points as, St =

[Xl , ' " , Xi] ,where xt = [xt, y/, ztr denotes the

3D coordinates of the j-th point at the t-th time instance.

The overall structure of the F frames can be represented as

70

a vertical concatenation of instantaneous structure as,

Analogous to structure S, measured 2D locations are con

tained in a 2F x P measurement matrix W. The imag

ing process is modeled by an orthographic camera, where

camera matrix at a time instance is denoted by 2 x 3 ma

trix Rt. The rows in Rt have norm equal to I and are

orthogonal to each other. We denote the vertical concate

nation of instantaneous camera matrices as a matrix R, i.e. [ T T]T R2Fx3 = R1, ... ,RF .

We assume that the nonrigid structure is locally rigid in

a window of N frames, where N « F. Considering the

frames in the interval (t - N /2, t + N /2), an approximate

relation between 3D structure and the image observation

can be described as,

W(2t-N:2t+N-l) = R(2t-N:2t+N-l)St, (1)

where the notation W(i:j) denotes consecutive rows in W from index i to j. Note that in Equation 1, we have done

mean centering and got rid of the translation component on

the lines of the technique presented in [15]. The complete

nonrigid structure can be modeled by varying t from 1 to F. The estimation of the 3D structure can be done using

factorization technique proposed in [15]. The basic idea is

to estimate rank-3 factorization of W(2t-N:2t+N-l) such

that W(2t-N:2t+N-l) = RtSt and solve for an unknown

matrix Qt such that

Since 2 x 3 camera matrix Rt consists of orthogonal rows of

norm equal to 1, therefore RtR[ = 12x2. This gives rise to

the following orthonormality constraints for the estimation

of Qt.

(3)

where i = 1, 2, . . . , N. Orthonormality constraints can be

used to estimate rectification matrices Qt and consequently

St can be estimated.

We observe that many nonrigid objects also exhibit local

rigidity over space which should also be exploited. Con

sequently, certain inter-point distances should remain con

stant over time. If the connectivity graph of such points

is provided, additional constraints, called bone length con

straints, can be imposed on the nonrigid structure, hence

making the estimation more stable. We consider two points

X� and X f which remain rigid with respect to each other

for all values of t. Let xi and Xf denote the j-th and k-th

I . SA S· Q-1XA j xj d Q-1XA k Xk co umn m t· mce t t = t an t t = t ' these length constraints can be written as following

for all values of t, where fLjk is the mean length between

Xi and X f, i.e. fLjk = J;; L[=1 1IQt1 (Xi - Xf) 112 and

11.112 denotes the Euclidean norm. Hence, Equation 4 en

forces the constraint that the length between points X i and

X f should remain constant through out the sequence. Equa

tions 3 and 4 provide constraints for the estimation of Qt and consequently St can be estimated.

3.2. Proposed Optimization In order to optimize these constraints, we make a squared

error cost function using equations 3 and 4 and minimize us

ing Quasi-Newton optimization. To get an initialization of

the optimization, we use orthonormality constraints given

in Equation 3. Substituting Gt = QtQ[, these constraints

become linear in Gt. We use linear least square to estimate

Gt, and then Qt is estimated using Cholesky Factorization.

The estimated QtS are used to initialize the joint nonlinear

optimization.

The initial estimate of QtS is also used for automatic

skeleton estimation. For this purpose, we use QtS to esti

mate the instantaneous structures StS using Equation 2. We

then estimate the variances in lengths of all possible pairs

and select 2P pairs with least length variation. The selected

pairs are taken as the rigid point-pairs.

The estimated rigid point-pairs and the initial estimate

of QtS is then used to optimize the orthonormality con

straints and bone length constraints given in equations 3 and

4. We use a truncated cost function to penalize the varia

tions in bone lengths. We estimate the 80th percentile of

bone length variances and truncate the larger costs to the

cost of 80th percentile. Please note that our approach only

penalizes large variations in bone lengths rather than enforc

ing them as hard constraints. Therefore, small variations

in bone lengths are allowed and precise bone connectivity

graph is not required. Hence optimizing orthonormality and

bone length constraints gives us the rectification transforms

Qt. Finally the instantaneous structures St is estimated us

ing Equation 2.

3.3. Rotation Alignment It should be noted that the instantaneous structures StS

estimated through above optimization may not be aligned

with each other. This is because the factorization given in

Equation 1 can only be recovered up to a 3 x 3 orthogonal

transform.

In fact, rotation ambiguity is inherent to the problem of

NRSfM if we look at the basic constraints equation:

71

(5)

. (6)

If we take any block diagonal rotation matrix U3Fx3F, whose 3 x 3 diagonal blocks represent arbitrary 3D rota

tions while the rest of the elements are zeros, then:

W RS, RUTUS, R'S',

(7)

(8)

(9)

where R' still represents a correct truncated rotation matrix

and R'S' = W. It implies that the same 2D observations

could have been generated by arbitrarily rotating the struc

ture such that the cameras were rotated by an inverse of

these arbitrary per-frame rotations, leading to a per-frame

rotation ambiguity instead of a desirable recovery of all the

structures up to the same rotation alignment.

To fix the alignment of StS, we estimate the camera ma

trix Rt using the following linear constraint.

W(2t-l:2t+l) = RtSt. ( 10)

The third row of the camera matrix is estimated by the cross

product of its first two rows. Then we multiply this rotation

with St and bring the structure in its canonical view. Thus

this approach is equivalent to as if a static camera is observ

ing a rotating nonrigid object.

Our approach is in contrast to the most nonrigid structure

from motion approaches which model the imaging process

as if a rotating camera is observing a nonrigid object. It

should be noted that both of these interpretations are per

fectly valid and are equivalent from the image observation

point of view.

3.4. Occlusion Handling In order to formulate a sequential approach to handle

missing data in nonrigid structure from motion, we parti

tion the image observation matrix into small overlapping

batches of length = F frames (F is typically 20-30). Each

batch consists of only the points whose tracks are com

pletely visible in all F frames. We run above algorithm

on each batch and estimate corresponding 3D structure.

By concatenating the 3D structure for each batch, we get

the overall nonrigid structure. We use overlapping recon

structed points to procrustes align one chunk on another.

After alignment, overlapping points are averaged in the con

catenated structure.

....

g � 08

� 2 0.6

� � 0.4

IX: 021�:"---'lr--......,.--r

10 20 30 40 50 Percentage of missing data in chunks

Figure 3. Effect of missing data in chunks: We generate synthetic

occlusions by deleting chunks of length = 25 at random positions

in 2D tracks. We plot the amount of missing data in percentage

versus the reconstruction error. Plots show that LRA clearly out

performs MP [7] and KSTA [5] .

This completes the description of our method based on

the key insight that local inaccuracy of model can be traded

off for stability of estimation in challenging scenarios, as

demonstrated in the following section.

4. Quantitative and Qualitative Evaluation We did extensive quantitative and qualitative evaluation

of the proposed method on both synthetic and real datasets.

For quantitative evaluation, we took Motion Capture se

quences and generated synthetic orthographic images by ro

tating the camera around z-axis with an angle 8. Synthetic

images formed the image observation matrix W. We ran

our algorithm on W and estimated the nonrigid structure.

We applied the estimated camera matrices on the 3D struc

ture and mimicked a static camera scenario. We reported er

rors as the mean Euclidean distance between the estimated

structure and the ground truth in millimeters, where the

maximum bone length in these datasets is roughly 200mm.

4.1. Handling Missing Data To study the effect of missing data, we generate syn

thetic occlusions in chunks and evaluate our sequential ap

proach. We choose stretch, yoga, pickup, and drink dataset,

with 300 frames each. We generate synthetic images with

per-frame camera motion = 5° and generate occlusions by

deleting chunks of length = 25 at random positions in 2D

tracks. We set N = 5 and batch length, F = 20. In Figure

3, we plot the reconstruction error of our approach versus

the percentage of missing data in chunks. We also com

pare our results against Kernel Shape Trajectory Approach

(KSTA) by Gotardo and Martinez [5] and Metric Projec

tions (MP) by Paladini et at. [7]. In Figure 4 we report qual

itative comparison of these results. Results clearly show

that our approach (LRA) out performs KSTA and MP.

4.2. Batch Length (F) and Camera Motion (8) We also analyze the numerical stability of our approach

against the batch length (F) and per-frame camera motion

72

Figure 4. Qualitative comparison of handling missing data in

chunks: We report results on stretch, drink, pickup, and yoga

data sets containing 10%,30%,40%, and 50% missing data re

spectively (top to bottom, 2 frames from each dataset). We set

N = 5, F = 20, and per-frame camera motion = 5°. The num

ber of basis (K) for KSTA [5] and MP [7] were selected as the

best value between 2-13 which gives minimum reconstruction er

ror. Results show that LRA out performs KSTA and MP.

and compare that against existing methods in nonrigid SlM.

In Figure 5 we analyze the effect of varying per-frame cam

era motion and the number of frames on the reconstruction

error and compare it against KSTA [5], and Block Matrix

Method (BMM) by Dai et al. [3].

We choose one MOCAP walk sequence and three syn

thetic sequences, stretch, pickup and yoga from [J]. We

divide the whole sequence into non-overlapping chunks of

length F = {15, 30, 45, 60} and generate image observa

tion matrix W by varying e from 1.5° to 6.0° . We run

our algorithm and estimate the mean reconstruction error

over all sequences and all chunks. We compare our results

against KSTA [5] and BMM [3]. For our method, we kept

window size to be fixed at N = 5, whereas vary the number

of basis (K) in Gotardo and Martinez, and Dai et at. be

tween 2-l3 and only report the best results of each method.

Figure 5 shows that the LRA gives significantly small struc

ture reconstruction error as compared to KSTA, and BMM

for small F and e, whereas for larger values of F and e, all

four methods become comparable.

20 15

F=15, N=5

3 4 5 6 Camera motion per frame e (in Degrees)

F=45, N=5 * 200 � 150 � 100

-+-LRA �KSTA -+-BMM

oL�::::�::::::� 2 3 4 5 6 Camera motion per frame e (in Degrees)

200 150

F=30, N=5

O�� 2 3 5 6 Camera motion per frame e (in Degrees)

F=60, N=5 * 200 � 150 I -+-LRA

�KSTA -+-BMM

� 100 g 5a�

UJ o 2 3 5 6 Camera motion per frame e (in Degrees)

Figure 5. Effect of per-frame camera motion () and number of

frames F on reconstruction error. The results are averaged on

non-overlapping chunks of length F on stretch, yoga, and pickup

data sets from [I] . LRA gives significantly small reconstruction

error than PTA [1] , KSTA [5] , and BMM [3] for small F and ().

4.3. Stability and Model Accuracy An important insight of our model is a tradeoff between

numerical stability and model inaccuracy. With the increase

of window size, both model error and numerical stability

increases. In Figure 6(a) we plot the mean reconstruction

error of the proposed method, LRA by varying window size

N and per-frame camera rotation e. We select a chunk of

length 100 frames from pickup sequence such that it con

tains the main action. We generate synthetic orthographic

images and test our algorithm. Plots show that the recon

struction error remains constant for a fairly long range of

N, but then starts increasing. Larger values of N provide

more constraints in the optimization, hence, increasing the

numerical stability. However, a large window also increases

the model error. The near flat area of the plot shows a com

pensation between model inaccuracy and numerical stabil

ity, whereas beyond a certain value of N, model inaccuracy

overcomes the numerical stability. LRA becomes equiva

lent to rigid structure from motion for N = F. In Figure 6(b) we analyze the effect of Gaussian noise

on reconstruction error by varying N. We select a chunk

of length 100 frames from the walk sequence and generate

synthetic orthographic images with e = 3° . Plots demon

strate the robustness of our approach against noise.

4.4. Real Results In Figure 7 we report results on three real sequences: Di

nosaur, Matrixl and Matrix2 having 50, 100 and 40 frames

respectively, where first two sequence are from [1]. We

choose N equal to 19, 17, and l3 for LRA; K equal to

3, 6 and 3 for KSTA respectively for Dinosaur, Matrixl

and Matrix2. K was selected as the value which gives

best reconstruction qualitatively. Figure 7 shows that LRA

73

LRA KSTA [4] MP [7]

F= 100

oL---�------�--� o 10 20 30 40

00 20 40 60 80 90

Window Size (N) Standard Deviation in Noise cr (in mm)

(a) Effect of N (b) Effect of Noise

Figure 6. Structure reconstruction error by varying window size

(N), per-frame camera motion (e), and observation noise (0-). Figure (a) shows that the reconstruction error decreases with the

increase in e and LRA remains quite insensitive to the selection of

N for a fairly long range. Figure (b) shows the robustness of LRA

against noise.

LRA KSTA [4] BMM [3]

�� IV �� "y V'-"«-�"'"'-I?::\

Figure 7. Reconstruction results on real sequences including Di

nosaur, Matrixland Matrix2. Reconstruction results in two views

are shown for two arbitrary frames for each sequences. Compar

ison of our method with BMM [3] and KSTA [5] shows qualita

tively better results.

gives qualitatively better reconstruction results compared to

KSTA and BMM. Video results are available on the project

page [10].

5. Conclusion In this paper, we show that the novel constraint for

NRStM in the form of local rigidity gives stable results in

challenging realistic scenarios with small camera motions

and shorter sequences. Moreover, we demonstrate that this

local factorization approach inherently handles natural oc

clusions gracefully. These improvements are a first step to

wards a practical NRStM algorithm.

One limitation of the new approach, as compared to

the subspace reduction methods, is a slight degradation in

quality of reconstruction in case of large number of frames

with large camera motion. This occurs because we tradeoff

model accuracy for method stability. We believe that com

bining the power of local rigidity constraints with subspace

reduction can help achieve the best of both worlds, and is a

possible future direction.

6. Acknowledgements The research was funded in part by Higher Education

Commission of Pakistan and Lahore University of Manage

ment Sciences.

References [1] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade. Nonrigid

Structure from Motion in Trajectory Space. NIPS, 2008.

[2] C. Bregler, A. Hertzmann, and H. Biermann. Recovering

non-rigid 3D shape from image streams. CVPR, pages 690-

696,2000.

[3] Y. Dai, H. Li, and M. He. A simple prior-free method for

non-rigid structure-from-motion factorization. CVPR, 2012.

[4] P. F. U. Gotardo and A. M. Martinez. Computing smooth

time-trajectories for camera and deformable shape in struc

ture from motion with occlusion. IEEE Trans. PAMI, 2011.

[5] P. F. U. Gotardo and A. M. Martinez. Kernel non-rigid struc

ture from motion. ICCV, 2011.

[6] M. Lee, 1. Cho, C.-H. Choi, and S. Oh. Procrustean nor

mal distribution for non-rigid structure from motion. CVPR,

20l3.

[7] M. Paladini, A. D. Bue, M. Stosic, M. Dodig, J. Xavier,

and L. Agapito. Factorization for non-rigid and articulated

structure using metric projections. CVPR, pages 2898-2905,

2009.

[8] H. S. Park and Y. Sheikh. 3d reconstruction of a smooth artic

ulated trajectory from a monocular image sequence. ICCV,

2011.

[9] Y. Rabaud and S. Belongie. Linear embeddings in non-rigid

structure from motion. CVPR,2009.

[10] A. Rehan. Project page - Nonrigid Structure from Mo-

tion using Llocal Rigidity. cvlab. 1 urns. edu. pk/ local rigidity, 2014.

[11] M. Salzmann, R. Hartley, and R. Fua. Convex optimization

for deformable surface 3-D tracking. ICC V, 2007.

[12] J. Sanchez-Riera, J. Ostlund, P. Fua, and F. Moreno-Noguer.

Simultaneous pose, correspondence and non-rigid shape.

CVPR,201O.

[13] A. Shaji, A. Yarol, L. Torresani, and P. Fua. Simultaneous

point matching and 3d deformable surface reconstruction.

CVPR,201O.

[14] J. Taylor, A. D. Jepson, and K. N. Kutulakos. Non-rigid

structure from locally-rigid motion. CVPR, 2010.

[15] c. Tomasi and T. Kanade. Shape and Motion from Image

Streams Under Orthography: A Factorization Method. IJCV,

9(2), 1992.

[16] L. Torresani, A. Hertzmann, and C. Bregler. Nonrigid Struc

ture from Motion: Estimating Shape and Motion with Hier

archical Priors. IEEE Trans. on PAMI, 30(5), 2008.

[l7] X. Wei and J. Chai. Modeling 3d human poses from uncali

brated monocular images. ICCV, 2009.

[18] J. Xiao, J. Chai, and T. Kanade. A Closed Form Solution to

Non-Rigid Shape and Motion Recovery. llCV, 67(2), 2006.

[19] J. Zhu, S. C. Hoi, Z. Xu, and M. R. Lyu. An effective ap

proach to 3d deformable surface tracking. ECCV, 2008.

74

[ieee 2014 ieee winter conference on applications of computer vision (wacv) - steamboat springs, co,...

Documents