motion segmentation - vision.ucsd.eduvision.ucsd.edu/~vrabaud/papers/researchexam.pdfthe simplest...

Motion Segmentation

Research Exam Report

Vincent RabaudComputer Science and Engineering Department

UCSD

Abstract

Videos add a temporal dimension to images but also a whole new complexity to computer vision problems. For example,performing object recognition on video data is potentially easier as more angles of view and lighting are available in thevideo. Similarly, stereo vision, which deduces the 3D structure of a scene based on several view angles, not only benefits fromthe greater number of views, but also from the fact that they are taken continuously in time and in space. But when dealingwith movies, by definition, objects move. Consequently, it becomes crucial to separate an object from the background oreven to make the distinction between several objects before any further analysis. Challenges for such a task increase with thenumber of object, the complexity of their movement as well as with the quality of the video. We will present several techniquesto recover motion from a video sequence as well as several existing methods to segment the observed motion.

1

Contents

1. Introduction 3

2. Background 32.1. Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32.2. Camera Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42.3. Motion Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4

3. Motion Recovery 43.1. Dense Motion Recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1.1 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.1.2 Block Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.1.3 Horn-Schunck Equation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.1.4 Pyramidal Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2. Local Motion Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2.1 KLT tracker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73.2.2 Affine Invariant Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2.3 Particles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

4. Motion Clustering 84.1. Clustering Concept. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84.2. Clustering Common Motions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2.1 Graph-Based Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2.2 RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

4.3. Layer-Based Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.3.1 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.3.2 Complex Layers/Sprites. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.3.3 Large Intraframe motion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.4. Full 3D Motion Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104.4.1 Unique Rigid Body. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104.4.2 Multiple Rigid Bodies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114.4.3 Articulated Body. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

5. Applications 115.1. Structure From Motion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115.2. Video Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12

6. Discussion 13

7. Conclusion 13

A. Geometrical Transformations 13

B. Stereo Vision 14

2

1. Introduction

Videos add a temporal dimension to images but also awhole new complexity to computer vision problems. Forexample, performing object recognition on video data ispotentially easier as more angles of view and lighting areavailable in the video. Similarly, stereo vision, which de-duces the 3D structure of a scene based on several view an-gles, not only benefits from the greater number of views, butalso from the fact that they are taken continuously in timeand in space. But when dealing with movies, by definition,objects move. Consequently, it becomes crucial to separatean object from the background or even to make the distinc-tion between several objects before any further analysis.

A way of achieving this goal is to track objects in thevideo [16, 17, 7]. But the main focus of tracking is to fol-low an object in a video sequence, which does not necessar-ily imply a recovery of its full motion nor knowledge aboutother objects in the scene. When performing an ideal mo-tion segmentation, the movement of each pixel in the videosequence has to be determined and the resulting compo-nents of the independent motions have to be separated.

Motion segmentation is a crucial analysis of video datain computer vision but it also bears some importance intelecommunication. Motion segmentation enables a betterunderstanding of the current scene which leads to some pos-sibility for compression or processing (e.g. stabilization). Itcan also be used to describe the scene and embed the infor-mation in the signal, like in the MPEG-7 standard.

We will first present some general notions about camerasand how motion in the 3D world can be interpreted in termsof its 2D image projection via the notion of optical flow.We will then focus on several approaches for recovering theoptical flow, presenting them by increasing complexity andefficiency, which almost coincides with their chronologicaldevelopment. We will then focus on several motion cluster-ing techniques before describing several applications.

2. Background

In motion segmentation, the raw footage to be analyzedis a set of images depicting one or several moving objects.We will mainly focus on the common case of successiveframes from a video sequence but some methods shown inthis paper can be applied to images more separated in time.

2.1. Notations

In this paper, we will use the following notations:

• a capital letter is used for matrices (e.g. A) while a boldfont refers to vectors (e.g. x). 0n refers to a vector ofdimensionsn× 1 whose elements are all0.

• I is used for an image andIt for the image at instanttin a video sequence.

Figure 1. Illustration of video segmentation (from [35] ). Firstrow: original frames of size180 × 240. Second and third row:segmented layers corresponding each to a different motion in thevideo sequence.

• x represents the pixel of coordinatesx = (x, y)>(or (x, y, 1)> if working in homogeneous coordinates)andI(x) represents the intensity of the pixel of coor-dinatesx in imageI. Also,N (x) represents a set ofpixels neighboringx.

• Ix andIy are the gradient images ofI and are definedas follows:

∀x ∈ Ix, Ix(x) =∂I

∂x(x) and∀x ∈ Iy, Iy(x) =

∂I

∂y(x)

Similarly, we define the gradient “image” (which is ac-tually a vector field):

∀x ∈ I,∇I(x) =(

∂I

∂x(x),

∂I

∂y(x))>

• in the same way, for a video sequence we define thetemporal derivativeItt of a pictureI

t as:

Itt (x) =∂It

∂t(x)

These derivatives cannot be exact as the given data is,for technical reasons, discrete: pictures are taken ata certain number of frames per second and they arecomposed of discrete units: pixels.

• (U, V ) are the x and y velocity images:(U(x), V (x))> represents the speed vector ofpixel x in the considered image. It is defined like theusual speed:

∀x ∈ I, U(x) = dxdt

andV (x) =dydt

• for the sake of simplicity, the formulas will often bewritten with the image notation,e.g.:

Ix + Iy = 0 ⇐⇒ ∀x ∈ I, Ix(x) + Iy(x) = 0

3

Figure 2. A pinhole camera is a box in which one of the walls hasbeen pierced to make a small hole through it and the opposite wallis made translucent. Assuming that the hole is indeed just a point,exactly one ray from each point in the scene passes through thepinhole and hits the wall opposite to it. This results in an invertedimage of the scene (top picture). We use frotal pinhole cameramodel which is equivalent except that the image is projected beforethe pinhole, hence giving a non inverted image (bottom picture).

2.2. Camera Models

In the different cases presented in this report, the modelchosen for a camera is a frontal pinhole camera,cf . Figure2.

When considering such a model and if the world coordi-nate system and the camera coordinate system are the same,a point of homogeneous coordinatesX = (X, Y, Z, 1)> inthe 3D world, will be mapped to a pointx = (x, y, 1)> inthe image plane obeying the equation:

x = K

1 0 0 00 1 0 00 0 1 0

XwhereK is a3×3 matrix called theintrinsic parameter ma-trix: it describes some internal properties of the camera (likewhere the optical center is located, the focal lengthetc.). Ina more general way, let us consider that the camera coor-dinate system is the world coordinate system transformedby a Euclidian transform(R,d) (rotation + displacement).Therefore, a 3D point of coordinatesX in the world coordi-nate system, has coordinatesXcam in the camera referentialsuch that:

Xcam = RX + d

Consequently, the coordinates of the corresponding pixel inthe picture are:

x = K︸︷︷︸intrinsic parameters

1 0 0 00 1 0 00 0 1 0

︸︷︷︸canonical projection

[R d0>3 1

]︸︷︷︸

extrinsic parameters

X

or, after grouping the matrices into the3× 4 projection ma-trix P , we have the general expression of camera projection:

x = PX (1)

2.3. Motion Models

When a 3D pointX moves to another pointX′, its corre-sponding pixelx moves in the image plane to another pixelx′. The simplest model of motion considers the motion ofx as a vectord, hence giving:

x + d = x′

But this model cannot capture more complex movement likerotation. Consequently, the common choice [1], for its sim-plicity as well as its ability to capture relatively complexmotions, is actually the following linear model:

x′ = Hx

where againx andx′ are in homogeneous coordinates.This simply means that when a rigid object moves in the

3D world, its pixels in the image undergo a homography.This homography is a3× 3 matrix but due to thescale am-biguity, it loses one degree of freedom and only has 8 in thegeneral case, calledprojective case. Usually, this transfor-mation is restricted to beaffineand hence to have 6 degreesof freedom:

H =[

A d0>2 1

]A simpler motion model is aEuclidian transform: a rotationR and a translationd. It has only 3 degrees of freedom:

H =[

R d0>2 1

]These different geometrical transformations are illustratedin AppendixA.

3. Motion Recovery

In this section, the motion is recovered either by treatingthe picture as a whole (hence giving a dense representation)or by focusing on certain salient features, hence obtainingsparse, but more reliable, data.

3.1. Dense Motion Recovery

This section presents early works of motion recovery thattried to get the motion of every pixel in the image. Theimportant notion of optical flow is also presented.

3.1.1 Optical Flow

Optical flow is a fundamental concept used to describe mo-tion from one picture to another. It is a set of vectors whoseorigin corresponds to certain pixels in the first picture andwhose end is the same pixels, but on the second picture.Figure3 illustrates this concept.

Recovering this optical flow for every pixel in the pictureis necessary for motion analysis but is only a first step inmotion segmentation.

4

(a) Barber’s Pole (b) Motion Field (c) Optical Flow

Figure 3.3(a)The example of the barber’s pole is given above tomake the distinction between two concepts.3(b) Themotion fielddescribes how a pixel moves in the real world, as projected on theimage plane.3(c) On the contrary,Optical flow is the apparentmotion of brightness patterns in the image. In most of real cases,these two concepts are similar.

3.1.2 Block Correlation

The most intuitive method to recover the motion betweentwo imagesIt andIt+1 is a block correlation method: itconsiders a windowW in imageIt and tries to find theclosest matching windowW ′ in imageIt+1. It thereforetries to find:

arg minW′

‖It(W)− It+1(W ′)‖ (2)

Knowing the position ofW ′ andW , the optical flow trans-forming W into W ′ is deduced. In Equation (2), severalnorms can be used with comparable results:L1,L2, nor-malized cross-correlationetc.

It is important to notice that this approach is similar tothe one used in stereo vision where two pictures of the samescene, but taken from different view points, are comparedblock per block in order to recover the depth. AppendixB illustrates the concept of stereo vision. Many techniquespresented in this report are used in other computer visionproblems, but stereo vision is probably the one most tied tooptical flow recovery.

3.1.3 Horn-Schunck Equation

Now, let us have a more local approach. A pixelx in an im-ageIt has a displacement vector ofdx = (dx,dy)> whenit moves to a consecutive imagesIt+dt. Then the followingbrightness constancyconstraints must hold:

∀x ∈ It, It+dt(x + dx) = It(x) (3)

Figure 4.Aperture Problemrefers to a situation where the motioncan only be recovered in one direction, hence some uncertainty onthe optical flow. In the presented example, only the framed partof the scene is seen and it seems that the stripes are moving rightwhile they are also moving up.

Assuming the movement is small enough, we can performa Taylor expansion of the first term of Equation (3):

It+dt(x + dx) ' It(x) +∂It

∂x(x)dx +

∂It

∂y(x)dy +

∂It

∂t(x)dt (4)

Using Equation (3) in Equation (4), we obtain the ap-proximation:

∀x ∈ It, ∂It

∂x(x)dx +

∂It

∂y(x)dy +

∂It

∂t(x)dt ' 0

Which can be condensed as:

IxU + IyV + It = 0 (5)

Equation (5) is known as the Horn Schunck equation [14].It is a linear equation that exists for each pixelx and itcontains two unknowns:U(x) andV (x). There is conse-quently an uncertainty in solving the equation. This ambi-guity is calledaperture problemand is illustrated in Figure4.

To determine completely the motion, it is therefore re-quired to make some assumptions about the data. Somehypotheses are made either on the motion field, or on theimage itself.

Also, the Horn-Schunck equation breaks down when itis not possible to perform a Taylor expansion,i.e. whenthere is too much motion between the frames. A high framerate is therefore important for good optical flow. For largerframe rates, one can downsample the images to a lower res-olution.

Smoothness ConstraintA way of finding a reasonable solution to Equation (5) is theHorn-Schunck [14] method which imposes an additionalsmoothness constraint on the motion matrices(U, V ): wewant∇U and∇V to have a small norm. The equation de-scribing(U, V ) now becomes:

arg minU,V

∫∫(IxU + IyV + It)

2+α(‖∇U‖2 + ‖∇V ‖2

)dx dy

(6)

5

Equation (6) can then be solved using any optimization al-gorithm (e.g. gradient descent, Levenberg-Marquardtetc.).This solution can be slow and can get stuck in local minima.

Local ConstraintAnother popular approach is the Lucas-Kanade [22]method: it assumes that all the pixels in a windowW ofthe image have a common motion(uW , vW)>. This as-sumption is often valid between two consecutive frames ofa video sequence. For each windowW, the goal is now tofind:

arg minuW ,vW

∑x∈W

w(x)2 (IxuW + IyvW + It)2 (7)

wherew(x) is a weighting function used to emphasize thecenter of the analyzed window (e.g. a Gaussian kernel). Tofind the minimum of Equation (7), we impose a null deriva-tive which leads to the equation:2664

Xx∈W

w2I2xXx∈W

w2IxIyXx∈W

w2IxIyXx∈W

w2I2y

3775 · »uWvW–

= −

2664Xx∈W

w2IxItXx∈W

w2IyIt

3775(8)

The first matrix is called thesecond moment matrix. Equa-tion (8) has a reliable solution if this matrix is invertible,i.e. if it has two high eigenvalues. In the “image world”, itmeans that the window contains textured data (ideally a cor-ner) that can disambiguate the previously mentionedaper-ture problem.

3.1.4 Pyramidal Approach

The methods shown previously suffer from relying on win-dow parameters: the height and width ofW need to be de-fined prior to the execution of the algorithm. They also relyon I(x, y, t) being a continuous function through time tosolve Equation (5) which can be broken easily with a poorframe rate or a rapidly moving object.

A way of leveraging this assumption is to apply the pre-vious methods but at different scales [2]. If a pixel moves at2 pixels/s at a certain scale, it moves at 1 pixels/s on twicesmaller images. Hence, it “seems” to move slower and themethod studied previously can be applied again. This bringsthe need for studying the problem on the same images but atdifferent scale thanks to what is called a scale pyramid (cf .Figure5).

The optical flow is computed at the highest level usingone of the previous approaches, usually the Lucas-Kanadeone. And, for each leveli:

• Take the flow(Ui, Vi) from leveli− 1

• Upsample the flow to create(U∗i , V ∗i ), an optical flowmatching the resolution of leveli. Multiply (U∗i , V

∗i )

by the scale factor used to go from leveli−1 to leveli

Figure 5.Gaussian pyramid. Each level in the pyramid is a sub-sampled version of the level below convolved with a Gaussian fil-ter (in order to smooth the picture). Each downsampled opticalflow is used to compute the one at a higher level. There are alsomethods performing several passes on the pyramid.

• Compute the transformation of the current picture bythe motion(U∗i , V

∗i )

• Apply Lucas-Kanade between the obtained picture andthe final picture to get the correction flow(U ′i , V

′i )

• Add the corrections(U ′i , V ′i ) to (U∗i , V ∗i ) to obtain theoptical flow(Ui, Vi)

The main advantage of such an approach is its possibilityto find a correct optical flow for uniform regions.

On the opposite end, there are methods [12] that have afine to coarse approach. They simply work in two steps:they first fit a motion model to each pixel and they nextaggregate pixels with the same motion. They repeat thesesteps three times by increasing the complexity of the motionmodel at each iteration (first translation, then affine, thenfull homography).

These methods try to use the fact that when a movingobject is looked at closely, all the points seem to move withthe same translation. They unfortunately require a good firstguess for the optical flow, hence a picture with good texture.

3.2. Local Motion Recovery

The different methods mentioned above lead to severalminimization problems. But, as we have seen the solutionsfound are not always optimal or unique in certain regionsbecause there is not enough information/texture. It is there-fore necessary to focus more on certain pixels in the image.Several types of these relevant pixels, called salient features,have been used to recover the motion. We have chosen topresent the KLT tracker (for its historical significance) and

6

(a) (b)

Figure 6. Example of the KLT tracker.6(a)Picture of the originalvideo sequence.6(b)Side view of tracks of features spotted on themoving person. The tracks have different length and shape, butthey do belong to the same object.

the SIFT and MSER feature tracker as they are the mostrobust feature descriptors to date.

3.2.1 KLT tracker

The KLT tracker [31] can be seen as an affine version ofthe Lucas-Kanade algorithm (cf . Section3.1.3). Its driv-ing principle is to determine the affine motion parameters(A,d) of local windowsW, from an imageIt to a consec-utive imageIt+1. These parameters are chosen to minimizethe following dissimilarity:

� =∫W

[It+1(Ax + d)− It(x)

]2w(x)dx (9)

wherew is a weight function, usually chosen to be constantor Gaussian. It is commonly assumed that only the trans-lation d matters between two frames, hence leading to theequation:

Zd = e (10)

where

Z =∫W

g(x)g>(x)w(x)dx

e =∫W

[It(x)− It+1(x)]g(x)w(x)dx

with

g(x) =[ ∂

∂xIt+1(x)

∂∂y I

t+1(x)

]Equation (10) will have a reliable solution if the second

moment matrixZ is full rank. Practically, the good win-dows are chosen to be the ones leading to aZ whose min-imal eigenvalue is above a threshold. Once this eigenvaluedrops below that threshold, the feature track is terminated.

(a) SIFT stands for: Scale Invariant Feature Transform. It is a local his-togram of the gradient distribution and its use has been widespread dueto its partial invariance to scale, affine transform and illumination.

(b) MSER stands for Maximally Stable Extremal Regions. An ex-tremal region is a connected component of pixels which are all brighter(MSER+) or darker (MSER-) than all the pixels on the region boundary.Such a feature, after some normalization [6], can be made invariant toaffine transformations.

Figure 7. The SIFT7(a)and MSER7(b) are affine invariant fea-tures.

3.2.2 Affine Invariant Features

More efficient features than corners have been recently usedto analyze motion. Though they are usually used for track-ing [29], they can also be used to determine accurately themotion in a video sequence.

We will focus on the SIFT [21] and MSER [23] featuresbecause they are the most efficient descriptors and also be-cause they have recently been used together for motion anal-ysis [28]. Figure7 illustrates these descriptors.

When used in motion segmentation, these features arecomputed for every frame of the sequence and are thenmatched between frames in order to determine if a featurepropagates from frame to frame. Once these features aretracked, their trajectories are fed to motion segmentation al-gorithms we will describe later, starting from Section4.

Nonetheless, these methods are more complex to com-pute and also are often an overkill for a conventional task.Usually, the much faster KLT tracker, or even the Lucas-Kanade approach are used.

3.2.3 Particles

Finally, some methods [26, 24] have recently combined thequality of optical flow (which is a two-frame process) withfeature trackers (which give good results on several framesbut only for certain points).

These methods take a more realistic approach to fea-ture tracking by accepting that some features can ap-

7

Figure 8. Each plot denotes a pair of adjacent frames. The algo-rithm propagates particles from one frame to the next accordingto the flow field, excluding particles (blue) that lie within the flowfield’s occlusion mask. The algorithm then adds links (red curves),optimizes all particle positions, and prunes particles with high er-ror after optimization. Finally, the algorithm inserts new particles(yellow) in gaps between existing particles. (from [26])

pear/disappear, get weaker, or get tracked with some un-certainty. Figure8 illustrates some rules used to help thefeatures propagate from frame to frame.

With these aproaches, the spatio-temporal space is pop-ulated with more features, and of a better quality too. It isan active area of research.

4. Motion Clustering

The previous section showed how a reasonable opticalflow could be obtained from a video sequence. In orderto segment this motion into different objects, several tech-niques can be used: unspecific methods common to othermachine learning problems or algorithms specially tuned tothe case of motion segmentation.

4.1. Clustering Concept

Our goal is to group pixels of a video sequence into clus-ters of coherent motion. So far, we have seen how to com-pute the optical flow,i.e. the motion of the pixels in thepicture. We still do not know how to retrieve the 3D motionfrom the optical flow but is it necessary ? Can’t we clusterthe optical flow as it is ?

Let us consider that the camera is orthographic (i.e. thereis no perspective effect), that the moving objects are rigidand that they move parallel to the image plane (i.e. we usea translational model). Then, all the pixels belonging toa common object will have the same optical flow. Con-sequently, any clustering algorithm (K-means [13], mean-shift clustering [19] etc.) can be applied to cluster the mo-tion into coherent groups.

Nonetheless, these methods are very constrained andthey perform poorly in presence of noise. They also requireto have the number of clusters as an input.

4.2. Clustering Common Motions

4.2.1 Graph-Based Algorithms

A simple clustering of the optical flow cannot work as itdoes not take any spatial information into consideration. Toremedy this problem, an approach is to build a graph fromthe video data and its optical flow, and to perform a graphcut in order to have the pixels segmented in different mo-tions. For a graphG = (V, E), a partitioning in two setsAandB can be achieved by minimizing thecut criterion:

cut(A,B) =∑

u∈A,v∈Bw(u, v)

wherew(u, v) is the weight between nodesu andv.Shi et al. [27] propose to create a graph linking pix-

els spatio-temporally. The nodes are the pixels themselveswhile the weights of the edges increases with color similar-ity, motion similarity and spatial proximity. They also use adifferentcut criterion namednormalized cutand defined asfollows:

Ncut(A,B) = cut(A,B)asso(A,V)

+cut(A,B)asso(B,V)

whereasso(A,V) =∑

u∈A,t∈V w(u, t) is the total con-nection from nodes inA to all nodes in the graph. Klein-berg & Tardos demonstrate that such labelling problemscorresponds to finding the maximuma posteriori labellingof a class of Markov Random Field [20]. Some other ap-proaches [34, 35] directly use this MRF formulation to as-sign motion labels to pixels.

The problem is known to be NP-complete, and the bestone can hope for in polynomial time is an approximation.While popular techniques, like loopy belief propagation,can be slow, Boykov, Veksler and Zabih [3] have recentlydeveloped a polynomial time algorithm that finds a solutionwith error at most two times that of the optimal solution.

This approach works well for simple motions, like trans-lations, but fails otherwise or with objects with many colors.

4.2.2 RANSAC

Finally, we describe a method that can be used alone or asa first step before further motion clustering as we will seein further sections: RANSAC [10] (for RANdom SAmpleConsensus). The key idea of this algorithm is to considerthat among all the data to analyze, there areinliers, thatfollow a certain model, andoutliers, that don’t. In the caseof motion segmentation [37], RANSAC takes random setsof points, computes their average motion and determineshow many other points comply to this motion with a certaintolerance. The quality of a model is then determined by thenumber of inliers as well as by the error obtained with theseinliers.

8

The main advantages of this method are its speed andthe fact that it can be used with different motion models(euclidian, affine, projective).

4.3. Layer-Based Techniques

In order to increase the stability of the previous algo-rithms and also to have satisfactory results in cases were theoptical flow is not smooth (e.g. occlusion of an object by another), more complex techniques have relaxed the assump-tions by assuming that the objects are somewhat planar andassimilated to layers in the picture. These techniques canalso deal with transparent objects.

4.3.1 Basic Concept

In a basic layer-based approach [33, 9], it is assumed thatthe current pictureIt of a video sequence can be explainedby the superposition ofn different layersLi. The goal isthen to recover the pixels that form the layerLti at eachinstantt, as well as their corresponding motionHti , whereHti is a homography matrix as specified in Section2.3.

The optical flow is assumed to be computed from oneof the methods shown in the previous sections. Therefore,each pixelx of a layerLi needs to verify:

x +

U t(x)V t(x)1

= Htix (11)If the composition of a layerLi were know, its motion

could be computed with a least square minimization on allthe equations (11) obtained for each of its pixels:

Hti =

1 0 00 1 00 0 1

+∑

x∈Li

U t(x)V t(x)1

x>(∑

x∈Li

xx>)−1(12)

In order to determine the composition of each layer, eachimageIt is first overclustered in random patchesPi of con-tiguous pixels (e.g., in square patches). The motion of eachpatch is then determined using Equation (12) and all thedifferent motions are then partitioned, using a simple clus-tering algorithm like K-means (assuming the number of ob-jects in the scene is known). These recovered dominant mo-tions are the ones of the layers.

Each pixelx is finally assigned to the layerLi whosemotion best matches his.

4.3.2 Complex Layers/Sprites

The previous section showed an example of a layer-basedapproach. Nonetheless, this method does not deal well withan unknown number of objects, transparent objects, non-rigid objects and the evolution of the shape of a layer. The

most advanced work [18] to solve for these issues proceedsas follows.

In the case of a unique layerL undergoing a transforma-tion T , each frameIt of a video sequence can be explainedin the following compositing equation:

It = T ItM × T ItS + T ItM × ItB + noise

where:

• × is the per element multiplication operator (.* inMATLAB):

I = J ×K ⇐⇒ ∀x, I(x) = J(x)K(x)

• T I describes pictureI after transformationT .

• ItM is the mask of layerL. It has values between0 and1 in order to model the transparency (1 being totallyopaque and0 totally transparent).

• ItS is the image describing layerL.

• ItM

= 1− ItM

• ItB is the background image

For L layers, layerL being the closest from the cameraand layer1 being the background, the compositing equationsimply becomes:

It =L∑

l=1

((L∏

i=l+1

TiItMi

)∗ TlItMl ∗ TlI

tSl

)+ noise

Figure9 illustrates this complex compositing formula.The unknownItMl andI

tSl

are then assumed to be Gaus-sian distributed and inferred in a complex EM procedurewhose details are beyond the scope of this paper.

4.3.3 Large Intraframe motion

One of the main advantages of a layer-based approach isits ability to be used for sequences with a high intra-framemotion or a low-frame rate. An example is given in [35] andwe detail this approach below.

The different steps of [35] are similar to the previous ap-proaches except that the algorithms used are more robust,hence able to handle more complex cases.

First, many corner features are computed using theForstner operator [11] in both of the imagesI andI ′ be-tween which we want to compute the motion. Then, thesefeature vectors are matched using theL1 distance: the pairsof matching points then define a motion from imageI toimageI ′. This step is therefore comparable to the opticalflow step in the previous layer-based methods.

9

x T2ItM2 T2ItS2

T2ItM2

= × +

T1ItM1 T1ItS1

T1ItM1 ItB

×( × + × )Figure 9. An image is formed by composing multiple layers of translated sprite appearance maps using translated sprite masks. (from [18])

Next, the principal motions in the picture are computedusing RANSAC on the different recovered motions. Con-sequently, the computed features are assigned to a uniquelayerLi whose motionHi is now known.

So far, this method is comparable to [32] but now, adense assignment assigns each pixelx from imagesI andI ′ to a layerLl(x) wherel : I → {1, . . . , k} is the map-ping function. These identities are given by minimizing thefollowing criterion:∑

x

[I(x)− I ′(Hl(x)x)

]2+λ∑x

∑y∈N (x)

sxy[1− δl(x)l(y)

](13)

In Equation (13), known as aGeneralized Potts model, thefirst term simply ensures that the intensity of a pixel ispreserved if it stays on the same layer. The second one,weigthed by a constantλ ensures the smoothness of the as-signment: similar pixels (with a similarity defined bysxy)should be assigned to the same layer (hence the kroneckerdelta functionδl(x)l(y) = 1 if l(x) = l(y), 0 otherwise).Here, the similarity is simply based on the distance betweenthe pixels and their proximity in intensity:

sxy = exp[− ‖x− y‖

2

constant1− (I(x)− I(y))

2

constant2

]4.4. Full 3D Motion Clustering

Finally, the quality of the motion determination can beimproved by imposing some constraints on the observedobjects. These can either describe the kind of motion theyhave (e.g. rigid, articulatedetc.) or the objects themselves(e.g. they are human beingsetc.). Of course, the looser theconstraints are, the better. These methods usually rely on avery good feature tracker.

4.4.1 Unique Rigid Body

One of the first studied cases, which is now assumed to besolved, is the case of a rigid object, which was first devel-oped by [30] for the case of single static object viewed bya moving camera. Here we will reformulate the methodin such a way that a static camera observes a scene witha moving object. Also, whereas the translation componentof motion is first eliminated in the Tomasi-Kanade formu-lation, we will retain that component in our formulation inorder to draw similarities with the multi body case.

Let us consider a set ofN 3D pointsXi belonging to arigid object that undergoes a rigid motion(Rt,dt) between

time 0 and timet, in a video sequence ofT frames. Then,at an instantt, the position of eachXi in the camera frameis: [

Rt dt0>3 1

] [Xi1

]Therefore, if we stack all these equations for each instant intime and each feature, we get:

x11 · · · x1Ny11 · · · y1N...

...xT1 · · · yTNyT1 · · · yTN

=i>1j>1...i>Tj>T

∣∣∣∣∣∣∣∣∣∣∣

d>x1d>y1

...d>xTd>yT

[X1 · · ·XN ]W = MS (14)

where(xti, yti) are the feature image position, vectorsi =[ixt, iyt, izt]

> j = [jxt, jyt, jzt]> are the first two rows of

the rotation matrix at instantt, and (dxt,dyt) are theXandY coordinates of the position of the object’s coordinateframe, in the camera frame, at the same instant.

What is remarkable about equation (14) is its ability toexplain the position of all the features at every instant intime (measurement matrixW ) by the multiplication of tworank 4 matrices: themotion matrixM and theshape ma-trix S. In the original formulation [30], as the translationalcomponent is first removed, these two matrices have rank atmost 3.

Therefore, when given a video sequence (i.e. when givena matrixW ), recovering the shape and motion is now equiv-alent to decomposeW into M andS. To do so, the SingularValue Decomposition ofM is computed:

W = UΣV >

whereΣ is only a4× 4 diagonal matrix (asW has to be ofrank 4). By settingM = UΣ

12 A andS = A−1Σ

12 V >, the

4× 4 matrixA is now the only unknown. Finally,A can berecovered by using two motion constraints:

• the first three elements of each row ofM is a unit normvector and the first and second sets of such sub-rowsare pairwise orthogonal.

• in orthography, the projection of the 3D centroid ofobject features into the image plane is the centroid ofthe feature points. Applying this property leads to therecovery of the right side ofM .

The recovery only requires a least square minimzation andwe refer the reader to [30] for more details.

10

Figure 10. Reordering process of the shape interaction matrix.(from [8])

4.4.2 Multiple Rigid Bodies

The method shown previously generalizes easily to severalbodies [8, 39] thanks to theshape interaction matrix:

Q = V V >

It can easily be shown that:

• Q is uniquely computable fromW , without needing toknow the segmentation(M,S)

• permuting columns ofW does not change the set ofvalues{Qij}.

• the values{Qij} are invariant to motion.

• each element ofQ provides important informationabout whether a pair of features belong to the sameobject: if Qij is non zero, the featuresi andj belongto the same object, otherwise, they do not.

Therefore, in a multi rigid body situation, the shapeinteraction matrix is computed, and its elements are thengrouped together as shown in Figure10.

The single and multi body cases are solved algebraicallybut are almost impossible to use with real cases as they arevery sensitive to noise. They also assume that all the fea-tures have the same life span which does not happen in prac-tice.

4.4.3 Articulated Body

Finally, the latest works in motion segmentation [38, 15] tryto analyze the more realistic case of articulated bodies.

They usually rely on a good segmentation of the featuresinto rigid parts. Let us assume we have two limbs with re-covered motions:[

i>1j>1

∣∣∣∣d>x1d>y1]

and

[i>2j>2

∣∣∣∣d>x2d>y2]

(15)

By comparing these two matrices, the motion of the twolimbs can be correlated:

• if the link between the limbs is a joint, the two matricesof Equation (15) must haved1 = d2 under the samecoordinate system. So, the motion of the two limbslie in different linear subspaces but have 1-dimensionalintersection.

• if the link is an axis, the two matrices of Equation(15) must haved1 = d2 and exactly one rotation col-umn the same under a proper coordinate system. Here,the motion of the two limbs lie in different linear sub-spaces but have 2-dimensional intersection.

At the very end of the spectrum, some complete models(weak [24] or strong [25]) can also be fit to the data but thesemethods usually require a significant manual input and canalso be considered as tracking and not motion segmentationanymore.

Some works [36, 5, 4] also try to solve for the non-rigidbody case, but are based on a certain number ofkey shapesthat are assumed to be rigid.

5. Applications

In our last section, we will show some representative ap-plications of motion segmentation. While a direct appli-cation of motion segmentation is “tracking”, motion seg-mentation is actually an overkill for this task: as mentionedearlier, tracking only requires to follow an object from oneframe to the next without really focusing on its position ormotion. That is why we decide to show applications thatuse motion segmentation at its full potential.

5.1. Structure From Motion

Motion segmentation can simply be used to display themotion and the shape of the objects that have been recov-ered. This is known in the movie industry asmotion track-ing and it can be used for special effects, like matte com-positing or avatar creation (like the characters Jar-Jar in“Star Wars” or Gollum in “The Lord of the Rings”,cf . Fig-ure11)

Figure 11. Motion tracking for 3D avatar creation. (from “Lord ofthe Rings”)

Also, the knowledge of motion between frames can beuseful to infer some structure about the moving objects oreven about the background. For example, information frommotion segmentation can be used for video object deletion

11

as shown in Figure1. For this figure, it is also importantto notice that no additional frames beyond the three shownwere used as input.

Finally, once a video sequence is segmented into movingobjects, it becomes easy to search for the different appear-ances of an object: this process is named “video google”[28]. Let us imagine a whole movie has been segmentedinto moving objects. Then, when selecting an object on oneframe, all its instances are retrieved in the sequence it ispresent, hence creating many other instances of the objectto be matched in the whole movie. Next, all these possibleappearances of the object are matched with all the differentsegmented objects of the entire movie (cf . Figure12).

Figure 12.Video google. Top row: The query frame with queryregion (side of the van) selected by the user. Second row: Theautomatically associated keyframes and outlined query regions.Next four rows: Example frames retrieved from the entire movie“Groundhog Day” by the object level query. Note that views ofthe van from the back and front are retrieved. This is not possiblewith wide-baseline matching methods alone using only the side ofthe van visible in the query image. (from [28])

5.2. Video Processing

Motion information is heavily used in the modern videocompression algorithms like DivX or the MPEG4-part 10standard (also known as H.264). When knowing which ob-jects are moving in the scene, these objects can be encodedonce and for all in sprites and then, only the transforma-tion matrix from one frame to the next needs to be stored inthe compressed file (cf . Figure13). A trivial case of mem-ory gain appears in a static scene where the background isstored as one picture and all the motion in time is just null.

Also, the motions of the objects contain information thatcan be added to the video file, like in the multimedia content

Figure 13. Top row: two consecutive images of the video se-quence. Bottom row: corresponding optical flow. The optical flowshows the relevant information (the non null vectors) and there-fore the only one that should be encoded. (picture from the book“H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multimedia” by Iain E. G. Richardson)

description standard MPEG-7. This standard uses XML tostore metadata, and can be attached to timecode in order totag particular events, or synchronise events to motion forexample.

Finally, the motion of a video sequence can be of a greathelp when trying to stabilize the picture. The principalmotion of the picture is known and can be assumed to bethe one of the background. Then, by smoothing this mo-tion through time, all the other motions of the video canbe aligned on this smoothed main motion, hence giving anoverall non-jittery sequence (cf . Figure14).

Figure 14. Top row: jittery camera motion. Bottom Row:smoothed video sequence (product “Stable Eyes” from the com-pany http://ovation.co.uk)

12

6. Discussion

By analyzing the different steps of motion segmentation,we have been able to cast light on the different problemsarising as well as on some possible solutions.

First of all, motion has to be determined. If this pro-cess is performed densely, it is fast but unreliable for low-textured regions or on motion boundaries. Therefore ad-ditional constraints,e.g. smoothness, or more efficient ap-proaches,e.g. scale pyramid, are needed to improve thequality of the results. Nonetheless, it often reveals to beinsufficient and the information has to be made more reli-able by analyzing some salient features. The computationof these features can lead to a usable optical flow, with aquality depending on the complexity of the feature type.

Once the visual motion of the pixels in the video se-quence has been determined, it has to be translated into ob-ject motion. If the objects are supposed to be somewhatplanar, techniques can solve for the motion grouping effi-ciently, even in a dense manner. Unfortunately, objects sel-dom behave like planes in the 3D world. Therefore, recon-struction of the motion, which happens at the same time asshape reconstruction, can be computed, but only by relyingon constantly visible features whose motion is deprived ofnoise, thus making the process hardly practically useable.Finally, the real case of a non-rigid object, even simple orarticulated, remains unsolved.

7. Conclusion

Throughout this report we have listed different tech-niques for finding the different motions in a video sequence.Even though it can often not be retrieved fully for everypixel in the images, it can be deduced fairly well for somesalient features. Once this information is known to be re-liable, further analysis leads to segmentation of the objectspresent in the scene by making assumption ranging fromnone to complex object structure hypotheses.

We have also been able to make the link between mo-tion segmentation and other computer vision application,like stereo vision, by revealing shared used techniques aswell as common goals.

Finally, the presentation of all these different techniqueswas a good opportunity for showing the evolution of mo-tion segmentation in the past two decades and discussing itsevolution as well as its possible future developments.

A. Geometrical Transformations

Figure 15. Shapes which are equivalent to a cube for the differentgeometric ambiguities.

The following table describes, for each 2D transforma-tion, the corresponding matrix, the number of degrees offreedom and the number of points required to recover fullythe transformation.

Transformation Matrix d.o.f.# Points toRecover

the Matrix

Projective

[A dp> 1

]8 4

Affine

[A d0>2 1

]6 3

Metric

[sR d0>2 1

]4 2

Euclidian

[R d0>2 1

]3 2

where:

• A is an affine matrix

• p andd are vectors

• s is a scale factor

• R is a rotation matrix

13

B. Stereo Vision

(a) Original Pair

↓

(b) Rectified Pair

↓

(c) Disparity Map

Figure 16. the different steps of stereo vision (from UniversityNoth Carolina, M. Pollefys class)

Stereo visiontakes multiple images of a scene16(a), andafter what is called epipolar rectification, outputs two com-parable images of the same scene,cf . Figure16(b). Thegoal is then to match each pixel of the left picture to its cor-responding one in the right picture. The pixels closest tothe camera will have a greater disparity than the ones in theback: in Figure16(c), the lighter the pixels, the greater thedisparity. Therefore, by knowing the disparity between thepictures, the depth of the scene can be deduced.

References

[1] J. Bergen, P. Burt, R. Hingorani, and S. Peleg. Com-puting two motions from three frames. InProc. 3rdInt’l. Conf. Computer Vision, pages 27–32, 1990.

[2] M. Black. Robust incremental optical flow. 1992.

[3] Y. Boykov, O. Veksler, and R. Zabih. Fast approx-imate energy minimization via graph cuts. InICCV(1), pages 377–384, 1999.

[4] M. Brand. Morphable 3d models from video. InProc.IEEE Conf. Comput. Vision and Pattern Recognition,pages II:456–463, 2001.

[5] C. Bregler, A. Hertzmann, and H. Biermann. Recov-ering non-rigid 3d shape from image streams. InProc.IEEE Conf. Comput. Vision and Pattern Recognition,pages II: 690–696, 2000.

[6] O. Chum, J. Matas, anďS. Obdřzálek. Epipolar geom-etry from three correspondences. In O. Drbohlav, ed-itor, Computer Vision — CVWW’03 : Proceedings ofthe 8th Computer Vision Winter Workshop, pages 83–88, Prague, Czech Republic, February 2003. CzechPattern Recognition Society.

[7] D. Comaniciu, V. Ramesh, and P. Meer. Kernel-basedobject tracking.IEEE Trans. Pattern Analysis and Ma-chine Intelligence, 25(5):564–577, May 2003.

[8] J. Costeira and T. Kanade. A multibody factoriza-tion method for independently moving-objects.Int’l.Journal of Computer Vision, 29(3):159–179, Septem-ber 1998.

[9] T. Darrell and A. Pentland. Robust estimation of amulti-layered motion representation. pages 173–178,1991.

[10] M. A. Fischler and R. C. Bolles. Random sampleconsensus: A paradigm for model fitting with appli-cations to image analysis and automated cartography.In M. A. Fischler and O. Firschein, editors,Read-ings in Computer Vision: Issues, Problems, Princi-ples, and Paradigms, pages 726–740. Kaufmann, LosAltos, CA., 1981.

[11] W. Forstner and E. Gulch. A fast operator for detec-tion and precise location of distinct points, corners andcentres of circular features. pages 281–305, 1987.

[12] M. Galun, A. Apartsin, and R. Basri. Multiscale seg-mentation by combining motion and intensity cues. InProc. IEEE Conf. Comput. Vision and Pattern Recog-nition, pages I: 256–263, 2005.

[13] H. R. H and S. L. G. In A. P. Company, editor,Com-puter and Robot Vision, Vol. II, volume II, 1993.

[14] B. Horn and B. Schunck. Determining optical flow.pages 144–156, 1981.

14

[15] M. Irani and P. Anandan. All about direct methods. InInternational Workshop on Vision Algorithms, pages267–277, 1999.

[16] M. Isard and A. Blake. Icondensation: Unifying low-level and high-level tracking in a stochastic frame-work. In Proc. 5th Europ. Conf. Comput. Vision, pageI: 893, 1998.

[17] M. Isard and J. MacCormick. Bramble: A bayesianmultiple-blob tracker. InProc. 8th Int’l. Conf. Com-puter Vision, pages II: 34–41, 2001.

[18] N. Jojic and B. Frey. Learning flexible sprites in videolayers. InProc. IEEE Conf. Comput. Vision and Pat-tern Recognition, pages I:199–206, 2001.

[19] Y. Ke, R. Sukthankar, and M. Hebert. Efficient tem-poral mean shift for activity recognition in video. InNIPS Workshop on Activity Recognition and Discov-ery, 2005.

[20] J. M. Kleinberg and E. Tardos. Approximation algo-rithms for classification problems with pairwise rela-tionships: Metric labeling and markov random fields.In IEEE Symposium on Foundations of Computer Sci-ence, pages 14–23, 1999.

[21] D. Lowe. Distinctive image features from scale-invariant keypoints.Int’l. Journal of Computer Vision,60(2):91–110, November 2004.

[22] B. Lucas and T. Kanade. An iterative image registra-tion technique with an application to stereo vision. InIJCAI81, pages 674–679, 1981.

[23] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robustwide baseline stereo from maximally stable extremalregions. InBMVC, page 3D and Video, 2002.

[24] V. Rabaud and S. Belongie. Counting crowded mov-ing objects. InProc. IEEE Conf. Comput. Vision andPattern Recognition, 2006.

[25] D. Ramanan and D. Forsyth. Finding and trackingpeople from the bottom up. InProc. IEEE Conf. Com-put. Vision and Pattern Recognition, pages II: 467–474, 2003.

[26] P. Sand and S. Teller. Particle video: Long-range mo-tion estimation using point trajectories. InProc. IEEEConf. Comput. Vision and Pattern Recognition, 2006.

[27] J. Shi and J. Malik. Motion segmentation and track-ing using normalized cuts. InProc. 6th Int’l. Conf.Computer Vision, pages 1154–1160, 1998.

[28] J. Sivic, F. Schaffalitzky, and A. Zisserman. Objectlevel grouping for video shots. InProc. 8th Europ.Conf. Comput. Vision, pages Vol II: 85–98, 2004.

[29] F. Tang and H. Tao. Object tracking with dynamicfeature graphs.VS-PETS, October 2005.

[30] C. Tomasi and T. Kanade. Shape and motion fromimage streams under orthography: A factorizationmethod.Int’l. Journal of Computer Vision, 9(2):137–154, November 1992.

[31] C. Tomasi and J. Shi. Good features to track. InProc.IEEE Conf. Comput. Vision and Pattern Recognition,pages 593–600, 1994.

[32] P. Torr. Geometric motion segmentation and modelselection, 1998.

[33] J. Wang and E. Adelson. Layered representation formotion analysis. InVismod, 1993.

[34] Y. Weiss. Smoothness in layers: Motion segmentationusing nonparametric mixture estimation. pages 520–526, 1997.

[35] J. Wills, S. Agarwal, and S. Belongie. What wentwhere. InProc. IEEE Conf. Comput. Vision and Pat-tern Recognition, pages I: 37–44, 2003.

[36] J. Xiao, J. Chai, and T. Kanade. A closed-form solu-tion to nonrigid shape and motion recovery. InProc.8th Europ. Conf. Comput. Vision, 2004.

[37] J. Yan and M. Pollefeys. Articulated motion segmen-tation using ransac with priors. InICCV Workshop onDynamical Vision, 2005.

[38] J. Yan and M. Pollefeys. Automatic kinematic chainbuilding from feature trajectories of articulated ob-jects. InProc. IEEE Conf. Comput. Vision and PatternRecognition, 2006.

[39] L. Zelnik-Manor and M. Irani. Degeneracies, de-pendencies and their implications in multi-body andmulti-sequence factorizations. InProc. IEEE Conf.Comput. Vision and Pattern Recognition, pages II:287–293, 2003.

15

motion segmentation - vision.ucsd.eduvision.ucsd.edu/~vrabaud/papers/researchexam.pdfthe simplest...

Documents