motion characterization of a dynamic scenesmuralid/papers/visapp2014.pdfmotion characterization of a...

Motion Characterization of a Dynamic Scene

Arun Balajee Vasudevan1, Srikanth Muralidharan1 Shiva Pratheek Chintapalli1 andShanmuganathan Raman2

1Electrical Engineering, Indian Institute of Technology Jodhpur, Rajasthan, India2Electrical Engineering, Indian Institute of Technology Gandhinagar, Gujarat, India

{arunbalajeev, srikanth, shiva}@iitj.ac.in, [email protected]

Keywords: Image and Video Analysis, Scene Understanding, Segmentation and Grouping

Abstract: Given a video, there are many algorithms to separate static and dynamic objects present in the scene. The pro-posed work is focused on classifying the dynamic objects further as having either repetitive or non-repetitivemotion. In this work, we propose a novel approach to achieve this challenging task by processing the opticalflow fields corresponding to the video frames of a dynamic natural scene. We design an unsupervised learningalgorithm which uses functions of the flow vectors to design the feature vector. The proposed algorithm isshown to be effective in classifying a scene into static, repetitive, and non-repetitive regions. The proposed ap-proach finds significance in various vision and computational photography tasks such as video editing, videosynopsis, and motion magnification.

1 Introduction

Computer vision involves estimation of scene in-formation, which the human vision can perceive veryeasily, from images and videos using efficient algo-rithms. One of the challenging problems in computervision is the identification of the type of motion agiven object exhibits in a natural scene. Analysingmotions of different objects in a scene might be a triv-ial task for a human being. However, it is extremelycomplex for a computer. This complexity is due to thedifferences in the appearance of the objects and thedifferent types of motion each object may undergo ata given time. Given a digital image/video, computervision researchers strive to perform high level visiontasks such as recognition and segmentation.

However, the digital video captured is just a 2Dprojection of the 3D scene being captured (Peterson,2010). A set of consecutive video frames provide nec-essary information for the segmentation of a scene de-pending on the types of motion present. A scene mayhave static, repetitive, and non-repetitive motion re-gions. Algorithms based on optical flow yield flowfields that form the basis for designing feature vectorfor each pixel location. For a general natural scene,displacement flow vectors of objects could exhibit awide range of variations. Hence, segmenting suchscenes is a major challenge. Segmentation of differentmotion regions in a dynamic scene has various appli-

cations such as removal of occlusion, scene catego-rization and understanding, video editing, video syn-opsis, motion magnification, to name a few.

(a) (b) (c)

(d) (e) (f)

Figure 1: (a) Video of a dynamic scene, (b) the frames ofa video corresponding to a dynamic scene, (c), (d), and (e)illustrate the repetitive, static, and non-repetitive regions ofthe scene, and (f) output segmentation from the proposedalgorithm.

Sampled video frames of a scene having a rotatingwheel are shown in Fig. 1(a). Some of the frames ex-tracted from the video are shown in Fig. 1(b). Theextracted frames indicate the presence of repetitiveobject in the scene (wheel) and also show the pres-ence of non-repetitive motion in an object (hand). Thescene also has static regions. Fig. 1(c), 1(d), and1(e) depict the various types of motion regions presentin this scene. The presence of non-repetitive motion

(like the one shown in Fig. 1(e)) is common in realworld scenes. Our approach aims at classifying thesedifferent types of motion automatically given a videocorresponding to any natural scene.

We address this problem by designing a novelsegmentation algorithm. We shall design robust fea-ture vectors to separate repetitive, non-repetitive, andstatic regions in a natural scene. This classificationwould enable us to categorise these regions and usethem to solve other computer vision applications. Ourapproach can also find application in compressing thevideos. Thus we have a diverse set of applications ofthe proposed approach in vision, computational pho-tography, and video processing. For instance, redun-dant multiple times recording of repetitive motion ina scene such as rotating fan motion, flowing river, pe-riodic sea waves and others can be avoided by propersegmentation of repetitive part of the scene from thevideo. This is the one of the fruitful application of ourproposed approach for three label segmentation.

The primary contributions of the proposed workare:

1. Design of a novel feature descriptor to classifythe static, repetitive, and non-repetitive motion re-gions in a scene.

2. Design of an unsupervised learning framework forbottom-up segmentation of these regions.

3. The feature vectors are modelled as functions ofthe contents from space time volume with finitetime support for efficient performance.

4. The approach does not depend on object surfaceproperties such as reflectance, texture, color, etc.and predicts the type of motion an object under-goes even for objects with different appearances.

In Section 2, a brief description about the related re-search work is provided. Section 3 contains a descrip-tion about the design of feature descriptor for unsu-pervised learning. In Section 4, results of applyingthe proposed feature vector on several dynamic scenedata sets are described. Section 5 presents the chal-lenges facing the proposed algorithm. Section 6 pro-vides directions regarding future work and section 7provides the conclusion of the present work.

2 Related Work

Lucas-Kanade (Lucas and Kanade, 1981) andHorn-Schunck (Horn and Schunck, 1981) are thestandard optical flow algorithms that mostly act asbuilding blocks to separate static and dynamic objectsin a scene. These algorithms operate efficiently un-

der the assumption that the objects undergo small dis-placements in successive frames. However even if theabove assumption is satisfied, the algorithm may notgive good results for scenes which violate brightnessconstancy assumption and scenes which have trans-parency, depth discontinuity, and specular reflections.Some of these shortcomings have been overcome inthe approach which uses feature descriptor matching(Black and Anandan, 1996). In order to get more ef-ficient results in situations involving large object dis-placements, a more recent work on optical flow uses ahierarchical region based matching (Brox and Malik,2011).

The flow fields obtained from an optical flow al-gorithm serve as the basic ingredients in the recent al-gorithms developed in the domain of video processingand computational photography. To extend its useful-ness to situations involving large displacements, theLarge Displacement Optical Flow (LDOF) algorithmwas developed (Brox and Malik, 2011). This algo-rithm also uses a feature descriptor. An additionalconstraint in the energy optimization equation alongwith other constraints are used to establish the match-ing criteria in LDOF.

There are motion based segmentation algorithmsthat aid in background subtraction (Stauffer andGrimson, 1999). These approaches create a Gaussianmixture model for each pixel and segment static anddynamic pixels using a threshold. The existing algo-rithms in video synopsis rely on the extraction of dy-namic objects in the scene for video synopsis (Pritchet al., 2008). They extract these interesting objectsfrom the video and store them along with the tim-ing information.The video is then condensed in whichonly the interesting objects are shown with their tim-ing information.

Figure 2: The Proposed Approach

The recent work on editing small movements us-ing motion analysis of image pyramids involves thestudy of phase variations of motion dependent param-eters. The phase variations are processed temporallyto remove or enhance minute changes over time. The

processing does not involve optical flow estimationand is therefore suitable for processing of videos ofscenes where optical flow based approaches may fail(Wadhwa et al., 2013).

One of the recent works in the domain of movingobject segmentation is given by Bergh and Van Gool(Van den Bergh and Van Gool, 2012). Their paperpresents a novel approach of combining color, depthand motion information (optical flow) for segment-ing the object using superpixels. This approach takesinto consideration the 3D position of object and itsdirection of motion while segmenting the object. An-other notable work on motion segmentation is givenby Ochs and Brox (Ochs and Brox, 2012). Their pa-per presents an efficient approach of separating themoving objects from a video using spectral cluster-ing.

3 The Proposed Approach

We use two of our own datasets of dynamic sceneeach containing a set of seven images captured in theburst mode with a DSLR camera. Additionally, weuse YUPenn dataset of dynamic scenes consisting of14 different scenes (Derpanis and Wildes, 2012). Wesampled a set of 5-7 images from each dynamic scenevideo present in the YUPenn dataset. In the scenes,we come across objects that exhibit a large displace-ment for which the standard algorithms such as LucasKanade flow vector usually fail. Therefore, we use thevelocity vectors obtained from the LDOF algorithm inthese cases. For the motions with relatively small dis-placement compared to the object size, Lucas-Kanadealgorithm is sufficient to achieve good results. Us-age of appropriate flow vectors are done manuallydepending on the type of motion- small or large dis-placement movements. The Optical flow algorithmsapplied on every two consecutive images gives 2-Dvector field consisting of optical flow vectors.

We consider the first image as the reference imageas optical flow vector of a frame is computed with ref-erence to another frame. We use the optical flow vec-tors for the segmentation of the reference image intodifferent motion regions corresponding to the naturalscene. We use divergence of the flow vector and thegradient of magnitude of the flow vector of the imageat every point in the image to construct the featurevector. The divergence of a vector field is the rateat which flux exits a given region of space. Gradientrepresents the magnitude and direction of maximumvariation of the scalar field (image).

Let

Pi(x,y) = (−→Vx ,−→Vy ,−→∇ .−→V ,−→∇ x|−→V |,−→∇ y|−→V |) (1)

Pi(x,y) = (p1,p2,p3,p4,p5) (2)

where Pi(x,y) is a vector at a given pixel location(x,y) for ith frame.

Here p1 and p2 are the optical flow vectors at (x,y).After applying the vector operations on the flow vec-tors, we obtain their divergence and gradient of flowvectors for every pixel in the image. p3 correspondsto the divergence and p4 and p5 represent the gradi-ents of magnitudes of the vectors of the given frame(image) as shown in Equation 1.

For better classification, we need to add finite tem-poral support using information from successive im-ages. Consequently, we take about 5-6 frames from apart of a 5 second video. We take the variance of di-vergence of flow vectors, and gradient of magnitudeof flow vector, as in Equation 1, on each pixel acrossthe video frames in temporal window. An intuitivereason behind using variance of divergence and gradi-ent is justified by the difference in variations in theirvalues for different regions. For static regions, thevariation will be negligible because the magnitude ofvelocity vectors in these regions will be close to zeroover the temporal support of about 5 seconds as in theexample shown in the Fig. 3. Non-repetitive motionregions will have a high variation due to their con-tinuous variation in movement and repetitive motionregions will have a continuous but a smaller variationin their velocity vectors.

Let Q be the feature vector that is used in the un-supervised learning algorithm. There is a significancebehind bringing Q vector in Equation 3 into logarith-mic domain . The formation of margin for three re-gion clustering takes place better in logarithmic do-main as it can be clearly seen from Fig. 3(b) and 3(f).

LetQi(x,y) = (q1,q2,q3,q4,q5), (3)

where q j = log(a+σ2(p j)), j = 1,2,3,4,5,where σ2 is the variance of a feature vector across

the space-time volume over finite time support, and ais a very small non-zero, positive constant (a=0.1 forFig. 3). We include this constant to avoid conditionwhen the variance becomes close to zero (especiallyin static regions) as the logarithm value may tend toinfinity. This parameter is purely experimental, wevaried its value from 0.01 to 1 for different cases in theFig. 4 depending on the corresponding segmentationoutput.

The number of pixels in a low resolution imageare high which make optimization at pixel level quitedifficult. Superpixels group a set of neighbouring pix-els which share similar properties. Using superpixels,we preserve the boundaries of the objects in the im-age and it will help in reducing the effective numberof pixels in the image (Ren and Malik, 2003). This

(a) (b) (c) (d) (e) (f)

Figure 3: (a) One among the input frame, (b) 3 class Segmentation using K-means clustering using Vx and Vy of−→V in Equation

1, (c) using LDOF on pixels lead to noisy segmentation, (d) result of applying superpixel on reference image, (e) expectedresult (ground truth), (f) segmentation result using the proposed algorithm. (white - repetitive, black - non-repetitive, gray -static).

increases the calculation efficiency during the execu-tion of program for motion classification in a scene.The superpixel based approach improves accuracy inclassification by finding features for each superpixelrather than for individual pixels. When we plot thedivergence of optical flow vector field, we may getnoisy results as observed in Fig. 3(c). Superpixelshelp in the suppression of noise as we take the aver-age of the feature vectors estimated within a super-pixel and this operation helps in improving the accu-racy while performing segmentation of the scene.

Every pixel within a superpixel is uniformly as-signed a particular feature vector. The individual ele-ments are calculated as the mean of the feature vectorvalues corresponding to the pixels present in a super-pixel. This feature vector is used in the unsupervisedlearning algorithm that clusters the feature points intothree different classes. In our experiment, we use K-means clustering algorithm for testing the designedfeature vector. This enables us in the segmentation ofthe scene according to the reference image.

Addition of more information to the feature vec-tor by including divergence for each set of flow vec-tor obtained from more frames improves the result ofclassification. Feature vector comprises of functionof variance of optical flow vectors, divergence of flowvectors, and gradients of its magnitude. This makethe flow vector into 5-dimensional motion flow vec-tor(5DMFV). Optical flow vectors obtained from dif-ferent methods are compared by analysing the resul-tant classification with the ground truth image createdfor the scene (see Fig. 3). Applying K-means cluster-ing for 3 clusters on this 5DMFV categorises into 3classes- static, repetitive and non-repetitive motion, ifpresent in the scene. As seen in Fig. 3, certain sceneshave significant variation in motion from static andsmall regular motion to random motion. This is thereason that interests us to group the entire 5DMFVspace into 3 clusters.

Figure 4: Various datasets and segmentation results (white- repetitive, black - non-repetitive, gray - static)

4 Results

Fig. 5 depicts the comparison of our approachwith one of the recent works in motion segmentationbased on point trajectories. Column-1 of the Fig. 5has a frame extracted from a video of a real worldscene. Figures in column 2 and 3 are the results ob-tained from the binaries of motion segmentation ap-proach of Ochs et al (Ochs and Brox, 2012). Wehave presented the results of our proposed approachwith just optical flow vector as feature vector in thecolumn-4 while column-5 shows results obtained us-ing 5DMFV. In these scenes, we have different kindsof motion like rotational motion of wheel, movinghand, train, fountain, trains and among others.

As seen from the first row, the reference imagehas static background, repetitive motion of wheel andnon-repetitive motion of hand. Although (Ochs andBrox, 2012) approach is able to segment the handclearly, the wheel segmentation is not perfect. Fig.5(d) shows that the segmentation result leads to manyerrors. Fig. 5(e) depicts the results of the proposedapproach which is able to segment the reference im-

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

(k) (l) (m) (n) (o)

Figure 5: (a, f, k) Real video sequence of Wheel, Fountain and Train, (b, g, l) Result obtained from binaries from (Ochs andBrox, 2012), (c, h, m) Trajectory clustering on video sequence from Ochs et al (Ochs and Brox, 2012), (d, i, n) Results ofproposed approach using Lucas and Kanade optical flow vectors as feature vectors, (e, j, o) Proposed Approach result using5-dimensional feature vector. Here, white depicts repetitive motion, black depicts non-repetitive motion and gray indicatesstatic region in the scene

age into three regions comprising of static, repetitiveand non-repetitive motions. Similarly, we have thefountain scene with repetitive fountain and non repeti-tive far off trees movement. Also we consider the rail-way scene with constant locomotive motion and irreg-ular tree movement due to the emitted smoke. In thefountain scene, the work by (Ochs and Brox, 2012)seems to perform better with less erroneous regionsin the trajectory clustering image when compared toFig. 5(j) which segmented three parts with some er-rors. However, proposed approach makes appropri-ate segmentation in the railway scene with white asrepetitive locomotive motion, black as tree movementand gray as the static region as shown in Fig. 5(o) incomparison to Fig. 5(l, m). We conclude from Fig.5 that the proposed approach leads to promising re-sults. Thus, we bring out a visual comparison of theabove methods. Their quantitative comparison is notpossible because state of the art method deals withsegmentation of static and dynamic parts of a scene.

We apply our algorithm to a set of five consecu-tive frames of the wheel scene. In Fig. 3, the wheelexhibits repetitive motion while the arm exhibits nonrepetitive motion. One of the frames of the video usedis shown in Fig. 3(a). We use the first image in our setas the reference image. Fig. 3(c) shows that the seg-mentation at pixel level is noisy. In the reference im-age, super pixels are calculated as shown in Fig. 3(d)and these superpixels are employed for the segmen-tation. Upon application of our algorithm, the result

(a) (b)

(c) (d)

Figure 6: (a) and (b) Consecutive frames of river scene. (c)the result of applying superpixel on reference image, and(d) segmentation result using the proposed algorithm (white- repetitive, black - non-repetitive, gray - static).

shown in Fig. 3(f) is obtained which matches closelywith the approximate ground truth image (Fig. 3(e)).Fig. 4 depicts two representative images from eachdataset that were used and their corresponding out-puts.

When we apply the proposed algorithm on theriver dataset (Fig. 6), we observe that the distant partof the river is classified as static, nearer part is clas-sified as non-repetitive and the middle part is clas-sified as repetitive. For a pinhole camera, the 2Dperspective projection of the camera is provided by( f X/Z, fY/Z), where f is the distance of optical cen-ter from image plane. Z is the depth of the scene. Sowhen object is far, depth variations are not felt andthe projection becomes orthographic projection. In

this case, the 3D scene point is projected as (X ,Y ).When object is near, depth variation is appreciable

and the projection becomes strong perspective pro-jection. In cases where the depth is slightly more,the projection becomes weak perspective projection.Therefore the variation in depth within the same ob-ject may result in different classifications of its parts.Fig. 6(d) shows that the distant region of the river isclassified as static region while the nearer region ofthe river is classified as having non-repetitive motionand rest of the region is classified as having repetitivemotion.

5 Challenges

There may be scenes that have very large ob-ject displacements where both normal optical flowmethod and LDOF algorithm yield less accurate re-sults. Though the scene may contain regions of non-repetitive motion, they may get unidentified becauseof its absence in the sampled images. The algorithmmay fail, for instance, when there is a lightning inthe scene. This is expected as optical flow algorithmworks under the assumption of constant brightness.

Change of lighting condition in the scene leads toerror in the segmentation. Optical flow algorithmshave its dependency on the brightness value at thepixel location. Segmentation problems arise in suchexceptional cases of complex natural scenes. Usageof unsupervised learning such as K-means clusteringgives rise to the problem of different partitions result-ing in different clusters.

6 Future Work

Classification of motion in a dynamic scene has abright research future when the scene is affected bydrastic illumination changes. In some of the previ-ously considered examples, we saw that the illumi-nation of the scene keeps fluctuating which leads tobad results upon implementation of the proposed al-gorithm (Fig. 4). There may be problems due to vari-ations in camera parameters such as aperture, focallength, and shutter speed. We plan to improve the pro-posed approach for use in video synopsis and motionmagnification in future.

7 Conclusion

The proposed approach segments the scene intostatic, repetitive, and non-repetitive motion regions

effective for a sampling rate between 1 per 30 framesto 1 per 5 frames. For scenes containing large dis-placements, LDOF gives better results. The approachfails in the scenes where lighting condition changesas the brightness constancy assumption does not holdtrue. Also when the depth of the object varies widely,we face difficulty in classification. We hope to cus-tomize this approach to other computer vision ap-plications involving segmentation of different objectsbased on the motion they exhibit.

REFERENCES

Black, M. J. and Anandan, P. (1996). The robust estima-tion of multiple motions: Parametric and piecewise-smooth flow fields. Computer Vision and Image Un-derstanding, 63(1):75 – 104.

Brox, T. and Malik, J. (2011). Large displacement opticalflow: descriptor matching in variational motion esti-mation. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 33(3):500–513.

Derpanis, K. G. and Wildes, R. (2012). Spacetime tex-ture representation and recognition based on a spa-tiotemporal orientation analysis. Pattern Analysisand Machine Intelligence, IEEE Transactions on,34(6):1193–1205.

Horn, B. K. and Schunck, B. G. (1981). Determining opticalflow. Artificial intelligence, 17(1):185–203.

Lucas, B. D. and Kanade, T. (1981). An iterative image reg-istration technique with an application to stereo vision(ijcai). In IJCAI, pages 674–679.

Ochs, P. and Brox, T. (2012). Higher order motion modelsand spectral clustering. In Computer Vision and Pat-tern Recognition (CVPR), 2012 IEEE Conference on,pages 614–621. IEEE.

Peterson, B. (2010). Understanding Exposure: How toShoot Great Photographs with Any Camera. AmphotoBooks.

Pritch, Y., Rav-Acha, A., and Peleg, S. (2008). Nonchrono-logical video synopsis and indexing. Pattern Analy-sis and Machine Intelligence, IEEE Transactions on,30(11):1971–1984.

Ren, X. and Malik, J. (2003). Learning a classificationmodel for segmentation. In Computer Vision, 2003.Proceedings. Ninth IEEE International Conferenceon, pages 10–17 vol.1.

Stauffer, C. and Grimson, W. E. L. (1999). Adaptivebackground mixture models for real-time tracking.In Computer Vision and Pattern Recognition, 1999.IEEE Computer Society Conference on., volume 2.IEEE.

Van den Bergh, M. and Van Gool, L. (2012). Real-timestereo and flow-based video segmentation with super-pixels. In Applications of Computer Vision (WACV),2012 IEEE Workshop on, pages 89–96. IEEE.

Wadhwa, N., Rubinstein, M., Durand, F., and Freeman,W. T. (2013). Phase-based video motion processing.ACM Trans. Graph. (Proceedings SIGGRAPH 2013),32(4).

motion characterization of a dynamic scenesmuralid/papers/visapp2014.pdfmotion characterization of a...

Documents