[ieee communication technologies, research, innovation, and vision for the future (rivf) - hanoi,...

Moving Objects Segmentation in Video Sequence based on Bayesian network

Thach-Thao Duong Faculty of Information Technology

University of Science Ho Chi Minh City, Vietnam [email protected]

Anh-Duc Duong Faculty of Information Technology

University of Science Ho Chi Minh City, Vietnam

[email protected]

Abstract—This paper proposes an improvement over moving objects segmentation method for video sequence based on Bayesian network. The method integrates temporal and spatial features by Bayesian network through three fields, which are motion vector field, intensity segmentation field and object video segmentation field. Markov random field aims to push the spatial connectivity between regions. The improvement concentrates on the MAP estimation procedure in order to obtain the exact segmentation results. The Iterative MAP Estimation may cause much more error in estimation procedure and degrade the convergence of the algorithm. This paper proposes a non-iterative Estimation as an improvement for this algorithm. The non-iterative MAP estimation does not need the previous segmentation result. Therefore, the inaccurate segmentation result of former stage does not have effect on the current segmentation stage. Additionally, the non-iterative MAP estimation was designed to adapt the original model so that it does not cause failure from the theory. Experiments show that the improvement is better than the original version and has good results in some benchmark video sequences.

Keywords:Video Segmenation, bayesian network, Markov random field, MAP estimation, moving objects.

I. INTRODUCTION

These days, the rapid development of internet and multimedia, the digital cameras and huge data store hardware cause the need to process the multimedia data such as video. The video object segmentation is a complicated problem and has an important key for essential multimedia applications such as video retrieval, modern video compression, intelligent surveillance and so forth.

There are several segmentation methods for video sequence. The classification of these methods is not very clear because segmentation is a complicated procedure invoking small solutions. These solutions concern and support each other. However, they are commonly divided into three groups which are background modeling method, motion-based method and spatial-temporal method. Comparing with the image segmentation, video sequence has additionally temporal information. Image segmentation employs spatial information such as intensity, edge, color and so forth. The problem in video segmentation is how to connect temporal information with spatial information into a model so that the result is coherent in terms of spatial-temporal features.

The background subtraction segmentation method is a common method for video sequence that has static background [19]. This method usually uses a reference background [11] to detect moving objects. However, this method is not appropriate for practical video sequence in which background is dynamic. For this case, probability based method is applied to model background [7] [9] [10] [29]. Kalman Filter [16], a single Gaussian model [17] or mixture of Gaussian models [25], Kernel Density Estimation [4] [5]; dynamic hidden Markov model [26], Mean-shift [8] [21] are employed to represent the complex background. Mean-shift is the effective and less cost method for background modeling. However, the problem arising is how to extract the exact objects when there are shades or strong illumination in the scene [2] [22]. The coherence between spatial and temporal information [24] is considered to deal with this problem. MAP-MRF [18] is used model both background and foreground based on spatial and temporal information. Motion-based segmentation used temporal information such as optical flow [27] to group moving pixels by clustering method such as K-means [56]. Maximum expectation [56] or graph-cut [32] were applied to layer motion observation [13] [14] [23] [20]. This method is proper for both static and dynamic background. However, this method does not have good results at borders, where the moving directions are different. It is necessary to employed spatial information to have exact borders. Spatial-Temporal Segmentation combines spatial features and motion features in clustering pixels [1] [31]. MRF model [8] [15] based on two consecutive frame was proposed to solve occlusions. Graph-cut [6] presented segmentation as max-cut and min-flow problems on graph based on minimum energy. This is the most effective segmentation method these days.

This paper presents an improvement of the segmentation method base on spatial-temporal information. The original method was proposed by Wang-Yang [30]. In this work, the Bayesian was employed to represent the spatial-temporal constraint. Section II introduces an overview of the original model. Section III describes the improvement which is non-iterative MAP estimation. Section IV shows some results and discussion. The conclusion is presented in Section V.

978-1-4244-8075-3/10/$26.00 ©2010 IEEE

II. MODEL

The based method is statistical model proposed by Wang-Yang [30]. The Wang Yang method combined temporal information and spatial information interaction via Bayesian network through three fields, which are motion vector field, intensity segmentation and object video segmentation. Markov random field aims to push the spatial connectivity between regions.

A. Model representation The model could be presented as following if assuming that

there are no variation of illumination and no occlusion: (1)

(2) y

k(x) is the intensity of pixel in k-th frame at position x.

dk(x) is the motion vector form (k-1)-th frame to k-th frame.

gk(x) is the intensity of position x.

nk(x) is Gaussian noise, mean of 0 and variance of .

Giving three consecutive frames . Applying Bayesian rules, it is necessary to estimate conditional probability

Figure 1: Bayesian Model for video segmentation [30].

Applying Bayesian rules and chain rules [12], Maximization posterior probability Estimation (MAP) of three fields is

(3)

B. Spatio-temporal constraint

(4)

(5) The backward DFD [28] and forward DFD

(6) (7)

Correlation coefficient of and is

(8)

(9) is the label of region. is the mean intensity of region

. is the variation of each region. (10)

C is a set of clique c. V is potential function for each clique and just depends on pixel inside the clique.

(11)

is Kronecker delta function . || || denote for Euclidean distance.

(12)

takes charge of a splitting force to divide regions in to smaller regions

(13)

aims to merge regions or takes charge of merging force between regions. Factor is used to regulate the constraint on intensity segmentation field.

C. Iterative MAP Estimation The procedure to compute the minimum proposed by Wang

Yang [30] had two steps.

Figure 2: Iterative Estimation Model.

Step 1: Update and when knowing . From proposed Bayesian model, and are conditional independent providing video segmentation field and three consecutive frames. Integrity estimation would be

(14)

Applied chain rules, MAP would be (15)

(16) Step 2: Update when knowing and

(17)

D. Local Optimization To increase the correctness at border, local optimization

was applied at border area in intensity segmentation field. in (13) was replaced by

(18)

(19)

(20)

sk gk gk-1,gk+1

zk dk zk dk, skStep 2

Step 1

(21)

III. NON-ITERATIVE ESTIMATION

The Iterative Estimation has some limitations. First of all, the video segmentation field may be wrong but still takes part in computing of intensity segmentation field and motion field. This may cause much more error in estimation procedure and degrade the convergence of the algorithm. This paper proposes a non-iterative Estimation as an improvement for this algorithm.

A. Non-iterative estimation model

Figure 3: Non-iterative estimation model

Step 1: Estimate and without knowing video field previously. From Bayesian Model, and are conditional independent when knowing three consecutive frames. Joint Estimation can be

(22) Apply chain rule, MAP estimation becomes

(23)

(24) Step2: Update when knowing estimation of motion vector field and intensity segmentation field . in (19) will be replaced by Apply Gibbs distribution and potential function on 2 neighbours points to compute p(

(25) Therefore, Local optimization in (19), (20), and (21) become

(26)

(27)

(28)

B. Non-iterative estimation algorithm Algorithm to estimation motion vector in frame gkStep1: Each pixel x(i,j) and each motion vector, estimate

Where ms is the size of vector motion bound Step2: Compute motion vector for pixel x

Step 3: Each pixel x, consider all the pixels y in the neighborhood Nx of pixel x and observed motion vector

of pixel y. Compute as in (26)

Step 4: Compute motion vector for pixel x

Algorithm to compute intensity segmentation field in frame gkStep 1: Initialize mean value for m levels of intensity, i = 1..m and assign initialized intensity segmentation label for each pixel x

MAXGRAY*(i+1/2)/m sk(x)=gk(x)*m/MAXGRAY

Where MAXGRAY is the maximum grayscale level Step 2: Each pixel x, consider intensity segmentation label sk(x)=1..m and observe pixels in neighborhood Nx. Compute

as (27) Step 3: Assigning intensity segmentation label sk for x

Each pixel x, consider y in Neighborhood Nx and motion vector for every pixel in Nx. Compute as (26) Step 4: Update mean and back to step 2 (Loop for a certain number of time)

Algorithm to estimate video segmentation field in frame gkStep1: Compute the distance transformation at each pixel in frame gkStep 2: Each pixel x with motion vector , distance transformation value DT(x) and video segmentation label for frame gk. Consider all possible video segmentation label zk(x)=0..m. Compute as (28) Step3: Assign video segmentation label for pixel x

C. Initialization and parametersWe assume that initialized video segmentation is known

previously. In order to obtain initialized video segmentation, the authors use MAP estimation [5] and method proposed by Wang and Adelson [56]. Video segmentation field can be obtained by manual labeling or automatic image segmentation. Authors use parameters selection method proposed by Chang [3]. Set of parameters (

1,

2,

3,

4) was defined by equalizing

the contributors joining in MAP estimation. This paper uses the parameter set published by the authors.

IV. RESULT AND DISCUSSION

Segmentation results in Fig. 4d-f are obtained by using the same parameter set (

1,

2,

3,

4, ) = (1, 12, 4, 16, 0.625)

with the iterative estimation, improve iterative estimation

following (27) and (28), and non-iterative respectively. Segmentation results in Fig. 4g is obtained by using parameter set (

1,

2,

3,

4, ) = (1, 12, 4, 32, 0.3125) with iterative.

Result in Fig. 4d has much noise whereas it is good in Fig. 4.2e, which is few noise and exact borders. Similarly, Fig. 4e, 4f have exact segmentation at borders whereas results in Fig. 4g is not exact at border of right arm. The segmentation result in Fig. 4c, which is non-iterative MAP estimation, is better comparing with the three other results.

The experiment was performed on some benchmark sequence for video segmentation such as “husky”, “football”, “bus” with non-iterative MAP estimation. The algorithm is good especially for sequence having moving objects or moving background. With the rapid motion, the algorithm falls to keep track with large movement. Therefore, this sequence was chosen because of average movement in object and background. “bus” has few objects whereas “husky” and “football” have many objects moving. Additionally, “football” has complicated movement and much more occlusion. “Bus” is the special sequence because objects and camera move with the same direction. Therefore, it seems that the objects are stationary and moving background. In “husky” and “football” sequences, camera moves quickly and suddenly in some phases. Experiment shows that “husky” and “bus” have good performance.

Comparing between motion vector field in Fig 5d and video segmentation field in Fig 5g, objects with fast movement are segmented better than slow moving objects. The reason is the motion vector of slow moving objects is not strong enough to split it from the background. The intensity segmentation with 4 levels in Fig. 5e, in which each pixel have the same intensity level belongs to the same region. Fig. 5f describes the distance from each pixel to its nearest border. The higher gray value, the farer it is from the border. It means that the brighter the pixel, the far it is form the border.

With ‘bus’ sequence, border information which is lost in Fig. 6e (border information can be lost because of over-segmentation, in this case is border at head of bus) is recovered by motion information. However, border will be segmented exactly in the case that spatial and temporal information is coherent to each other for example the human body and dog in Fig. 5g and bus head in Fig. 6g. The algorithm works well even in large and complex texture area for example background in Fig. 5 and Fig. 6.

(a) (b) (c)

(d) (e) (f) (g) Figure 4: Results for “table tennis” sequence

(a-c) three consecutive frame 41, 42, 43 (d) Result with iterative MAP (e) Result with improved iterative MAP (f) Result with non-iterative MAP (g) Result with iterative MAP

(a) (b) (c)

(d) (e)

(f) (g) Figure 5: Results for “husky” sequence with iterative estimation

(a-c)three consecutive frame 54,55,56 (d) Motion vector (e) Intensity segmentation field (f) Distance Transformation Image (g) Video segmentation field

(a) (b) (c)

(d) (e) (f) (g) Figure 6: Results for “bus” sequence with iterative estimation

(a-c) three consecutive frames 30, 31, 32 (d) Motion vector (e) Intensity segmentation field (f) Distance Transformation Image (g) Video segmentation field

The results in Fig. 5 and Fig. 6, this improved algorithm is better with the coherent spatial-temporal feature within region without good intensity segmentation.

Fig. 7a-b indicates that the difference of spatial information causes the region splitting. Fig. 7c-d shows that small regions with nearly the same temporal and spatial information will be merge into a single region. The results for “football” sequences in Fig.8 from frame 1 to 4 show that object border still be kept even objects move quickly and occlude each other.

Fig. 9 shows segmentation result for frames from 23 to 32, in which the camera moves quickly and suddenly. The object borders are not exact during this stage and makes chaos in

segmentation. When the camera changes the direction suddenly, the border is not exact anymore because this model is based on simple affine movement of camera. When both camera and objects move at the same time, the motion vector field does not reveal the real movement of object. Similarly in Fig. 10, which is “husky” sequence, the camera begins to rotate and zoom closer to objects from Fig. 10c. Therefore, border of objects is violated and regions are fragmented.

(a) (b)

(c) (d) Figure 7: Results of “husky” sequence in frame 49,50 and 264, 267.

(a) (b)

(c) (d) Figure 8: Result for “football” sequence with frame 1, 2 and 3, 4.

To sum up, the experiments show some that the results conforms the proposed model. Firstly, intensity segmentation field aims to exact border in coherent spatial-temporal regions. In some case, some similar regions may belong to different segmented objects. The spatial constraint may be weaker when the motion information in the same region is not coherent. This is the reason why the lost border can be recovered by motion

(a) (b) (c)

(d) (e) (f)

(g) (h) (i) Figure 9: Result for “football” sequence with frame form 23 to 31.

(a) (b) (c)

(d) (e) (f)

(g) (h) (i) Figure 10: Result for “husky” sequence with frame from 166 to 174.

information. Secondly, this proposed method can not have exact border when complex moving camera or objects. However, this is potential method for segmenting coherent spatial-temporal regions. Thirdly, when objects stop moving and stay, the motion vector is weak. Therefore the border is violated and disappears. In this case, the parameter should be increase. However the change should be considered to balance the contribution to MAP estimation.

V. CONCLUSTION AND FUTURE WORK

The paper approach the object video segmentation based on spatial and temporal features. Bayesian model and Markov Chain are employed to present the relationship between spatial and temporal information. The Bayesian model combines the interaction between motion vector, intensity segmentation and

video segmentation. Markov Random Field is applied to group the coherent spatial-temporal pixel into regions. The paper researches and makes an improvement on MAP estimation. The experiment shows that the non-iterative estimation is better. It converges to the segmentation result even with small parameters. It still remains the quality of regions, which is coherent spatial-temporal information.

To make this algorithm become more practical, it is necessary to be optimized the complexity. It is not the general segmentation algorithm for all video sequences. It may need some assumptions to reduce the cost. Additionally, the simple affine motion model is not suitable for sequences that camera and objects move complicatedly. The model should be verified for this complex case. In the future, our research orientation is trying to model complicated camera movement and optimize the algorithm.

REFERENCES

[1] Bergen L. and Meyer F. (2000) "A novel approach to depth ordering in monocular image sequence", CVPR, pp. 536-541.

[2] Boult, Micheals T. E. R. J., Gao X., and Eckmann M. (,2001), “Into the woods: Visual surveillance of noncooperative and camouflaged targets in complex outdoor settings”, Proc. IEEE, vol. 89, pp. 1382-1402.

[3] Chang M. M., Tekalp A. M., and Sezan M. I. (1997), “Simultaneous motion estimation and segmentation,” IEEE Trans. Image Processing,vol. 6, pp. 1326-1333.

[4] Elgammal A., Duraiswami R., Harwood D., and Davis L. S. (2002), “Background and foreground modeling using nonparametric kernel density estimation for visual surveillance,” Proc. IEEE, vol. 90, pp. 1151-1163.

[5] Elgammal A., Harwood D., and Davis L. (2000), “Non-parametric model for background subtraction”, Proceedings of European Conference on Computer Vision, vol. 2, pp. 751–767.

[6] Freedman D. and Turek M. W. (2005), “Illumination-invariant tracking via graph cuts”, Proc. Conf. Comp. Vision Pattern Rec., pp.10-17.

[7] Grimson Y., Stauffer C., Romano R., and Lee L. (1998), “Using adaptive tracking to classify and monitor activities in a site”, Proc. Conf. Comp. Vision Pattern Rec, pp.22-29.

[8] Han B., Comaniciu D. and Davis L. (2004), “Sequential kernel density approximation through mode propagation: Applications to background modeling”, ACCV.

[9] Haritaoglu I., Harwood D., and Davis L.S. (1998), “W4: A real time system for detecting and tracking people”, Computer Vision and Pattern Recognition, pp. 962–967.

[10] Isard M. and Blake A. (1998), “A mixed-state Condensation tracker with automatic model-switching”, Proc. Int’l Conf. Computer Vision, pp. 107–112, 1998.

[11] Jain R. and Nagel H. (1979), “On the analysis of accumulative difference pictures from image sequence of real world scenes”, IEEE Trans. Pattern Anal. Machine Intell., 1(2), 1979.

[12] Jensen F. V. (2001), Bayesian Networks and Decision Graphs, Springer-Verlag, 2001.

[13] Jepson A. D., Fleet D. J., and Black M. J. (2002), “A layered motion representation with occlusion and compact spatial support”, Proc. European Conf. Computer Vision, pp. 692-706, 2002.

[14] Jojic N. and Frey B. J. (2001), “Learning flexible sprites in video layers”, Proc. IEEE Conf. Computer Vision and Pattern Recognition,vol. 1, pp. 199-206.

[15] Kamijo S., Ikeuchi K., and Sakauchi M. (2001), “Segmentations of spatio-temporal images by spatio-temporal Markov random field model”, Proc. EMMCVPR Workshop, pp. 298-313.

[16] Koller D., Weber J., and Malik J. (1994) “Robust multiple car tracking with occlusion reasoning”, Proceedings of the European Conference on Computer Vision, pp. 189–196.

[17] Lee D. S. (2005), “Effective Gaussian mixture learning for video background subtraction”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 827–832.

[18] Mahamud S. (2006), “Comparing belief propagation and graph cuts for novelty detection”, Proc. Conf. Comp. Vision Pattern Re., vol. 1, pp.1154-1159.

[19] McIvor A. M. (2000), “Background subtraction techniques”, In Proc. of Image and Vision Computing.

[20] Pentland A. and Darrel A. (1991). “Cooperative robust estimation using layers of support”, TR.163, MIT Media Lab, Vision and Modeling Group.

[21] Piccardi M. and Jan T. (2004), “Mean-shift background image modelling”, Proceedings of International Conference on Image Processing, pp. 3399–3402.

[22] Prati A., Mikic I., Trivedi M., and Cucchiara R. (2003), “Detecting moving shadows: Algorithms and evaluation”, IEEE Trans. Patt. Anal. Mach. Intel., vol. 25, pp. 918-923.

[23] Seki M., Wada T., Fujiwara H. and Sumi K. (2003), “Background subtraction based on cooccurrence of image variations”, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 65-72.

[24] Sheikh Y. and Shah M. (2005), “Bayesian modeling of dynamic scenes for object detection”, IEEE Trans. Pattern Anal. Machine Intell.,27(11),pp. 603.619.

[25] Stauffer C. and Grimson W. E. L. (2000), “Learning patterns of activity using real-time tracking”, IEEE Trans. Patt. Anal. Mach. Intel., vol. 22, pp. 747-757.

[26] Stenger B., Ramesh V., Paragios N., Coetzee F., and Buhmann J. M. (2001), “Topology free hidden Markov models: application to background modeling”, Proc. Int’l Conf. Computer Vision, vol. 1, pp. 294-301.

[27] Tian Y. and Hampapur A. (2005), “Robust salient motion detection with complex background for real-time video surveillance”, Workshop on Motion and Video Computing.

[28] Tekalp A. M. (1995), Digital Video Processing, Prentice Hall, 1995. [29] Wang J. Y. A. and Adelson E. H. (1994), “Representing moving images

with layers”, IEEE Trans. Image Processing, vol. 3, pp. 625-637, 1994. [30] Wang Y., Loe K. F., Tan T., (2003), “Video segmentation based on

graphical models”, IEEE Proceedings of Computer vision and Pattern Recognition”, vol. 2, pp. 335-342.

[31] Wardhani A. and Gonzalez R. (1999), “Image Structure Analysis for CBIR”, Proc. Digital Image Computing: Techniques and Applications, pp.166- 168.

[32] Xiao J. and Shah M. (2005), “Accurate motion layer segmentation and matting”, Proc. Conf. Comp. Vision Pattern Rec., pp. 698-703.

[ieee communication technologies, research, innovation, and vision for the future (rivf) - hanoi,...

Documents