multi-instance object segmentation with occlusion handlingpurely bottom-up schemes. to address this...

1
Multi-instance Object Segmentation with Occlusion Handling Yi-Ting Chen 1 , Xiaokai Liu 1,2 , Ming-Hsuan Yang 1 1 University of California at Merced. 2 Dalian University of Technology. In this paper, we present a multi-instance object segmentation algorithm to tackle occlusions. As an object is split into two parts by an occluder, it is nearly impossible to group the two separate regions into an instance by purely bottom-up schemes. To address this problem, we propose to incor- porate top-down category specific reasoning and shape prediction through exemplars into an intuitive energy minimization framework. We start by finding the occluding regions, i.e., the overlap between two instances. For example, the overlap between the person and motorbike gives the occluding region, i.e., leg of the person, in Figure 1. To find these re- gions, we need to parse and categorize the two overlapping instances. Re- cently, Hariharan et al. [3] propose a simultaneous detection and segmenta- tion (SDS) algorithm that shows a significant improvement in the segmenta- tion classification task. This classification capability provides us a powerful top-down category specific reasoning to tackle occlusions. Then, we use categorized segmentation hypotheses obtained by SDS to infer occluding regions by checking if two of the top-scoring categorized segmentation pro- posals are overlapped. If they overlap, we record this occluding region into the occluding region set. On the other hand, the classification capability are used to generate cate- gory specific likelihood maps and to find the corresponding category specific exemplar sets to better estimate the shape of objects. These category spe- cific likelihood maps are used to indicate the location of objects in different category in a probabilistic manner. As the bottom-up segmentations tend to undershoot (e.g., missing parts of an object) and overshoot (e.g., con- taining background clutter), we enhance these segments through the non- parametric, data-driven shape predictor based on the chamfer matching [4]. The overview of the exemplar-based shape prediction is shown in Figure 2. The inferred occluded regions, shape predictions and class-specific like- lihood maps are formulated into an energy minimization framework to ob- tain the desired segmentation candidates (e.g., Figure 1(d)). Let y p denote the label of a pixel p in an image and y denotes a vector of all y p . The energy function given the foreground-specific appearance model A i is defined as E (y; A i )= pP U p (y p ; A i )+ p,qN V p,q (y p , y q ) , (1) where P denotes all pixels in an image, N denotes pairs of adjacent pixels, U p (·) is the unary term and V p ,q (·) is the pairwise term. Our unary term U p (·) is the linear combination of several terms and is written as U p (y p ; A i ) = -α A i log p(y p ; c p , A i ) - α O log p(y p ; O) -α P c j log p(y p ; P c j ) . (2) For the pairwise term V p,q (y p , y q ), we follow the definition as Grabcut [5]. The first potential p(y p ; c p , A i ) evaluates how likely a pixel of color c p is to take label y p based on a foreground-specific appearance model A i . As in [5], an appearance model A i consists of two Gaussian mixture models, foreground and background. Each foreground-specific appearance model A i is initialized using a foreground mask f i . The second potential p(y p ; O) accounts for the occlusion handling in the proposed energy minimization framework where O denotes the occlud- ing set. Given a foreground mask f i and its score s c j f i in class c j , we check the corresponding score of the region f i \O * , which is removing one of the possible occluding region O * in O from the foreground mask f i . When s c j f i \O * > s c j f i , that means the pixel p in the occluding region O * is discour- aged to be associated with the foreground mask f i . In this case, we penalize the energy of the occluding regions by adding the penalization when the pixel location p is foreground. When the pixel location p is background, the energy of the occluding regions is subtracted with the penalization. This is an extended abstract. The full paper is available at the Computer Vision Foundation webpage. Figure 1: Segmentation quality comparison. Given an image (a), our method (d) can handle occlusions caused by the leg of the person while MCG [1] (c) includes the leg of the person as part of the motorbike. More- over, the segment in (c) is classified as a bicycle using class-specific classi- fiers whereas our segment can be classified correctly as a motorbike. exemplar templates matched points inferred masks shape prior 0.28 0.64 0.30 A ˟ ˟ B corner detector Chamfer matching M n M 1 S n Figure 2: Overview of the exemplar-based shape predictor. This figure shows an example that the shape predictor uses the top-down class-specific shape information to remove the overshooting on the back of the horse. The third potential p(y p ; P c j ) corresponds to one of the class-specific likelihood map P c j . Because of the probabilistic nature of class-specific likelihood map P c j , we set the third potential as p(y p ; P c j )= P y p c j (1 -P 1-y p c j ). Finally, we iteratively minimize the energy function (1) as in [5]. Parame- ters of the foreground-specific appearance model will keep updating in each iteration until the energy function converges. The implementation and evaluation of the proposed algorithm is de- scribed in the paper in detail. We demonstrate the effectiveness of the pro- posed algorithm by comparing with SDS on the challenging PASCAL VOC segmentation dataset [2]. The experimental results show that the proposed algorithm achieves favorable performance. Moreover, the results suggest that high quality segmentations improve the detection accuracy significantly. [1] Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T. Barron, Ferran Marques, and Jitendra Malik. Multiscale combinatorial grouping. In CVPR, 2014. [2] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis- serman. The PASCAL Visual Object Classes Challenge (VOC). IJCV, 88(2):303–338, 2010. [3] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Simultaneous detection and segmentation. In ECCV, 2014. [4] Ming-Yu Liu, Oncel Tuzel, Ashok Veeraraghavan, and Rama Chel- lappa. Fast directional chamfer matching. In CVPR, 2010. [5] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In SIG- GRAPH, 2004.

Upload: others

Post on 15-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-instance Object Segmentation with Occlusion Handlingpurely bottom-up schemes. To address this problem, we propose to incor-porate top-down category specific reasoning and shape

Multi-instance Object Segmentation with Occlusion Handling

Yi-Ting Chen1, Xiaokai Liu1,2, Ming-Hsuan Yang1

1University of California at Merced. 2Dalian University of Technology.

In this paper, we present a multi-instance object segmentation algorithm totackle occlusions. As an object is split into two parts by an occluder, itis nearly impossible to group the two separate regions into an instance bypurely bottom-up schemes. To address this problem, we propose to incor-porate top-down category specific reasoning and shape prediction throughexemplars into an intuitive energy minimization framework.

We start by finding the occluding regions, i.e., the overlap between twoinstances. For example, the overlap between the person and motorbike givesthe occluding region, i.e., leg of the person, in Figure 1. To find these re-gions, we need to parse and categorize the two overlapping instances. Re-cently, Hariharan et al. [3] propose a simultaneous detection and segmenta-tion (SDS) algorithm that shows a significant improvement in the segmenta-tion classification task. This classification capability provides us a powerfultop-down category specific reasoning to tackle occlusions. Then, we usecategorized segmentation hypotheses obtained by SDS to infer occludingregions by checking if two of the top-scoring categorized segmentation pro-posals are overlapped. If they overlap, we record this occluding region intothe occluding region set.

On the other hand, the classification capability are used to generate cate-gory specific likelihood maps and to find the corresponding category specificexemplar sets to better estimate the shape of objects. These category spe-cific likelihood maps are used to indicate the location of objects in differentcategory in a probabilistic manner. As the bottom-up segmentations tendto undershoot (e.g., missing parts of an object) and overshoot (e.g., con-taining background clutter), we enhance these segments through the non-parametric, data-driven shape predictor based on the chamfer matching [4].The overview of the exemplar-based shape prediction is shown in Figure 2.

The inferred occluded regions, shape predictions and class-specific like-lihood maps are formulated into an energy minimization framework to ob-tain the desired segmentation candidates (e.g., Figure 1(d)). Let yp denotethe label of a pixel p in an image and y denotes a vector of all yp. The energyfunction given the foreground-specific appearance model Ai is defined as

E(y;Ai) = ∑p∈P

Up(yp;Ai)+ ∑p,q∈N

Vp,q(yp,yq) , (1)

where P denotes all pixels in an image, N denotes pairs of adjacent pixels,Up(·) is the unary term and Vp ,q(·) is the pairwise term. Our unary termUp(·) is the linear combination of several terms and is written as

Up(yp;Ai) = −αAi logp(yp;cp,Ai)−αO logp(yp;O)

−αPc jlogp(yp;Pc j ) . (2)

For the pairwise term Vp,q(yp,yq), we follow the definition as Grabcut [5].The first potential p(yp;cp,Ai) evaluates how likely a pixel of color cp

is to take label yp based on a foreground-specific appearance model Ai. Asin [5], an appearance model Ai consists of two Gaussian mixture models,foreground and background. Each foreground-specific appearance modelAi is initialized using a foreground mask fi.

The second potential p(yp;O) accounts for the occlusion handling inthe proposed energy minimization framework where O denotes the occlud-ing set. Given a foreground mask fi and its score sc j

fiin class c j , we check

the corresponding score of the region fi\O∗, which is removing one of thepossible occluding region O∗ in O from the foreground mask fi. Whensc j

fi\O∗ > sc jfi

, that means the pixel p in the occluding region O∗ is discour-aged to be associated with the foreground mask fi. In this case, we penalizethe energy of the occluding regions by adding the penalization when thepixel location p is foreground. When the pixel location p is background, theenergy of the occluding regions is subtracted with the penalization.

This is an extended abstract. The full paper is available at the Computer Vision Foundationwebpage.

Figure 1: Segmentation quality comparison. Given an image (a), ourmethod (d) can handle occlusions caused by the leg of the person whileMCG [1] (c) includes the leg of the person as part of the motorbike. More-over, the segment in (c) is classified as a bicycle using class-specific classi-fiers whereas our segment can be classified correctly as a motorbike.

exemplar templates matched points inferred masks shape prior

0.28

0.64

0.30

A ˟ ˟ B

corner detector

Chamfer matching

Mn

M1

Sn

Figure 2: Overview of the exemplar-based shape predictor. This figureshows an example that the shape predictor uses the top-down class-specificshape information to remove the overshooting on the back of the horse.

The third potential p(yp;Pc j ) corresponds to one of the class-specificlikelihood map Pc j . Because of the probabilistic nature of class-specific

likelihood map Pc j , we set the third potential as p(yp;Pc j )=Pypc j (1−P

1−ypc j ).

Finally, we iteratively minimize the energy function (1) as in [5]. Parame-ters of the foreground-specific appearance model will keep updating in eachiteration until the energy function converges.

The implementation and evaluation of the proposed algorithm is de-scribed in the paper in detail. We demonstrate the effectiveness of the pro-posed algorithm by comparing with SDS on the challenging PASCAL VOCsegmentation dataset [2]. The experimental results show that the proposedalgorithm achieves favorable performance. Moreover, the results suggestthat high quality segmentations improve the detection accuracy significantly.

[1] Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T. Barron, Ferran Marques,and Jitendra Malik. Multiscale combinatorial grouping. In CVPR, 2014.

[2] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis-serman. The PASCAL Visual Object Classes Challenge (VOC). IJCV,88(2):303–338, 2010.

[3] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik.Simultaneous detection and segmentation. In ECCV, 2014.

[4] Ming-Yu Liu, Oncel Tuzel, Ashok Veeraraghavan, and Rama Chel-lappa. Fast directional chamfer matching. In CVPR, 2010.

[5] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut:Interactive foreground extraction using iterated graph cuts. In SIG-GRAPH, 2004.