[hal-00857918, v1] parameter estimation and energy ... parameter estimation and energy minimization...

1

Parameter Estimation and EnergyMinimization for Region-based Semantic

SegmentationM. Pawan Kumar, Haithem Turki, Dan Preston, and Daphne Koller

Abstract—We consider the problem of parameter estimation and energy minimization for a region-based semantic segmentationmodel. The model divides the pixels of an image into non-overlapping connected regions, each of which is assigned a labelindicating its semantic class. In the context of energy minimization, the main problem we face is the large number of putativepixel-to-region assignments. We address this problem by designing an accurate linear programming based approach for selectingthe best set of regions from a large dictionary. The dictionary is constructed by merging and intersecting segments obtained frommultiple bottom-up over-segmentations. The linear program is solved efficiently using dual decomposition. In the context ofparameter estimation, the main problem we face is the lack of fully supervised data. We address this issue by developing aprincipled framework for parameter estimation using diverse data. More precisely, we propose a latent structural support vectormachine formulation, where the latent variables model any missing information in the human annotation. Of particular interestto us are three types of annotations: (i) images segmented using generic foreground or background classes; (ii) images withbounding boxes specified for objects; and (iii) images labeled to indicate the presence of a class. Using large, publicly availabledatasets we show that our methods are able to significantly improve the accuracy of the region-based model.

✦

1 INTRODUCTION

S EMANTIC segmentation offers a useful representa-tion of the scene depicted in an image by assigning

each pixel to a specific semantic class (for example,‘person’, ‘building’ or ‘tree’). It is a long standing goalof computer vision, and is an essential building blockof many of the most ambitious applications such asautonomous driving, robot navigation, content-basedimage retrieval and surveillance. Its importance canbe assessed by its long history in the computer visionliterature.Encouraged by their success in low-level vision

applications such as image denoising and stereo re-construction [1], earlier efforts focused on pixel-levelmodels, where each pixel was assigned a label us-ing features extracted from a regularly shaped patcharound it [2], [3], or at an offset from it [4]. How-ever, the features extracted from such patches are notreliable in the presence of background clutter. Forexample, a patch around a boundary pixel of a treemay contain sky or building pixels. This may inhibitan algorithm from using the fact that trees are mostlygreen.To avoid the problem of pixel-based methods, re-

searchers have started to develop region-based mod-els. Such models divide an image into regions, where

• M. Pawan Kumar is with the Center for Visual Computing, EcoleCentrale Paris, Chatenay-Malabry, 92295, France.E-mail: [email protected]

• Haithem Turki, Dan Preston and Daphne Koller are with the ComputerScience Department, Stanford University, Stanford, CA, 94305.E-mail: {hturki,dpreston,koller}@cs.stanford.edu

each region is a set of connected pixels. The reasoningis that if the regions are large enough to providereliable discriminative features, but small enough sothat all the pixels within a region belong to thesame semantic class, then we can obtain an accuratesegmentation of the image. The first region-basedmodels [5], [6], [7] for high-level vision defined theregions of the image as the segments obtained using astandard bottom-up over-segmentation approach [8],[9]. However, since over-segmentation approaches areagnostic to the task at hand, these regions maynot capture the boundaries between the scene en-tities accurately. To address this issue, some workshave suggested heuristics for selecting a good over-segmentation [5], [6]. However, even the best over-segmentation approach is unlikely to provide regionsof sufficient accuracy.In order to obtain regions that capture the bound-

aries accurately, some efforts have been made tocombine multiple over-segmentations. For example,Pontafaru et al. [10] suggested taking the intersectionof multiple over-segmentation. However, such an ap-proach results in very small regions. Other modelssuggest using overlapping regions [11], [12], [13], [14].However, the pixels within the overlap can supporttwo contradicting hypotheses (that is, they can belongto two different semantic classes) thereby overcount-ing the data.In this work, we use the region-based model of

Gould et al. [15], which consists of two layers. Thevariables of the first layer correspond to the im-age pixels. A labeling of the first layer (that is, anassignment of values to the variables) divides the

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

http://hal.inria.fr/hal-00857918

http://hal.archives-ouvertes.fr

2

pixels into contiguous, non-overlapping regions. Thevariables of the second layer correspond to the regionsdefined by the first layer. A labeling of the secondlayer assigns each region to a unique semantic class,thereby providing the segmentation of the image. Thereal-valued energy function of the model providesthe desirability of an overall labeling (that is, thecombined labeling of the first and second layers), thatis, the lower the energy the better the labeling. Givenan image, its semantic segmentation is inferred byminimizing the energy over all possible labelings. Theadvantage of the above formulation is two-fold: (i) inaddition to the segmentation, the regions themselveswould be inferred (including the number of regions),which implies that unlike other models, the regionswould be task dependent; and (ii) the regions wouldbe connected and non-overlapping, thereby avoidingdata overcounting.While the region-based model of Gould et al.

[15] possesses several desirable qualities, its practicaldeployment poses two formidable problems. First,minimizing the energy associated with this modelis extremely challenging due to the large numberof possible pixel-to-region assignments. Furthermore,the regions are required to be connected, a difficultconstraint to impose [16], [17]. Second, the energyfunction consists of several thousand parameters thatneed to be estimated accurately in order to obtaingood segmentations via energy minimization.In order to address the difficulty posed by energy

minimization, Gould et al. [15] proposed a methodthat constructs a large dictionary of putative re-gions using multiple over-segmentations obtained bychanging the parameters of a bottom-up approach.Specifically, the putative regions are defined as setsof pixels obtained by merging and intersecting thesegments with each other. While merging segmentstogether provides large regions, their intersection withsmall segments ensures that they align well with theboundary. The set of non-overlapping regions of theimage are selected from the dictionary by minimizingthe energy function using a simple greedy algorithm.However, while we would expect the dictionary itselfto contains accurate regions, the greedy algorithm issusceptible to getting stuck in a bad local minimum.In order to alleviate this issue, we formulate theproblem of selecting the best set of regions from alarge dictionary as an integer program and designan accurate linear programming (LP) relaxation forit. Furthermore, we show how the LP relaxation canbe solved efficiently by suitably modifying the dualdecomposition framework [18].In order to estimate the parameters of the region-

based model, Gould et al. [15] devised a piecewiselearning approach, called closed loop learning, whichrelies on a fully supervised training dataset. How-ever, we argue that their approach suffers from twosignificant drawbacks. First, piecewise learning can

result in a bad solution. Second, and more important,the collection of fully supervised data is an oneroustask as reflected by the small size of the currentdatasets for semantic segmentation [15], [19]. In orderto overcome these problems, we propose the use ofdiverse data, where the level of annotation variesamong the training samples, from pixelwise segmen-tation to bounding boxes and image-level labels. Wedesign a principled framework for learning with di-verse data, with the aim of exploiting the varyingdegrees of information in the different datasets to thefullest. Specifically, we formulate the parameter learn-ing problem using a latent structural support vectormachine (LSVM) [20], [21], where the latent variablesmodel any missing information in the annotation. Inorder to optimize the objective function of LSVM usingthe self-paced learning algorithm [22], we modify ourLP relaxation based energy minimization algorithmsuch that it can complete the annotation of weaklysupervised images.We demonstrate the efficacy of our novel energy

minimization and parameter estimation methods us-ing large, publicly available datasets. Earlier versionsof this article have appeared as [23], [24].

2 THE REGION-BASED MODELWe begin by providing a formal description of thetwo-layer region-based model of Gould et al. [15]. Thefirst layer of the model consists of random variablesVP that correspond to the pixels of a given image X.Each random variable can take one label from the setLP = {1, 2, · · · , R} that represents the region to whichit belongs. Here R is the total number of regions in agiven image (which has to be inferred automatically).A labeling of the first layer is a vector YP thatdivides the pixels into connected, non-overlappingregions. The second layer consists of random variablesVR(YP ) that correspond to the regions defined byYP . Each random variable can take one label fromthe set LR = {1, 2, · · · , L} where L is the total numberof semantic classes we consider. The desired segmen-tation is provided by a labeling YR of the secondlayer. A labeling of the entire model is denoted byY = (YP ,YR). The energy of the model consists oftwo types of potentials, unary and pairwise, which wedefined in terms of the image features and the modelparameters. The potentials and the energy functionare described in detail in the following subsections.

2.1 Unary PotentialFor each variable vr ∈ V R(YP ) (corresponding to aregion r) we define a unary potential θr(Y

Rr ;X) for

assigning it the label YRr . The value of the unary

potential depends on the parameters as well as thefeatures extracted from the image. In more detail, letur(X) denote the features extracted from the pixelsbelonging to the region r, which can capture shape,appearance and texture information (for example,

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

3

green regions are likely to be grass or tree, while blueregions are likely to be sky). We refer the interestedreader to [15] for details regarding the features used inthis paper. The unary potential of assigning the regionr to a semantic class i is defined as

θr(i;X) = w⊤i ur(X), (1)

where wi is the unary parameter corresponding to theclass i.

2.2 Pairwise PotentialFor each pair of neighboring variables, whose corre-sponding regions share at least one boundary pixel,we define a pairwise potential θrr′(Y

Rr ,Y

Rr′ ;X) for

assigning labels YRr and YR

r′ to vr and vr′ respec-tively. Similar to the unary potential, the value of thepairwise potential depends on the parameters andthe image features. In more detail, let prr′(X) referto the features extracted using the pixels belongingto regions r and r′, which can capture contrast andcontextual information (for example, boats are likelyto be above water, while cars are likely to be aboveroad). We refer the interested reader to [15] for detailsregarding the features used in this paper. The pairwisepotential for assigning the regions r and r′ to semanticclasses i and j respectively is defined as

θrr′(i, j;X) = w⊤ijprr′(X), (2)

where wij is the pairwise parameter corresponding tothe classes i and j.

2.3 Energy FunctionGiven an image X and a labeling Y, we define theunary feature for a semantic class i as

Ψi(X,Y) =R∑

r=1

δ(YRr = i)ur(X), (3)

that is, it is the sum of the features over all theregions that are assigned the class i in the labelingY. Similarly, we define the pairwise feature for thesemantic classes i and j as

Ψij(X,Y) =∑

(r,r′)∈ER(YP )

δ(YRr = i)δ(YR

r′ = j)prr′(X),

(4)that is, it is the sum of the features over all pairs ofneighboring regions that are assigned the classes i andj respectively in the labeling Y. Here, ER(YP ) is theset of pairs of regions that share at least one boundarypixel, as defined byYP . Using the unary and pairwisefeatures, we define the joint feature vector of the inputX and output Y of the model as

Ψ(X,Y) = [Ψi(X,Y), ∀i; Ψij(X,Y), ∀i, j]. (5)

In other words, the joint feature vector of the inputand the output is the concatenation of the unaryfeatures for all semantic classes and the pairwise

features for all pairs of semantic classes. Similarly, wedefine the parameter of the model as

w = [wi, ∀i;wij , ∀i, j], (6)

that is, the parameter is the concatenation of the unaryparameters for all semantic classes and the pairwiseparameters for all pairs of semantic classes.The energy of the model can then be written con-

cisely using the parameter and the joint feature vectoras follows:

E(Y;X,w) = w⊤Ψ(X,Y)

=∑

vr∈VR(YP )

θr(YRr ) +

∑

(vr,vr′ )∈ER(YP )

θrr′(YRr ,Y

Rr′), (7)

Note that we have dropped the input X from thenotation of the individual potentials to avoid clutter.

3 ENERGY MINIMIZATIONFor the model described above, we now consider theproblem of obtaining the labeling Y∗ that minimizesthe energy function. This will provide us with the bestset of regions for the task (YP ) as well as their labels(YR). In this section, we assume that the parame-ters w have been provided. In the next section, wewill describe a max-margin framework for estimatingthe parameters using a diversely annotated trainingdataset.As noted earlier, the main difficulty in energy min-

imization arises due to the fact that there are manypossible labelings YP that group pixels into regions.Specifically, for a given H ×W image there can be asmany as HW regions (that is, each pixel is a region).Hence the total number of possible labelings YP is(HW )(HW ). Furthermore, we also need to ensure thatthe inferred labeling YP provides connected regions,which is well-known to be a difficult constraint toimpose [16], [17].In order to overcome these problems, we make use

of bottom-up over-segmentation approaches. Specif-ically, we minimize the energy using the followingtwo steps: (i) construct a large dictionary of connectedputative regions using multiple over-segmentations;and (ii) select the set of regions that minimize theenergy (that is, infer YP and YR). We begin bydescribing our algorithm for selecting regions from adictionary. We then provide details of the dictionarythat we found to be effective in our experiments.

3.1 Region Selection as OptimizationGiven a dictionary of regions, we wish to select asubset of regions such that: (i) the entire image isexplained by the selected regions; (ii) no two selectedregions overlap with each other; and (iii) the energyE(Y;X,w) is minimized. Note that the dictionaryitself may contain overlapping regions of any shape

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

4

or size. We do not place any restrictions on it otherthan the assumption that it contains at least one setof disjoint regions that explains the entire image.We formulate the above task as an integer programand provide an accurate linear programming (LP)relaxation for it.

3.1.1 Integer Programming FormulationBefore describing the integer program, we need to setup some notation. We denote the dictionary of regionsby D. The intersection of all the regions in D defines aset of super-pixels S . The set of all regions that containa super-pixel s ∈ S is denoted by C(s) ⊆ D. Finally,the set of neighboring regions is denoted by E , wheretwo regions r and r′ are considered neighbors of eachother (that is, (r, r′) ∈ E) if they do not overlap andshare at least one boundary pixel.To formulate our problem as an integer program we

define binary variables yr(i) for each region r ∈ D andi ∈ LI = LR ∪{0}. These variables indicate whether aparticular region is selected and if so, which label ittakes. Specifically, if yr(0) = 1 then the region r is notselected, else if yr(i) = 1 where i ∈ LR then the regionr is assigned the label i. Similarly, we define binaryvariables yrr′(i, j) for all neighboring regions (r, r′) ∈E such that yrr′(i, j) = yr(i)yr′(j). Furthermore, wedefine unary and pairwise potentials correspondingto the augmented label set LI as

θr(i) =

{

θr(i) if i ∈ LR,0 otherwise,

θrr′(i, j) =

{

θrr′(i, j) if i, j ∈ LR,0 otherwise.

(8)

The problem of obtaining the desired subset of re-gions can then be formulated as the following integerprogram:

miny∈SELECT (D)

∑

r∈D,i∈LI

θr(i)yr(i) +

∑

(r,r′)∈E,i,j∈LI

θrr′(i, j)yrr′(i, j), (9)

where the feasible set SELECT (D) is specified usingthe following constraints:

yr(i), yrr′(i, j) ∈ {0, 1}, ∀r, r′ ∈ D, i, j ∈ LI ,

∑

i∈LI

yr(i) = 1, ∀r ∈ D,

∑

j∈LI

yrr′(i, j) = yr(i), ∀(r, r′) ∈ E , i ∈ LI ,

∑

i∈LI

yrr′(i, j) = yr′(j), ∀(r, r′) ∈ E , j ∈ LI ,

∑

r∈C(s)

∑

i∈LR

yr(i) = 1, ∀s ∈ S. (10)

The first set of constraints ensure that the variablesy are binary. The second constraint implies that eachregion r should be assigned one label from the set LI .

The third constraint enforces yrr′(i, j) = yr(i)yr′(j).The final constraint, which we call covering con-straint, restricts each super-pixel to be covered byexactly one selected region.

3.1.2 Linear Programming RelaxationProblem (9) is NP-hard since y is constrained to bebinary (which specifies a non-convex feasible region).However, we can obtain an approximate solution tothe above problem by relaxing the constraint on ysuch that yr(i) and yrr′(i, j) take (possibly fractional)values in the interval [0, 1]. The resulting LP relax-ation is similar to the standard relaxation for energyminimization in pairwise random fields [25], [26],with the exception of the additional covering con-straints. However, this relaxation is very weak whenthe pairwise potentials are not submodular (roughlyspeaking, when they encourage neighboring regionsto take different labels) [27]. For example, consider thecase where each region is either selected or not (thatis, |LI | = 2). For two neighboring regions r and r′, thepairwise potential θrr′(·, ·) is 0 if one or both regionsare not selected and θrr′(1, 1) otherwise (as definedby equation (8)). If θrr′(1, 1) > 0 then the neighboringregions are encouraged to take different labels; thatis, the pairwise potentials are non-submodular. Thisresults in frustrated cycles for which the standard LP

relaxation provides a weak approximation [28] (wetested this empirically for our problem, but do notinclude the results due to space limitations).There are two common ways to handle non-

submodular problems in the literature: (i) applyingthe roof duality relaxation [28], [29]; and (ii) usingmessage passing algorithms [30], [31], [32] based oncycle inequalities [33]. Unfortunately, both these meth-ods are not directly applicable in our case. Specifically,roof duality does not allow us to incorporate thecovering constraints. Adding cycle inequalities stillresults in a weak approximation. For example, seeFig. 1 that shows a clique formed by three neighboringregions (r1, r2 and r3) along with all the regions thatoverlap with at least one of them. Consider the casewhere |LI | = 2 and θrr′(1, 1) > 0, thereby result-ing in non-submodular pairwise potentials (shown inFig. 1(b)). Since each super-pixel needs to be coveredby exactly one selected region, it follows that theenergy of the optimal assignment for this clique willbe strictly greater than 0 (as at least two neighbor-ing regions have to be selected). However, the LP

relaxation, with all the cycle inequalities added in,still provides a fractional solution whose objectivefunction value is 0 (shown in Fig. 1(c)).Our example shows that cycle inequalities are not

sufficient to make the relaxation tight. Instead, werequire all the constraints that define the marginalpolytope [26] (convex hull of the valid integral label-ings) of the entire clique. We note here that the recentwork of Sontag et al. [32] also advocates the use of

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

5

(a) (b) (c)

Fig. 1. (a) Neighboring regions r1, r2 and r3, shown using solid lines, consist of a single super-pixel. The regions shown using dashedlines are formed by two super-pixels. Specifically, r4 = r1 ∪ r2, r5 = r1 ∪ r3 and r6 = r2 ∪ r3. (b) The potentials corresponding to the cliqueof size 6 formed by the regions. The branches (horizontal lines) along the trellis (the vertical line on top of a region) represent the differentlabels that each region may take. We consider a two label case here. The unary potential θr(i) is shown next to the ith branch of the trellison top of region r. The pairwise potential θrr′ (i, j) is shown next to the connection between the ith branch of r and the jth branch of r′.The only non-zero potential θrr′ (1, 1) > 0 corresponds to selecting both the regions r and r′. The optimal labeling of the clique must havean energy greater than 0 since at least two neighboring regions must be selected. (c) The optimal solution of the LP relaxation. The valueof yr(i) is shown next to the ith branch of r and the value of yrr′ (i, j) is shown next to the connection between the ith and jth branchesof r and r′ respectively. Note that the solution satisfies all cycles inequalities, that is,

∑(r,r′)∈EC

yrr′ (0, 0) + yrr′ (1, 1) ≥ 1, where EC is acycle. Hence the solution lies within the feasible region of the relaxation. However, it can be easily verified that its objective function value is0, thereby proving that the relaxation is not tight.

such constraints. However, in their experiments theyfound cliques of size three (which are also a cycle ofsize three) to be sufficient. In contrast, we use cliquesthat are formed by three neighboring regions alongwith all the regions that overlap with at least one ofthe three regions. To the best of our knowledge, oursis one of the first examples in vision where constraintson cliques are essential for obtaining a good solution.Although a large number of constraints are requiredto specify the marginal polytope of a clique (theirexact form is not important for this work), we showhow the overall relaxation can be solved efficiently.

3.2 Solving the RelaxationWe use the dual decomposition framework that iswell-known in the optimization community [18] andhas recently been introduced in computer vision [34].We begin by describing the general framework andthen specify how it can be modified to efficiently solveour relaxation.

3.2.1 Dual DecompositionConsider the following convex optimization

problem: minz∈F

∑M

k=1 gk(z), where F representsthe convex feasible region of the problem. Theabove problem is equivalent to the following:minzk∈F,z

∑

k gk(zk), s.t. zk = z. Introducing theadditional variables zk allows us to obtain the dualproblem as

maxλk

minzk∈F,z

∑

k

gk(zk) +∑

k

λk(zk − z), (11)

where λk are the Lagrange multipliers. Differentiatingthe dual function with respect to z we obtain the con-straint that

∑

k λk = 0, which implies that we can dis-card z from the above problem. The simplified formof the dual suggests the following strategy for solvingit. We start by initializing λk such that

∑

k λk = 0.Keeping the values of λk fixed, we solve the followingslave problems:minzk∈F (gk(zk) + λkzk).Upon obtain-ing the optimal solutions z∗k of the slave problems,

we update the values of λk by projected subgradientdescent where the subgradient with respect to λk

is z∗k. In other words, we update λk ← λk + ηtz∗k

where ηt is the learning rate at iteration t. In orderto satisfy the constraint

∑

k λk = 0 we project thevalue of λk to λk ← λk − (

∑

k λk) /M . Under fairlygeneral conditions, this iterative strategy known asdual decomposition converges to the globally optimalsolution of the original problem. We refer the readerto [18] for details.

3.2.2 Dual Decomposition for Selecting RegionsWhen using dual decomposition, it is crucial to se-lect slave problems whose optimal solutions can becomputed quickly. With this in mind, we choose threetypes of slave problems. Below we describe each ofthese slave problems and justify their use by provid-ing an efficient method to optimize them.The first type of slave problems is similar to the

one used for energy minimization in pairwise randomfields. Specifically, each slave problem is defined usinga subset of regions DT ⊆ D and edges ET ⊆ E thatform a tree. For each such graph (DT , ET ) we definethe following problem:

miny≥0

∑

r∈DT ,i

(

θr(i)

nr

+ λ1r(i)

)

yr(i) + (12)

∑

(r,r′)∈ET ,i,j

(

θrr′(i, j)

nrr′+ λ1

rr′(i, j)

)

yrr′(i, j),

s.t. y ≥ 0,∑

i∈LI

yr(i) = 1,

∑

j∈LI

yrr′(i, j) = yr(i),∑

i∈LI

yrr′(i, j) = yr′(j),

where nr and nrr′ are the total number of slaveproblems involving r ∈ D and (r, r′) ∈ E respectively.As the LP relaxation is tight for trees (that is, it hasintegral solutions) [25], [26], the above problem canbe solved efficiently using belief propagation [35].

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

6

The second type of slave problems correspond tothe covering constraints. Specifically, for each s ∈ Swe define:

miny≥0

∑

r∈C(s),i

(

θr(i)

nr

+ λ2r(i)

)

yr(i) (13)

s.t.∑

r∈C(s)

∑

i∈LR

yr(i) = 1,∑

i∈LI

yr(i) = 1. (14)

It can be verified that the constraint matrix for theabove problem is totally unimodular. In other words,the optimal solution is integral and in fact, can befound efficiently in O(L|C(s)|) time (where L is thenumber of region labels and |C(s)| is the number ofregions in D that cover s).

The third type of slave problems correspond tothe clique constraints defined in §3.1. Specifically, forevery clique defined by DQ ⊆ D and EQ ⊆ E , wespecify:

miny∈MQ

∑

r∈DQ,i

(

θr(i)

nr

+ λ3r(i)

)

yr(i) + (15)

∑

(r,r′)∈EQ,i,j

(

θrr′(i, j)

nrr′+ λ3

rr′(i, j)

)

yrr′(i, j),

where MQ is the marginal polytope of the clique.Note that since MQ is the convex hull of all validintegral labelings and the objective function is linear,it follows that the above problem has an integraloptimal solution. Furthermore, it corresponds to theminimization of a sparse higher order function [36],[37]. In other words, although there are O((L+1)|DQ|)possible labelings for the clique, only a small fractionof them are valid. For example, consider the cliqueshown in Fig. 1. If region r4 is selected, then regionsr1 and r2 cannot be selected since they overlap withr4. Furthermore, we also know that exactly one of r1and r4 must be selected. Using similar observations,the number of valid labelings of a general clique usedin our relaxation (that is, a clique formed by threeneighboring regions along with all the regions thatoverlap with at least one of them) can be shown tobe O((L|C(sQ)|)

3). Here, sQ is the super-pixel s thatis covered by at least one region in DQ and has thelargest corresponding set C(s). Note that we can alsouse cliques defined by m neighboring regions that canbe optimized in O((L|C(sQ)|)

m) time (we omit detailson the derivation of the time complexity due to lackof space). However, we found m = 3 to be sufficientto obtain a tight relaxation.We iteratively solve the above slave problems and

update λ. Upon convergence, we obtain the primalsolution (that is, the subset of regions and their labels)in a similar manner to [30], [34], [36]. Briefly, thisinvolves sequentially considering each super-pixel (insome arbitrary order) and picking the best region forit (according to the values of the dual variable λ). Theprimal-dual gap provides us with an estimate of how

good our solution is. Typically, we obtain a very smallgap by using our approach.

3.3 Generating the DictionariesIdeally, we would like to form a dictionary thatconsists of all the regions obtained by merging andintersecting the segments provided by a bottom-upapproach. However, this will clearly result in a verylarge dictionary that cannot be handled even usingour efficient algorithm. Instead, we iteratively searchover the regions using the following strategy. Weinitialize our dictionary with the regions obtainedfrom one (very coarse) over-segmentation. In the sub-sequent iterations, we consider two different typesof dictionaries (similar to [15]). The first dictionaryconsists of the current set of regions Dcur (those thathave provided the best explanation of the image untilnow, in terms of the energy) as well as all the regionsobtained by merging two neighboring regions in Dcur.The second type of dictionary uses multiple over-segmentations to define the regions. Specifically, inaddition to Dcur, it also consists of all the regionsobtained by merging every segment from the over-segmentations with all its overlapping and neighbor-ing regions in Dcur. Similarly, it also contains all theregions obtained by intersecting every segment withall its overlapping regions in Dcur. While using thefirst dictionary results in larger regions (by mergingneighboring regions together), the second dictionaryallows us to correct any mistakes (merging two in-compatible regions) by considering intersections ofsegments and regions.It is worth noting that, unlike most of the previ-

ous works [7], [12], [38], [39], [40], our dictionariesdefine regions that are not just segments obtainedby a bottom-up approach. This additional degreeof freedom is important for obtaining regions thatprovide discriminative features while respecting sceneboundaries.Although dual decomposition is in general an ef-

ficient strategy for solving convex problems, whenthe size of the dictionary is too large, it may not bepractically feasible. In order to improve the efficiencyof our overall energy minimization algorithm, we usetwo heuristics based on a shrinking strategy [41] anda move-making strategy [42] respectively. Since theseheuristics are not central to the understanding of theremainder of the paper, we provide their descriptionin the appendices.

4 PARAMETER ESTIMATIONGiven a training dataset that consists of images withdifferent types of ground-truth annotations, our goalis to learn accurate parameters for the region-basedmodel. To this end, we design a principled frame-work for learning with diverse data, with the aim ofexploiting the varying degrees of information in thedifferent datasets to the fullest: from the cleanliness

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

7

of pixelwise segmentations, to the vast availability ofbounding boxes and image-level labels. Specifically,we formulate the parameter learning problem usinga latent structural support vector machine (LSVM) [20],[21], where the latent variables model any missinginformation in the annotation. For this work, we focuson three types of missing information: (i) the specificclass of a pixel labeled using a generic foreground orbackground class (for example, images from the VOC

segmentation dataset [19] or the Stanford backgrounddataset [15]; (ii) the segmentation of an image anno-tated with bounding boxes of objects (for example,images from the VOC detection dataset [19]); and (iii)the segmentation of an image labeled to indicate thepresence of a class (for example, images from theImageNet dataset [43]). We show how the energy min-imization algorithm described in the previous sectioncan be suitably modified to obtain an accurate localminimum to the optimization problem correspondingto LSVM using the self-paced learning algorithm [22].

4.1 Learning with Generic ClassesTo simplify the discussion, we first focus on the casewhere the ground-truth annotation of an image spec-ifies a pixelwise segmentation that includes genericforeground or background labels. As will be seen insubsequent subsections, the other cases of interest,where the ground-truth only specifies bounding boxesfor objects or image-level labels, will be handled bysolving a series of LSVM problems that deal withgeneric class annotations.

4.1.1 NotationAs mentioned earlier in section 2, we denote an imageby X and a labeling (that is, a segmentation) by Y.The joint feature vector is denoted by Ψ(X,Y), andthe energy of a segmentation is equal to w⊤Ψ(X,Y),where w are the parameters of the model. The bestsegmentation of an image is obtained using energyminimization, that is, Y∗ = argminY w⊤Ψ(X,Y).

We denote the training dataset as T ={(Xk,Ak), k = 1, · · · , n}, where Xk is an imageand Ak is the corresponding annotation. For eachimage X with annotation A, we specify a set oflatent, or hidden, variables H such that {A,H}defines a labeling Y of the model. In other words,for each pixel p labeled using the generic foreground(background) class, the latent variable Hp models itsspecific foreground (background) class.

4.1.2 Learning as Risk MinimizationGiven the dataset T , we learn the parameters w

by training an LSVM. Briefly, an LSVM minimizesan upper bound on a user-specified risk, or loss,∆(A, {A, H}). Here, A is the ground-truth and{A, H} is the predicted segmentation for a given setof parameters. In this work, we specify the loss usingthe overlap score, which is the measure of accuracy

for the VOC challenge [19]. For an image labeled usingspecific foreground classes and a generic background(the label set denoted by F), we define the lossfunction as

∆(A, {A, H}) = 1−1

|F|

∑

i∈F

∣

∣

∣Pi(A) ∩ Pi({A, H})

∣

∣

∣

∣

∣

∣Pi(A) ∪ Pi({A, H})

∣

∣

∣

, (16)

where the function Pi(·) returns the set of all the pixelslabeled using class i. Note that when i is the genericbackground, then Pi({A, H}) is the set of all pixelslabeled using any specific background class. A similarloss function can be defined for images labeled usingspecific background classes and a generic foreground(the label set B). Formally, the parameters are learnedby solving the following non-convex optimizationproblem:

minw,ξk≥0

λ

2||w||2 +

1

n

n∑

k=1

ξk, (17)

s.t. w⊤Ψ(Xk, {Ak,Hk})−minH

Ψ(Xk, {Ak,H})

≥ ∆(Ak, {Ak,Hk})− ξk, ∀Ak,Hk.

Intuitively, the problem requires that for every imagethe energy of the ground-truth annotation, augmentedwith the best value of its latent variables, should beless than the energy of all other labelings. The desiredmargin between the two energy values is proportionalto the loss.

4.1.3 Learning an LSVMAlgorithm 1 describes the main steps of learning anLSVM using our recently proposed self-paced learning(SPL) algorithm [22]. Unlike the concave convex proce-dure (CCCP) [20], [21] used in previous works, whichtreats all images equally, SPL automatically chooses aset of easy images at each iteration (in step 4), anduses only those images to update the parameters.Following our earlier work [22], we choose the

initial threshold σ0 such that half the images areconsidered easy in the first iteration, and set theannealing factor µ = 1.3. These settings have beenshown to work well in practice for a large numberof applications [22]. In order to avoid learning fromimages whose latent variables are never imputedcorrectly, we measure the accuracy of the model aftereach iteration using a validation set (different from thetest set). We report test results using the parametersthat are the most accurate on the validation set.Let us take a closer look at what is needed to learn

the parameters of our model. The first step of SPL

requires us to impute the latent variables of eachtraining image X, given the ground-truth annotationA, by solving problem (18). In other words, this stepcompletes the segmentation to provide us with apositive example. We call this annotation-consistentinference. The second step requires us to find a seg-mentation with low energy but high loss (a negative

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

8

Algorithm 1 The self-paced learning algorithm for LSVM.

input T , w0, σ0, µ, ǫ.1: w← w0, σ ← σ0.2: repeat3: Impute the latent variables as

Hk = argminH

w⊤Ψ(Xk, {Ak,H}). (18)

4: Compute the slack variables ξk as

ξk = maxAk,Hk

∆(Ak, {Ak,Hk})−

w⊤Ψ(Xk, {Ak,Hk})

+w⊤Ψ(Xk, {Ak,Hk}). (19)

Using ξk, define variables vk = δ(ξk ≤ σ), whereδ(·) = 1 if its argument is true and 0 otherwise.

5: Update the parameters by solving the followingproblem that only considers easy images:

minw,ξk≥0

λ

2||w||2 +

1

n

n∑

k=1

vkξk, (20)

s.t. w⊤(

Ψ(Xk, {Ak,Hk}) −

Ψ(Xk, {Ak,Hk}))

≥ ∆(Ak, {Ak,Hk})− ξk, ∀Ak,Hk.

6: Change the threshold σ ← σµ.7: until Decrease in objective (17) is below tolerance

ǫ and all images have been labeled as easy.

example) by solving problem (19). We call this loss-augmented inference. The third step requires solvingthe convex optimization problem (20). In this work,we solve it using stochastic subgradient descent [44],where the subgradient for a given image X is thejoint feature vector Ψ(X, {A∗,H∗}), where {A∗,H∗}are obtained by solving problem (19). We refer theinterested reader to [44] for details.To summarize, in order to learn from generic class

segmentations, we require two types of inferencealgorithms—annotation-consistent inference and loss-augmented inference. In the remainder of this sub-section, we describe these inference algorithms for theregion-based model. However, we note that both thesetypes of inference algorithms can be designed forseveral existing models by suitably modifying theirenergy minimization algorithm, which would allowenable us to learn their parameters using datasetslabeled with generic classes.

4.1.4 Annotation-Consistent InferenceThe goal of annotation-consistent inference is to im-pute the latent variables that minimize the energyunder the constraint that they do not contradict theground-truth annotation (which specifies a pixelwisesegmentation using generic classes). In other words,a pixel marked as a specific class must belong to a re-gion labeled as that class. Furthermore, a pixel marked

as generic foreground (background) must be labeledusing a specific foreground (background) class.For a given dictionary of regions D, annotation-

consistent inference is equivalent to the followinginteger program (IP):

miny∈SELECT (D)

θ⊤y s.t. ∆(A,y) = 0. (21)

The set SELECT (D) refers to the set of all validassignments to y, as specified in equation (10). Theconstraint ∆(A,y) = 0 (where we have overloaded ∆for simplicity to consider y as its argument) ensuresthat the imputed latent variables are consistent withthe ground-truth. Similar to the energy minimizationproblem discussed in the previous section, the aboveIP is solved approximately by relaxing the elementsof y to take values between 0 and 1, resulting in alinear program (LP).Fig. 2(a) shows examples of the segmentation ob-

tained using the above annotation-consistent infer-ence over different iterations of SPL. Note that thesegmentations obtained are able to correctly identifythe specific classes of pixels labeled using genericclasses. The quality of the segmentation, together withthe ability of SPL to select correct images to learn from,results in an accurate set of parameters.

4.1.5 Loss-Augmented InferenceThe goal of loss-augmented inference is to find alabeling that minimizes the energy while maximizingthe loss (as shown in problem (19)), which can beformulated as the following IP:

miny∈SELECT (D)

θ⊤y −∆(A,y). (22)

Unfortunately, relaxing y to take fractional values inthe interval [0, 1] for the above problem does not resultin an LP. This is due to the dependence of ∆ on thelabeling y in both its numerator and denominator(see equation (16)). We address this issue by adoptinga two stage strategy: (i) replace ∆ by another lossfunction that results in an LP relaxation; and (ii) usingthe solution of the first stage as an accurate initial-ization, solve problem (22) via iterated conditionalmodes (ICM). For the first stage, we use the macro-average error over all classes as the loss function, thatis

∆′(A, {A, H}) = 1−1

|L|

∑

i∈L

∣

∣

∣Pi(A) ∩ Pi({A, H})

∣

∣

∣

|Pi(A)|, (23)

where L is the appropriate label set (F for imageslabeled using specific foreground and generic back-ground, B for images labeled using specific back-ground and generic foreground). Note that the de-nominator of ∆′ does not depend on the predictedlabeling. Hence, it can be absorbed into the unarypotentials, leading to a pairwise energy minimizationproblem, which can be solved using our LP relaxation.

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

9

Image Annotation Iteration 1 Iteration 3 Iteration 6

(a)

Image + Box Inferred Annotation Iteration 1 Iteration 2 Iteration 4

(b)

Fig. 2. Labelings obtained using annotation-consistent inference during different iterations of SPL. (a) Images annotated with genericclasses. Column 2 shows the annotation (where the checkered patterns indicate generic classes). In columns 3− 5, pixels labeled using thecorrect specific-class by annotation-consistent inference are shown in white, while pixel labeled using the wrong specific-class are shown inblack (we labeled these images with specific-class annotations only for the purpose of illustration; these annotations were not used duringtraining). A blue surrounding box on the labeling implies that the example was selected as easy by SPL, while a red surrounding box indicatesthat is wasn’t selected during the specified iteration. Note that SPL discards the image where the cow (row 2) is incorrectly labeled. (b) Imagesannotated using bounding boxes. Column 2 shows the annotation obtained using bounding box inference. Note that the objects have beenaccurately segmented. Furthermore, SPL discards the image where the sky (row 2) is incorrectly labeled.

In our experiments, ICM converged within very few it-erations (typically less than 5) when initialized in thismanner. As will be seen in section 5, the approximatesubgradients provided by the above loss-augmentedinference were sufficient to obtain an accurate set ofparameters.

4.2 Learning with Bounding BoxesWe now focus on learning specific-class segmentationfrom training images with user-specified boundingboxes for instances of some classes. To simplify ourdescription, we make the following assumptions: (i)the image contains only one bounding box, which pro-vides us with the location information for an instanceof a specific foreground class; and (ii) all the pixelsthat lie outside the bounding box belong to the genericbackground class. We note that our approach can betrivially extended to handle cases where the aboveassumptions do not hold true (for example, boundingboxes for background or multiple boxes per image).The major obstacle in using bounding box anno-

tations is the lack of a readily available loss functionthat compares bounding boxes B to pixelwise segmen-tations (A, H). Note that it would be unwise to use aloss function that compares two bounding boxes (theground-truth and the predicted one that can be de-rived from the segmentation), as this function wouldnot be compatible with the overlap score loss usedin the previous section. In other words, minimizingsuch a loss function would not necessarily improvethe segmentation accuracy. We address this issue byadopting a simple, yet effective, strategy that solves

a series of LSVM problems for generic classes. Ourapproach consists of three steps:

• Given an image X and its bounding box annota-tion B, we infer its segmentation YB using thecurrent set of parameters (say, the parameterslearned using generic class segmentation data).The segmentation is obtained by minimizing anobjective function that augments the energy ofthe model with terms that encourage the segmen-tation to agree with the bounding box (see detailsbelow).

• Using the above segmentation, we define ageneric class annotation A of the image (seedetails below).

• Annotations A are used to learn the parameters.

The new parameters then update the segmentation,and the entire process is repeated until convergence(that is, until the segmentations stop changing). Notethat the third step simply involves learning an LSVM

as described in the previous section. We describe thefirst two steps in more detail.

4.2.1 Using the Bounding Box for Segmentation.We assume that the specified bounding box is tight(a safe assumption for most datasets) and penalizeany row and column of the bounding box that is notcovered by the corresponding class. Here, a row orcolumn is said to be covered if it contains a sufficientnumber of pixels s that have been assigned to the cor-responding class of the bounding box. Formally, givena bounding box B of class c, we define an annotationA′ such that all the pixels p inside the bounding box

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

10

have no label specified in A′ (denoted by A′p = 0) and

all the pixels outside the bounding box have a genericbackground label (consistent with our assumption).Furthermore, we define latent variables H that modelthe specific semantic classes for each pixel. Using A′

and B, we estimate H as follows:

H = argminH

w⊤Ψ(X, {A′,H}) +∑

q

κqIq(H, c). (24)

Here q indexes the rows and columns of B, and Iqis an indicator function for whether q is covered bythe latent variables H. The penalty κq has a highvalue κmax for the center row and center column, anddecreases at a linear rate to κmin at the boundary. Inour experiments, we set κmax = 10κmin and cross-validated the value of κmin using a small set ofimages. We found that our method produced verysimilar segmentations for a large range of κmin.We refer to problem (24) as bounding-box inference.

Note that Iq adds a higher-order potential to theenergy of the model since its value depends on thelabels of all the pixels in a particular row or column.However, the potential is sparse (that is, it only takesa non-zero value for a small number of labelings).Hence, the above problem can be optimized efficientlyfor any semantic segmentation model [36]. In thefollowing, we describe bounding-box inference forour region-based model.

4.2.2 Bounding-Box InferenceGiven a dictionary of regions D and a bounding boxB of class c, we obtain the segmentation by solvingthe LP relaxation of the following IP:

miny∈SELECT (D),yq∈{0,1}

θ⊤y +

∑

q

κq(1− yq)

s.t. ∆(A′,y) = 0, yq ≤∑

r∈C(q)

yrc . (25)

Here yq is a boolean variable whose value is the com-plement of the indicator function Iq in problem (24).Note that, for our model, a row or a column isconsidered covered if at least one region overlappingwith it is assigned the class c. The loss function ∆ ismeasured over all pixels that lie outside the boundingbox, which are assumed to belong to the genericbackground class. Fig. 2(b) shows some example an-notations obtained from bounding-box inference, to-gether with the results of annotation-consistent infer-ence during different iterations of SPL. The quality ofthe annotations and the ability of SPL to select goodimages ensures that our model is trained withoutnoise.Lempitsky et al. [45] have suggested a method to

obtain a binary segmentation of an image with a user-specified bounding box. However, our setting differsfrom theirs in that, unlike the low-level vision modelused in [45] (likelihoods from RGB values, contrastdependent penalties), we use a more sophisticated

high-level model which encodes information aboutspecific classes and their pairwise relationship using aregion-based representation. Hence, we can resort toa much simpler optimization strategy and still obtainaccurate segmentations.

4.2.3 From Segmentation to Annotation.Let YB = {A′,H} denote the labeling obtainedfrom bounding-box inference. Using YB we definean annotation A as follows. For each pixel p insidethe bounding box that was labeled as class c, that is,YB

p = c, we define Ap = c. For pixels p inside thebounding box such that YB

p 6= c, we define Ap = 0. Inother words, during annotation-consistent inferencethese pixels can belong to any class, foreground orbackground. The reason for specifying Ap in thismanner is that, while we are fairly certain that thepixels labeled YB

p = c do belong to the class c, dueto the lack of information in the annotation we arenot sure of which class the other pixels belong to. Notlabeling such pixels prevents using the mistakes madein bounding-box inference to learn the parameters.Finally, for all the pixels outside the bounding boxwe set Ap = A′

p, that is, they are labeled as genericbackground.

4.3 Learning with Image-Level LabelsWe use a similar three step process to the one de-scribed above for bounding boxes in order to takeadvantage of the numerous images with image-levellabels, which indicate the presence of a class. In moredetail, given an image containing class c, we definean annotation A′ that does not specify a label forany pixel of the image (that is, A′

p = 0 for all p). Weestimate the value of the latent variables H that modelthe specific semantic class of each pixel by solving thefollowing image-label inference problem:

H = argminH

w⊤Ψ(X, {A′,H}) + κmaxI(H, c). (26)

Here I is an indicator function for whether the imageis covered by the latent variables H. Similar to arow or a column of a bounding box, an image isconsidered covered if a sufficient number of pixelss are assigned the label c. Once again, the functionI introduces a sparse higher-order potential that canbe handled efficiently [36]. This would allow us tolearn a general semantic segmentation model usingthe method described above. For the region-basedmodel, we perform image-label inference as follows.Given a dictionary D and image containing class c,

we obtain a segmentation by solving the LP relaxationof the following IP:

miny∈SELECT (D),y∈{0,1}

θ⊤y − κmaxy, s.t. y ≤

∑

r∈D

yrc . (27)

The value of y is the complement of the indicatorfunction I in problem (26). Once again, SPL reduces

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

11

the noise during training by selecting images withcorrect annotations and latent variables.To obtain an annotation A from the above seg-

mentation, we label all pixels p belonging to classc as Ap = c. For the rest of the pixels, we defineAp = 0. The annotations obtained in this manner areused to refine the parameters and the entire processis repeated until convergence.

5 EXPERIMENTSWe now demonstrate the efficacy of our energy mini-mization and parameter estimation approaches usinglarge, publicly available datasets.

5.1 Energy MinimizationWe compare our LP relaxation based approach forenergy minimization with three baselines: (i) re-gions obtained by the intersection of multiple over-segmentations of an image; (ii) regions defined bythe segments of the best single over-segmentation (interms of the energy of the model); and (iii) regions se-lected from a dictionary similar to ours by the methodof [15] (using the code provided by the authors). Foreach method, we use the same set of parameters,which are learned by the closed loop learning (CLL)technique of [15].

5.1.1 DatasetWe use the publicly available Stanford backgrounddataset [15]. It consists of 715 images (collected fromstandard datasets such as PASCAL, MSRC and geo-metric context) whose pixels have been labeled asbelonging to one of seven background classes or ageneric foreground class. For each image we use threeover-segmentations obtained by employing differentkernels for the standard mean-shift algorithm [9].Similar to [15], we split the dataset into 572 images fortraining and 143 images for testing. We report resultson four different splits.

5.1.2 ResultsWe evaluate the accuracy of the labeling obtainedby measuring the percentage of pixels whose labelsmatched the ground-truth. Here, the label of a pixelis the label assigned to the region to which it be-longs. Table 1 shows the average energy and accuracyfor the different approaches. Note that all region-based methods outperform the pixel-based approachdescribed in [15] (that provides comparable resultsto [4]). However, the choice of the regions greatlyaffects the value of the energy and hence the accuracyof the segmentation. By using large dictionaries andan accurate LP relaxation, our approach provides astatistically significant improvement (using paired t-test with p = 0.05) over other methods, both interms of energy and accuracy. Our algorithm takesless than 10 minutes per image on average on a 2.4GHz processor.

Energy AccuracyPixel - 76.65 ± 1.20

Intersection 16150 ± 2005 76.84 ± 1.34Segmentation 6796 ± 833 77.85 ± 1.50

[15] 4815 ± 592 78.52 ± 1.40Our Method 1630 ± 306 79.42 ± 1.41

TABLE 1The mean and standard deviation of the energy and the pixel-wise

accuracy over four folds is shown. The first row corresponds to a

pixel-based model [15]. The second row uses regions obtained by

intersecting the three over-segmentations. The third row shows the

results obtained by using the best single over-segmentation (in

terms of the energy function). The fourth row corresponds to the

method described in [15]. The fifth row shows our method’s results.

Using multiple over-segmentations to define an accurate dictionary

and selecting the regions by our LP relaxation based method results

in a lower energy labeling that provides better accuracy.

Fig. 3 shows some example segmentations obtainedby the various approaches. Note that the regions cor-responding to the intersection of over-segmentationsrespect the boundaries of the scene entities. However,they are too small to provide reliable features. Evenusing the best single over-segmentation results inregions that are not large enough. The method of [15]overcomes this problem to some extent by using anaccurate dictionary of regions. However, as it is proneto getting stuck in a bad local minima, it sometimesselects regions that result in a high energy labeling.Our approach addresses this deficiency by allowingus to make large moves at each iteration.

5.2 Parameter EstimationDespite obtaining a substantial improvement in termsof the energy, our LP relaxation approach only resultsin a small improvement in terms of the segmentationaccuracy. This is due to the fact that the parameterslearned using CLL are not suited to our inference ap-proach. To alleviate this, we use our LSVM formulationto estimate the parameters using large, publicly avail-able datasets that specify varying levels of annotationfor the training images.

5.2.1 Generic Class Annotations

Comparison. We show the advantage of the LSVM for-mulation over CLL, which was specially designed forthe region-based model, for the problem of learning aspecific-class segmentation model using generic classannotations.

Datasets. We use two datasets: (i) the VOC2009 seg-mentation dataset, which provides us with annota-tions consisting of 20 specific foreground classes anda generic background; and (ii) SBD, which provides uswith annotations consisting of 7 specific backgroundclasses and a generic foreground. Thus, we consider27 specific classes, which results in a harder learn-ing problem compared to methods that use only the

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

12

Image Pixel Intersection Segmentation [15] Our

Fig. 3. Examples of semantic segmentation obtained using different types of regions. Our method provides large regions that align withthe scene boundaries.

VOC2009 segmentation dataset or SBD. The total sizeof the dataset is 1846 training (1274 from VOC2009

and 572 from SBD), 278 validation (225 from VOC2009

and 53 from SBD) and 840 test (750 from VOC2009

and 90 from SBD) images. For CLL, the validation setis used to learn the pairwise potentials and severalhyper-parameters while for LSVM it is used for earlystopping (see section 4.1).

Results. Tables 2 and 3 (rows 1 and 2) show theaccuracies obtained for SBD and VOC2009 test imagesrespectively. The accuracies are measured using theoverlap score, that is, 1−∆(A, A, H), where A is theground-truth and (A, H) is the predicted segmenta-tion. While both CLL and LSVM produce specific-classsegmentations of all the test images, we use genericclasses while measuring the performance due to thelack of specific-class ground-truth annotations. Notethat LSVM provides better accuracies for nearly allthe object classes in VOC2009 (17 of 21 classes). ForSBD, LSVM provides a significant boost in performancefor ‘sky’, ‘road’, ‘grass’ and ‘foreground’. With theexception of ‘building’, the accuracies for other classesis comparable. The reason for poor performance in the‘mountain’ class is that several ‘mountain’ pixels arelabeled as ‘tree’ in SBD (which confuses both the learn-ing algorithms). Our results convincingly demonstratethe advantage of using LSVM.

5.2.2 Bounding Box AnnotationsComparison. We now compare the model learnedusing only generic class annotations with the modelthat is learned by also considering bounding boxannotations. In keeping with the spirit of SPL, weuse the previous LSVM model (learned using easierexamples) as initialization for learning with additionalbounding boxes.

Datasets. In addition to VOC2009 and SBD, we usesome of the bounding box annotations that wereintroduced in the VOC2010 detection dataset. Ourcriteria for choosing the images is that (i) they were

not present in the VOC2009 detection dataset (whichwere used to obtain detection-based image featuresfor the model); and (ii) none of their bounding boxesoverlapped with each other. This provides us with anadditional 1564 training images that have previouslynot been used to learn a segmentation model.

Results. Tables 2 and 3 (row 3) show the accura-cies obtained by training on the above dataset forVOC2009 and SBD respectively. Once again, we observean improvement in the accuracies for nearly all theVOC2009 classes (18 of 21 classes) compared to theLSVM trained using only generic class annotations.For SBD, we obtain a significant boost for ‘tree’, ‘wa-ter’ and ‘foreground’, while the accuracies of ‘road’,‘grass’ and ‘mountain’ remain (almost) unchanged.

Table 3. Accuracies for the SBD test set. See caption of Fig. 2 foran explanation of the various methods.

5.2.3 Image-Level AnnotationsComparison. We compare the model learned usinggeneric class and bounding box annotations with themodel that is learned by also considering image-levellabels. Once again, following the idea of SPL closely,we use the model learned in the previous subsectionas an initialization for learning with the additionalimage-level labels.

Datasets. In addition to SBD, VOC2009 segmentationdataset and VOC2010 detection dataset, we use a sub-

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

13

Table 2. Accuracies for the VOC2009 test set. First row shows the results obtained using CLL [15] with a combination of VOC2009 and SBD

training images. The second row shows the results obtained using SPL for LSVM with the same training set of the training images. The thirdrow shows the results obtained using an additional 1564 bounding box annotations. The fourth row shows the results obtained by furtheraugmenting the training dataset with 1000 image-level annotations. The best accuracy for each class is underlined. The fifth row shows theresults obtained when the LSVM is learned using CCCP on the entire dataset.

Image CLL Our

Figure 4. The first two rows show the results obtained for imagesfrom the VOC2009 test set. Note that, unlike our approach thatlearns the parameters using LSVM on a large dataset, CLL mislabelsbackground pixels into the wrong specific classes. The last two rowsshow images from the SBD test set. While our approach is able toidentify most of the foreground pixels correctly, CLL mislabels themas background.

set of the ImageNet [43] dataset that provides tagsindicating the presence of an object class in an image.Due to time limitations, we restrict ourselves to 1000randomly chosen images from ImageNet. Our onlycriterion for choosing the images was that they mustcontain at least 1 of the 20 foreground classes fromthe VOC datasets.

Results. Tables 2 and 3 (row 4) show the accuraciesobtained by training on the above dataset for VOC2009

and SBD respectively. For the VOC2009 segmentationtest set, the final model learned from all the trainingimages provides the best accuracy for 12 of the 21classes. Compared to the model learned using genericclass labels and bounding boxes, we obtain a sig-nificant improvement for 13 classes by incorporatingimage-level annotations. Of the remaining 8 classes,the accuracies are comparable for ‘bird’, ‘boat’, ‘chair’

and ‘train’. For the SBD test set, the model trainedusing all the data obtains the highest accuracy for 5of the 8 classes. Fig. 4 shows examples of the specific-class segmentation obtained using our method. Notethat the parameters learned using our approach on alarge dataset are able to correctly identify the specificclasses of pixels.

5.2.4 SPL vs. CCCP

Comparison. We now test the hypothesis that our SPL

algorithm is better suited for learning with diversedata than the previously used CCCP algorithm.

Datasets. We use the entire dataset consisting ofstrongly supervised images from the SBD and VOC2009

segmentation datasets and weakly supervised imagesfrom the ImageNet and VOC2010 detection datasets.

Results. Tables 2 and 3 (row 5) show the accu-racies obtained using CCCP for VOC2009 and SBD

respectively. Note that CCCP does not provide anyimprovement over CLL, which is trained using onlythe strongly supervised images, in terms the averageoverlap score for VOC2009 . While the overlap scoreimproves for the SBD dataset, the improvement issignificantly better when using SPL (row 4). Theseresults convincingly demonstrate that, unlike CCCP,SPL is able to handle the noise inherent in the problemof learning with diverse data.

6 DISCUSSION

The region-based model of Gould et al. [15] presentsan intuitive definition of regions, namely regions aresets of connected pixels that do not overlap witheach other and minimize the energy for the taskunder consideration. To aid the use of this model,we proposed an energy minimization algorithm thatavoids the problem of getting stuck in a bad local min-ima. Furthermore, we presented a principled LSVM

framework for learning the parameters of the modelusing diverse data, and convincingly demonstrated itsbenefits using large, publicly available datasets.

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

14

Our LP relaxation approach can also be used toperform energy minimization in other region-basedmodels, such as those developed for geometric re-construction [39], object discovery [46] and objectdetection and segmentation [38], [40]. Furthermore,the accuracy of the method may also help in the devel-opment of a successful unified scene understandingmodel.While our parameter estimation approach focused

on only three types of annotations, it can be extendedto handle other types of data. For example, instead ofjust a two-level hierarchy (the generic classes and thespecific classes), we can consider a general hierarchyof labels, such as the one defined in ImageNet [43]( ‘Ferrari’ and ‘Honda’ sub-classes for ‘car’). Such ahierarchy can be viewed as a tree over labels. Givena pixel p annotated with a non-leaf label l, we canspecify a latent variable Hp that models its leaf-levellabel (where the leaves are restricted to lie in the sub-tree rooted at l). The loss function and the inferencealgorithms for generic classes can be trivially modifiedto deal with this more general case.Our ongoing work is in two directions: (i) dealing

with noisy labels, such as the labels obtained fromGoogle Images or Flickr; and (ii) improving the ef-ficiency of SPL by exploiting the fact that very easyimages can be discarded during later iterations. Boththese directions are aimed towards learning a seg-mentation model from the millions of freely availableimages on the Internet.

REFERENCES

[1] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov,A. Agarwala, M. Tappen, and C. Rother, “A comparitive studyof energy minimization methods for Markov random fieldswith smoothness-based priors,” PAMI, 2008.

[2] X. He, R. Zemel, and M. Carriera-Perpinan, “Multiscale con-ditional random fields for image labeling,” in CVPR, 2004.

[3] S. Konishi and A. Yuille, “Statistical cues for domain specificimage segmentation with performance analysis,” in CVPR,2000.

[4] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “TextonBoost:Joint appearance, shape and context modeling for multi-classobject recognition and segmentation,” in ECCV, 2006.

[5] D. Larlus and F. Jurie, “Combining appearance models andMarkov random fields for category level object segmenta-tions,” in CVPR, 2008.

[6] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, andS. Belongie, “Objects in context,” in ICCV, 2007.

[7] A. Saxena, M. Sun, and A. Ng, “Make3D: Learning 3D scenestructure from a single still image,” PAMI, 2008.

[8] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “From con-tours to regions: An empirical evaluation,” in CVPR, 2009.

[9] D. Comaniciu and P. Meer, “Mean shift analysis and applica-tions,” in ICCV, 1997.

[10] C. Pontafaru, C. Schmid, and M. Hebert, “Object recognitionby integrating multiple segmentations,” in ECCV, 2008.

[11] J. Gonfaus, X. Boix, J. Van de Weijer, A. Bagdanov, J. Serrat,and J. Gonzalez, “Harmony potentials for joint classificationand segmentation,” in CVPR, 2010.

[12] P. Kohli, L. Ladicky, and P. Torr, “Robust higher order poten-tials for enforcing label consistency,” in CVPR, 2008.

[13] L. Ladicky, C. Russell, P. Kohli, and P. Torr, “Associativehierarchical CRFs for object class image segmentation,” inICCV, 2009.

[14] F. Li, J. Carreira, and C. Sminchisescu, “Object recognition asranking holistic figure-ground hypotheses,” in CVPR, 2010.

[15] S. Gould, R. Fulton, and D. Koller, “Decomposing a scene intogeometric and semantically consistent regions,” in ICCV, 2009.

[16] S. Nowozin and C. Lampert, “Global connectivity potentialsfor random field models,” in CVPR, 2009.

[17] S. Vicente, V. Kolmogorov, and C. Rother, “Graph cut basedimage segmentation with connectivity priors,” in CVPR, 2008.

[18] D. Bertsekas, Nonlinear Programming. Athena Scientific, 1999.[19] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zis-

serman, “The PASCAL visual object classes (VOC) challenge,”IJCV, 2010.

[20] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discrimi-natively trained, multiscale, deformable part model,” in CVPR,2008.

[21] C.-N. Yu and T. Joachims, “Learning structural SVMs withlatent variables,” in ICML, 2009.

[22] M. P. Kumar, B. Packer, and D. Koller, “Self-paced learning forlatent variable models,” in NIPS, 2010.

[23] M. P. Kumar and D. Koller, “Efficiently selecting regions forscene understanding,” in CVPR, 2010.

[24] M. P. Kumar, H. Turki, D. Preston, and D. Koller, “Learningspecific-class segmentation from diverse data,” in ICCV, 2011.

[25] C. Chekuri, S. Khanna, J. Naor, and L. Zosin, “A linearprogramming formulation and approximation algorithms forthe metric labelling problem,” Disc. Math., 2005.

[26] M. Wainwright, T. Jaakkola, and A. Willsky, “MAP estimationvia agreement on trees: Message passing and linear program-ming,” IEEE Info. Theory, 2005.

[27] P. Hammer, “Some network flow problems solved withpseudo-Boolean programming,” Operations Research, 1965.

[28] C. Rother, V. Kolmogorov, V. Lempitsky, and M. Szummer,“Optimizing binary MRFs via extended roof duality,” inCVPR, 2007.

[29] E. Boros and P. Hammer, “Pseudo-Boolean optimization,”Discrete Applied Mathematics, 2002.

[30] N. Komodakis and N. Paragios, “Beyond loose LP-relaxations:Optimizing MRFs by repairing cycles,” in ECCV, 2008.

[31] M. P. Kumar and P. Torr, “Efficiently solving convex relax-ations for MAP estimation,” in ICML, 2008.

[32] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss,“Tightening LP relaxations for MAP using message passing,”in UAI, 2008.

[33] F. Barahona and A. Mahjoub, “On the cut polytope,” Mathe-matical Programming, 1986.

[34] N. Komodakis, N. Paragios, and G. Tziritas, “MRF optimiza-tion via dual decomposition: Message-passing revisited,” inICCV, 2007.

[35] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networksof Plausible Inference. Morgan Kauffman, 1998.

[36] N. Komodakis and N. Paragios, “Beyond pairwise energies:Efficient optmization for higher-order MRFs,” in CVPR, 2009.

[37] C. Rother, P. Kohli, W. Feng, and J. Jia, “Minimizing sparsehigher order functions of discrete variables,” in CVPR, 2009.

[38] C. Gu, J. Lim, P. Arbelaez, and J. Malik, “Recognition usingregions,” in CVPR, 2009.

[39] D. Hoiem, A. Efros, and M. Herbert, “Geometric context froma single image,” in ICCV, 2005.

[40] L.-J. Li, R. Socher, and L. Fei-Fei, “Towards total scene un-derstanding: Classification, annotation and segmentation in anautomatic framework,” in CVPR, 2009.

[41] T. Joachims, “Making large-scale SVM learning practical,” inAdvances in Kernel Methods. MIT Press, 1999.

[42] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energyminimization via graph cuts,” PAMI, 2001.

[43] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarhical image database,” inCVPR, 2009.

[44] S. Shalev-Shwartz, Y. Singer, and N. Srebro, “Pegasos: Primalestimated sub-gradient solver for SVM,” in ICML, 2009.

[45] V. Lempitsky, P. Kohli, C. Rother, and T. Sharp, “Image seg-mentation with a bounding box prior,” in ICCV, 2009.

[46] B. Russell, A. Efros, J. Sivic, W. Freeman, and A. Zisserman,“Using multiple segmentations to discover objects and theirextent in image collections,” in CVPR, 2006.

hal-0

0857

918,

ver

sion

1 -

4 Se

p 20

13

[hal-00857918, v1] parameter estimation and energy ... parameter estimation and energy minimization...

Documents