smooth image segmentation by nonparametric bayesian inference

Smooth Image Segmentation by NonparametricBayesian Inference

Peter Orbanz and Joachim M. Buhmann

Institute of Computational Science, ETH Zurich{porbanz, jbuhmann}@inf.ethz.ch

Abstract. A nonparametric Bayesian model for histogram clusteringis proposed to automatically determine the number of segments whenMarkov Random Field constraints enforce smooth class assignments. Thenonparametric nature of this model is implemented by a Dirichlet pro-cess prior to control the number of clusters. The resulting posterior canbe sampled by a modification of a conjugate-case sampling algorithmfor Dirichlet process mixture models. This sampling procedure estimatessegmentations as efficiently as clustering procedures in the strictly con-jugate case. The sampling algorithm can process both single-channel andmulti-channel image data. Experimental results are presented for real-world synthetic aperture radar and magnetic resonance imaging data.

1 Introduction

Unsupervised data clustering and image segmentation models usually assumethat an appropriate number of classes is either known a priori or specified bythe data analyst. More sophisticated methods automatically select the numberof clusters, e. g. by resampling strategies [1]. Recently, nonparametric Bayesianmodels based on Dirichlet processes have successfully been applied to machinelearning problems such as natural language processing [2] and object categoriza-tion [3]. These models perform automatic model selection by supporting a rangeof prior choices for the number of classes; the different resulting models are thenscored by the likelihood according to the observed data.

The question how automatic model selection can be performed in image seg-mentation for models such as Markov random fields plays an important role incomputer vision; see e. g. [4] for recent work employing a Bayesian informationcriterion. Our approach, which is based on Dirichlet processes, combines spatialconstraints on class labels with an estimate of a preferred number of clusters.The smoothness constraints are modeled as a Markov random field (MRF) ona neighborhood graph. To combine MRF image models for segmentation witha nonparametric selection of the segment number, the Dirichlet process prior isenhanced by a smoothness constraint on the label field. Local feature histogramsare extracted from the image and grouped by histogram clustering. Adjacent im-age patches are assigned to the same cluster with high probability if they areneighbors with respect to the neighborhood graph of the MRF.

A. Leonardis, H. Bischof, and A. Pinz (Eds.): ECCV 2006, Part I, LNCS 3951, pp. 444–457, 2006.c© Springer-Verlag Berlin Heidelberg 2006

Smooth Image Segmentation by Nonparametric Bayesian Inference 445

The paper is organized as follows: Sec. 2 briefly reviews Dirichlet processmixture (MDP) models and their application to data clustering. We discusstheir combination with MRFs in Sec. 3, and the histogram clustering modelused for application to image segmentation in Sec. 4. Sec. 5 proposes a MCMCsampling algorithm to sample the combined model. Experimental results aregiven in Sec. 6.

2 Data Clustering with MDP Models

The statistical model considered in this work is a Dirichlet process mixture(MDP) model [5]. MDP approaches belong to a class of models referred to asnonparametric Bayesian models. A MDP clustering model consists of three prin-cipal ingredients: A parametric likelihood function F , a probability distributionG0, which is referred to as the base measure, and a Dirichlet process DP (αG0)parameterized by the base measure and a positive constant α ∈ R+. In thisarticle, the base measure G0 will generally be assumed to be infinite. Under theMDP model, a set of distinct classes is assumed to generate the observed datax1, . . . ,xn. Each class has a generative distribution, described by the likelihoodF . Each cluster (indexed by k) is characterized by a parameter value θ∗k, so thedata within the cluster is generated according to x ∼ F ( . |θ∗k). This makes MDPmodels conceptually similar to finite parametric mixture models. MDP mod-els generate the parameter values θ∗k, which characterize the classes, accordingto a Dirichlet process DP (αG0). In contrast to parametric mixture models, thenumber of classes is not a constant, and will change during the sampling process.

Formally, models based on Dirichlet processes draw a distribution G at ran-dom from a stochastic process [5]. The sample values drawn by means of theDP, the mixture parameters θ1, . . . , θn, are assumed to be generated by the dis-tribution G:

θ1, . . . , θn ∼ G with G ∼ DP(αG0) . (1)

The practical applicability of the process, however, is based on the observationthat the distribution G can be integrated out. Given a set of samples θ1, . . . , θn,a new sample θn+1 has a closed-form conditional distribution:

θn+1|θ1, . . . , θn ∼ 1n + α

n∑

i=1

δθi(θn+1) +α

n + αG0(θn+1) , (2)

where δθ denotes the Dirac measure concentrated at θ. Therefore, sampling theDirichlet process generates random values in the domain of the base measureG0, but with a different distribution than the one specified by G0.

A draw from the distribution (2) will, with probability nn+α , yield a sample

value which has already occurred. (Provided that G0 is infinite, a draw from thesecond term in (2) will generate a previously unobserved value with probabilityone.) If any two samples θi, θj are identical, the corresponding Dirac measurescoincide. One may therefore group the samples θ1, . . . , θn into NC ≤ n classes

446 P. Orbanz and J.M. Buhmann

containing identical values. Each class k ∈ {1, . . . , NC} is characterized by itsassociated sample value, denoted θ∗k. Denoting the number of samples in groupk by nk, the distribution (2) may be rewritten as a sum over clusters rather thanindividual samples:

pn+1(θn+1|θ1, . . . , θn) :=NC∑

k=1

nk

n + αδθ∗

k(θn+1) +

α

n + αG0(θn+1). (3)

The distribution may be regarded as a mixture model. It contains NC finite (de-generate) components, which correspond to the clusters already created, and thebase measure component, which is responsible for the creation of new classes.The probability of occurrence for each cluster is proportional to its size. Theprobability for a new class to be created is adjusted by means of the DP param-eter α. Definition (3) also implies that DP (αG0) can be sampled efficiently, ifwe provide an algorithm to sample the base measure G0.

Data generation (of n data values x1, . . . ,xn) according to a MDP model canbe summarized by

xi ∼ F ( . |θi)θi ∼ pi(θi|θ1, . . . , θi−1) . (4)

Inference of this model is not as straightforward as sampling (3), since the ob-served data is x1,. . .,xn, whereas the DP distribution is conditional on θ1, . . . , θn.The generative model in (4) has to be sampled conditional on the observed dataxi. A sampling algorithm as described in Sec. 5 obtains estimates of the param-eters θi. MDP models perform automatic model selection, since the number ofclusters is determined by the dynamics of the process, i.e., it is not an input pa-rameter. New classes are generated during the sampling process. When samplingthe parameter θi for a given data value xi, the data value may be assigned to anexisting class k (by setting θi := θ∗k). The probability for this to happen dependson the likelihood F (xi|θ∗k) and on the number of points already assigned to theclass in question (since large classes, with a large value of nk, are more probablethan small ones). If the cluster distribution provides a good description of thedata, the probability of assignment is high, since the likelihood F assumes a largevalue. If this is not the case for any existing cluster, a new cluster is created forthe data value with high probability. Generation of a new cluster corresponds tosampling from the base measure term in (3).

The applicability of MDP models to clustering problems in machine learningand computer vision may be best illustrated by the following observation: Anyparametric mixture model of the form

m(x|t1, . . . , tK , c1, . . . , cK) =K∑

k=1

ckr(x|tk) (5)

can be used within the MDP clustering framework by setting F = r and placinga suitable prior G0 on the parameter tk. The prior serves as the base measure.


The class parameters tk are substituted by samples θ∗k generated by a processDP (αG0) as described above, and the cluster sizes nk are analogous to themixture weights ck in the parametric case. Given a set of observed data valuesx1, . . . ,xn, sampling the MDP model will result in a set of estimates θ1, . . . , θn forthe corresponding class parameters. By grouping identical values, the parameterestimates implicitly determine the number of clusters, class assignments of thedata and the mixture proportions of the model.

3 Markov Random Field Constraints and DirichletProcess Models

This section describes how Markov random field models can be integrated with aMDP clustering approach. Our objective is to obtain a model capable of combin-ing the clustering and model selection performed by the MDP with smoothnessconstraints on the class labels. The model is applicable to any clustering problemfor which it is reasonable to assume a spatially coherent class structure, such assegmentation of noisy images: To obtain smooth segments, a MRF constraintencourages adjacent points in the image to be assigned to the same class.

Consider a clustering problem with vectorial input data x1, . . . ,xn. Each pointxi is assumed to be generated according to a parameter vector θi. Two points areconsidered to originate from the same cluster if their respective parameter vectorsare identical. The cluster assignment of feature xi is denoted by Si ∈ {1, . . . , NC}.We will use the notation θ−i (or S−i) to denote the set of all parameters (orcluster assignments) with the value corresponding to feature i removed. TheMDP clustering model for this problem is once again defined by a likelihood Fand a base measure G0 to parameterize the Dirichlet process.

To combine the MDP model with a MRF, we restrict the choice of MRFconstraints to pairwise difference priors [6], which are commonly used to modelspatial smoothness of the label field. The MRF definition is based on an undi-rected neighborhood graph N and we write l ∈ ∂ (i) to denote that the feature ofindex l is a neighbor of feature i. The MRF prior Π consists of two components,

Π (θ) ∝ P (θ)M (θ) . (6)

P is a parametric prior on the parameter θ, which will be referred to as theinitial prior. It is used to model initial beliefs about which parameter val-ues are likely to occur. M is a MRF contribution term of the form M (θi) ∝exp (−H (θ1, . . . , θn)), H being a cost function defined on the neighborhoodgraph N . The term M is used to model constraints such as smoothness, whichare conditional on the neighborhood of a feature. M defines a pairwise differenceprior if the cost function assumes the form H (θi|θ−i) =

∑l∈∂(i) wilΦ (θi − θl),

where Φ is a non-negative, even function and wil are weights associated with theedges of the graph. Conditional on θ−i, the prior for θi is given by

Π (θi|θ−i) ∝ P (θi|θ−i) exp (−H (θi|θ−i)) . (7)


In the above relations, normalization constants have been neglected, because inmany practical cases, M will be improper.

The MDP approach and MRF constraints are combined by drawing the initialprior P in (6) from a DP. The resulting generative model is summarized by

xi ∼ F (xi|θi)θi ∼ M(θi|θ−i)P (θi)P ∼ DP (αG0) . (8)

To obtain a conditional form of this model, i. e. a form in which the randommeasure P does not occur explicitly, the conditional MDP prior (3) is substitutedinto (7). For a fixed size data set x1, . . . ,xn, the sequential form (3) of theconditional prior is rewritten as a prior for θi given the remaining parametervalues:

pn(θi|θ−i) ∝NC∑

k=1

nk−iδθ∗

k(θi) + αG0(θi) , (9)

where nk−i denotes the number of observations assigned to cluster k when xi is

removed from the set. The conditional form of the combined MDP/MRF prioris then given by

Π (θi|θ−i) ∝ pn(θi|θ−i) exp (−H (θi|θ−i)) . (10)

Smoothness constraints for clustering problems are formulated on the clusterassignments, so the MRF cost function is a function defined on labels. A costfunction modeling spatial smoothness measures whether or not neighboring fea-tures are assigned to the same cluster. This binary notion of similarity betweenneighbors is expressed by cost functions of the general form

H (Si|S−i) =∑

l∈∂(i)

δSi,Slφ(S1, . . . , Sn) , (11)

as proposed by Geman e. a. [7]. A special property of the MDP setting is theone-to-one correspondence between cluster labels and cluster parameters (sincetwo sites belong to the same cluster if and only if their class parameters θare identical). The correspondence admits an equivalent formulation of the costfunction (11) in terms of class parameters:

H (θi|θ−i) =∑

l∈∂(i)

δθi,θlφ(θ1, . . . , θn) . (12)

Combination of the resulting MRF with the conditional MDP prior (9) affectsonly the first, finite term, because the support of H is a subset {θ1, . . . , θn}. Arandom value θ ∼ G0 drawn from an infinite base measure will be different fromany value in supp (H) with probability one, and therefore

M(θi|θ−i)G0(θi) = G0(θi) (13)


almost surely. The relation holds irrespectively of any particular choice of G0 andH . Intuitively, (13) expresses the modeling assumption that the MRF constraintshould encourage uniform assignments of neighbors.The MRF contribution is non-trivial for a given label Si only if one or more neighbors of xi are assigned to thesame class as xi. Since a draw from the base measure will always result in the cre-ation of a new class, the MRF term does not affect the base measure term.

4 The Histogram Clustering Model

The primary focus of this article is on histogram clustering, with application toimage segmentation. The input features are composed of a set of n histogramshi = (hi1, . . . , hiNbins), hij ∈ N0, representing local intensity distributions of adigital image. They replace the data values x1, . . . ,xn in the previous sections.Each histogram is associated with a pixel location in the image, referred to as asite. All histograms contain an identical number Ncounts of values.

Given a vector θi of bin probabilities, a random histogram hi is multino-mially distributed with density F (hi|θi) = 1/ZM (hi) exp

(∑Nbinsj=1 hij log(θij)

).

Each vector θi for is assumed to be drawn from the respective conjugate prior,a Dirichlet distribution G0(θi|β,πππ) = 1

ZD(β,πππ) exp(∑Nbins

j=1 (βπj − 1) log(θij)),

where β ∈ R+ and πππ is a vector representing a finite probability distributionon Nbins elements.

To apply MRF constraints to the image segmentation problem, two featuresare defined as neighbors in the MRF neighborhood graph N if their associatedsites are neighbors in the image. These neighborhoods are either of size D = 4(two horizontal and two vertical neighbors) or D = 8 (all direct neighbors),cropped at the image boundaries. The cost function is of the form (12). For thesake of simplicity, φ in (12) is chosen to depend only on a scale parameter λ(defined once for the whole image) and the size of the neighborhood:

H (θi|θ−i) = λ∑

l∈∂(i)

(D − δθi,θl) , (14)

where D = 4 or D = 8, respectively. Thus, exp(−H) = exp(−λD) for featurehi if no neighbor is assigned to the same cluster. If one or more neighbors areassigned to the same class, exp(−H) will increase and thus favor the assignment.

The model may be extended to the case of multiple histograms available ateach site. This extension makes the method applicable to color images, wherea single one-dimensional histogram is drawn from each color channel at eachsite, and to radar images with multiple channels representing different frequencybands. Another possible application is the inclusion of additional filter infor-mation, by applying a filter transform to the image and drawing histogramsfrom the filter response. For example, texture information may be includedin the form of Gabor filter response histograms. Suppose that C histogramshl

i = (hli1, . . . , h

liNbins

), l = 1, . . . , C, are available at each site i. First con-sider the basic parametric Bayesian model without the DP, consisting of the


multinomial likelihood and Dirichlet prior in the single-channel case. To modelmultiple channels, the different channels are assumed to be independent. Eachmarginal histogram hl

i is parameterized by its own vector θli of bin probabil-

ities, and we write θi := (θ1i , . . . , θC

i ). Due to independence, the joint likeli-hood F (h1

i , . . . ,hCi |θ1

i , . . . , θCi ) factors into a product over the channel likeli-

hoods F (hli|θl

i) . Each parameter vector θli is drawn from a Dirichlet distribution

Gl0(θ|βl,πππl), resulting in the model

F (h|θ)G0(θ) =C∏

l=1

F (hli|θl

i)Gl0(θ|βl,πππl) . (15)

The MDP/MRF generative model for multichannel data is then obtained bysubstituting F (h|θ) and G0(θ) into the generative model (8).

5 Sampling

The algorithm proposed here to sample the combined MDP/MRF model is aMarkov chain Monte Carlo procedure similar to the algorithm proposedMacEachern [8] for sampling MDP models with a conjugate likelihood/basemeasure pair. Each iteration samples a set of cluster assignments S1, . . . , Sn

for all sites. New estimates of the cluster parameters θ∗k are then sampled condi-tional on the assignments Si and the observed data. Due to the way in which thefinitely supported cost function of the MRF acts on the MDP model, some keyformulas reduce to the conjugate case. As a consequence, the sampling approachremains applicable despite the fact that the constrained model is not conjugate.It is easily extended to the case of multiple channels.

To sample a cluster assignment Si given a current set of parameters θ1, .., θn

and the datum xi, the posterior probability of occurrence for each class is com-puted by integrating the complete model over θi:

∫

Ωθ

exp (−H(θi|θ−i))F (xi|θi)

(NC∑

k=1

nk−iδθ∗

k(θi) + αG0(θi)

)dθi

=NC∑

k=1

nk−i exp (−H(θ∗k|θ−i))F (xi|θ∗k) + α

∫

Ωθ

F (xi|θi)G0(θi)dθi . (16)

Since H(θ|θ−i) �= 0 only if θ ∈ {θ∗1 , . . . , θ∗NC

}, exp (−H(θ|θ−i)) �= 1 holds onlyon a set of Lebesgue measure zero. Such a set does not affect the value of theintegral, and the MRF contribution term may therefore be neglected in the basemeasure integral, as we have done above. Each term in (16) corresponds to asingle cluster (with the integral involving the base measure G0 corresponding tothe creation of a new group), and we define cluster proportions by setting

qi0 := α

∫

Ωθ

F (xi|θi)G0(θi)dθi

qik := nk−i exp (−H(θ∗k|θ−i)) F (xi|θ∗k) . (17)


These proportions are transformed into cluster probabilities by normalization,

qik :=qik∑NC

j=0 qij

. (18)

A cluster assignment Si is sampled by sampling from the finite probability dis-tribution defined by the vector (qi0, . . . , qiNC). In the second step, new values forthe cluster parameters θ∗k are chosen by sampling from the class posterior, i. e.the posterior based on all data values currently assigned to the given class:

θ∗k ∼ G0 (θ∗k)∏

i|Si=k

F (xi|θ∗k) . (19)

The combined MDP/MRF model is thus sampled by the following algorithm:

Algorithm 1 (MDP/MRF Sampling)Initialize: Generate θ ∼ G0 and set θi = θ for i = 1, . . . , n.Repeat:1. For i = 1, . . . , n:

(a) If xi is the only feature assigned to its cluster k = Si, remove this cluster.(b) For k = 0, . . . , NC, compute the component probabilities qi,k according

to eqs. (17) and (18).(c) Draw a random index k according to the finite distribution(qi,0,. . .,qi,NC).(d) Assignment:

– If k ∈ {1, . . . , NC}, assign xi to cluster k.– If k = 0, create a new cluster for xi.

2. For each cluster k = 1, . . . , NC: Update the cluster parameters θ∗k given theclass assignments S1, . . . , Sn by sampling

θ∗k ∼ G0 (θ∗k)∏

i|Si=k

F (xi|θ∗k) . (20)

In the histogram clustering model introduced above for the single-channelcase, F is a multinomial distribution, G0 a Dirichlet distribution and the ob-served data xi are the histograms hi. Due to the conjugacy of F and G0, theintegral required for the computation of qi0 may be solved analytically:

qi0 = α

∫

Ωθ

F (xi|θi)G0(θi)dθi = αZD(hi + βπ)

ZD(βπππ)ZM (hi). (21)

Conjugacy also implies that the class posterior (20) is a Dirichlet distribution,with the prior parameters updated by the data assigned to the cluster:

G0 (θ∗k|βπππ)∏

i|Si=k

F (xi|θ∗k) = G0

⎛

⎝θ∗k

∣∣∣∣∣∣

∑

i|Si=k

hi + βπππ

⎞

⎠ . (22)


Efficient sampling algorithms based on gamma samples are available for thisdistribution [9], which ensures the feasibility of step 2 of the algorithm.

In the case of multiple channels, products of multinomial and Dirichlet distri-butions have to be substituted for F and G0 in the derivation above, assumingthat the different channels are statistically independent. Since the MRF term isdefined on class labels, it applies to all channels, rather than to each individualchannel. The cluster proportions are computed according to

qi0 := α

∫

Ωθ

C∏

l=1

(F (hl

i|θli)G

l0(θ

li|βl,πππl)

)dθi =

C∏

l=1

ZD(hli + βlπl)

ZD(βlπππl)ZM (hli)

qik := nk−i exp (−H(θ∗k|θ−i))

C∏

l=1

F (xli|θ∗l

k ) . (23)

The class posterior turns into a product of Dirichlet distributions, each of whichmay be sampled individually:

C∏

l=1

⎛

⎝Gl0(θ∗l

k |βlπππl) ∏

i|Si=k

F(hl

i|θ∗lk

)⎞

⎠ =C∏

l=1

Gl0

⎛

⎝θ∗lk

∣∣∣∣∣∣

∑

i|Si=k

hli + βlπππl

⎞

⎠ . (24)

Sampling of the Dirichlet process for the multichannel model is thus conductedby parallel Dirichlet process sampling procedures applied to the individual chan-nels. The channels couple through the class assignments Si, and through theMRF contribution defined on these labels.

6 Experimental Results

The experiments presented in this section were conducted on two classes of noisyimages, synthetic aperture radar (SAR) and magnetic resonance imaging (MRI)data. Aside from the visual quality of the segmentations, we especially studytwo model selection questions: (i) How does the hyperparameter of the Dirichletprocess influence the model selection (i. e. the number of segments selected)?(ii) How do results compare to other model selection methods?

The histograms used in the experiments shown here where extracted from adigital image by centering a square window around each pixel on an equidis-tant grid and sorting the intensity values of all pixels within the window intoa histogram. Choosing the size of the histogram window generally results in atrade-off between regularity and detail: Using a large window will smooth seg-mentation results, but coarsen the resolution. Small windows preserve detail, butusually give less robust segmentation results. Using a model with a smoothnessconstraint permits the choice of small windows. For the experiments shown be-low, histograms were obtained from a five-by-five pixel sliding window, centeredat each node of a rectangular grid of width two.

The nonparametric Bayesian model selection strategy introduced in the pre-vious sections is compared with the stability method [10, 1], a competitive model


Fig. 1. Segmentation results on real-world radar data. Original image (left), uncon-strained MDP segmentation (middle), MDP segmentation with smoothness constraint(right).

Fig. 2. A SAR image with a high noise level and ambiguous segments (left). Solutionswithout (middle) and with smoothing (right).

selection technique for clustering. Stability is a cross-validation based wrappermethod for an arbitrary clustering algorithm chosen by the user. The methodrepeatedly computes clustering solutions on randomly chosen subsets of the in-put data, and evaluates the predictive power of the obtained cluster model onthe remaining data. An instability index is computed for different number ofclusters, which measures how unstable cluster solutions are under the randomsplit procedure. The chosen model is the one for which the instability index isminimal. Usually, a local rather than the global minimum is chosen, since stabil-ity algorithms are known to preferentially estimate a global minimum for verysimple solutions (often only two classes). Consider, for example, intensity-basedimage segmentation: A two-class segmentation, which simply splits the imageinto light and dark regions, tends to be highly stable with respect to the randomsplit procedure, but is usually not the desired solution.

The MDP/MRF method applied for image segmentation employs a multi-nomial likelihood. To obtain a valid comparison, the algorithm chosen for usewith stability is an EM algorithm which estimates a mixture of multinomial


Fig. 3. MR frontal view image of a monkey’s head. Original image (left), smoothedMDP segmentation (middle), original image overlaid with segment boundaries (right).

Fig. 4. Segmentation result for multichannel data: A SAR image with three channels(left), segmentation result obtained with the MDP/MRF model, and the original imageoverlaid with segment boundaries (right)

distributions (also known as the ACM algorithm [11]). Figs. 1 and 2 showresults for two SAR images. Segments of the image in Fig. 1 are well sepa-rated. As the results show, segmentation quality for noisy data can be improvedsignificantly by a smoothness constraint. Fig. 2 provides an example of am-biguous, poorly separated segments. In this case, both the unconstrained andconstrained segmentation results are of limited quality. Another type of noisydata, a MR image, is shown in Fig. 3 together with its (constrained) segmen-tation result. Fig. 4 shows segmentation results obtained with the multichannelversion of the algorithm on a SAR image consisting of three separate frequencybands.

The burn-in phase of the Gibbs sampling algorithm is assumed to have ter-minated once the number of assignments changed per iteration remains stablebelow 1% of the total number of sites. This condition is usually met after atmost 500-1000 iterations. The behavior of the class assignments during the sam-pling process visualized by the plot in Fig. 5. In both cases, the algorithm takesabout 600 iterations to stabilize (the curves become constant apart from fluc-tuations). The splitting behavior of the algorithm differs significantly betweenthe two cases: In the unconstrained case, large batches of sites are reassigned at


0 100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

4

0 100 200 300 400 500 600 700 800 900 10000

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

Fig. 5. Cluster sizes during the sampling process for the unconstrained and smoothedversion of the MDP method. The number of sites assigned to each cluster (vertical) aredrawn against the number of iterations (horizontal), with each graph representing acluster. Left: Radar image (Fig. 1), no smoothing. Right: Same image, with smoothing.

Table 1. Number of clusters chosen by the algorithm on two radar images for differentvalues of the hyperparameter

α 1e-10 1e-9 1e-8 1e-7 1e-6 1e-5 1e-4 1e-3Image Fig. 1 MDP 2 4 4 6 5 4 5 6

smoothed 2 2 3 4 4 4 4 4Images Fig. 2 MDP 4 3 4 7 6 5 5 9

smoothed 2 2 3 4 5 3 3 5

once to new clusters (visible as jumps in the diagram). In the constrained case,assignments change gradually.

The influence of the DP hyperparameter α is shown in Tab. 1. In general, thenumber of clusters increases for larger values of α (i. e. when the probability ishigh that a new cluster is created by the DP). When the smoothing constraint isactivated, the number of clusters becomes more stable with respect to changesof α than without smoothing. We note that the number of clusters selected ismore volatile for the poorly separated image in Fig. 2.

For comparison of the model selection results, the stability method has beenapplied to the two SAR images in Figs. 1 and 2. The resulting instability in-dices for two to nine clusters are given in Tab. 2. For the image in Fig. 1, thelocal minimum of the instability index is assumed for five clusters, with the so-lutions NC = 3, 4, 5 within range of the error bars. This outcome is comparableto the result of the smoothed MDP model, which (except for very small val-ues of α) selects three or four clusters. The unconstrained MDP model tendsto select a larger number of clusters. Since the instability index is obtained byaveraging over results on random subsets, one should expect its results to beconservative. This is indeed the case, since the smoothed MDP approach pro-duces a comparable number of segments as the stability method does withoutsmoothing. Now consider the image in Fig. 2, for which MDP results, evenin the smoothed case, are rather unstable (cf. Tab. 1). The local minimum


Table 2. Stability indices computed with ACM clustering on two radar images fordifferent numbers of clusters

Stability index Stability indexNC Image Fig. 1 Image Fig. 2 NC Image Fig. 1 Image Fig. 22 0.0012 ± 0.0009 0.0003 ± 0.3341 6 0.4740 ± 0.0867 0.2933 ± 0.34373 0.3359 ± 0.2324 0.1765 ± 0.2856 7 0.5164 ± 0.0434 0.2907 ± 0.30074 0.3204 ± 0.2113 0.1233 ± 0.3481 8 0.5598 ± 0.0728 0.3532 ± 0.28895 0.2947 ± 0.0884 0.1436 ± 0.1929 9 0.6637 ± 0.0512 0.3378 ± 0.2801

of the instability index is assumed at NC = 4, but the whole range of com-puted solutions (NC = 2, . . . , 9) is within one standard deviation of the localminimum. Thus both the MDP/MRF approach and the stability method giveunreliable results on an image with a high noise level and poorly discerniblesegments. Both methods are constructed around the same probabilistic model ofthe data (a multinomial histogram clustering model). We therefore concludethat the reliability of model selection results depends, for both approaches,on the ability of the clustering model to resolve differences between segmentdistributions.

7 Discussion

There exists a considerable number of DP-based models [12] with a wide rangeof applications in statistics and, more recently, natural language processing anddocument retrieval [13, 2]. To our knowledge, this paper summarizes the firstattempt both to apply the Dirichlet nonparametric approach to image segmen-tation, and to combine it with Markov random fields, the standard Bayesianapproach to image processing and spatial statistics.

We believe that a wide range of applications for MDP models may emergein computer vision. Despite their mathematical intricacies, the fact that thesemodels may be regarded as mixture distributions with a variable number ofmixture components (cf. Sec. 2) makes them an intuitive and powerful tool forprobabilistic modeling. Instead of the multinomial distribution employed in ourhistogram clustering approach, any type of parametric likelihood may be usedwith the MDP model. If the base measure is set to the respective conjugateprior, standard sampling algorithms are applicable. For example, a nonparamet-ric analogue of the widely used k-means algorithm may be obtained by choosinga Gaussian of fixed, uniform covariance as the likelihood and a Gaussian prioron the mean parameter as the base measure. For applications requiring fast in-ference, sampling algorithms may be substituted by more efficient approximatemethods [14].

We have shown how to combine the MDP clustering model with a spatialsmoothness constraint. We like to emphasize that this nonparametric frameworkis applicable to any type of mixture component distribution and our samplingalgorithm remains applicable for any conjugate likelihood/base measure pair.


Our experiments confirm what the structure of the model suggests: The abilityof the parametric model used within the nonparametric framework to resolvedifferences between segments determines the quality of segmentation results. Italso determines how stable model selection results are with respect to changesof the hyperparameter.

In summary, the MDP approach can be regarded as a model selection frame-work built in the style of a wrapper method around an application dependentparametric model. Additionally, it may be equipped with a smoothness con-straint for image segmentation. The comparison with the stability frameworkbased on cross-validation yields consistent results for the number of clusters.

References

1. Lange, T., Roth, V., Braun, M., Buhmann, J.M.: Stability-based validation ofclustering solutions. Neural Computation 16 (2004) 1299–1323

2. Zaragoza, H., Hiemstra, D., Tipping, D., Robertson, S.: Bayesian extension to thelanguage model for ad hoc information retrieval. In: Proc. SIGIR 2003. (2003)

3. Sudderth, E., Torralba, A., Freeman, W.T., Willsky, A.S.: Describing visual scenesusing transformed dirichlet processes. In Weiss, Y., Scholkopf, B., Platt, J., eds.:Advances in Neural Information Processing Systems 18, MIT Press (2006)

4. Murtagh, F., Raftery, A.E., Starck, J.L.: Bayesian inference for multiband im-age segmentation via model-based cluster trees. Image and Vision Computing 23(2005) 587–596

5. Antoniak, C.E.: Mixtures of Dirichlet processes with applications to bayesian non-parametric estimation. Annals of Statistics 2 (1974) 1152–1174

6. Besag, J., Green, P., Higdon, D., Mengersen, K.: Bayesian computation andstochastic systems. Statistical Science 10 (1995) 3–66

7. Geman, D., Geman, S., Graffigne, C., Dong, P.: Boundary detection by constrainedoptimization. IEEE Trans. on Pat. Anal. Mach. Intel. 12 (1990) 609–628

8. MacEachern, S.N.: Estimating normal means with a conjugate style Dirichlet pro-cess prior. Communications in Statistics: Simulation and Computation 23 (1994)727–741

9. Devroye, L.: Non-uniform random variate generation. Springer (1986)10. Breckenridge, J.: Replicating cluster analysis: Method, consistency and validity.

Multivariate Behavioral Research 24 (1989) 147–16111. Puzicha, J., Hofmann, T., Buhmann, J.M.: Histogram clustering for unsupervised

segmentation and image retrieval. Pattern Recognition Letters 20 (1999) 899–90912. MacEachern, S., Muller, P.: Efficient MCMC schemes for robust model extensions

using encompassing Dirichlet process mixture models. In Ruggeri, F., Rios Insua,D., eds.: Robust Bayesian Analysis. Springer (2000)

13. Blei, D.M., Griffith, T.L., Jordan, M.I., Tenenbaum, J.B.: Hierarchical topic modelsand the nested chinese restaurant process. In Thrun, S., Saul, L., Scholkopf, B.,eds.: Advances in Neural Information Processing Systems 16, MIT Press (2004)

14. Blei, D.M., Jordan, M.I.: Variational methods for the Dirichlet process. In: Pro-ceedings of the 21st International Conference on Machine Learning. (2004)

smooth image segmentation by nonparametric bayesian inference

Documents