modeling clutter perception using parametric proto-object...

9
Modeling Clutter Perception using Parametric Proto-object Partitioning Chen-Ping Yu Department of Computer Science Stony Brook University [email protected] Wen-Yu Hua Department of Statistics Pennsylvania State University [email protected] Dimitris Samaras Department of Computer Science Stony Brook University [email protected] Gregory J. Zelinsky Department of Psychology Stony Brook University [email protected] Abstract Visual clutter, the perception of an image as being crowded and disordered, af- fects aspects of our lives ranging from object detection to aesthetics, yet relatively little effort has been made to model this important and ubiquitous percept. Our approach models clutter as the number of proto-objects segmented from an im- age, with proto-objects defined as groupings of superpixels that are similar in intensity, color, and gradient orientation features. We introduce a novel paramet- ric method of clustering superpixels by modeling mixture of Weibulls on Earth Mover’s Distance statistics, then taking the normalized number of proto-objects following partitioning as our estimate of clutter perception. We validated this model using a new 90-image dataset of real world scenes rank ordered by human raters for clutter, and showed that our method not only predicted clutter extremely well (Spearman’s ρ =0.8038, p< 0.001), but also outperformed all existing clut- ter perception models and even a behavioral object segmentation ground truth. We conclude that the number of proto-objects in an image affects clutter perception more than the number of objects or features. 1 Introduction Visual clutter, defined colloquially as a “confused collection” or a “crowded disorderly state”, is a dimension of image understanding that has implications for applications ranging from visualiza- tion and interface design to marketing and image aesthetics. In this study we apply methods from computer vision to quantify and predict human visual clutter perception. The effects of visual clutter have been studied most extensively in the context of an object detection task, where models attempt to describe how increasing clutter negatively impacts the time taken to find a target object in an image [19][25][29][18][6]. Visual clutter has even been suggested as a surrogate measure for set size effect, the finding that search performance often degrades with the number of objects in a scene [32]. Because human estimates of the number of objects in a scene are subjective and noisy - one person might consider a group of trees to be an object (a forest or a grove) while another person might label each tree in the same scene as an “object”, or even each trunk or branch of every tree - it may be possible to capture this seminal search relationship in an objectively defined measure of visual clutter [21][25]. One of the earliest attempts to model visual clutter used edge density, i.e. the ratio of the number of edge pixels in an image to image size [19]. The subsequent feature congestion model ignited interest in clutter perception by estimating 1

Upload: others

Post on 18-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling Clutter Perception using Parametric Proto-object …papers.nips.cc/paper/5201-modeling-clutter-perception-using-parame… · fects aspects of our lives ranging from object

Modeling Clutter Perception using ParametricProto-object Partitioning

Chen-Ping YuDepartment of Computer Science

Stony Brook [email protected]

Wen-Yu HuaDepartment of Statistics

Pennsylvania State [email protected]

Dimitris SamarasDepartment of Computer Science

Stony Brook [email protected]

Gregory J. ZelinskyDepartment of Psychology

Stony Brook [email protected]

Abstract

Visual clutter, the perception of an image as being crowded and disordered, af-fects aspects of our lives ranging from object detection to aesthetics, yet relativelylittle effort has been made to model this important and ubiquitous percept. Ourapproach models clutter as the number of proto-objects segmented from an im-age, with proto-objects defined as groupings of superpixels that are similar inintensity, color, and gradient orientation features. We introduce a novel paramet-ric method of clustering superpixels by modeling mixture of Weibulls on EarthMover’s Distance statistics, then taking the normalized number of proto-objectsfollowing partitioning as our estimate of clutter perception. We validated thismodel using a new 90-image dataset of real world scenes rank ordered by humanraters for clutter, and showed that our method not only predicted clutter extremelywell (Spearman’s ρ = 0.8038, p < 0.001), but also outperformed all existing clut-ter perception models and even a behavioral object segmentation ground truth. Weconclude that the number of proto-objects in an image affects clutter perceptionmore than the number of objects or features.

1 Introduction

Visual clutter, defined colloquially as a “confused collection” or a “crowded disorderly state”, isa dimension of image understanding that has implications for applications ranging from visualiza-tion and interface design to marketing and image aesthetics. In this study we apply methods fromcomputer vision to quantify and predict human visual clutter perception.

The effects of visual clutter have been studied most extensively in the context of an object detectiontask, where models attempt to describe how increasing clutter negatively impacts the time taken tofind a target object in an image [19][25][29][18][6]. Visual clutter has even been suggested as asurrogate measure for set size effect, the finding that search performance often degrades with thenumber of objects in a scene [32]. Because human estimates of the number of objects in a sceneare subjective and noisy - one person might consider a group of trees to be an object (a forest or agrove) while another person might label each tree in the same scene as an “object”, or even eachtrunk or branch of every tree - it may be possible to capture this seminal search relationship in anobjectively defined measure of visual clutter [21][25]. One of the earliest attempts to model visualclutter used edge density, i.e. the ratio of the number of edge pixels in an image to image size[19]. The subsequent feature congestion model ignited interest in clutter perception by estimating

1

Page 2: Modeling Clutter Perception using Parametric Proto-object …papers.nips.cc/paper/5201-modeling-clutter-perception-using-parame… · fects aspects of our lives ranging from object

Figure 1: How can we quantify set size or the number of objects in these scenes, and would thisobject count capture the perception of scene clutter?

image complexity in terms of the density of intensity, color, and texture features in an image [25].However, recent work has pointed out limitations of the feature congestion model [13][21], leadingto the development of alternative approaches to quantifying visual clutter [25][5][29][18].

Our approach is to model visual clutter in terms of proto-objects: regions of locally similar featuresthat are believed to exist at an early stage of human visual processing [24]. Importantly, proto-objectsare not objects, but rather the fragments from which objects are built. In this sense, our approachfinds a middle ground between features and objects. Previous work used blob detectors to segmentproto-objects from saliency maps for the purpose of quantifying shifts of visual attention [31], butthis method is limited in that it results in elliptical proto-objects that do not capture the complexityor variability of shapes in natural scenes. Alternatively, it may be possible to apply standard imagesegmentation methods to the task of proto-object discovery. While we believe this approach hasmerit (see Section 4.3), it is also limited in that the goal of these methods is to approximate a humansegmented ground truth, where each segment generally corresponds to a complete and recognizableobject. For example, in the Berkeley Segmentation Dataset [20] people were asked to segment eachimage into 2 to 20 equally important and distinguishable things, which results in many segmentsbeing actual objects. However, one rarely knows the number of objects in a scene, and ambiguity inwhat constitutes an object has even led some researchers to suggest that obtaining an object groundtruth for natural scenes is an ill-posed problem [21].

Our clutter perception model uses a parametric method of proto-object partitioning that clusters su-perpixels, and requires no object ground truth. In summary, we create a graph having superpixels asnodes, then compute feature similarity distances between adjacent nodes. We use Earth Mover’s Dis-tance (EMD) [26] to perform pair-wise comparisons of feature histograms over all adjacent nodes,and model the EMD statistics with mixture of Weibulls to solve an edge-labeling problem, whichidentifies and removes between-cluster edges to form isolated superpixel groups that are subse-quently merged. We refer to these merged image fragments as proto-objects. Our approach is basedon the novel finding that EMD statistics can be modeled by a Weibull distribution (Section 2.2),and this allows us to model such similarity distance statistics with a mixture of Weibull distribution,resulting in extremely efficient and robust superpixel clustering in the context of our model. Ourmethod runs in linear time with respect to the number of adjacent superpixel pairs, and has an end-to-end run time of 15-20 seconds for a typical 0.5 megapixel image, a size that many supervisedsegmentation methods cannot yet accommodate using desktop hardware [2][8][14][23][34].

2 Proto-object partitioning

2.1 Superpixel pre-processing and feature similarity

To merge similar fragments into a coherent proto-object region, the term fragment and the measureof coherence (similarity) must be defined. We define an image fragment as a group of pixels thatshare similar low-level image features: intensity, color, and orientation. This conforms with pro-cessing in the human visual system, and also makes a fragment analogous to an image superpixel,which is a perceptually meaningful atomic region that contains pixels similar in color and texture[30]. However, superpixel segmentation methods in general produce a fixed number of superpixelsfrom an image, and groups of nearby superpixels may belong to the same proto-object due to the in-tended over-segmentation. Therefore, we extract superpixels as image fragments for pre-processing,

2

Page 3: Modeling Clutter Perception using Parametric Proto-object …papers.nips.cc/paper/5201-modeling-clutter-perception-using-parame… · fects aspects of our lives ranging from object

and subsequently merge similar superpixels into proto-objects. We define that a pair of adjacent su-perpixels belong to a coherent proto-object if they are similar in all three low-level image features.Thus we need to determine a similarity threshold for each of the three features, that separates thesimilarity distance values into “similar”, and “dissimilar” populations, detailed in Section 2.2.

In this work, the similarity statistics are based on comparing histograms of intensity, color, and ori-entation features from an image fragment. The intensity feature is a 1D 256 bin histogram, the colorfeature is a 76×76 (8 bit color) 2D histogram using hue and saturation from the HSV colorspace,and the orientation feature is a symmetrical 1D 360 bin histogram using gradient orientations, simi-lar to the HOG feature [10]. All three feature histograms are normalized to have the same total mass,such that bin counts sum to one.

We use Earth Mover’s Distance (EMD) to compute the similarity distance between feature his-tograms [26], which is known to be robust to partially matching histograms. For any pair of adja-cent superpixels va and vb, their normalized feature similarity distances for each of the intensity,color, and orientation features are computed as: xn;f = EMD(va;f , vb;f )/EMDf , where xn;f de-notes the similarity (0 is exactly the same, and 1 means completely opposite) between the nth pair(n = 1, ..., N ) of nodes va and vb under feature f ∈ {i, c, o} as intensity, color, and orientation.EMDf is the maximum possible EMD for each of the three image features; it is well defined inthis situation such that the largest difference between intensities is black to white, hues that are180◦ apart, and a horizontal gradient against a vertical gradient. Therefore, EMDf normalizesxn;f ∈ [0, 1]. In the subsequent sections, we explain our proposed method for finding the adaptivesimilarity threshold from xf , which is the EMDs of all pairs of adjacent nodes .

2.2 EMD statistics and Weibull distributon

Any pair of adjacent superpixels are either similar enough to belong to the same proto-object, or theybelong to different proto-objects, as separated by the adaptive similarity threshold γf that is differentfor every image. We formulate this as an edge labeling problem: given a graph G = (V,E), whereva ∈ V and vb ∈ V are two adjacent nodes (superpixels) having edge ea,b ∈ E, a 6= b betweenthem, also the nth edge of G. The task is to label the binary indicator variable yn;f = I(xn;f < γf )on edge ea,b such that yn;f = 1 if xn;f < γf , which means va;f and vb;f are similar (belongs to thesame proto-object), otherwise yn;f = 0 if va;f and vb;f are dissimilar (belongs to different proto-objects). Once γf is computed, removing the edges such that yf = 0 results in isolated clusters oflocally similar image patches, which are the desired groups of proto-objects.

Intuitively, any pair of adjacent nodes is either within the same proto-object cluster, or betweendifferent clusters (yn;f = {1, 0}), therefore we consider two populations (the within-cluster edges,and the between-cluster edges) to be modeled from the density of xf in a given image. In theory,this would mean that the density of xf is a distribution exhibiting bi-modality, such that the leftmode corresponds to the set of xf that are considered similar and coherent, while the right modecontains the set of xf that represent dissimilarity. At first thought, applying k-means with k = 2or a mixture of two Gaussians would allow estimation of the two populations. However, there isno evidence showing that similarity distances follow symmetrical or normal distributions. In thefollowing, we argue that the similarity distances xf computed by EMD follow Weibull distribution,which is a distribution of the Exponential family that is skewed in shape.

We define EMD(P,Q) = (∑mi

∑nj f′ijdij)/(

∑mi

∑nj f′ij), with an optimal flow f ′ij such that∑

j f′ij ≤ pi,

∑i f′ij ≤ qj ,

∑i,j f

′i,j = min(

∑i pi,

∑j qj), and f ′ij ≥ 0, where P =

{(x1, p1), ..., (xm, pm)} and Q = {(y1, q1), ..., (yn, qn)} are the two signatures to be compared,and dij denotes a dissimilarity metric (i.e. L2 distance) between xi and yj in Rd. When P andQ are normalized to have the same total mass, EMD becomes identical to Mallows distance [17],defined as Mp(X,Y ) = ( 1n

∑ni=1 |xi−yi|p)1/p, where X and Y are sorted vectors of the same size,

and Mallows distance is an Lp-norm based distance measurement. Furthermore, Lp-norm baseddistance metrics are Weibull distributed if the two feature vectors to be compared are correlatedand non-identically distributed [7]. We show that our features assumptions are satisfied in Section4.1. Hence, we can model each feature of xf as a mixture of two Weibull distributions separately,and compute the corresponding γf as the boundary locations between the two components of themixtures. Although the Weibull distribution has been used in modeling actual image features such

3

Page 4: Modeling Clutter Perception using Parametric Proto-object …papers.nips.cc/paper/5201-modeling-clutter-perception-using-parame… · fects aspects of our lives ranging from object

as texture and edges [12][35], it has not been used to model EMD similarity distance statistics untilnow.

2.3 Weibull mixture model

Our Weibull mixture model (WMM) takes the following general form:

WK(x; θ) =

K∑k=1

πkφk(x; θk) , φ(x;α, β, c) =β

α(x− cα

)β−1e−(x−cα )β (1)

where θk = (αk, βk, ck) is the parameter vector for the kth mixture component, and φ denotes thethree-parameter Weibull pdf with the scale (α), shape (β), and location (c) parameter, and the mixingparameter π such that

∑k πk = 1. In this case, our two-component WMM contains a 7-parameter

vector θ = (α1, β1, c1, α2, β2, c2, π) that yields the following complete form:

W2(x; θ) = π(β1α1

(x− c1α1

)β1−1)e−(x−c1α1

)β1 + (1− π)(β2α2

(x− c2α2

)β2−1)e−(x−c2α2

)β2 (2)

To estimate the parameters ofW2(x; θ), we tested two optimization methods: maximum likelihoodestmation (MLE), and nonlinear least squares minimization (NLS). Both MLE and NLS requires aninitial parameter vector θ′ to begin the optimization, and the choice of θ′ is crucial to the convergenceof the optimal parameter vector θ. In our case, the initial guess is quite well defined: for any nodeof a specific feature vj;f , and its set of adjacent neighbors vNj;f = N(vj;f ), the neighbor that is mostsimilar to vj;f is most likely to belong to the same cluster as vj;f , and it is especially true under anover-segmention scenario. Therefore, the initial guess for the first mixture component φ1;f is theMLE of φ1;f (θ′1;f ;x

′f ), such that x′f = {min(EMD(vj;f ,v

Nj;f ))|vj;f ; j = 1, ..., z, f ∈ {i, c, o}},

where z is the total number of superpixels, and x′f ⊂ xf . After obtaining θ′1 = (α′1, β′1, c′1), several

θ′2 can be computed for the re-start purpose via MLE from the data taken by Pr(xf |θ′1) > p, wherePr is the cumulative distribution function, and p is a range of percentiles. Together, they form thecomplete initial guess parameter vector θ′ = (α′1, β

′1, c′1, α′2, β′2, c′2, π′) where π′ = 0.5.

2.3.1 Parameter estimation

Maximum likelihood estimation (MLE) estimates the parameters by maximizing the log-likelihoodfunction of the observed samples. The log-likelihood function ofW2(x; θ) is given by:

lnL(θ;x) =N∑n=1

ln{π(β1α1

(xn − c1α1

)β1−1)e−(xn−c1α1

)β1 +(1−π)(β2α2

(xn − c2α2

)β2−1)e−(xn−c2α2

)β2 }

(3)

Due to the complexity of this log-likelihood function and the presence of the location parametersc1,2, we adopt the Nelder-Mead method as a derivative-free optimization of MLE that performs pa-rameter estimation with direct-search [22][16], by minimizing the negative log-likelihood functionof Eq. 3.

For the NLS optimization method, first xf are approximated with histograms much like a box filterthat smoothes a curve. The appropriate histogram bin-width for data representation is computedby w = 2(IQR)n−1/3, where IQR is the interquartile range of the data with n observations [15].This allows us to optimize a two component WMM to the height of each bin with NLS as a curvefitting problem, which is a robust alternative to MLE when the noise level can be reduced by someapproximation scheme. Then, we find the least squares minimizer by using the trust-region method[27][28]. Both the Nelder-Mead MLE algorithm and the NLS method are detailed in the supple-mentary material.

Figure 2 shows the WMM fit using the Nelder-Mead MLE method. In addition to the good fit of themixture model to the data, it also shows that the right skewed data (EMD statistics) is remarkablyWeibull, this further validates that EMD statistics follow Weibull distribution both in theory andexperiments.

4

Page 5: Modeling Clutter Perception using Parametric Proto-object …papers.nips.cc/paper/5201-modeling-clutter-perception-using-parame… · fects aspects of our lives ranging from object

Figure 2: (a) original image, (b) after superpixel pre-processing [1] (977 initial segments), (c) finalproto-object partitioning result (150 segments). Each final segment is shown with its mean RGBvalue to approximate proto-object perception. (d) W2(xf ; θf ) optimized using the Nelder-Meadalgorithm for intensity, (e) color, and (f) orientation based on the image in (b). The red line indicatesthe individual Weibull components; and the blue line is the density of the mixtureW2(xf ; θf ).

2.4 Visual clutter model with model selection

At times, the dissimilar population can be highly mixed in with the similar population, the density ofwhich would resemble more of a single Weibull in shape such as Figure 2d. Therefore, we fit a singleWeibull as well as a two component WMM over xf , and apply the Akaike Information Criterion(AIC) to prevent any possible over-fittings by the two component WMM. AIC tends to place aheavier penalty on the simpler model, which is suitable in our case to ensure that the preferenceis placed on the two-population mixture models. For models optimized using MLE, the standardAIC is used; for the NLS cases, the corrected AIC (AICc) for smaller sample size (generally whenn/k ≤ 40) with residual sum of squares (RSS) is used, and it is defined asAICc = n ln(RSS/n)+2k+2k(k+1)/(n−k−1), where k is the number of model parameters, n is the number of samples.

The optimal γf can then be determined as follows:

γf =

max(x, ε), s.t. π1φ1;f (x|θ1;f ) = π2φ2;f (x|θ2;f ) AIC(W2) ≤ AIC(W1)

max(−α1(ln(1− τ))1/β1 , ε) Otherwise(4)

The first case is when the mixture model is preferred, then the optimal γf is the crossing pointbetween the mixture components, and the equality can be solved in linear time by searching over thevalues of the vector xf ; in the second case when the single Weibull is preferred by model selection,γf is calculated by the inverse CDF of W1, which computes the location of a given percentileparameter τ . Note that γf is lower bounded by a tolerance parameter ε in both cases to preventunusual behaviors when an image is nearly blank (γf ∈ [ε, 1]), making τ and ε the only modelparameters in our framework.

We perform Principle Component Analysis (PCA) on the similarity distance values xf of intensity,color, and orientation and obtain the combined distance feature value by projecting xf to the firstprinciple component, such that the relative importance of each distance feature is captured by itsvariance through PCA. This projected distance feature is used to construct a minimum spanning treeover the superpixels to form the structure of graph G, which weakens the inter-cluster connectivityby removing cycles and other excessive graph connections. Finally, each edge of G is labeled

5

Page 6: Modeling Clutter Perception using Parametric Proto-object …papers.nips.cc/paper/5201-modeling-clutter-perception-using-parame… · fects aspects of our lives ranging from object

according to Section 2.2 given the computed γf , such that an edge is labeled as 1 (similar) only ifthe pair of superpixels are similar in all three features. Edges labeled as 0 (dissimilar) are removedfrom G to form isolated clusters (proto-objects), and our visual clutter model produces a normalizedclutter measure that is between 0 and 1 by dividing the number of proto-objects by the initial numberof superpixels such that it is invariant to different scales of superpixel over-segmentation.

3 Dataset and ground truth

Various in-house image datasets have been used in previous work to evaluate their models of visualclutter. The feature congestion model was evaluated on 25 images of US city/road maps and weathermaps [25]; the models in [5] and [29] were evaluated on another 25 images consisting of 6, 12, or 24synthetically generated objects arranged into a grid ; and the model from [18] used 58 images of sixmap or chart categories (airport terminal maps, flowcharts, road maps, subway maps, topographiccharts, and weather maps). In each of these datasets, each image must be rank ordered for visualclutter with respect to every other image in the set by the same human subject, which is a tiring andtime consuming process. This rank ordering is essential for a clutter perception experiment as itestablishes a stable clutter metric that is meaningful across participants; alas it limits the dataset sizeto the number of images each individual observer can handle. Absolute clutter scales are undesirableas different raters might use different ranges on this scale.

We created a comparatively large clutter perception dataset consisting of 90 800×600 real worldimages sampled from the SUN Dataset images [33] for which there exists human segmentations ofobjects and object counts. These object segmentations serve as one of the ground truths in our study.The high resolution of these images is also important for the accurate perception and assessmentof clutter. The 90 images were selected to constitute six groups based on their ground truth objectcounts, with 15 images in each group. Specifically, group 1 had images with object counts in the1-10 range, group 2 had counts in the 11-20 range, up to group 6 with counts in the 51-60 range.

These 90 images were rated in the laboratory by 15 college-aged participants whose task was toorder the images in terms of least to most perceived visual clutter. This was done by displaying eachimage one at a time and asking participants to insert it into an expanding set of previously ratedimages. Participants were encouraged to take as much time as they needed, and were allowed tofreely scroll through the existing set of clutter rated images when deciding where to insert the newimage. A different random sequence of images was used for each participant (in order to control forbiases and order effects), and the entire task lasted approximately one hour. The average correlation(Spearman’s rank-order correlation) over all pairs of participants was 0.6919 (p < 0.001), indicatinggood agreement among raters. We used the median ranked position of each image as the ground truthfor clutter perception in our experiments.

4 Experiment and results

4.1 Image feature assumptions

In their demonstration that similarity distances adhere to a Weibull dstribution, Burghouts et al. [7]derived and related Lp-norm based distances from the statistics of sums [3][4] such that for non-identical and correlated random variables Xi, the sum

∑Ni=1Xi is Weibull distributed if Xi are

upper-bounded with a finite N , where Xi = |si − Ti|p such that N is the dimensionality of thefeature vector, i is the index, and s, t ∈ T are different sample vectors of the same feature.

The three image features used in this model are finite and upper bounded, and we follow the pro-cedure from [7] with L2 distance to determine whether they are correlated. We consider distancesfrom one reference superpixel feature vector s to 100 other randomly selected superpixel featurevectors T (of the same feature), and compute the differences at index i such that we are obtainingthe random variable Xi = |si − Ti|p. Pearson’s correlation is then used to determine the relation-ship between Xi and Xj , i 6= j at a confidence level of 0.05. This procedure is repeated 500 timesper image for all three feature types over all 90 images. As predicted, we found an almost perfectcorrelation between feature value differences for each of the features tested: Intensity: 100%, Hue:99.2%, Orientation: 98.97%). This confirms that the low level image features used in this studyfollow a Weibull distribution.

6

Page 7: Modeling Clutter Perception using Parametric Proto-object …papers.nips.cc/paper/5201-modeling-clutter-perception-using-parame… · fects aspects of our lives ranging from object

WMM-mle WMM-nls MS[9] GB[11] PL[6] ED[19] FC[25] # Obj C3[18]0.8038 0.7966 0.7262 0.6612 0.6439 0.6231 0.5337 0.5255 0.4810

Table 1: Correlations between human clutter perception and all the evaluated methods. WMM is theWeibull mixture model underlying our proto-object partitioning approach, with both optimizationmethods.

4.2 Model evaluation

We ran our model with different parameter settings of ε ∈ {0.01, 0.02, ..., 0.20} and τ ∈{0.5, 0.6, ..., 0.9} using SLIC superpixels [1] initialized at 1000 seeds. We then correlated thenumber of proto-objects formed after superpixel merging with the ground truth behavioral clutterperception estimates by computing the Spearman’s Rank Correlation (Spearman’s ρ) following theconvention of [25][5][29][18].

A model using MLE as the optimization method achieved the highest correlation, ρ = 0.8038,p < 0.001 with ε = 0.14 and τ = 0.8. Because we did not have separate training/testing sets,we performed 10-fold cross-validation and obtained an average testing correlation of r = 0.7599,p < 0.001. When optimized using NLS, the model achieved a maximum correlation of ρ = 0.7966,p < 0.001 with ε = 0.14 and τ = 0.4, and the corresponding 10-fold cross-validation yielded anaverage testing correlation of r = 0.7375, p < 0.001. The high cross-validation averages indicatethat our model is highly robust, and generalizable to unseen data.

It is worth pointing out that, the optimal value of the tolerance parameter ε showed a peak correlationat 0.14. To the extent that this is meaningful and extends to people, it suggests that visual clutterperception may ignore feature dissimilarity on the order of 14% when deciding whether two adjacentregions are similar and should be merged.

We compared our model to four other state-of-the-art models of clutter perception: the feature con-gestion model [25], the edge density method [19], the power-law model [6], and the C3 model [18].Table 1 shows that our model significantly out-performed all of these previously reported methods.The relatively poor performance of the recent C3 model was surprising, and can probably be at-tributed to the previous evaluation of that model using charts and maps rather than arbitrary realisticscenes (personal communication with authors). Collectively, these results suggest that a model thatmerges superpixels into proto-objects best describes human clutter perception, and that the benefit ofusing a proto-object model for clutter prediction is not small; our model resulted in an improvementof at least 15% over existing models of clutter perception. Although we did not record run-timestatistics on the other models, our model, implemented in Matlab1, had an end-to-end (excludingsuperpixel pre-processing) run-time of 15-20 seconds using 800×600 images running on an Win7Intel Core i-7 computer with 8 Gb RAM.

4.3 Comparison to image segmentation methods

We also attempted to compare our method to state of the art image segmentation algorithms suchas gPb-ucm [2], but found that the method was unable to process our image dataset using either anIntel Core i-7 machine with 8 Gb RAM or an Intel Xeon machine with 16 Gb RAM, at the highimage resolutions required by our behavioral clutter estimates. A similar limitation was found forimage segmentation methods that utilizes gPb contour detection as pre-processing, such as [8][14],while [23][34] took 10 hours on a single image and did not converge.

Therefore, we limit our evaluation to mean-shift [9] and Graph-based method [11], as they are ableto produce variable numbers of segments based on the unsupervised partitioning of the 90 imagesfrom our dataset. Despite using the best dataset parameter settings for these unsupervised methods,our method remains the highest correlated model with the clutter perception ground truth as shownin Table 1, and that methods allowing quantification of proto-object set size (WMM, Mean-shift,and Graph-based) outperformed all of the previous clutter models .

We also correlated the number of objects segmented by humans (as provided in the SUN Dataset)with the clutter perception ground truth, denoted as # obj in Table 1. Interestingly, despite object

1Code is available at mysbfiles.stonybrook.edu/~cheyu/projects/proto-objects.html

7

Page 8: Modeling Clutter Perception using Parametric Proto-object …papers.nips.cc/paper/5201-modeling-clutter-perception-using-parame… · fects aspects of our lives ranging from object

Figure 3: Top: Four images from our dataset, rank ordered for clutter perception by human raters,median clutter rank order from left to right: 6, 47, 70, 87. Bottom: Corresponding images afterparametric proto-object partitioning, median clutter rank order from left to right: 7, 40, 81, 83.

count being a human-derived estimate, it produced among the lowest correlations with clutter per-ception. This suggests that clutter perception is not determined by simply the number of objects ina scene; it is the proto-object composition of these objects that is important.

5 Conclusion

We proposed a model of visual clutter perception based on a parametric image partitioning methodthat is fast and able to work on large images. This method of segmenting proto-objects from an im-age using mixture of Weibull distributions is also novel in that it models similarity distance statisticsrather than feature statistics obtained directly from pixels. Our work also contributes to the behav-ioral understanding of clutter perception. We showed that our model is an excellent predictor ofhuman clutter perception, outperforming all existing clutter models, and predicts clutter perceptionbetter than even a behavioral segmentation of objects. This suggests that clutter perception is bestdescribed at the proto-object level, a level intermediate to that of objects and features. Moreover,our work suggests a means of objectively quantifying a behaviorally meaningful set size for scenes,at least with respect to clutter perception. We also introduced a new and validated clutter perceptiondataset consisting of a variety of scene types and object categories. This dataset, the largest and mostcomprehensive to date, will likely be used widely in future model evaluation and method compari-son studies. In future work we plan to extend our parametric partitioning method to general imagesegmentation and data clustering problems, and to use our model to predict human visual searchbehavior and other behaviors that might be affected by visual clutter.

6 Acknowledgment

We appreciate the authors of [18] for sharing and discussing their code, Dr. Burghouts for providingdetailed explanations to the feature assumption part in [7], and Dr. Matthew Asher for providingtheir human search performance data on their work in Journal of Vision, 2013. This work wassupported by NIH Grant R01-MH063748 to G.J.Z., NSF Grant IIS-1111047 to G.J.Z. and D.S., andthe SUBSAMPLE Project of the DIGITEO Institute, France.

References[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. Slic superpixels compared to state-

of-the-art superpixel methods. IEEE TPAMI, 2012.

[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation.IEEE TPAMI, 2010.

[3] E. Bertin. Global fuctuations and gumbel statistics. Physical Review Letters, 2005.

8

Page 9: Modeling Clutter Perception using Parametric Proto-object …papers.nips.cc/paper/5201-modeling-clutter-perception-using-parame… · fects aspects of our lives ranging from object

[4] E. Bertin and M. Clusel. Generalised extreme value statistics and sum of correlated variables. Journal ofPnysics A, 2006.

[5] M. J. Bravo and H. Farid. Search for a category target in clutter. Perception, 2004.[6] M. J. Bravo and H. Farid. A scale invariant measure of clutter. Journal of Vision, 2008.[7] G. J. Burghouts, A. W. M. Smeulders, and J.-M. Geusebroek. The distribution family of similarity dis-

tances. In NIPS, 2007.[8] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In

CVPR, 2010.[9] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE TPAMI,

2002.[10] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.[11] P. F. Felzenszwalb and H. D. P. Efficient graph-based image segmentation. In ICCV, 2004.[12] J.-M. Geusebroek and A. W. Smeulders. A six-stimulus theory for stochastic texture. IJCV, 2005.[13] J. M. Henderson, M. Chanceaux, and T. J. Smith. The influence of clutter on real-world scene search:

Evidence from search efficiency and eye movements. Journal of Vision, 2009.[14] A. Ion, J. Carreira, and C. Sminchisescu. Image segmentation by figure-ground composition into maximal

cliques. In ICCV, 2011.[15] A. J. Izenman. Recent developments in nonparametric density estimation. Journal of the American

Statistical Association, 1991.[16] J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. Wright. Convergence properties of the nelder-mead

simplex method in low dimensions. SIAM Journal on Optimization, 1998.[17] E. Levina and P. Bickel. The earth mover’s distance is the mallows distance: some insights from statistics.

In ICCV, 2001.[18] M. C. Lohrenz, J. G. Trafton, R. M. Beck, and M. L. Gendron. A model of clutter for complex, multivari-

ate geospatial displays. Human Factors, 2009.[19] M. L. Mack and A. Oliva. Computational estimation of visual complexity. In the 12th Annual Object,

Perception, Attention, and Memory Conference, 2004.[20] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its

application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.[21] M. B. Neider and G. J. Zelinsky. Cutting through the clutter: searching for targets in evolving complex

scenes. Journal of Vision, 2011.[22] J. A. Nelder and R. Mead. A simplex method for function minimization. The computer journal, 1965.[23] S. R. Rao, H. Mobahi, A. Y. Yang, S. Sastry, and Y. Ma. Natural image segmentation with adaptive texture

and boundary encoding. In ACCV, 2009.[24] R. A. Rensink. Seeing, sensing, and scrutinizing. Vision Research, 2000.[25] R. Rosenholtz, Y. Li, and L. Nakano. Measuring visual clutter. Journal of Vision, 2007.[26] Y. Rubner, C. Tomasi, and L. J. Guibas. A metric for distributions with applications to image databases.

In ICCV, 1998.[27] T. Steihaug. The conjugate gradient method and trust regions in large scale optimization. SIAM Journal

on Numerical Analysis, 1983.[28] P. L. Toint. Towards an efficient sparsity exploiting newton method for minimization. Sparse Matrices

and Their Uses, 1981.[29] R. van den Berg, F. W. Cornelissen, and J. B. T. M. Roerdink. A crowding model of visual clutter. Journal

of Vision, 2009.[30] O. Veksler, Y. Boykov, and P. Mehrani. Superpixels and supervoxels in an energy optimization framework.

In ECCV, 2010.[31] M. Wischnewski, A. Belardinelli, W. X. Schneider, and J. J. Steil. Where to look next? combining static

and dynamic proto-objects in a tva-based model of visual attention. Cognitive Computation, 2010.[32] J. M. Wolfe. Visual search. Attention, 1998.[33] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN database: Large-scale scene recognition

from abbey to zoo. In CVPR, 2010.[34] A. Y. Yang, J. Wright, Y.Ma, and S. Sastry. Unsupervised segmentation of natural images via lossy data

compression. CVIU, 2008.[35] V. Yanulevskaya and J.-M. Geusebroek. Significance of the Weibull distribution and its sub-models in

natural image statistics. In Int. Conference on Computer Vision Theory and Applications, 2009.

9